In this post, we derive the bias-variance decomposition of mean square error for regression.
In regression analysis, it’s common to decompose the observed value as following:
where the true regression is regarded as a constant given . The error (or the data noise) is independent of with the assumption that follows a Gaussian distribution of mean 0 and variance . Noticed that (0) is only a description of the data. When we replace with its estimation , (0) turns into a more practical form:
Where the estimation of the true regression is regarded as a non-constant given , and the residual describes the gap between and . Based on (0) and (1), we can make the following observations:
Using (2) and (3), we can show that why minimizing mean squared error for regression problem is useful. For the derivation, we need a few more identities related to expectation and variance. Given any two independent random variable x, y, and a constant c, we have:
Begin with the definition of mean squared error; we can rewrite it in the form of expected value:
By expanding , we get
Noted that because is independent of and
And we reach our final form (7) which is the sum of data noise variance , prediction variance and the squared prediction bias . Such result is the bias-variance decomposition.
Why variance matter in regression?
Lowing the prediction bias certainly gives the model higher accuracy on the training dataset; however, to obtain similar performance outside of training dataset, we want to prevent the model from overfitting the training dataset. Given that the true regression has zero variance, a robust model should have prediction variance as small as possible, and this is consistent with the objective of the mean squared error.