In this post, we derive the bias-variance decomposition of mean square error for regression.


Bias-variance tradeoff

Regression Decomposition

In regression analysis, it’s common to decompose the observed value as following:

where the true regression is regarded as a constant given . The error (or the data noise) is independent of with the assumption that follows a Gaussian distribution of mean 0 and variance . Noticed that (0) is only a description of the data. When we replace with its estimation , (0) turns into a more practical form:

Where the estimation of the true regression is regarded as a non-constant given , and the residual describes the gap between and . Based on (0) and (1), we can make the following observations:

Bias-variance decomposition

Using (2) and (3), we can show that why minimizing mean squared error for regression problem is useful. For the derivation, we need a few more identities related to expectation and variance. Given any two independent random variable x, y, and a constant c, we have:

Begin with the definition of mean squared error; we can rewrite it in the form of expected value:

By expanding , we get

Noted that because is independent of and

And we reach our final form (7) which is the sum of data noise variance , prediction variance and the squared prediction bias . Such result is the bias-variance decomposition.

Why variance matter in regression?

Lowing the prediction bias certainly gives the model higher accuracy on the training dataset; however, to obtain similar performance outside of training dataset, we want to prevent the model from overfitting the training dataset. Given that the true regression has zero variance, a robust model should have prediction variance as small as possible, and this is consistent with the objective of the mean squared error.