#### Abstract

In this post, we derive the bias-variance decomposition of mean square error for regression.

### Regression Decomposition

In regression analysis, it’s common to decompose the observed value as following:

where the true regression $f_x$ is regarded as a constant given $x$. The error (or the data noise) $\epsilon_{x}$ is independent of $f_x$ with the assumption that $\epsilon_{x}$ follows a Gaussian distribution of mean 0 and variance $\sigma^2$. Noticed that (0) is only a description of the data. When we replace $f_x$ with its estimation $\hat{f}_x$, (0) turns into a more practical form:

Where the estimation of the true regression $\hat{f}_x$ is regarded as a non-constant given $x$, and the residual $r_x$ describes the gap between $y_x$ and $\hat{f}_x$. Based on (0) and (1), we can make the following observations:

### Bias-variance decomposition

Using (2) and (3), we can show that why minimizing mean squared error for regression problem is useful. For the derivation, we need a few more identities related to expectation and variance. Given any two independent random variable x, y, and a constant c, we have:

Begin with the definition of mean squared error; we can rewrite it in the form of expected value:

By expanding $(y_x - \hat{f}_x)^2$, we get

Noted that $2 \mathrm{E} \big[ \epsilon_{x} \hat{f}_x \big] = 2 \mathrm{E} \big[ \epsilon_{x} \big] \mathrm{E} \big[ \hat{f}_x \big] = 0$ because $\epsilon_{x}$ is independent of $\hat{f}_x$ and $\mathrm{E} \big[ \epsilon_{x} \big] = 0$ %

And we reach our final form (7) which is the sum of data noise variance $\sigma^2$, prediction variance $\mathrm{Var} \big[ \hat{f}_x \big]$ and the squared prediction bias $(f_x - \mathrm{E} \big[ \hat{f}_x \big])^2$. Such result is the bias-variance decomposition.

#### Why variance matter in regression?

Lowing the prediction bias certainly gives the model higher accuracy on the training dataset; however, to obtain similar performance outside of training dataset, we want to prevent the model from overfitting the training dataset. Given that the true regression has zero variance, a robust model should have prediction variance as small as possible, and this is consistent with the objective of the mean squared error.