Proof of ANOVA’s Partitioning of Sum of Squares Formula

Spread the love

The analysis-of-variance (ANOVA) approach — whose purpose is mainly to analyyze the quality of the estimated regression — is based on the so-called partioning of sums of squares, whose formula is as follows [Walpole et al., p. 415]

$\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \hspace{1cm} (1.1)$

In short form, the above formula is indicated as

SST = SSR + SSE

The purpose of this short article is to provide the proof for the above formula — the so-called partition of sums of squares — for the case of regression involving a single independent variable x.

First of all, we start from the expansion of the left-hand side of formula (1.1),

$\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}[(y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})]^2 = \hspace{1cm} (1.2)$

$= \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})(y_i - \hat{y}_i) =$

$= \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})\hat{\epsilon}_{i}$

where residuals are indicated by the epsilon character. Recall that the formula for the fitted regression line (involving a single independent variable x) is

$\hat{y}_i = a + bx_i \hspace{1cm} (1.3)$

Parameters a and b are estimated by the so-called method of least squares, which involves the minimization of the error sum of squares SSE, which means that the derivatives of SSE with respect to a and b are both set to 0:

$SSE = \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - a - bx_i)^2 \hspace{1cm} (1.4)$

$\frac{\partial (SSE)}{\partial a} = -2 \sum_{i=1}^{n}(y_i - a - bx_i) = 0 \implies \bar{y} = a + b\bar{x} \hspace{1cm} (1.5)$

$\frac{\partial (SSE)}{\partial b} = -2 \sum_{i=1}^{n}(y_i - \hat{y}_i)x_i = -2 \sum_{i=1}^{n} \hat{\epsilon}_i x_i = 0 \implies \sum_{i=1}^{n} \hat{\epsilon}_i x_i = 0 \hspace{1cm} (1.6)$

Equation (1.2) can be rewritten as:

$\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})\hat{\epsilon}_{i} = \hspace{1cm} (1.7)$

$= \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (a + bx_i - \bar{y})\hat{\epsilon}_{i} =$

$= \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 (a - \bar{y}) \sum_{i=1}^{n} \hat{\epsilon}_{i} + 2b\sum_{i=1}^{n} \hat{\epsilon}_{i} x_i =$

Recalling equation (1.6)

$\sum_{i=1}^{n} \hat{\epsilon}_{i} x_i = 0 \hspace{1cm} (1.8)$

The sum of errors is expected to be almost zero:

$\sum_{i=1}^{n} \hat{\epsilon}_{i} \approx 0 \hspace{1cm} (1.9)$

We replace those values in (1.7) and get:

$\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \hspace{1cm} (1.10)$

which is exactly what we wanted to prove.

References

Walpole R.E., Meyers R.H., Myers S. L., Ye K. Probability & Statistics for Scientists and Engineers – Eighth Edition. Pearson Prentice Hall, 2007. ISBN 0-13-187711-9.