Why is the denominator $N-p-1$ in estimation of variance?

$\begingroup$

I was recently going through the book Elements of Statistical Learning by Tibshirani et.al. In this book, while explaining the ordinary least squares model, the authors state that assume that $y_i \epsilon \mathbb{R}$ represents the observed variables, $\hat{y_i}$ represents the model output and $\mathbf{x_i} \epsilon \mathbf{R}^{p+1}$ represent the inputs. If the $y_i$s are assumed to be uncorrelated and have constant with variance $\sigma$, then the unbiased estimate of variance is $\hat{\sigma} = \frac{1}{\left (N-p-1 \right)}\sum\left( y_i - \hat{y_i} \right)^2$, summation being done from $i=1$ to $i=N$. Note that $p$ has been used here to denote the dimensionality of $\mathbf{x_i}$s. My question is why is the factor in the denominator $N-p-1$ while estimating the variance of $y_i$s i.e. $\hat{\sigma}$ ? From my understanding if the $y_s$s are real numbers that have constant variance, the factor should be equal to $N-1$.

$\endgroup$

3 Answers

$\begingroup$

The current accepted answer is flawed, as it implicitly assumes that the error of the model $\varepsilon$ is Gaussian (otherwise you need not have $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$).

Here's a proof with the general assumption that $\varepsilon$ has mean $0$ and variance $\sigma^2 I_N$.

First note that $\sum(y_i-\hat{y}_i)^2=\|y-X\hat\beta\|^2$.

We have $$\begin{align} y-X\hat\beta &= X\beta +\varepsilon -X(X^TX)^{-1}X^T(X\beta +\varepsilon)\\ &=X\beta +\varepsilon - X\beta -X(X^TX)^{-1}X^T\varepsilon\\ &= (I_N-H)\varepsilon\end{align}$$ where $H=X(X^TX)^{-1}X^T$ is the hat matrix. It's easy to check that $H^T=H$ and $H^2=H$ (indeed the hat matrix is merely the orthogonal projection on $\operatorname{Im}X$).

Hence $\begin{aligned}[t]E( \|y-X\hat\beta\|^2) &= E(\varepsilon^T(I_N-H)^T (I_N-H)\varepsilon)=E(\varepsilon^T(I_N-H)\varepsilon) \end{aligned}$

Note that $\varepsilon^T(I_N-H)\varepsilon=\sum_{i,j} \varepsilon_i\varepsilon_j (\delta_{ij}-H_{ij})$, thus $$E(\varepsilon^T(I_N-H)\varepsilon)=\sum_{i,j} \sigma^2\delta_{ij} (\delta_{ij}-H_{ij})=\sigma^2(N-\operatorname{tr}H)$$

Note that $\operatorname{tr}H =\operatorname{tr}(X(X^TX)^{-1}X^T)=\operatorname{tr}(X^TX(X^TX)^{-1})=\operatorname{tr}(I_{p+1})=p+1 $

Putting everything together, $E( \|y-X\hat\beta\|^2)=\sigma^2(N-p-1)$

$\endgroup$ 2 $\begingroup$

You can show that $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$. As expectation of a $\chi^2_{N-p-1}$ is $(N-p-1)$. Hence $\mathbb{E}(\frac{1}{N-p-1}\sum(y_i-\hat{y}_i)^2)=\sigma^2$.

$N-p-1$ is in the denominator to make the estimator unbiased.

$\endgroup$ 2 $\begingroup$

To answer the question without using the Gaussian assumption, nor the additive model assumption as it has been done in previous answer, here is my take:

The assumption made are that the observations $y_i$ are uncorrelated and have mean zero, constant variance $\sigma^2$, and that the $x_i$ are fixed (non random).

From the previous section we have that $\hat{y} = X(X^TX)^{-1}X^Ty = Hy$where the hat matrix can be shown easily to satisfy $H^T=H^2=H$.

Also we can rewrite $\sum_{i=1}^N(y_i - \hat{y_i})^2$ as $(y-\hat{y} )^T(y-\hat{y} )$ and thus it comes that

\begin{align*} \mathbb{E}[\sum_{i=1}^N(y_i - \hat{y_i})^2] &= \mathbb{E}[(y-Hy)^T(y-Hy)] = \mathbb{E}[((I-H)y)^T((I-H)y)] \\ &= \mathbb{E}[y^T(I-H)(I-H)y] = \mathbb{E}[y^T(I-H)y] \\ &= \mathbb{E}[\sum_{i,j}y_iy_j(\delta_{ij}-H_{i,j})] \\ &= \sum_{i,j}\mathbb{E}[y_iy_j\delta_{ij}]- \sum_{i,j}\mathbb{E}[y_iy_jH_{i,j}] \\ &= \sum_{i}\mathbb{E}[y_i^2]- \sum_{i,j}\mathbb{E}[y_iy_j]H_{i,j} \end{align*}

The hat matrix has been removed from the expectation since it is fixed as the $x_i$ are.

In addition the $y_i$ are uncorrelated of mean zero and variance $\sigma^2$ thus using the cyclic property of the trace:\begin{align*} \mathbb{E}[\sum_{i=1}^N(y_i - \hat{y_i})^2] &= \sum_{i}\sigma^2 - \sum_{i}\sigma^2H_{i,i} = \sigma^2[N - trace(H)] \\ &= \sigma^2[N - trace(X(X^TX)^{-1}X^T)] \\ &= \sigma^2[N - trace(X^TX(X^TX)^{-1})] \\ &= \sigma^2[N - p - 1] \end{align*}

So finally we have that an unbiased estimator $\hat{\sigma}^2$ of $\sigma^2$ is \begin{align} \mathbb{E}[\hat{\sigma}^2] &= \mathbb{E}[\frac{1}{N - p - 1} \cdot \sum_{i=1}^N(y_i - \hat{y_i})^2] = \sigma^2 \end{align}

$\endgroup$

Why is the denominator $N-p-1$ in estimation of variance?

3 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

Where can I farm Boar Prime in warframe?

Can Vette be corrupted?

What are the advantages/disadvantages of each race?