\(\newcommand{\E}{\mathbb{E}}\)Recently I stumbled across a tweet asking why we divide \(n - 1\) instead of \(n\) if we want our variance estimator to be unbiased. Somewhat embarrassingly, I have no idea why. I’m sure that if we chug through the algebra, we’ll be able to derive it. And sure enough, a quick google search reveals a number of excellent articles that bravely did the math™.

If, however, you’re lazy like me and refuse to do the algebra, then this post is for you. The goal of this post is to use as little algebra as possible to prove that

\[\sigma^2 = \frac{n}{n - 1} \cdot \E~\hat\sigma^2,\]

where \(\hat \sigma^2\) is the sample variance and \(\sigma^2\) is the population variance. Some knowledge of linear algebra and triangles will be required though.

Setup

First, let’s formalize what \(\hat \sigma\) and \(\sigma\) are a little bit. Let \(X_n\) be an \(n\)-dimensional random vector consisting of i.i.d. random variables \((X_1, \ldots, X_n)\). The sample variance is thus

\[\hat \sigma^2 = \frac{1}{n} \sum_i (X_i - \bar X)^2,\]

where \(\bar X\) is the mean of \(X_n\). Meanwhile, the population variance is

\[\sigma^2 = \E (X - \mu)^2,\]

where \(\mu\) is the population mean \(\E(X)\).

Proof

I’ll try to be as succinct as possible. Note that it suffices to prove that

\[\sigma^2 = n \sigma^2 - n \cdot \E~\hat \sigma^2.\]

Next, introduce a new estimator which is almost exactly the same as \(\hat \sigma^2\) except that we centralize using the true mean \(\mu\),

\[\tilde \sigma = \frac{1}{n} \sum_i (X_i - \mu)^2.\]

It is pretty easy to see that \(\tilde \sigma\) is an unbiased estimator of \(\sigma\). We now express \(\hat \sigma\) and \(\tilde \sigma\) into geometry-friendly terms, along with some judicious renaming

\[\begin{align*} A^2 = n \hat \sigma^2 &= \| X_n - 1_n \bar X \|^2 \\ C^2 = n \tilde \sigma^2 &= \| X_n - 1_n \mu \|^2, \end{align*}\]

and observe that \((X_n, 1_n \bar X, 1_n \mu)\) are co-planar. Now complete the triangle with

\[B^2 = n \cdot (\bar X - \mu)^2 = \| 1_n \bar X - 1_n \mu \|^2.\]

This is a right triangle; \(A\) is perpendicular to \(B\) since \(1_n \bar X\) is the projection of \(X_n\) onto \(1_n\). Now use the Pythagorean Theorem to conclude that

\[n \cdot (\bar X - \mu)^2 = n \tilde \sigma^2 - n \hat \sigma^2.\]

Taking expectations on both sides, the RHS becomes

\[n \sigma^2 - n \cdot \E~\hat \sigma^2,\]

and the LHS simply becomes

\[\begin{align*} n \cdot \E (\bar X - \mu)^2 = n \cdot \mathrm{var}(\bar X) = \sigma^2. && \blacksquare \end{align*}\]

Final Thoughts

I think what’s neat about this geometric interpretation is that it captures quite nicely the insight that \(1_n \mu\) can never be closer to \(X_n\) than \(1_n \bar X\). Furthermore, if you crudely think of \(X_n\) as fixed and \(\mu\) randomly wiggling around \(\bar X\), you can visualize the process as a right triangle whose side \(B\) is randomly wiggling around \(B = 0\). For such a randomized triangle, the Pythagorean Theorem immediately shows that the variance of \(B\) is exactly the how much \(C^2\) is longer than \(A^2\) on average. This neat little geometric fact which links variance of one term to the bias between two other terms arises not just here, but also more broadly in all bias-variance decompositions!

Afterthought: (n - 1) Degrees of Freedom?

A common refrain I see for justifying the \((n - 1)\) term is that we “lose a degree of freedom”. This is technically true in that you can fully recover \(X_n\) simply by knowing \(X_1, \ldots, X_{n-1}\) and the sample mean \(\bar X\), so attempting to centralize \(X_n\) using \(\bar X\) loses one degree of randomness compared to centralizing using \(\mu\), which retains the original \(n\) degrees of randomness. As tempting as it is to accept this as a valid explanation, I think this is actually a bad one because it doesn’t tell me why losing \(1\) degree of freedom out of \(n\) translates precisely to doing a multiplicative correction by \(n / (n - 1)\) other than some fuzzy feeling that it feels like the right correction to do. From a pedagogical perspective, I think appealing to degrees of freedom in fact introduces confusion by giving us the illusion of understanding what’s going on when we don’t really. Of course, if there’s some deep insight from the degrees-of-freedom perspective that I’m not seeing and if someone is kind enough to walk me through it, I’ll happily retract my critique.


This is a slight abuse of notation since \(A, B, C\) are the lengths of the line segments and not the line segments themselves, but I think you get what I mean.