Statistical inference

Statistical inference is the process of leveraging a mathematical modeling apparatus to make deductions about quantities estimated from data;

Modeling and inference involve


In order to say something about the uncertainty of estimates (of unknown parameters and thereby predictions) we need

That’s where statistical models come in.

To keep it simple for now, lets assume that our data is an independent and identically distributed sub-sample from a Gaussian population:

\[ Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2), \quad i=1,\dots,n. \]


What does that mean, independent and identically distributed?

It means that

In other words, \[ \begin{aligned} p(Y_i = y_i \mid Y_j = y_j, \mu, \sigma^2) &= p(Y_i \mid \mu, \sigma^2), \quad \forall i \ne j, \quad \mbox{or} \\ p(Y_i = y_i, Y_j = y_j \mid \mu, \sigma^2) &= p(Y_i = y_i \mid \mu, \sigma^2) \cdot p(Y_j = y_j \mid \mu, \sigma^2), \end{aligned} \] similarly.


What does that mean, Gaussian or normally distributed \(\mathcal{N}(\mu, \sigma^2)\)?.

It means that

A “bell-shaped” curve characteristic of a Gaussian, with \(\mu=12\), \(\sigma^2=25^2\).

y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2)
abline(v=12, col="gray")
abline(v=qnorm(c(0.975,0.025),12,25), col="red", lty=2)

An estimate

A simple way to estimate the Gaussian mean parameter \(\mu\) from data is through the empirical average.

If we regard our data as IID Gaussian, then \(\mathbb{E}\{Y_i\} = \mu\).

What does that mean for \(\bar{Y}\)? \[ \mathbb{E}\{\bar{Y}\} = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\{ Y_i \} = \frac{1}{n} \cdot n \mu = \mu. \]


An estimator \(\hat{\theta}_n \equiv \hat{\theta}(y_1,\dots, y_n)\) is said to be unbiased for a parameter \(\theta\) if \[ \mathbb{E}\{\hat{\theta}_n\} = \theta. \]

A frequency argument

Notice what we’re doing here.

Implicitly we are making a frequency argument, which is at the heart of most of classical (Frequentist) statistics.

What other properties should a good estimator have?


They should be efficient

What is the variance of our estimator \(\bar{Y}\) of \(\mu\)?

\[ \mathbb{V}\mathrm{ar}\{ \bar{Y} \} = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}\mathrm{ar}\{ Y_i \} = \frac{\sigma^2}{n}, \] since \(\mathbb{V}\mathrm{ar}\{a X + bY + c\} = a^2 \mathbb{V}\mathrm{ar}\{X\} + b^2 \mathbb{V}\mathrm{ar}\{Y\}\) as long as \(X\) and \(Y\) are independent.

Sampling distribution

So now we know the mean and variance of \(\bar{Y}\), our estimator. \[ \mathbb{E}\{\bar{Y}\} = \mu \quad \mbox{ and } \quad \mathbb{V}\mathrm{ar}\{\bar{Y}\} = \frac{\sigma^2}{n} \]

Now, \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\), and a result from probability is that the sum of independent normal RVs is also normal.


This is a remarkable result. \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \]

It says that, even though (almost certainly) you do not know the true population parameters \(\mu\) and \(\sigma^2\), if your population is normal, then the average you calculate from \(n\) data points is “close” to \(\mu\)

Now before we get too excited, observe that \(\sigma^2\) is on the scale of \((Y_i - \mu)^2\),

Principles of Estimation

Method of moments

It seemed sensible to estimate the population mean, \(\mu\) with the sample mean \(\bar{Y}\).

  • This is, in fact, a principled mode of estimating parameters from data
  • and can be extended to other parameters.

The \(p^{\mathrm{th}}\) moment of a RV \(Y\) is \(\mathbb{E}\{Y^p\}\).


  • IID \(Y_1, \dots, Y_n\), sampled from a distribution with
  • \(d\)-dimensional parameter \(\theta = (\theta_1, \dots, \theta_d)\),

the method of moments (MoM) estimator \(\hat{\theta}_n\) may be calculated by solving the following system of \(d\) equations: \[ \frac{1}{n}\sum_{i=1}^n Y_i^p = \mathbb{E}\{Y^p\}, \quad p=1, \dots, d. \]

MoM for Gaussian data

When \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), we have \(\theta = (\mu, \sigma^2)\), and the system of equations is \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i &= \hat{\mu} \\ \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \hat{\mu}^2, \end{aligned} \]

  • where the second equation utilizes that \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{E}\{Y^2\} - \mathbb{E}\{Y\}^2\).
  • We’ve already seen the first equation – that’s our unbiased estimator for the mean.

MoM for the Gaussian variance \(\sigma^2\)

What about that second equation?

Plugging in \(\hat{\mu}\) gives \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \bar{Y}^2 \\ \hat{\sigma}^2 &= \frac{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2}{n} \\ &= \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}. \end{aligned} \]

So now we have the MoM estimator for \((\mu, \sigma^2)\).

  • We’ve seen some properties of \(\hat{\mu}\);
  • what about \(\hat{\sigma}^2\)?


Is the MoM estimator for \(\sigma^2\) unbiased?

\[ \begin{aligned} \mathbb{E}\{\hat{\sigma}^2\} &= \mathbb{E} \left\{\frac{1}{n} \sum_{i=1}^n (Y_i - \bar{Y})^2 \right\} \\ & \;\, \vdots \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 \sum_{i=1}^n Y_i \bar{Y} - \sum_{i=1}^n \bar{Y}^2 \right\} \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 n \bar{Y}^2 - n \bar{Y}^2 \right\} % = \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2 \right\} \\ % &= \mathbb{E}\{Y_i^2\} - \mathbb{E}\{\bar{Y}^2\} = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ &= \frac{n-1}{n} \sigma^2 \end{aligned} \]


  • But it is asymptotically unbiased.

Bias correction

And that’s the reason folks use the following instead: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2. \]

You can justify the change in denominator on more intuitive grounds.

Observe how the data is being used twice when estimating discrepancy:

  1. To estimate the mean, via \(\bar{Y}\).
  2. To estimate deviations around the mean, via \((Y_i - \bar{Y})^2\).

Because you’ve used the data twice, you’ve lost “one” opportunity (in 1.) to observe variability (in 2.).

Degrees of freedom

We call that “loosing a degree of freedom (DoF)”.

To see why a DoF is lost, consider the following special case.

If \(n=1\), then \(\bar{Y} = Y_1\), and \(\hat{\sigma}^2 = (Y_1 - Y_1)^2 = 0\).

  • But that doesn’t make any sense.
  • How could our estimated variance (sensibly) be zero?
  • Because by using \(Y_1\) to estimate \(\mu\) we lost the opportunity to use it again to estimate \(\sigma^2\).
  • We lost a DoF.

\(S^2\) is better because you end up with \(0/0\) which is undefined, which tells you

  • when \(n=1\) you can have no concept of variance.

Chi-squared distributions

Working out the distribution of \(S^2\) requires the following result from probability.

If \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), then

  • \((Y_i - \mu)^2/\sigma^2 \sim \chi^2_1\), a chi-squared distribution with \(\nu=1\) DoFs.
  • \(\sum_{i=1}^n (Y_i - \mu)^2/\sigma^2 \sim \chi^2_n\).

The chi-squared distribution is positively skewed distribution with

  • mean \(\nu\).
  • and variance \(2\nu\).

Here is the chi-squared density with \(\nu = 81\).

y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 81), type="l", lwd=2)

Sampling distribution of \(\sigma^2\)

Now, it turns out that we can write \[ \sum_{i=1}^n \frac{(Y_i - \mu)^2}{\sigma^2} = \frac{(n-1) S^2}{\sigma^2} + \frac{n(\bar{Y} - \mu)^2}{\sigma^2}. \]

That’s useful because

  • the LHS is our \(\chi^2_n\) RV
  • and the second RV on the RHS is clearly \(\chi^2_1\) since \(\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\).

Therefore (waving hands a little, because to do it right would involve a big digression on moment generating functions)

\[ \frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1} \quad \mbox{ or } \quad S^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}. \]


The likelihood flips around the relationship between the sampling model (distributional assumptions about how the data were generated) and its parameters.

  • It is a function that maps the parameter space onto the real line, giving higher values to settings of the parameters which support the observations in the data.

\[ L(\theta) \equiv L(\theta; y) = P_\theta(Y_1 = y_1, \dots, Y_n = y_n) \]

  • In the case of IID data with density \(f_\theta (y_i)\), the likelihood is

\[ L(\theta) = \prod_{i=1}^n P_\theta (Y_i = y_i) = \prod_{i=1}^n f_\theta (y_i). \]

What would that look like for our Gaussian example?

Maximum Likelihood

A maximum likelihood estimator (MLE) is calculated by finding the value of \(\theta\) which maximizes \(L(\theta)\).

Typically, one works with the log likelihood in search of the MLE

  • because it is easier mathematically, especially with IID data

\[ \ell(\theta) = \log L(\theta ; y) = \sum_{i=1}^n \log f_\theta (y_i). \]

  • The typical tactic is to maximize via derivatives,
  • i.e., to solve to so-called likelihood equations

\[ \nabla \ell(\theta) = 0. \]

MLE for Gaussian data

Consider our motivating Gaussian example. The log likelihood is \[ \ell(\mu, \sigma^2) = \log L(\theta ; y) = c + - \frac{n}{2} \log \sigma^2 - \frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\sigma^2}. \]

Now differentiatiating and setting equal to zero gives:

\[ \begin{aligned} \frac{\partial \ell}{\partial \mu} \Big{|}_{\hat{\mu}} &= 0 & \Rightarrow && \hat{\mu} &= \bar{y}. \\ \frac{\partial \ell}{\partial \sigma^2} \Big{|}_{\hat{\mu},\hat{\sigma}^2} &= 0 & \Rightarrow && \hat{\sigma}^2 &= \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2. \\ \end{aligned} \]

  • These are the same as our MoM estimators,
  • unbiased and biased respectively.

Remarkably general framework

The cool thing about the MLE (as compared to MoM) is that it is remarkably generic.

  • Those results that we got for the Gaussian
    • e.g., that \(\hat{\mu} \equiv \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\)
    • i.e., that the estimate is centered around its true value and it is efficient, in the sense that variance is going down with \(n\),
  • work for data under other sampling models too.

But that’s certainly not obvious at first glance.

The general result,

  • the so-called asymptotic distribution of the MLE
  • is similar to the central limit theorem, which is probably the most important result in all of statistics.

Central limit theorem

The central limit theorem (CLT) says that sums of IID RVs are eventually Gaussian no matter what their original distribution.

Let \(Y_1, \dots, Y_n\) be a sequence of IID RVs with

  • mean \(\mu = \mathbb{E}\{Y_i\}\) and variance \(\sigma^2 = \mathbb{V}\mathrm{ar}\{Y_i\}\).

Now let \(S_n = Y_1 + \cdots + Y_n\).
\[ \mbox{If } \quad Z_n = \frac{S_n - n \mu}{\sigma \sqrt{n}} \quad \mbox{ then } \quad Z_n \stackrel{d}{\rightarrow} \mathcal{N}(0,1) \quad \mbox{ as } \quad n \rightarrow \infty. \]

Or in other words, if \(\bar{Y}_n = S_n/n\), then \[ \bar{Y}_n \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ as } \quad n \rightarrow \infty. \]

  • in shorthand: \(\bar{Y}_n \sim \mathcal{AN}(\mu, \sigma^2/n)\).

Illustrating the CLT

For example, exponentially distributed RVs don’t look very normal.

y <- seq(0, 5, length=1000)
plot(y, dexp(y), type="l", lwd=2)

y2 <- colMeans(matrix(rexp(2*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 2")
lines(y, dnorm(y, 1, 1/sqrt(2)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(5*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 5")
lines(y, dnorm(y, 1, 1/sqrt(5)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(10*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 10")
lines(y, dnorm(y, 1, 1/sqrt(10)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(100*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 100")
lines(y, dnorm(y, 1, 1/sqrt(100)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(1000*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 1000")
lines(y, dnorm(y, 1, 1/sqrt(1000)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")