Statistical inference

Statistical inference is the process of leveraging a mathematical modeling apparatus to make deductions about quantities estimated from data;

  • about performing the calculations required to make "good" estimates;
  • understanding their properties, and how "good" they are relative to alternatives;
  • and in particular understanding how those estimated quantities could have differed if you saw a different sample of data from the same population.

Modeling and inference involve

  • equations,
  • probability distributions,
  • and unknown parameters to those equations and distributions which can be estimated from data.


In order to say something about the uncertainty of estimates (of unknown parameters and thereby predictions) we need

  • some assumptions about how the data relate to the population.

That's where statistical models come in.

  • We need to invest in a "cartoon" version of how the data were generated,
  • one which is convenient to work with mathematically.

To keep it simple for now, lets assume that our data is an independent and identically distributed sub-sample from a Gaussian population:

\[ Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2), \quad i=1,\dots,n. \]


What does that mean, independent and identically distributed?

It means that

  • each observation comes from the same population
  • and therefore is a random quantity from the same distribution (\(\mathcal{N}(\mu, \sigma^2)\)),
  • and that given a form of that distribution (i.e., \(\mu\) and \(\sigma^2\)), knowing the value of one of them \(Y_j = y_j\), say, does not help you narrow down values for another, e.g., \(Y_i\).

In other words, \[ \begin{aligned} p(Y_i = y_i \mid Y_j = y_j, \mu, \sigma^2) &= p(Y_i \mid \mu, \sigma^2), \quad \forall i \ne j, \quad \mbox{or} \\ p(Y_i = y_i, Y_j = y_j \mid \mu, \sigma^2) &= p(Y_i = y_i \mid \mu, \sigma^2) \cdot p(Y_j = y_j \mid \mu, \sigma^2), \end{aligned} \] similarly.


What does that mean, Gaussian or normally distributed \(\mathcal{N}(\mu, \sigma^2)\)?.

It means that

  • \(\mathbb{E}\{Y_i\} = \mu\)
  • \(\mathbb{V}\mathrm{ar}\{Y_i\} = \sigma^2\)
  • and that the probability density of a particular \(y\)-value follows \[ p(Y_i = y_i) = f(y_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{- \frac{(y_i - \mu)^2}{\sigma^2} \right\} \]
  • and together that means that \[ p(\mu - 1.96 \sigma < Y_i < \mu + 1.96\sigma) = 0.95. \]
  • Also observe that the density is symmetric, with most probable value \(f(\mu)\).

A "bell-shaped" curve characteristic of a Gaussian, with \(\mu=12\), \(\sigma^2=25^2\).

y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2)
abline(v=12, col="gray")
abline(v=qnorm(c(0.975,0.025),12,25), col="red", lty=2)

An estimate

A simple way to estimate the Gaussian mean parameter \(\mu\) from data is through the empirical average.

  • The estimate is \(\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\).
  • The corresponding estimator is \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\).

If we regard our data as IID Gaussian, then \(\mathbb{E}\{Y_i\} = \mu\).

What does that mean for \(\bar{Y}\)? \[ \mathbb{E}\{\bar{Y}\} = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\{ Y_i \} = \frac{1}{n} \cdot n \mu = \mu. \]

  • That's good right?


An estimator \(\hat{\theta}_n \equiv \hat{\theta}(y_1,\dots, y_n)\) is said to be unbiased for a parameter \(\theta\) if \[ \mathbb{E}\{\hat{\theta}_n\} = \theta. \]

  • In the case of our parameter \(\mu\), we have established that \(\hat{\mu}_n = \bar{Y}\) is an unbiased estimator.

  • Having a biased estimator is usually bad (not always; why?).

  • Bias would mean that, on average, what we estimate from data is not the same as what we are targeting in the population.

A frequency argument

Notice what we're doing here.

Implicitly we are making a frequency argument, which is at the heart of most of classical (Frequentist) statistics.

  • By taking expectations of \(\hat{\theta}_n\) we are imagining what sort of estimates we would get, on average, with similar data sets (re-)sampled from the population.

  • If \(\hat{\theta}_n\) is unbiased, then on average we get \(\theta\).
  • In our example, \(\bar{Y}\) is unbiased,
    • on average it is \(\mu\).

What other properties should a good estimator have?


They should be efficient

  • which basically means that, by similar frequency arguments, they should have low variance.
  • Or in other words, they should make the greatest use of the limited amount of observed data, \(n\).

What is the variance of our estimator \(\bar{Y}\) of \(\mu\)?

\[ \mathbb{V}\mathrm{ar}\{ \bar{Y} \} = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}\mathrm{ar}\{ Y_i \} = \frac{\sigma^2}{n}, \] since \(\mathbb{V}\mathrm{ar}\{a X + bY + c\} = a^2 \mathbb{V}\mathrm{ar}\{X\} + b^2 \mathbb{V}\mathrm{ar}\{Y\}\) as long as \(X\) and \(Y\) are independent.

  • So the variance decreases linearly as data is added. Pretty efficient, eh?

Sampling distribution

So now we know the mean and variance of \(\bar{Y}\), our estimator. \[ \mathbb{E}\{\bar{Y}\} = \mu \quad \mbox{ and } \quad \mathbb{V}\mathrm{ar}\{\bar{Y}\} = \frac{\sigma^2}{n} \]

Now, \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\), and a result from probability is that the sum of independent normal RVs is also normal.

  • Normal distributions are uniquely determined by their mean and variance,
    • and these are determined identically by parameters, typically notated as \(\mu\) and \(\sigma^2\).
  • Since we know the mean and the variance of \(\bar{Y}\), above, it must be that \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \]
  • This is the sampling distribution of \(\bar{Y}\).


This is a remarkable result. \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \]

It says that, even though (almost certainly) you do not know the true population parameters \(\mu\) and \(\sigma^2\), if your population is normal, then the average you calculate from \(n\) data points is "close" to \(\mu\)

  • on average it is \(\mu\), averaging over random data sets looking like \(Y_1, Y_2, \dots, Y_n\),
  • and the amount which it deviates from \(\mu\) goes down as \(n\) goes up.

Now before we get too excited, observe that \(\sigma^2\) is on the scale of \((Y_i - \mu)^2\),

  • so on the scale of \(Y_i\), uncertainty really goes down by \(\sigma/\sqrt{n}\).
  • That means "diminishing returns" as \(n\) is increased.

Principles of Estimation

Method of moments

It seemed sensible to estimate the population mean, \(\mu\) with the sample mean \(\bar{Y}\).

  • This is, in fact, a principled mode of estimating parameters from data
  • and can be extended to other parameters.

The \(p^{\mathrm{th}}\) moment of a RV \(Y\) is \(\mathbb{E}\{Y^p\}\).


  • IID \(Y_1, \dots, Y_n\), sampled from a distribution with
  • \(d\)-dimensional parameter \(\theta = (\theta_1, \dots, \theta_d)\),

the method of moments (MoM) estimator \(\hat{\theta}_n\) may be calculated by solving the following system of \(d\) equations: \[ \frac{1}{n}\sum_{i=1}^n Y_i^p = \mathbb{E}\{Y^p\}, \quad p=1, \dots, d. \]

MoM for Gaussian data

When \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), we have \(\theta = (\mu, \sigma^2)\), and the system of equations is \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i &= \hat{\mu} \\ \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \hat{\mu}^2, \end{aligned} \]

  • where the second equation utilizes that \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{E}\{Y^2\} - \mathbb{E}\{Y\}^2\).
  • We've already seen the first equation – that's our unbiased estimator for the mean.

MoM for the Gaussian variance \(\sigma^2\)

What about that second equation?

Plugging in \(\hat{\mu}\) gives \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \bar{Y}^2 \\ \hat{\sigma}^2 &= \frac{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2}{n} \\ &= \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}. \end{aligned} \]

So now we have the MoM estimator for \((\mu, \sigma^2)\).

  • We've seen some properties of \(\hat{\mu}\);
  • what about \(\hat{\sigma}^2\)?


Is the MoM estimator for \(\sigma^2\) unbiased?

\[ \begin{aligned} \mathbb{E}\{\hat{\sigma}^2\} &= \mathbb{E} \left\{\frac{1}{n} \sum_{i=1}^n (Y_i - \bar{Y})^2 \right\} \\ & \;\, \vdots \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 \sum_{i=1}^n Y_i \bar{Y} - \sum_{i=1}^n \bar{Y}^2 \right\} \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 n \bar{Y}^2 - n \bar{Y}^2 \right\} % = \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2 \right\} \\ % &= \mathbb{E}\{Y_i^2\} - \mathbb{E}\{\bar{Y}^2\} = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ &= \frac{n-1}{n} \sigma^2 \end{aligned} \]


  • But it is asymptotically unbiased.

Bias correction

And that's the reason folks use the following instead: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2. \]

You can justify the change in denominator on more intuitive grounds.

Observe how the data is being used twice when estimating discrepancy:

  1. To estimate the mean, via \(\bar{Y}\).
  2. To estimate deviations around the mean, via \((Y_i - \bar{Y})^2\).

Because you've used the data twice, you've lost "one" opportunity (in 1.) to observe variability (in 2.).

Degrees of freedom

We call that "loosing a degree of freedom (DoF)".

To see why a DoF is lost, consider the following special case.

If \(n=1\), then \(\bar{Y} = Y_1\), and \(\hat{\sigma}^2 = (Y_1 - Y_1)^2 = 0\).

  • But that doesn't make any sense.
  • How could our estimated variance (sensibly) be zero?
  • Because by using \(Y_1\) to estimate \(\mu\) we lost the opportunity to use it again to estimate \(\sigma^2\).
  • We lost a DoF.

\(S^2\) is better because you end up with \(0/0\) which is undefined, which tells you

  • when \(n=1\) you can have no concept of variance.

Chi-squared distributions

Working out the distribution of \(S^2\) requires the following result from probability.

If \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), then

  • \((Y_i - \mu)^2/\sigma^2 \sim \chi^2_1\), a chi-squared distribution with \(\nu=1\) DoFs.
  • \(\sum_{i=1}^n (Y_i - \mu)^2/\sigma^2 \sim \chi^2_n\).

The chi-squared distribution is positively skewed distribution with

  • mean \(\nu\).
  • and variance \(2\nu\).

Here is the chi-squared density with \(\nu = 81\).

y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 81), type="l", lwd=2)

Sampling distribution of \(\sigma^2\)

Now, it turns out that we can write \[ \sum_{i=1}^n \frac{(Y_i - \mu)^2}{\sigma^2} = \frac{(n-1) S^2}{\sigma^2} + \frac{n(\bar{Y} - \mu)^2}{\sigma^2}. \]

That's useful because

  • the LHS is our \(\chi^2_n\) RV
  • and the second RV on the RHS is clearly \(\chi^2_1\) since \(\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\).

Therefore (waving hands a little, because to do it right would involve a big digression on moment generating functions)

\[ \frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1} \quad \mbox{ or } \quad S^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}. \]


The likelihood flips around the relationship between the sampling model (distributional assumptions about how the data were generated) and its parameters.

  • It is a function that maps the parameter space onto the real line, giving higher values to settings of the parameters which support the observations in the data.

\[ L(\theta) \equiv L(\theta; y) = P_\theta(Y_1 = y_1, \dots, Y_n = y_n) \]

  • In the case of IID data with density \(f_\theta (y_i)\), the likelihood is

\[ L(\theta) = \prod_{i=1}^n P_\theta (Y_i = y_i) = \prod_{i=1}^n f_\theta (y_i). \]

What would that look like for our Gaussian example?

Maximum Likelihood

A maximum likelihood estimator (MLE) is calculated by finding the value of \(\theta\) which maximizes \(L(\theta)\).

Typically, one works with the log likelihood in search of the MLE

  • because it is easier mathematically, especially with IID data

\[ \ell(\theta) = \log L(\theta ; y) = \sum_{i=1}^n \log f_\theta (y_i). \]

  • The typical tactic is to maximize via derivatives,
  • i.e., to solve to so-called likelihood equations

\[ \nabla \ell(\theta) = 0. \]

MLE for Gaussian data

Consider our motivating Gaussian example. The log likelihood is \[ \ell(\mu, \sigma^2) = \log L(\theta ; y) = c + - \frac{n}{2} \log \sigma^2 - \frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\sigma^2}. \]

Now differentiatiating and setting equal to zero gives:

\[ \begin{aligned} \frac{\partial \ell}{\partial \mu} \Big{|}_{\hat{\mu}} &= 0 & \Rightarrow && \hat{\mu} &= \bar{y}. \\ \frac{\partial \ell}{\partial \sigma^2} \Big{|}_{\hat{\mu},\hat{\sigma}^2} &= 0 & \Rightarrow && \hat{\sigma}^2 &= \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2. \\ \end{aligned} \]

  • These are the same as our MoM estimators,
  • unbiased and biased respectively.

Remarkably general framework

The cool thing about the MLE (as compared to MoM) is that it is remarkably generic.

  • Those results that we got for the Gaussian
    • e.g., that \(\hat{\mu} \equiv \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\)
    • i.e., that the estimate is centered around its true value and it is efficient, in the sense that variance is going down with \(n\),
  • work for data under other sampling models too.

But that's certainly not obvious at first glance.

The general result,

  • the so-called asymptotic distribution of the MLE
  • is similar to the central limit theorem, which is probably the most important result in all of statistics.

Central limit theorem

The central limit theorem (CLT) says that sums of IID RVs are eventually Gaussian no matter what their original distribution.

Let \(Y_1, \dots, Y_n\) be a sequence of IID RVs with

  • mean \(\mu = \mathbb{E}\{Y_i\}\) and variance \(\sigma^2 = \mathbb{V}\mathrm{ar}\{Y_i\}\).

Now let \(S_n = Y_1 + \cdots + Y_n\).
\[ \mbox{If } \quad Z_n = \frac{S_n - n \mu}{\sigma \sqrt{n}} \quad \mbox{ then } \quad Z_n \stackrel{d}{\rightarrow} \mathcal{N}(0,1) \quad \mbox{ as } \quad n \rightarrow \infty. \]

Or in other words, if \(\bar{Y}_n = S_n/n\), then \[ \bar{Y}_n \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ as } \quad n \rightarrow \infty. \]

  • in shorthand: \(\bar{Y}_n \sim \mathcal{AN}(\mu, \sigma^2/n)\).

Illustrating the CLT

For example, exponentially distributed RVs don't look very normal.

y <- seq(0, 5, length=1000)
plot(y, dexp(y), type="l", lwd=2)

y2 <- colMeans(matrix(rexp(2*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 2")
lines(y, dnorm(y, 1, 1/sqrt(2)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(5*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 5")
lines(y, dnorm(y, 1, 1/sqrt(5)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(10*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 10")
lines(y, dnorm(y, 1, 1/sqrt(10)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(100*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 100")
lines(y, dnorm(y, 1, 1/sqrt(100)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(1000*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 1000")
lines(y, dnorm(y, 1, 1/sqrt(1000)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

Estimating the mean

So the sample average is a good estimate of the mean of a population,

  • no matter how the data are generated, as long as
  • it is IID
  • and the sample size \(n\) is large enough.

The CLT is a version of a more specific set of law of large numbers results,

  • which basically says that averages converge to means "in probability".

But we're not just interested in population mean (parameters).

Asymptotic distribution of the MLE


  • \(\theta\) be a \(d\)-dimensional parameter to a family of distributions emitting log likelihood \(\ell\), and \(y_1, \dots, y_n\) be modeled as IID from that family
  • \(\hat{\theta}_n = \hat{\theta}_n(y_1, \dots, y_n)\) be the \(d\)-dimensional vector value that maximizes \(\ell(\theta; y_1, \dots, y_n)\).
    • Usually found as the solution to the likelihood equations \[ \frac{\partial \ell}{\partial \theta_j} \Big{|}_{\hat{\theta}} = 0, \quad j=1,\dots, d \]

Now, let \(\theta_0\) be the true (but unknown) value of \(\theta\) generating the data.

Under some regularity conditions \[ \hat{\theta}_n \sim \mathcal{AN}_d \left[\theta_0, \frac{1}{n} i_1^{-1}(\theta_0) \right]. \]

Fisher information

The covariance matrix of the derivative of the log likelihood \(\ell(\theta; Y_1, \dots, Y_n)\) is called the Fisher information (FI) matrix.

\[ i(\theta) = \mathbb{C}\mathrm{ov} \{\nabla \ell(\theta; Y)\} = - \mathbb{E} \{\nabla \nabla^\top \ell(\theta, Y) \} \]

  • When the data is IID \(\ell(\theta; Y_1, \dots, Y_n) = \sum_{i=1}^n \ell(\theta; Y_i)\).
  • In that case \(i(\theta) = i_n(\theta) = n i_1 (\theta)\) where \(i_1\) is the FI of a single observation.

Recall that the second derivative (Hessian in the vectorized case) is telling you something about the curvature.

  • So the FI is the expected curvature.
  • The more peaked the likelihood, the higher the curvature,
  • therefore the lower \(i_1^{-1}(\theta)\) is, and thus the lower the variance of \(\hat{\theta}_n\).

Gaussian FI

For the \(n=1\) case, letting \(y = y_1 = \bar{y}\), we have \[ \begin{aligned} \frac{\partial \ell}{\partial \mu} &= \frac{y - \mu}{\sigma^2} & \mbox{and} && \frac{\partial \ell}{\partial \sigma^2} &= - \frac{1}{2\sigma^2} + \frac{(y - \mu)^2}{2(\sigma^2)^2} \end{aligned} \]

Now, for second derivatives \[ \begin{aligned} \frac{\partial^2 \ell}{\partial \mu^2} &= - \frac{1}{\sigma^2} & \mbox{and} && \frac{\partial^2 \ell}{\partial \sigma^2} &= \frac{1}{2(\sigma^2)^2} - \frac{(y-\mu)^2}{(\sigma^2)^3} \\ \frac{\partial^2 \ell}{\partial \mu \partial \sigma^2} &= - \frac{y-\mu}{(\sigma^2)^2} \end{aligned} \]

Next we turn \(y\)s into \(Y\)s, take expectations, and negate.

Gaussian FI (2)

We get \[ i_1(\theta) = - \mathbb{E} \{\nabla \nabla^\top \ell(\theta, Y_1)\} = \left[ \begin{array}{cc} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2(\sigma^2)^2} \end{array} \right] \]

So what did we learn?

  • Actually, \(\hat{\mu}\) and \(\hat{\sigma}^2\) are independent!
  • \(i^{-1}(\mu)/n = \sigma^2/n\) so \(\hat{\mu} \sim \mathcal{AN}(\mu, \sigma^2/n)\)
    • In fact, we knew that already and that it's exact!
  • \(\hat{\sigma}^2 \sim \mathcal{AN}(\sigma^2, 2\sigma^4/n)\)
    • How does that compare to \(\hat{\sigma}^2 \sim \frac{\sigma^2}n \chi^2_n\)?

Asymptotic versus actual

true <- 25^2*rchisq(1000, 81)/81; o <- order(true)
hist(true, freq=FALSE, main="", ylim=c(0, 0.006))
lines(true[o], dnorm(true[o], 25^2, sqrt(25^4/81)), col=2, lty=2, lwd=2)

Variance of the mean

So, we have estimators for both parameters, \(\mu\) and \(\sigma^2\) \[ \hat{\mu} = \bar{Y} \quad \mbox{ and } \quad \hat{\sigma}^2 = s^2 \] and we know their sampling distributions.

You may have noticed that the distribution for \(\bar{Y}\) depends on \(\sigma^2\), and we don't know what that is! \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \] but we do know \(s^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}\).

  • How do we combine those two?


We have what's called a scale mixture of normals.

Reversing the logic a bit, we have \[ \sigma^2 \sim (n-1) s^2 / \chi^2_{n-1} \] - giving us a way to generate random \(\sigma^2\) values to plug into the sampling distribution of \(\bar{Y}\) \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \] - which in turn, for each \(\sigma^2\) thus generated, gives us a distribution for \(\bar{Y}\) that depends on only one unknown quantity, \(\mu\).

By repeating that process we are essentially integrating over \(\sigma^2\),

  • something we can do analytically, or via Monte Carlo.

Student-\(t\) distribution

A statistician working for Guiness, who published under the pseudonym "Student", showed that this scale mixture has a convenient closed form.

The Student-\(t\) distribution is shaped like the Gaussian

  • but has heavier tails
  • due to \(\chi^2\) distributed uncertainty in the variance.

The Student-\(t\) that we want to use for the sampling distribution of \(\bar{Y}\) is \[ \frac{\bar{Y} - \mu}{s/\sqrt{n}} \sim t_{n-1} \quad \Leftrightarrow \quad \bar{Y} \sim t_{n-1}(\mu, s^2/n). \]