## Statistical inference

Statistical inference is the process of leveraging a mathematical modeling apparatus to make deductions about quantities estimated from data;

• about performing the calculations required to make “good” estimates;
• understanding their properties, and how “good” they are relative to alternatives;
• and in particular understanding how those estimated quantities could have differed if you saw a different sample of data from the same population.

Modeling and inference involve

• equations,
• probability distributions,
• and unknown parameters to those equations and distributions which can be estimated from data.

## Modeling

In order to say something about the uncertainty of estimates (of unknown parameters and thereby predictions) we need

• some assumptions about how the data relate to the population.

That’s where statistical models come in.

• We need to invest in a “cartoon” version of how the data were generated,
• one which is convenient to work with mathematically.

To keep it simple for now, lets assume that our data is an independent and identically distributed sub-sample from a Gaussian population:

$Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2), \quad i=1,\dots,n.$

## IID

What does that mean, independent and identically distributed?

It means that

• each observation comes from the same population
• and therefore is a random quantity from the same distribution ($$\mathcal{N}(\mu, \sigma^2)$$),
• and that given a form of that distribution (i.e., $$\mu$$ and $$\sigma^2$$), knowing the value of one of them $$Y_j = y_j$$, say, does not help you narrow down values for another, e.g., $$Y_i$$.

In other words, \begin{aligned} p(Y_i = y_i \mid Y_j = y_j, \mu, \sigma^2) &= p(Y_i \mid \mu, \sigma^2), \quad \forall i \ne j, \quad \mbox{or} \\ p(Y_i = y_i, Y_j = y_j \mid \mu, \sigma^2) &= p(Y_i = y_i \mid \mu, \sigma^2) \cdot p(Y_j = y_j \mid \mu, \sigma^2), \end{aligned} similarly.

## Gaussian

What does that mean, Gaussian or normally distributed $$\mathcal{N}(\mu, \sigma^2)$$?.

It means that

• $$\mathbb{E}\{Y_i\} = \mu$$
• $$\mathbb{V}\mathrm{ar}\{Y_i\} = \sigma^2$$
• and that the probability density of a particular $$y$$-value follows $p(Y_i = y_i) = f(y_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{- \frac{(y_i - \mu)^2}{\sigma^2} \right\}$
• and together that means that $p(\mu - 1.96 \sigma < Y_i < \mu + 1.96\sigma) = 0.95.$
• Also observe that the density is symmetric, with most probable value $$f(\mu)$$.

A “bell-shaped” curve characteristic of a Gaussian, with $$\mu=12$$, $$\sigma^2=25^2$$.

y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2)
abline(v=12, col="gray")
abline(v=qnorm(c(0.975,0.025),12,25), col="red", lty=2)

## An estimate

A simple way to estimate the Gaussian mean parameter $$\mu$$ from data is through the empirical average.

• The estimate is $$\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$$.
• The corresponding estimator is $$\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i$$.

If we regard our data as IID Gaussian, then $$\mathbb{E}\{Y_i\} = \mu$$.

What does that mean for $$\bar{Y}$$? $\mathbb{E}\{\bar{Y}\} = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\{ Y_i \} = \frac{1}{n} \cdot n \mu = \mu.$

• That’s good right?

## Unbiased

An estimator $$\hat{\theta}_n \equiv \hat{\theta}(y_1,\dots, y_n)$$ is said to be unbiased for a parameter $$\theta$$ if $\mathbb{E}\{\hat{\theta}_n\} = \theta.$

• In the case of our parameter $$\mu$$, we have established that $$\hat{\mu}_n = \bar{Y}$$ is an unbiased estimator.

• Having a biased estimator is usually bad (not always; why?).

• Bias would mean that, on average, what we estimate from data is not the same as what we are targeting in the population.

## A frequency argument

Notice what we’re doing here.

Implicitly we are making a frequency argument, which is at the heart of most of classical (Frequentist) statistics.

• By taking expectations of $$\hat{\theta}_n$$ we are imagining what sort of estimates we would get, on average, with similar data sets (re-)sampled from the population.

• If $$\hat{\theta}_n$$ is unbiased, then on average we get $$\theta$$.
• In our example, $$\bar{Y}$$ is unbiased,
• on average it is $$\mu$$.

What other properties should a good estimator have?

## Efficiency

They should be efficient

• which basically means that, by similar frequency arguments, they should have low variance.
• Or in other words, they should make the greatest use of the limited amount of observed data, $$n$$.

What is the variance of our estimator $$\bar{Y}$$ of $$\mu$$?

$\mathbb{V}\mathrm{ar}\{ \bar{Y} \} = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}\mathrm{ar}\{ Y_i \} = \frac{\sigma^2}{n},$ since $$\mathbb{V}\mathrm{ar}\{a X + bY + c\} = a^2 \mathbb{V}\mathrm{ar}\{X\} + b^2 \mathbb{V}\mathrm{ar}\{Y\}$$ as long as $$X$$ and $$Y$$ are independent.

• So the variance decreases linearly as data is added. Pretty efficient, eh?

## Sampling distribution

So now we know the mean and variance of $$\bar{Y}$$, our estimator. $\mathbb{E}\{\bar{Y}\} = \mu \quad \mbox{ and } \quad \mathbb{V}\mathrm{ar}\{\bar{Y}\} = \frac{\sigma^2}{n}$

Now, $$\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i$$, and a result from probability is that the sum of independent normal RVs is also normal.

• Normal distributions are uniquely determined by their mean and variance,
• and these are determined identically by parameters, typically notated as $$\mu$$ and $$\sigma^2$$.
• Since we know the mean and the variance of $$\bar{Y}$$, above, it must be that $\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n).$
• This is the sampling distribution of $$\bar{Y}$$.

## Woah!

This is a remarkable result. $\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n).$

It says that, even though (almost certainly) you do not know the true population parameters $$\mu$$ and $$\sigma^2$$, if your population is normal, then the average you calculate from $$n$$ data points is “close” to $$\mu$$

• on average it is $$\mu$$, averaging over random data sets looking like $$Y_1, Y_2, \dots, Y_n$$,
• and the amount which it deviates from $$\mu$$ goes down as $$n$$ goes up.

Now before we get too excited, observe that $$\sigma^2$$ is on the scale of $$(Y_i - \mu)^2$$,

• so on the scale of $$Y_i$$, uncertainty really goes down by $$\sigma/\sqrt{n}$$.
• That means “diminishing returns” as $$n$$ is increased.

# Principles of Estimation

## Method of moments

It seemed sensible to estimate the population mean, $$\mu$$ with the sample mean $$\bar{Y}$$.

• This is, in fact, a principled mode of estimating parameters from data
• and can be extended to other parameters.

The $$p^{\mathrm{th}}$$ moment of a RV $$Y$$ is $$\mathbb{E}\{Y^p\}$$.

For

• IID $$Y_1, \dots, Y_n$$, sampled from a distribution with
• $$d$$-dimensional parameter $$\theta = (\theta_1, \dots, \theta_d)$$,

the method of moments (MoM) estimator $$\hat{\theta}_n$$ may be calculated by solving the following system of $$d$$ equations: $\frac{1}{n}\sum_{i=1}^n Y_i^p = \mathbb{E}\{Y^p\}, \quad p=1, \dots, d.$

## MoM for Gaussian data

When $$Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$$, we have $$\theta = (\mu, \sigma^2)$$, and the system of equations is \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i &= \hat{\mu} \\ \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \hat{\mu}^2, \end{aligned}

• where the second equation utilizes that $$\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{E}\{Y^2\} - \mathbb{E}\{Y\}^2$$.
• We’ve already seen the first equation – that’s our unbiased estimator for the mean.

## MoM for the Gaussian variance $$\sigma^2$$

What about that second equation?

Plugging in $$\hat{\mu}$$ gives \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \bar{Y}^2 \\ \hat{\sigma}^2 &= \frac{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2}{n} \\ &= \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}. \end{aligned}

So now we have the MoM estimator for $$(\mu, \sigma^2)$$.

• We’ve seen some properties of $$\hat{\mu}$$;
• what about $$\hat{\sigma}^2$$?

## Bias?

Is the MoM estimator for $$\sigma^2$$ unbiased?

\begin{aligned} \mathbb{E}\{\hat{\sigma}^2\} &= \mathbb{E} \left\{\frac{1}{n} \sum_{i=1}^n (Y_i - \bar{Y})^2 \right\} \\ & \;\, \vdots \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 \sum_{i=1}^n Y_i \bar{Y} - \sum_{i=1}^n \bar{Y}^2 \right\} \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 n \bar{Y}^2 - n \bar{Y}^2 \right\} % = \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2 \right\} \\ % &= \mathbb{E}\{Y_i^2\} - \mathbb{E}\{\bar{Y}^2\} = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ &= \frac{n-1}{n} \sigma^2 \end{aligned}

Nope!

• But it is asymptotically unbiased.

## Bias correction

And that’s the reason folks use the following instead: $S^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2.$

You can justify the change in denominator on more intuitive grounds.

Observe how the data is being used twice when estimating discrepancy:

1. To estimate the mean, via $$\bar{Y}$$.
2. To estimate deviations around the mean, via $$(Y_i - \bar{Y})^2$$.

Because you’ve used the data twice, you’ve lost “one” opportunity (in 1.) to observe variability (in 2.).

## Degrees of freedom

We call that “loosing a degree of freedom (DoF)”.

To see why a DoF is lost, consider the following special case.

If $$n=1$$, then $$\bar{Y} = Y_1$$, and $$\hat{\sigma}^2 = (Y_1 - Y_1)^2 = 0$$.

• But that doesn’t make any sense.
• How could our estimated variance (sensibly) be zero?
• Because by using $$Y_1$$ to estimate $$\mu$$ we lost the opportunity to use it again to estimate $$\sigma^2$$.
• We lost a DoF.

$$S^2$$ is better because you end up with $$0/0$$ which is undefined, which tells you

• when $$n=1$$ you can have no concept of variance.

## Chi-squared distributions

Working out the distribution of $$S^2$$ requires the following result from probability.

If $$Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$$, then

• $$(Y_i - \mu)^2/\sigma^2 \sim \chi^2_1$$, a chi-squared distribution with $$\nu=1$$ DoFs.
• $$\sum_{i=1}^n (Y_i - \mu)^2/\sigma^2 \sim \chi^2_n$$.

The chi-squared distribution is positively skewed distribution with

• mean $$\nu$$.
• and variance $$2\nu$$.

Here is the chi-squared density with $$\nu = 81$$.

y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 81), type="l", lwd=2)

## Sampling distribution of $$\sigma^2$$

Now, it turns out that we can write $\sum_{i=1}^n \frac{(Y_i - \mu)^2}{\sigma^2} = \frac{(n-1) S^2}{\sigma^2} + \frac{n(\bar{Y} - \mu)^2}{\sigma^2}.$

That’s useful because

• the LHS is our $$\chi^2_n$$ RV
• and the second RV on the RHS is clearly $$\chi^2_1$$ since $$\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)$$.

Therefore (waving hands a little, because to do it right would involve a big digression on moment generating functions)

$\frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1} \quad \mbox{ or } \quad S^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}.$

## Likelihood

The likelihood flips around the relationship between the sampling model (distributional assumptions about how the data were generated) and its parameters.

• It is a function that maps the parameter space onto the real line, giving higher values to settings of the parameters which support the observations in the data.

$L(\theta) \equiv L(\theta; y) = P_\theta(Y_1 = y_1, \dots, Y_n = y_n)$

• In the case of IID data with density $$f_\theta (y_i)$$, the likelihood is

$L(\theta) = \prod_{i=1}^n P_\theta (Y_i = y_i) = \prod_{i=1}^n f_\theta (y_i).$

What would that look like for our Gaussian example?

## Maximum Likelihood

A maximum likelihood estimator (MLE) is calculated by finding the value of $$\theta$$ which maximizes $$L(\theta)$$.

Typically, one works with the log likelihood in search of the MLE

• because it is easier mathematically, especially with IID data

$\ell(\theta) = \log L(\theta ; y) = \sum_{i=1}^n \log f_\theta (y_i).$

• The typical tactic is to maximize via derivatives,
• i.e., to solve to so-called likelihood equations

$\nabla \ell(\theta) = 0.$

## MLE for Gaussian data

Consider our motivating Gaussian example. The log likelihood is $\ell(\mu, \sigma^2) = \log L(\theta ; y) = c + - \frac{n}{2} \log \sigma^2 - \frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\sigma^2}.$

Now differentiatiating and setting equal to zero gives:

\begin{aligned} \frac{\partial \ell}{\partial \mu} \Big{|}_{\hat{\mu}} &= 0 & \Rightarrow && \hat{\mu} &= \bar{y}. \\ \frac{\partial \ell}{\partial \sigma^2} \Big{|}_{\hat{\mu},\hat{\sigma}^2} &= 0 & \Rightarrow && \hat{\sigma}^2 &= \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2. \\ \end{aligned}

• These are the same as our MoM estimators,
• unbiased and biased respectively.

## Remarkably general framework

The cool thing about the MLE (as compared to MoM) is that it is remarkably generic.

• Those results that we got for the Gaussian
• e.g., that $$\hat{\mu} \equiv \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)$$
• i.e., that the estimate is centered around its true value and it is efficient, in the sense that variance is going down with $$n$$,
• work for data under other sampling models too.

But that’s certainly not obvious at first glance.

The general result,

• the so-called asymptotic distribution of the MLE
• is similar to the central limit theorem, which is probably the most important result in all of statistics.

## Central limit theorem

The central limit theorem (CLT) says that sums of IID RVs are eventually Gaussian no matter what their original distribution.

Let $$Y_1, \dots, Y_n$$ be a sequence of IID RVs with

• mean $$\mu = \mathbb{E}\{Y_i\}$$ and variance $$\sigma^2 = \mathbb{V}\mathrm{ar}\{Y_i\}$$.

Now let $$S_n = Y_1 + \cdots + Y_n$$.
$\mbox{If } \quad Z_n = \frac{S_n - n \mu}{\sigma \sqrt{n}} \quad \mbox{ then } \quad Z_n \stackrel{d}{\rightarrow} \mathcal{N}(0,1) \quad \mbox{ as } \quad n \rightarrow \infty.$

Or in other words, if $$\bar{Y}_n = S_n/n$$, then $\bar{Y}_n \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ as } \quad n \rightarrow \infty.$

• in shorthand: $$\bar{Y}_n \sim \mathcal{AN}(\mu, \sigma^2/n)$$.

## Illustrating the CLT

For example, exponentially distributed RVs don’t look very normal.

y <- seq(0, 5, length=1000)
plot(y, dexp(y), type="l", lwd=2)

y2 <- colMeans(matrix(rexp(2*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 2")
lines(y, dnorm(y, 1, 1/sqrt(2)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(5*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 5")
lines(y, dnorm(y, 1, 1/sqrt(5)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(10*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 10")
lines(y, dnorm(y, 1, 1/sqrt(10)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(100*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 100")
lines(y, dnorm(y, 1, 1/sqrt(100)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")

y2 <- colMeans(matrix(rexp(1000*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 1000")
lines(y, dnorm(y, 1, 1/sqrt(1000)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")