Department of Statistics, Virginia Tech

Statistical **inference** is the process of leveraging a mathematical modeling apparatus to make deductions about quantities estimated from data;

- about performing the calculations required to make “good” estimates;
- understanding their properties, and how “good” they are relative to alternatives;
- and in particular understanding how those estimated quantities could have differed if you saw a different sample of data
*from the same population*.

Modeling and inference involve

- equations,
- probability distributions,
- and unknown parameters to those equations and distributions which can be estimated from data.

In order to say something about the uncertainty of estimates (of unknown parameters and thereby predictions) we need

- some assumptions about how the data relate to the population.

That’s where statistical models come in.

- We need to invest in a “cartoon” version of how the data were generated,
- one which is convenient to work with mathematically.

To keep it simple for now, lets assume that our data is an **independent and identically distributed** sub-sample from a Gaussian population:

\[ Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2), \quad i=1,\dots,n. \]

What does that mean, **independent and identically distributed**?

It means that

- each observation comes from the same population
- and therefore is a random quantity from the same distribution (\(\mathcal{N}(\mu, \sigma^2)\)),
- and that given a form of that distribution (i.e., \(\mu\) and \(\sigma^2\)), knowing the value of one of them \(Y_j = y_j\), say, does not help you narrow down values for another, e.g., \(Y_i\).

In other words, \[ \begin{aligned} p(Y_i = y_i \mid Y_j = y_j, \mu, \sigma^2) &= p(Y_i \mid \mu, \sigma^2), \quad \forall i \ne j, \quad \mbox{or} \\ p(Y_i = y_i, Y_j = y_j \mid \mu, \sigma^2) &= p(Y_i = y_i \mid \mu, \sigma^2) \cdot p(Y_j = y_j \mid \mu, \sigma^2), \end{aligned} \] similarly.

What does that mean, **Gaussian** or **normally distributed** \(\mathcal{N}(\mu, \sigma^2)\)?.

It means that

- \(\mathbb{E}\{Y_i\} = \mu\)
- \(\mathbb{V}\mathrm{ar}\{Y_i\} = \sigma^2\)
- and that the probability
**density**of a particular \(y\)-value follows \[ p(Y_i = y_i) = f(y_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{- \frac{(y_i - \mu)^2}{\sigma^2} \right\} \] - and together that means that \[ p(\mu - 1.96 \sigma < Y_i < \mu + 1.96\sigma) = 0.95. \]
- Also observe that the density is symmetric, with most probable value \(f(\mu)\).

A “bell-shaped” curve characteristic of a Gaussian, with \(\mu=12\), \(\sigma^2=25^2\).

```
y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2)
abline(v=12, col="gray")
abline(v=qnorm(c(0.975,0.025),12,25), col="red", lty=2)
```

A simple way to estimate the Gaussian mean parameter \(\mu\) from data is through the empirical average.

- The
**estimate**is \(\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\). - The corresponding
**estimator**is \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\).

If we regard our data as IID Gaussian, then \(\mathbb{E}\{Y_i\} = \mu\).

What does that mean for \(\bar{Y}\)? \[ \mathbb{E}\{\bar{Y}\} = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\{ Y_i \} = \frac{1}{n} \cdot n \mu = \mu. \]

- That’s good right?

An estimator \(\hat{\theta}_n \equiv \hat{\theta}(y_1,\dots, y_n)\) is said to be **unbiased** for a parameter \(\theta\) if \[
\mathbb{E}\{\hat{\theta}_n\} = \theta.
\]

In the case of our parameter \(\mu\), we have established that \(\hat{\mu}_n = \bar{Y}\) is an unbiased estimator.

Having a biased estimator is usually bad (not always; why?).

Bias would mean that, on average, what we estimate from data is not the same as what we are targeting in the population.

Notice what we’re doing here.

Implicitly we are making a frequency argument, which is at the heart of most of classical (Frequentist) statistics.

By taking expectations of \(\hat{\theta}_n\) we are imagining what sort of estimates we would get, on average, with similar data sets (re-)sampled from the population.

- If \(\hat{\theta}_n\) is unbiased, then on average we get \(\theta\).
- In our example, \(\bar{Y}\) is unbiased,
- on average it is \(\mu\).

What other properties should a good estimator have?

They should be **efficient**

- which basically means that, by similar frequency arguments, they should have
**low variance**. - Or in other words, they should make the greatest use of the limited amount of observed data, \(n\).

What is the variance of our estimator \(\bar{Y}\) of \(\mu\)?

\[ \mathbb{V}\mathrm{ar}\{ \bar{Y} \} = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}\mathrm{ar}\{ Y_i \} = \frac{\sigma^2}{n}, \] since \(\mathbb{V}\mathrm{ar}\{a X + bY + c\} = a^2 \mathbb{V}\mathrm{ar}\{X\} + b^2 \mathbb{V}\mathrm{ar}\{Y\}\) as long as \(X\) and \(Y\) are independent.

- So the variance decreases linearly as data is added. Pretty efficient, eh?

So now we know the mean and variance of \(\bar{Y}\), our estimator. \[ \mathbb{E}\{\bar{Y}\} = \mu \quad \mbox{ and } \quad \mathbb{V}\mathrm{ar}\{\bar{Y}\} = \frac{\sigma^2}{n} \]

Now, \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\), and a result from probability is that the sum of independent normal RVs is also normal.

- Normal distributions are uniquely determined by their mean and variance,
- and these are determined identically by parameters, typically notated as \(\mu\) and \(\sigma^2\).

- Since we know the mean and the variance of \(\bar{Y}\), above, it must be that \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \]
- This is the
**sampling distribution**of \(\bar{Y}\).

This is a remarkable result. \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \]

It says that, even though (almost certainly) you do not know the true population parameters \(\mu\) and \(\sigma^2\), if your population is normal, then the average you calculate from \(n\) data points is “close” to \(\mu\)

- on average it is \(\mu\), averaging over random data sets looking like \(Y_1, Y_2, \dots, Y_n\),
- and the amount which it deviates from \(\mu\) goes down as \(n\) goes up.

Now before we get too excited, observe that \(\sigma^2\) is on the scale of \((Y_i - \mu)^2\),

- so on the scale of \(Y_i\), uncertainty really goes down by \(\sigma/\sqrt{n}\).
- That means “diminishing returns” as \(n\) is increased.

It seemed sensible to estimate the population mean, \(\mu\) with the sample mean \(\bar{Y}\).

- This is, in fact, a principled mode of estimating parameters from data
- and can be extended to other parameters.

The \(p^{\mathrm{th}}\) **moment** of a RV \(Y\) is \(\mathbb{E}\{Y^p\}\).

For

- IID \(Y_1, \dots, Y_n\), sampled from a distribution with
- \(d\)-dimensional parameter \(\theta = (\theta_1, \dots, \theta_d)\),

the **method of moments (MoM)** estimator \(\hat{\theta}_n\) may be calculated by solving the following system of \(d\) equations: \[
\frac{1}{n}\sum_{i=1}^n Y_i^p = \mathbb{E}\{Y^p\}, \quad p=1, \dots, d.
\]

When \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), we have \(\theta = (\mu, \sigma^2)\), and the system of equations is \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i &= \hat{\mu} \\ \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \hat{\mu}^2, \end{aligned} \]

- where the second equation utilizes that \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{E}\{Y^2\} - \mathbb{E}\{Y\}^2\).
- We’ve already seen the first equation – that’s our unbiased estimator for the mean.

What about that second equation?

Plugging in \(\hat{\mu}\) gives \[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i^2 &= \hat{\sigma}^2 + \bar{Y}^2 \\ \hat{\sigma}^2 &= \frac{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2}{n} \\ &= \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}. \end{aligned} \]

So now we have the MoM estimator for \((\mu, \sigma^2)\).

- We’ve seen some properties of \(\hat{\mu}\);
- what about \(\hat{\sigma}^2\)?

Is the MoM estimator for \(\sigma^2\) unbiased?

\[ \begin{aligned} \mathbb{E}\{\hat{\sigma}^2\} &= \mathbb{E} \left\{\frac{1}{n} \sum_{i=1}^n (Y_i - \bar{Y})^2 \right\} \\ & \;\, \vdots \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 \sum_{i=1}^n Y_i \bar{Y} - \sum_{i=1}^n \bar{Y}^2 \right\} \\ % &= \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - 2 n \bar{Y}^2 - n \bar{Y}^2 \right\} % = \frac{1}{n} \mathbb{E} \left\{\sum_{i=1}^n Y_i^2 - n \bar{Y}^2 \right\} \\ % &= \mathbb{E}\{Y_i^2\} - \mathbb{E}\{\bar{Y}^2\} = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ &= \frac{n-1}{n} \sigma^2 \end{aligned} \]

Nope!

- But it is
**asymptotically unbiased**.

And that’s the reason folks use the following instead: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2. \]

You can justify the change in denominator on more intuitive grounds.

Observe how the data is being used twice when estimating discrepancy:

- To estimate the mean, via \(\bar{Y}\).
- To estimate deviations around the mean, via \((Y_i - \bar{Y})^2\).

Because you’ve used the data twice, you’ve lost “one” opportunity (in 1.) to observe variability (in 2.).

We call that “loosing a **degree of freedom (DoF)**”.

To see why a DoF is lost, consider the following special case.

If \(n=1\), then \(\bar{Y} = Y_1\), and \(\hat{\sigma}^2 = (Y_1 - Y_1)^2 = 0\).

- But that doesn’t make any sense.
- How could our estimated variance (sensibly) be zero?
- Because by using \(Y_1\) to estimate \(\mu\) we lost the opportunity to use it again to estimate \(\sigma^2\).
- We lost a DoF.

\(S^2\) is better because you end up with \(0/0\) which is undefined, which tells you

- when \(n=1\) you can have no concept of variance.

Working out the distribution of \(S^2\) requires the following result from probability.

If \(Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)\), then

- \((Y_i - \mu)^2/\sigma^2 \sim \chi^2_1\), a chi-squared distribution with \(\nu=1\) DoFs.
- \(\sum_{i=1}^n (Y_i - \mu)^2/\sigma^2 \sim \chi^2_n\).

The chi-squared distribution is positively skewed distribution with

- mean \(\nu\).
- and variance \(2\nu\).

Here is the chi-squared density with \(\nu = 81\).

```
y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 81), type="l", lwd=2)
```

Now, it turns out that we can write \[ \sum_{i=1}^n \frac{(Y_i - \mu)^2}{\sigma^2} = \frac{(n-1) S^2}{\sigma^2} + \frac{n(\bar{Y} - \mu)^2}{\sigma^2}. \]

That’s useful because

- the LHS is our \(\chi^2_n\) RV
- and the second RV on the RHS is clearly \(\chi^2_1\) since \(\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\).

Therefore (waving hands a little, because to do it right would involve a big digression on **moment generating functions**)

\[ \frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1} \quad \mbox{ or } \quad S^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}. \]

The **likelihood** flips around the relationship between the sampling model (distributional assumptions about how the data were generated) and its parameters.

- It is a function that maps the parameter space onto the real line, giving higher values to settings of the parameters which support the observations in the data.

\[ L(\theta) \equiv L(\theta; y) = P_\theta(Y_1 = y_1, \dots, Y_n = y_n) \]

- In the case of IID data with density \(f_\theta (y_i)\), the likelihood is

\[ L(\theta) = \prod_{i=1}^n P_\theta (Y_i = y_i) = \prod_{i=1}^n f_\theta (y_i). \]

What would that look like for our Gaussian example?

A **maximum likelihood estimator (MLE)** is calculated by finding the value of \(\theta\) which maximizes \(L(\theta)\).

Typically, one works with the log likelihood in search of the MLE

- because it is easier mathematically, especially with IID data

\[ \ell(\theta) = \log L(\theta ; y) = \sum_{i=1}^n \log f_\theta (y_i). \]

- The typical tactic is to maximize via derivatives,
- i.e., to solve to so-called
**likelihood equations**

\[ \nabla \ell(\theta) = 0. \]

Consider our motivating Gaussian example. The log likelihood is \[ \ell(\mu, \sigma^2) = \log L(\theta ; y) = c + - \frac{n}{2} \log \sigma^2 - \frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\sigma^2}. \]

Now differentiatiating and setting equal to zero gives:

\[ \begin{aligned} \frac{\partial \ell}{\partial \mu} \Big{|}_{\hat{\mu}} &= 0 & \Rightarrow && \hat{\mu} &= \bar{y}. \\ \frac{\partial \ell}{\partial \sigma^2} \Big{|}_{\hat{\mu},\hat{\sigma}^2} &= 0 & \Rightarrow && \hat{\sigma}^2 &= \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2. \\ \end{aligned} \]

- These are the same as our MoM estimators,
- unbiased and biased respectively.

The cool thing about the MLE (as compared to MoM) is that it is remarkably generic.

- Those results that we got for the Gaussian
- e.g., that \(\hat{\mu} \equiv \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n)\)
- i.e., that the estimate is centered around its true value and it is efficient, in the sense that variance is going down with \(n\),

- work for data under other sampling models too.

But that’s certainly not obvious at first glance.

The general result,

- the so-called
**asymptotic distribution of the MLE** - is similar to the central limit theorem, which is probably the most important result in all of statistics.

The **central limit theorem (CLT)** says that sums of IID RVs are eventually Gaussian no matter what their original distribution.

Let \(Y_1, \dots, Y_n\) be a sequence of IID RVs with

- mean \(\mu = \mathbb{E}\{Y_i\}\) and variance \(\sigma^2 = \mathbb{V}\mathrm{ar}\{Y_i\}\).

Now let \(S_n = Y_1 + \cdots + Y_n\).

\[
\mbox{If } \quad Z_n = \frac{S_n - n \mu}{\sigma \sqrt{n}} \quad \mbox{ then } \quad Z_n \stackrel{d}{\rightarrow} \mathcal{N}(0,1) \quad \mbox{ as } \quad n \rightarrow \infty.
\]

Or in other words, if \(\bar{Y}_n = S_n/n\), then \[ \bar{Y}_n \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ as } \quad n \rightarrow \infty. \]

- in shorthand: \(\bar{Y}_n \sim \mathcal{AN}(\mu, \sigma^2/n)\).

For example, exponentially distributed RVs don’t look very normal.

```
y <- seq(0, 5, length=1000)
plot(y, dexp(y), type="l", lwd=2)
```

```
y2 <- colMeans(matrix(rexp(2*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 2")
lines(y, dnorm(y, 1, 1/sqrt(2)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")
```

```
y2 <- colMeans(matrix(rexp(5*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 5")
lines(y, dnorm(y, 1, 1/sqrt(5)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")
```

```
y2 <- colMeans(matrix(rexp(10*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 10")
lines(y, dnorm(y, 1, 1/sqrt(10)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")
```

```
y2 <- colMeans(matrix(rexp(100*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 100")
lines(y, dnorm(y, 1, 1/sqrt(100)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")
```

```
y2 <- colMeans(matrix(rexp(1000*1000), ncol=1000))
hist(y2, main="", freq=FALSE, xlab="averages of 1000 samples of size 1000")
lines(y, dnorm(y, 1, 1/sqrt(1000)), col="red")
legend("topright", "AN", lty=1, col=2, bty="n")
```