Department of Statistics, Virginia Tech

Let \(Z \sim \mathcal{N}(0, 1)\). Now,

\[ \mathbb{P}(-1.96 < Z < 1.96) = \mathbb{P}(Z < 1.96) - \mathbb{P}(Z < -1.96) \approx 0.95. \]

`pnorm(1.96) - pnorm(-1.96)`

`## [1] 0.9500042`

These are the **central 95% quantiles** of a standard normal distribution.

```
z <- seq(-3, 3, length=1000)
plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=c(-1,1)*1.96, lty=2, col=2)
legend("bottom", "95% quantiles", lty=2, col=2)
```

Of course, there are uncountably many such 95% intervals, but centrality

- which in this case means 2.5% omitted on the left and 2.5% on the right

makes it unique.

The `q*`

functions in R will give you the precise quantile values.

`qnorm(c(0.025, 0.975))`

`## [1] -1.959964 1.959964`

`pnorm(qnorm(0.975)) - pnorm(qnorm(0.025))`

`## [1] 0.95`

90% and 50% are also common.

```
plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=qnorm(c(0.05, 0.95)), lty=2, col=3)
abline(v=qnorm(c(0.25, 0.75)), lty=2, col=4)
legend("topright", c("90%", "50%"), lty=2, col=c(3:4))
```

To illustrate the “log run” thinking behind the frequency argument for CIs, consider the following MC experiment.

- The true mean is \(\mu = 5\), which we will not presume to know.
- The true variance is \(\sigma^2 = 2\), which we will assume we know (for now).
- The sample size is \(n = 30\).
- We repeatedly draw samples \(Y_1, \dots, Y_{30}\) from \(\mathcal{N}(5, 2^2)\) a thousand times and
- calculate \(\bar{y}\)’s
- and the 95% CI associated with each \(\bar{y}\).

```
ys <- matrix(rnorm(30*1000, 5, 2), ncol=30)
ybars <- rowMeans(ys)
q95 <- qnorm(0.975)
ybar.se <- 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)
```

Some are in and some are out.

```
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
CIs[inside[1],]
```

`## [1] 4.369601 5.800957`

`CIs[setdiff(1:1000, inside)[1],]`

`## [1] 5.129830 6.561185`

About how often are they in?

`length(inside)/nrow(CIs)`

`## [1] 0.943`

In this MC experiment it looks like our **empirical coverage** is very close to the actual, desired, or **nominal coverage** of 95%. Since we generated the data so that it followed the assumptions (Gaussian, etc.), we shouldn’t be surprised by this result.

- But real data never really conform to our assumptions
- and we don’t know what the truth is
- and we don’t have replicates (at least not without more work).

And because we pretended the variance was known

- and we used the right value!

If we use the wrong variance, all bets are off.

```
ybar.se <- 1/sqrt(30) ## instead of 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
```

`## [1] 0.669`

- Woah, that’s way too small!
- Make the standard error (\(\sigma^2/n\)) too big and the interval will way over cover.

So if we assume we know that \(\sigma^2 = 25^2\) in our motivating example, we can easily calculate a CI for SAP ROE.

```
nucleus <- read.csv("../data/nucleus.csv")
sap.ybar <- mean(nucleus$ROE)
n <- length(nucleus$ROE)
sap.ybar + c(-1,1)*q95*25/sqrt(n)
```

`## [1] 7.192693 18.081381`

It is easy to see that the industry average is well within that range.

`mean(nucleus$IndustryROE)`

`## [1] 15.69877`

- But there is more to it than that.

Here is the chi-squared density with \(\nu = 80\).

```
y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 80), type="l", lwd=2, main="Chi-squared DoF=80")
```

The distribution for \(\bar{Y}\) depends on \(\sigma^2\) \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ and we know that } \quad s^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}. \]

- How do we combine those two?

Reversing the logic a bit, we have \[ \sigma^2 \sim (n-1) s^2 / \chi^2_{n-1} \] - giving us a way to generate random \(\sigma^2\) values to plug into the sampling distribution of \(\bar{Y}\) \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \] - which in turn, for each \(\sigma^2\) thus generated, gives us a distribution for \(\bar{Y}\) that depends on only one unknown quantity, \(\mu\).

By repeating that process we are essentially integrating over \(\sigma^2\),

- which we can illustate by Monte Carlo.

Suppose \(\mu = 12\).

```
ybar <- seq(2, 22, length=1000)
sap.y <- nucleus$ROE
sigma2 <- 80* var(sap.y) / rchisq(10, 80)
plot(ybar, dnorm(ybar, 12, sqrt(sigma2[1]/81)), type="l", lwd=2, ylim=c(0,0.17), main="scale mixture")
for(i in 2:10) lines(ybar, dnorm(ybar, 12, sqrt(sigma2[i]/81)), type="l", lwd=2)
```

Now converting to draws for \(\bar{Y}\).

```
sigma2 <- (var(sap.y)/80) * rchisq(1000, 80)
hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")
```

A statistician working for Guiness, who published under the pseudonym “Student”, showed that this scale mixture has a convenient closed form.

\[ \frac{\bar{Y} - \mu}{s/\sqrt{n}} \sim t_{n-1} \quad \Leftrightarrow \quad \bar{Y} \sim t_{n-1}(\mu, s^2/n). \]

```
hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")
sdybar <- sqrt(var(sap.y)/81)
lines(ybar, dt(((ybar-12)/sdybar), 80)/sdybar, col=2, lwd=2, lty=2)
legend("topright", "Student-t", col=2, lty=2, lwd=2)
```

Returning to our Monte Carlo experiment:

```
qt95 <- qt(0.975, 30-1)
s <- apply(ys, 1, sd)
ybar.tse <- s/sqrt(30)
CIs <- cbind(ybars - qt95*ybar.tse, ybars + qt95*ybar.tse)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
```

`## [1] 0.942`

- Pretty good.
- Sometimes a little high or low, but nothing systematic.

Returning to our SAP ROE example, but now abandoning our \(\sigma^2 = 25^2\) assumption, we obtain the following CI.

```
sap.se <- sd(nucleus$ROE)/sqrt(n)
qt95 <- qt(0.975, n-1)
sap.ybar + c(-1,1)*qt95*sap.se
```

`## [1] 6.963478 18.310596`

Wider than what we had before.

`sap.ybar + c(-1,1)*q95*25/sqrt(n)`

`## [1] 7.192693 18.081381`

R provides a simple command to calculate a CI, which gives the same result.

```
rbind(sap.ybar + c(-1,1)*qt95*sap.se,
confint(lm(nucleus$ROE~1)))
```

```
## 2.5 % 97.5 %
## 6.963478 18.3106
## (Intercept) 6.963478 18.3106
```

Actually, we are asking R to fit an ordinary least squares regression (a linear model, or `lm`

) with an intercept only.

- And then asking for the CI (
`confint`

) on the intercept.