## Quantiles

Let $$Z \sim \mathcal{N}(0, 1)$$. Now,

$\mathbb{P}(-1.96 < Z < 1.96) = \mathbb{P}(Z < 1.96) - \mathbb{P}(Z < -1.96) \approx 0.95.$

pnorm(1.96) - pnorm(-1.96)
## [1] 0.9500042

These are the central 95% quantiles of a standard normal distribution.

z <- seq(-3, 3, length=1000)
plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=c(-1,1)*1.96, lty=2, col=2)
legend("bottom", "95% quantiles", lty=2, col=2)

Of course, there are uncountably many such 95% intervals, but centrality

• which in this case means 2.5% omitted on the left and 2.5% on the right

makes it unique.

The q* functions in R will give you the precise quantile values.

qnorm(c(0.025, 0.975))
## [1] -1.959964  1.959964
pnorm(qnorm(0.975)) - pnorm(qnorm(0.025))
## [1] 0.95

90% and 50% are also common.

plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=qnorm(c(0.05, 0.95)), lty=2, col=3)
abline(v=qnorm(c(0.25, 0.75)), lty=2, col=4)
legend("topright", c("90%", "50%"), lty=2, col=c(3:4))

## Monte Carlo coverage of a Confidence Interval

To illustrate the “log run” thinking behind the frequency argument for CIs, consider the following MC experiment.

• The true mean is $$\mu = 5$$, which we will not presume to know.
• The true variance is $$\sigma^2 = 2$$, which we will assume we know (for now).
• The sample size is $$n = 30$$.
• We repeatedly draw samples $$Y_1, \dots, Y_{30}$$ from $$\mathcal{N}(5, 2^2)$$ a thousand times and
• calculate $$\bar{y}$$’s
• and the 95% CI associated with each $$\bar{y}$$.
ys <- matrix(rnorm(30*1000, 5, 2), ncol=30)
ybars <- rowMeans(ys)
q95 <- qnorm(0.975)
ybar.se <- 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)

Some are in and some are out.

inside <- which(5 > CIs[,1] & 5 < CIs[,2])
CIs[inside[1],]
## [1] 4.369601 5.800957
CIs[setdiff(1:1000, inside)[1],]
## [1] 5.129830 6.561185

About how often are they in?

length(inside)/nrow(CIs)
## [1] 0.943

In this MC experiment it looks like our empirical coverage is very close to the actual, desired, or nominal coverage of 95%. Since we generated the data so that it followed the assumptions (Gaussian, etc.), we shouldn’t be surprised by this result.

• But real data never really conform to our assumptions
• and we don’t know what the truth is
• and we don’t have replicates (at least not without more work).

And because we pretended the variance was known

• and we used the right value!

If we use the wrong variance, all bets are off.

ybar.se <- 1/sqrt(30)  ## instead of 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
## [1] 0.669
• Woah, that’s way too small!
• Make the standard error ($$\sigma^2/n$$) too big and the interval will way over cover.

## SAP ROE example

So if we assume we know that $$\sigma^2 = 25^2$$ in our motivating example, we can easily calculate a CI for SAP ROE.

nucleus <- read.csv("../data/nucleus.csv")
sap.ybar <- mean(nucleus$ROE) n <- length(nucleus$ROE)
sap.ybar + c(-1,1)*q95*25/sqrt(n)
## [1]  7.192693 18.081381

It is easy to see that the industry average is well within that range.

mean(nucleus$IndustryROE) ## [1] 15.69877 • But there is more to it than that. ## Combining $$s$$ and $$\bar{Y}$$ for the Student-$$t$$ Here is the chi-squared density with $$\nu = 80$$. y2 <- seq(40, 140, length=1000) plot(y2, dchisq(y2, 80), type="l", lwd=2, main="Chi-squared DoF=80") The distribution for $$\bar{Y}$$ depends on $$\sigma^2$$ $\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ and we know that } \quad s^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}.$ • How do we combine those two? Reversing the logic a bit, we have $\sigma^2 \sim (n-1) s^2 / \chi^2_{n-1}$ - giving us a way to generate random $$\sigma^2$$ values to plug into the sampling distribution of $$\bar{Y}$$ $\bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n).$ - which in turn, for each $$\sigma^2$$ thus generated, gives us a distribution for $$\bar{Y}$$ that depends on only one unknown quantity, $$\mu$$. By repeating that process we are essentially integrating over $$\sigma^2$$, • which we can illustate by Monte Carlo. Suppose $$\mu = 12$$. ybar <- seq(2, 22, length=1000) sap.y <- nucleus$ROE
sigma2 <- 80* var(sap.y) / rchisq(10, 80)
plot(ybar, dnorm(ybar, 12, sqrt(sigma2[1]/81)), type="l", lwd=2, ylim=c(0,0.17), main="scale mixture")
for(i in 2:10) lines(ybar, dnorm(ybar, 12, sqrt(sigma2[i]/81)), type="l", lwd=2)

Now converting to draws for $$\bar{Y}$$.

sigma2 <- (var(sap.y)/80) * rchisq(1000, 80)
hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")

A statistician working for Guiness, who published under the pseudonym “Student”, showed that this scale mixture has a convenient closed form.

$\frac{\bar{Y} - \mu}{s/\sqrt{n}} \sim t_{n-1} \quad \Leftrightarrow \quad \bar{Y} \sim t_{n-1}(\mu, s^2/n).$

hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")
sdybar <- sqrt(var(sap.y)/81)
lines(ybar, dt(((ybar-12)/sdybar), 80)/sdybar, col=2, lwd=2, lty=2)
legend("topright", "Student-t", col=2, lty=2, lwd=2)

## CI with unknown variance

Returning to our Monte Carlo experiment:

qt95 <- qt(0.975, 30-1)
s <- apply(ys, 1, sd)
ybar.tse <- s/sqrt(30)
CIs <- cbind(ybars - qt95*ybar.tse, ybars + qt95*ybar.tse)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
## [1] 0.942
• Pretty good.
• Sometimes a little high or low, but nothing systematic.

Returning to our SAP ROE example, but now abandoning our $$\sigma^2 = 25^2$$ assumption, we obtain the following CI.

sap.se <- sd(nucleus$ROE)/sqrt(n) qt95 <- qt(0.975, n-1) sap.ybar + c(-1,1)*qt95*sap.se ## [1] 6.963478 18.310596 Wider than what we had before. sap.ybar + c(-1,1)*q95*25/sqrt(n) ## [1] 7.192693 18.081381 R provides a simple command to calculate a CI, which gives the same result. rbind(sap.ybar + c(-1,1)*qt95*sap.se, confint(lm(nucleus$ROE~1)))
##                2.5 %  97.5 %
##             6.963478 18.3106
## (Intercept) 6.963478 18.3106

Actually, we are asking R to fit an ordinary least squares regression (a linear model, or lm) with an intercept only.

• And then asking for the CI (confint) on the intercept.