Quantiles

Let \(Z \sim \mathcal{N}(0, 1)\). Now,

\[ \mathbb{P}(-1.96 < Z < 1.96) = \mathbb{P}(Z < 1.96) - \mathbb{P}(Z < -1.96) \approx 0.95. \]

pnorm(1.96) - pnorm(-1.96)
## [1] 0.9500042

These are the central 95% quantiles of a standard normal distribution.

z <- seq(-3, 3, length=1000)
plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=c(-1,1)*1.96, lty=2, col=2) 
legend("bottom", "95% quantiles", lty=2, col=2)

Of course, there are uncountably many such 95% intervals, but centrality

makes it unique.

The q* functions in R will give you the precise quantile values.

qnorm(c(0.025, 0.975))
## [1] -1.959964  1.959964
pnorm(qnorm(0.975)) - pnorm(qnorm(0.025))
## [1] 0.95

90% and 50% are also common.

plot(z, dnorm(z), type="l", lwd=2, main="standard normal", ylab="N(0,1) density")
abline(v=qnorm(c(0.05, 0.95)), lty=2, col=3) 
abline(v=qnorm(c(0.25, 0.75)), lty=2, col=4) 
legend("topright", c("90%", "50%"), lty=2, col=c(3:4))

Monte Carlo coverage of a Confidence Interval

To illustrate the “log run” thinking behind the frequency argument for CIs, consider the following MC experiment.

ys <- matrix(rnorm(30*1000, 5, 2), ncol=30)
ybars <- rowMeans(ys)
q95 <- qnorm(0.975)
ybar.se <- 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)

Some are in and some are out.

inside <- which(5 > CIs[,1] & 5 < CIs[,2])
CIs[inside[1],]
## [1] 4.369601 5.800957
CIs[setdiff(1:1000, inside)[1],]
## [1] 5.129830 6.561185

About how often are they in?

length(inside)/nrow(CIs)
## [1] 0.943

In this MC experiment it looks like our empirical coverage is very close to the actual, desired, or nominal coverage of 95%. Since we generated the data so that it followed the assumptions (Gaussian, etc.), we shouldn’t be surprised by this result.

And because we pretended the variance was known

If we use the wrong variance, all bets are off.

ybar.se <- 1/sqrt(30)  ## instead of 2/sqrt(30)
CIs <- cbind(ybars - q95*ybar.se, ybars + q95*ybar.se)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
## [1] 0.669

SAP ROE example

So if we assume we know that \(\sigma^2 = 25^2\) in our motivating example, we can easily calculate a CI for SAP ROE.

nucleus <- read.csv("../data/nucleus.csv")
sap.ybar <- mean(nucleus$ROE)
n <- length(nucleus$ROE)
sap.ybar + c(-1,1)*q95*25/sqrt(n)
## [1]  7.192693 18.081381

It is easy to see that the industry average is well within that range.

mean(nucleus$IndustryROE)
## [1] 15.69877

Combining \(s\) and \(\bar{Y}\) for the Student-\(t\)

Here is the chi-squared density with \(\nu = 80\).

y2 <- seq(40, 140, length=1000)
plot(y2, dchisq(y2, 80), type="l", lwd=2, main="Chi-squared DoF=80")

The distribution for \(\bar{Y}\) depends on \(\sigma^2\) \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n) \quad \mbox{ and we know that } \quad s^2 \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}. \]

Reversing the logic a bit, we have \[ \sigma^2 \sim (n-1) s^2 / \chi^2_{n-1} \] - giving us a way to generate random \(\sigma^2\) values to plug into the sampling distribution of \(\bar{Y}\) \[ \bar{Y} \sim \mathcal{N}(\mu, \sigma^2/n). \] - which in turn, for each \(\sigma^2\) thus generated, gives us a distribution for \(\bar{Y}\) that depends on only one unknown quantity, \(\mu\).

By repeating that process we are essentially integrating over \(\sigma^2\),

Suppose \(\mu = 12\).

ybar <- seq(2, 22, length=1000)
sap.y <- nucleus$ROE
sigma2 <- 80* var(sap.y) / rchisq(10, 80)
plot(ybar, dnorm(ybar, 12, sqrt(sigma2[1]/81)), type="l", lwd=2, ylim=c(0,0.17), main="scale mixture")
for(i in 2:10) lines(ybar, dnorm(ybar, 12, sqrt(sigma2[i]/81)), type="l", lwd=2)

Now converting to draws for \(\bar{Y}\).

sigma2 <- (var(sap.y)/80) * rchisq(1000, 80)
hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")

A statistician working for Guiness, who published under the pseudonym “Student”, showed that this scale mixture has a convenient closed form.

\[ \frac{\bar{Y} - \mu}{s/\sqrt{n}} \sim t_{n-1} \quad \Leftrightarrow \quad \bar{Y} \sim t_{n-1}(\mu, s^2/n). \]

hist(rnorm(1000, 12, sqrt(sigma2/81)), freq=FALSE, main="scale mixture", xlab="y-bar")
sdybar <- sqrt(var(sap.y)/81)
lines(ybar, dt(((ybar-12)/sdybar), 80)/sdybar, col=2, lwd=2, lty=2)
legend("topright", "Student-t", col=2, lty=2, lwd=2)

CI with unknown variance

Returning to our Monte Carlo experiment:

qt95 <- qt(0.975, 30-1)
s <- apply(ys, 1, sd)
ybar.tse <- s/sqrt(30)
CIs <- cbind(ybars - qt95*ybar.tse, ybars + qt95*ybar.tse)
inside <- which(5 > CIs[,1] & 5 < CIs[,2])
length(inside)/nrow(CIs)
## [1] 0.942

Returning to our SAP ROE example, but now abandoning our \(\sigma^2 = 25^2\) assumption, we obtain the following CI.

sap.se <- sd(nucleus$ROE)/sqrt(n)
qt95 <- qt(0.975, n-1)
sap.ybar + c(-1,1)*qt95*sap.se
## [1]  6.963478 18.310596

Wider than what we had before.

sap.ybar + c(-1,1)*q95*25/sqrt(n)
## [1]  7.192693 18.081381

R provides a simple command to calculate a CI, which gives the same result.

rbind(sap.ybar + c(-1,1)*qt95*sap.se,
    confint(lm(nucleus$ROE~1)))
##                2.5 %  97.5 %
##             6.963478 18.3106
## (Intercept) 6.963478 18.3106

Actually, we are asking R to fit an ordinary least squares regression (a linear model, or lm) with an intercept only.