I thought that's what we were doing

So far we have been fitting lines to points.

  • We haven't really been doing statistics because we haven't really introduced any probabilities.
  • We don't really have a model.
  • And we'll need a model if we are going to perform inference:
    • to get a sense of estimation uncertainty,
    • to test for the usefulness of explanatory variables (\(x\)),
    • and to put "error bars" on our predictions from the line(s).

Prediction and the modeling goal

A prediction rule is any function where you input an \(x\) and it outputs \(\hat{y} \equiv \hat{y}(x)\) as a predicted response at \(x\).

The least squares line generates a prediction rule: \[ \hat{y} = b_0 + b_1 x. \]

  • The rule tells us what to do when a new \(x\) comes along;
  • run it through the formula and obtain a guess \(\hat{y}(x)\).

\(\hat{y}\) is just what we expect for a given \(x\);

  • it is not going to be a perfect prediction.
  • If we actually sample a \(y\)-value at \(x\) we will not get exactly \(\hat{y}\).


Forecast accuracy

We need to devise a notion of forecast accuracy.

  • How sure are we about our forecast?
  • Or how different could \(Y(x)\) be from what we expect?

Forecasts are useless without some kind of uncertainty qualification/quantification.

One method is to specify a range of \(Y(x)\) values that are likely at a given \(x\) value,

  • a so-called prediction interval, quickly summarizing the distribution of \(Y(x)\).

Full accounting of uncertainty

To construct a "proper" prediction interval we will have to:

  1. obtain estimates of the coefficients, \(b_0\) and \(b_1\), generating the "fitted" line
  2. assess the likely range of residual values corresponding to a \(Y(x)\) that has not yet been observed, centered at the fitted line \(Y(x) = b_0 + b_1 x\).
  3. acknowledge that the "fitted line" may be fooled by particular realizations of the residuals
    • i.e., that our estimated coefficients \(b_0\) and \(b_1\) are random.

We know how to do Step 1 already.

  • Steps 2 and 3 require investing in a probability model.

Simple linear regression (SLR) model

Here it is: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), \quad \mbox{ for } \quad i=1,\dots, n. \]

Similar, but with important differences.

  • It is a model, so we are assuming this relationship holds for some true but unknown values of \(\beta_0, \beta_1\).
  • Greek letters remind us that they are not the same as the LS estimates \(b_0\) or \(b_1\)
    • or any other estimate for that matter.
  • The errors \(\varepsilon_i\) are independent, additive, "idiosyncratic noise".
  • Their distribution is known up to its spread \(\sigma^2\).
  • Greek letters remind us that \(\varepsilon_i\) is not \(e_i\).

Data generating mechanism

Before looking at any data, the model specifies,

  • how \(Y\) varies with \(x\) on average,
  • and the influence of factors other than \(x\), namely \(\varepsilon\) which is independent of \(x\).

We think about the data as being one possible realization of data that could have been generated from the model \[ Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2). \]

  • \(\sigma^2\) controls the dispersion of \(Y_i\) around \(\beta_0 + \beta_1 x_i\).


Conditional variance

In this setup \[ Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2). \]

The conditional variance is \[ \mathbb{V}\mathrm{ar}\{Y \mid x\} = \mathbb{V}\mathrm{ar}\{ \beta_0 + \beta_1 x + \varepsilon \mid x\} = \mathbb{V}\mathrm{ar}\{\varepsilon\} = \sigma^2. \]

  • We customarily write \(x\) on the right-hand-side of the conditioning bar even though it is not really necessary
    • for all sorts of reasons.

Analogous to our sliced-boxplots (house price v. size) example

  • \(\sigma^2 < \mathbb{V}\mathrm{ar}\{Y\}\) if \(x\) and \(Y\) are related.
  • The bigger \(R^2 = 1 - \sigma^2/\mathbb{V}\mathrm{ar}\{Y\}\) the more \(x\) matters.


How do we estimate the unknown parameters, \(\beta_0\), \(\beta_1\) and \(\sigma^2\)?

  • Well we have one way already – well, three in fact! – but it wasn't very statistical.
  • Now that we have a model we have more options.

How would we estimate the parameters by maximum likelihood?

  • Start by writing down the likelihood.

\[ L(\beta_0, \beta_1, \sigma^2) = \cdots \]

Maximum likelihood coefficients

The log likelihood is

\[ \ell(\beta_1, \beta_2, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

We can see already that the derivatives with respect to \(\beta_0\) and \(\beta_1\) are the same as the derivatives of \[ \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2, \] and when we set them to zero the \(-1/2\sigma^2\) constant out front will cancel.

  • In other words, we we have the least squares problem all over again.
  • The MLE is \(\hat{\beta}_0 = b_0\) and \(\hat{\beta}_1 = b_1\).

Maximum likelihood variance

All that remains is \(\sigma^2\), however an analogy works there too.

Let \(\hat{\mu}_i \equiv \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\), and write

\[ \ell(\hat{\beta}_1, \hat{\beta}_2, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}_i)^2. \]

Now we can see what will happen when we differentiate w.r.t. \(\sigma^2\).

  • The same thing as last time.
  • From that we deduce the following estimate, where \(\hat{\varepsilon}_i \equiv e_i\).

\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\varepsilon}_i^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

Bias and DoF

Just like before this is a biased estimate

  • and the bias correction involves adjusting for degrees of freedom.
  • Two DoFs are lost in this case because two parameters (\(\beta_0\) and \(\beta_1\)) were estimated from the data before that same data was again used to estimate the variance.

For that reason we usually take \[ \hat{\sigma}^2 = s^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2. \]

  • The proof that this is unbiased is similar to the one in the likelihood supplement.
  • The intuition here is that lines are defined by two points, so you need at least three points for variance to be properly defined.

Residual standard error

size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
summary(reg <- lm(price ~ size))
## Call:
## lm(formula = price ~ size)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.425  -8.618   0.575  10.766  18.498 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.885      9.094   4.276 0.000903 ***
## size          35.386      4.494   7.874 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 14.14 on 13 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8133 
## F-statistic:    62 on 1 and 13 DF,  p-value: 2.66e-06

Residual standard error (2)

R calls \(s = \sqrt{s^2} = \sqrt{\hat{\sigma}^2}\) the residual standard error.

  • Remember, standard error just means estimated standard deviation,
  • so residual standard error is the estimated standard deviation of the residuals.
s <- summary(reg)$sigma
n <- length(size)
c(s, sqrt(sum(reg$resid^2)/(n-2)))
## [1] 14.1384 14.1384


We have estimates now, and they are not the same as the true parameters:

  • \(\beta_0\) is not \(b_0\), \(\beta_1\) is not \(b_1\) and \(\varepsilon_i\) is not \(e_i\).

Estimation uncertainty

How much do our estimates depend on the particular random sample we happen to observe?

Before deriving this sampling distribution, lets do some Monte Carlo experiments.

  • Randomly draw different sample \(Y_i\)-values at some \(x_i\)'s
  • and for each sample compute the estimate of \(b_0\), \(b_1\) and \(s\).

If the estimates don't vary much from sample to sample, then it doesn't matter which sample you happen to observe.

  • If the estimates do vary a lot, then it matters!

Code for our first experiment is given in lmmc.R.

MC sampling distribution of lines

The least squares (LS) lines are much closer to the true line when \(n=50\);

  • when \(n=5\), some lines are close, others aren't – we need to get lucky!

Also notice that the LS lines are, more often than not,

  • closer to the true line near the middle of the cloud of data.
  • They get farther apart away from the middle of the cloud.
  • This is as true for small samples \((n=5)\) as it is for large \((n=50)\).

A Monte Carlo demonstration in linear.R.

MC sampling distribution recap

What did we just do?

  • We "imagined", through simulation, the sampling distribution of a LS line.

What happens in real life?

  • We just get one data set, and we don't know the true generating model.
  • But we can still "imagine", and we can do it without simulation!
  • Keep that hyperbolic shape in mind, it'll be back.

Sampling distn of the estimators

We have several options here, e.g.,

  • study \(b_0\), \(b_1\) and \(s^2\) directly, or
  • use asymptotic normality of MLE.

For now we are going to take the first route, and we're going to be a little quick-and-dirty about it.

  • We will revisit these results with greater theoretical rigor when we get to multiple linear regression, which will require a more general treatment.

The program involves finding standard errors for the estimators:

  • Find the standard errors of the LS coefficients, \(b_0\) and \(b_1\).
  • Find the standard errors of \(s^2\).
  • Combine those into a standard errors of \(Y(x)\) for a new \(x\).

Sampling distribution for the slope

On your homework you are asked to show that

\[ \mathbb{V}\mathrm{ar}\{b_1\} = \frac{\sigma^2}{(n-1) s_x^2} \quad \mbox{where} \quad s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2. \]

Slope uncertainty depends on three factors

  1. sample size \((n)\)
  2. error variance \((\sigma^2 \equiv \sigma_\varepsilon^2)\)
  3. the spread of the inputs \((s_x^2)\).

It turns out (suspend disbelief) that …

\[ b_1 \sim \mathcal{N}(\beta_1, \sigma_{b_1}^2) \quad \mbox{ where } \quad \sigma_{b_1}^2 = \frac{\sigma^2}{(n-1) s_x^2} \]

Sampling distribution for the intercept

This one is rather easier. \[ \mathbb{V}\mathrm{ar}\{b_0\} = \mathbb{V}\mathrm{ar}\{\bar{Y} - \bar{x} b_1\} = \mathbb{V}\mathrm{ar}(\bar{Y}) + \bar{x}^2\mathbb{V}\mathrm{ar}\{b_1\} - 2\bar{x} \mathbb{C}\mathrm{ov}(\bar{Y},b_1) \]

  • since \(\bar{Y}\) and \(b_1\) are uncorrelated: the slope wouldn't change if you shifted the data up and down.
  • Uncertainty depends on the same three factors.

So \[ b_0 \sim \mathcal{N}(\beta_0, \sigma_{b_0}^2) \quad \mbox{ where } \quad \sigma_{b_0}^2 = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{(n-1) s_x^2} \right) \]

Standard errors of the coefficients

These formulas aren't especially practical since they involve unknown \(\sigma^2\).

  • Solution: use \(s^2\) instead: \[ s_{b_1} = \sqrt{\frac{s^2}{(n-1)s_x^2}} \quad\quad s_{b_0} = \sqrt{s^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{(n-1)s^2_x}\right)} \]

  • These are the standard errors for the regression coefficients.

That changes the sampling distribution, and we'll need to use the fact that \[ s^2 \sim \frac{\sigma^2}{n-2} \chi^2_{n-2}. \]

  • As an analogue of our previous result for estimating means this is obvious.
  • Proving it is a little more complicated and involves Cochran's Theorem.

Sampling distribution

Now we are ready to summarize our results for the sampling distribution of the coefficients.

\[ b_0 \sim t_{n-2}(\beta_0, s_{b_0}^2) \quad \mbox{ and } \quad b_1 \sim t_{n-2}(\beta_1, s_{b_1}^2) \]

These distributions form the basis for

  1. confidence intervals for \(\beta_0\) and \(\beta_1\)
  2. tests of hypothesis about these coefficients, and their \(p\)-values.
  3. And the derivation of the predictive distribution for \(Y(x)\).

Lets look at the first two and return to the third an a moment.

All of the relevant info is in summary

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 38.88468   9.093904 4.275906 9.027119e-04
## size        35.38596   4.494083 7.873901 2.659867e-06
sb0 <- sqrt(s^2*((1/n)+mean(size)^2/((n-1)*var(size))))
sb1 <- sqrt(s^2/((n-1)*var(size)))
c(sb0, sb1)
## [1] 9.093904 4.494083

Confidence intervals

b0 <- 38.885
b1 <- 35.386
t025 <- qt(.975, df=n-2) 
rbind(c(b0 - sb0*t025, b0 + sb0*t025), b1 + c(-1,1)*sb1*t025)
##          [,1]     [,2]
## [1,] 19.23882 58.53118
## [2,] 25.67712 45.09488
confint(reg, level=0.95)
##                2.5 %   97.5 %
## (Intercept) 19.23850 58.53087
## size        25.67709 45.09484


The summary also provides information about \(t\)-tests under the hypotheses

\[ \begin{aligned} \mathcal{H}_0 &: \beta_j = 0 \\ \mathcal{H}_1 &: \beta_j \ne 0 & \mbox{for any $j$; for now $j \in \{0,1\}$.} \end{aligned} \]

Compare to the summary output above:

c(b0/sb0, 2*pt(-abs(b0/sb0), df=n-2))
## [1] 4.2759413858 0.0009026534
c(b1/sb1, 2*pt(-abs(b1/sb1), df=n-2))
## [1] 7.873909e+00 2.659836e-06

CAPM hypothesis tests

Lets do another example:

  • revisit the CAPM regression for the Windsor fund.

Does Windsor have a non-zero intercept; i.e., does it make/lose money independent of the market?

\[ \begin{aligned} \mathcal{H}_0 &: \beta_0 = 0 && \mbox{and there is no free money} \\ \mathcal{H}_1 &: \beta_0 \ne 0 && \mbox{and Windsor is cashing regardless of the market} \end{aligned} \]

  • Recall that the estimate of the intercept, \(b_0\), is the "alpha" of the fund.

The summary of the regression in R:

mfund <- read.csv("mfunds.csv")
capm <- lm(mfund$windsor ~ mfund$valmrkt)
## Call:
## lm(formula = mfund$windsor ~ mfund$valmrkt)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.069920 -0.010772 -0.000192  0.009178  0.062547 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.003647   0.001409   2.588   0.0105 *  
## mfund$valmrkt 0.935717   0.029150  32.100   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.01872 on 178 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8519 
## F-statistic:  1030 on 1 and 178 DF,  p-value: < 2.2e-16

Windsor has alpha

By hand:

zb0 <- 0.003647/.001409
## [1] 2.588361
2*pt(-abs(zb0), df=178)
## [1] 0.0104401

It turns out that we reject the null at \(\alpha = 0.05\) with a \(p\)-value of \(\varphi = 0.0104\).

  • So Windsor does have an "alpha" over the market.

What about beta?

Looking now at the slope, this is a rare case where the null hypothesis is not zero:

\[ \begin{aligned} \mathcal{H}_0 &: \beta_1 = 1 && \mbox{Windsor is just the market (+ alpha)} \\ \mathcal{H}_1 &: \beta_1 \ne 1 && \mbox{Windsor softens or exaggerates the market moves} \end{aligned} \]

asking if Windsor moves in differently than the market (e.g., more conservative).

  • Recall that the estimate of the slope \(b_1\) is the "beta" of the stock.

This time R's \(t\)/\(p\)-values are not what we want.

zb1 <- (0.935717 - 1)/0.029150
2*pt(-abs(zb1), df=178)
## [1] 0.02871786
  • Reject the null at 5%; Windsor is more conservative than the market.


The conditional forecasting problem:

  • Given a new input \(x_f\) and sample data \(\{x_i, y_i\}_{i=1}^n\) predict the "future" observation \(Y_f = Y(x_f)\).

The solution is to use our LS fitted value: \(\hat{Y}_f = b_0 + b_1 x_f\).

  • Thats the easy bit.

The hard (and very important!) part of forecasting is assessing uncertainty about predictions.

  • both \(Y_f\) and \(\hat{Y}_f\) are capitalized for a reason, and they are not the same thing.

Predictive error

When predicting with \(\hat{Y}_f\) the prediction error is \[ e_f = Y_f - \hat{Y}_f = Y_f - b_0 - b_1 x_f \]

  • we know \(x_f\), we don't know \(Y_f\) and everything else is random.

Decomposing predictive error

We can decompose \(e_f\) into two sources of error:

  • inherent idiosyncratic randomness (due to \(\varepsilon\))
  • estimation error in the intercept and slope estimates
    • i.e., discrepancy between our fitted line and the truth.

\[ \begin{aligned} e_f &= Y_f - \hat{Y}_f = (Y_f - \mathbb{E}\{Y_f \mid x_f\}) + \mathbb{E}\{Y_f \mid x_f\}-\hat{Y}_f \\ &= \varepsilon_f + (\mathbb{E}\{Y_f \mid x_f\} -\hat{Y}_f ) =\mathrm{ idiosynchratic } + \mathrm{ fit } \\ &= \varepsilon_f + (\beta_0 - b_0) + (\beta_1 - b_1)x_f. \end{aligned} \]

The variance is \[ \mathbb{V}\mathrm{ar}\{e_f\} = \mathbb{V}\mathrm{ar}\{\varepsilon_f\} + \mathbb{V}\mathrm{ar}\{\mathbb{E}\{Y_f \mid x_f\} -\hat{Y}_f \} \ = \sigma^2 + \mathbb{V}\mathrm{ar}\{\hat{Y}_f\}\\ \]

  • So what about the fit error \(\mathbb{V}\mathrm{ar}\{\hat{Y}_f\}\)?

Variance of fit error

By variances of linear combinations of RVs:

\[ \mathbb{V}\mathrm{ar}\{b_0 + b_1 x_f\} = \mathbb{V}\mathrm{ar}\{b_0\} + x_f^2 \mathbb{V}\mathrm{ar}\{b_1\} + 2x_f\mathbb{C}\mathrm{ov}(b_0,b_1) \]

  • Those first two terms are easy; what about the last one?

Intuitively, the intercept and slope are negatively correlated

  • If the slope estimate is too high, the intercept will be too low.

It turns out that \[ \mathbb{C}\mathrm{ov}(b_0, b_1) = - \sigma^2 \left(\frac{\bar{x}}{(n-1) s_x^2} \right). \]

Hyperbolic shape

Combining terms and a little simplification yields

\[ \begin{aligned} \mathbb{V}\mathrm{ar}\{\hat{Y}_f\} = \mathbb{V}\mathrm{ar}\{b_0 + b_1 x_f\} &= \mathbb{V}\mathrm{ar}\{b_0\} + x_f^2 \mathbb{V}\mathrm{ar}\{b_1\} + 2x_f\mathbb{C}\mathrm{ov}(b_0,b_1) \\ &= \sigma^2 \left[\frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right]. \end{aligned} \]

Going back to the expression for the \(\mathbb{V}\mathrm{ar}\{ e_f \}\) we obtain

\[ \begin{aligned} \mathbb{V}\mathrm{ar}\{e_f\} &= \sigma^2 + \mathbb{V}\mathrm{ar}\{\hat{Y}_f\}\\ &= \sigma^2 \left[1 + \frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right]. \end{aligned} \]

Predictive standard error

Finally, plugging in our estimate \(s^2\) for \(\sigma^2\) gives what we call \(s_{\mathrm{pred}}^2\): \[ s_{\mathrm{pred}}^2 = s^2 \left[1 + \frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right]. \]

  • The square root is \(s_{\mathrm{pred}}\), predictive standard error. \[ s_{\mathrm{pred}} = \sqrt{s^2 + s_{\mathrm{fit}}^2}. \]

A large predictive error variance (high uncertainty) comes from

  • Large \(s\) (i.e., large \(\varepsilon\)'s)
  • Small \(n\) (not enough data).
  • Small \(s_x\) (not enough observed spread in covariates).
  • Large \(x_f - \bar{x}\) (predicting far from the middle of the data).

Hyperbolic shape

For \(x_f\) far from \(\bar{x}\) uncertainty grows quadratically and we get our hyperbolic shaped error-bars.

Predicting in R

Back to our house price example:

xf <- data.frame(size=c(1.48,2,3))
predict(reg, newdata=xf, interval="prediction")
##         fit       lwr      upr
## 1  91.25591  59.50249 123.0093
## 2 109.65661  78.07862 141.2346
## 3 145.04257 111.58989 178.4952
  • interval="prediction" gives lwr and upr, otherwise we just get fit
  • \(s_{\mathrm{pred}}\) is not shown in this output

\(s_{\mathrm{fit}}\) and \(s_\mathrm{pred}\) in R

p <- predict(reg, newdata=xf, se.fit=TRUE)
sfit <- p$se.fit
rbind(sfit, hand=s*sqrt(1/n + (xf$size-mean(size))^2/((n-1)*var(size))))
##            1        2        3
## sfit 4.01762 3.709547 6.315213
## hand 4.01762 3.709547 6.315213
spred <- sqrt(s^2+sfit^2)
b <- reg$coef
b[1] + b[2]*xf[1,] + c(0,-1,1)*qt(.975, df=n-2)*spred[1]
## [1]  91.25591  59.50249 123.00932