## I thought that's what we were doing

So far we have been fitting lines to points.

• We haven't really been doing statistics because we haven't really introduced any probabilities.
• We don't really have a model.
• And we'll need a model if we are going to perform inference:
• to get a sense of estimation uncertainty,
• to test for the usefulness of explanatory variables ($$x$$),
• and to put "error bars" on our predictions from the line(s).

## Prediction and the modeling goal

A prediction rule is any function where you input an $$x$$ and it outputs $$\hat{y} \equiv \hat{y}(x)$$ as a predicted response at $$x$$.

The least squares line generates a prediction rule: $\hat{y} = b_0 + b_1 x.$

• The rule tells us what to do when a new $$x$$ comes along;
• run it through the formula and obtain a guess $$\hat{y}(x)$$.

$$\hat{y}$$ is just what we expect for a given $$x$$;

• it is not going to be a perfect prediction.
• If we actually sample a $$y$$-value at $$x$$ we will not get exactly $$\hat{y}$$.

## Forecast accuracy

We need to devise a notion of forecast accuracy.

• How sure are we about our forecast?
• Or how different could $$Y(x)$$ be from what we expect?

Forecasts are useless without some kind of uncertainty qualification/quantification.

One method is to specify a range of $$Y(x)$$ values that are likely at a given $$x$$ value,

• a so-called prediction interval, quickly summarizing the distribution of $$Y(x)$$.

## Full accounting of uncertainty

To construct a "proper" prediction interval we will have to:

1. obtain estimates of the coefficients, $$b_0$$ and $$b_1$$, generating the "fitted" line
2. assess the likely range of residual values corresponding to a $$Y(x)$$ that has not yet been observed, centered at the fitted line $$Y(x) = b_0 + b_1 x$$.
3. acknowledge that the "fitted line" may be fooled by particular realizations of the residuals
• i.e., that our estimated coefficients $$b_0$$ and $$b_1$$ are random.

We know how to do Step 1 already.

• Steps 2 and 3 require investing in a probability model.

## Simple linear regression (SLR) model

Here it is: $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), \quad \mbox{ for } \quad i=1,\dots, n.$

Similar, but with important differences.

• It is a model, so we are assuming this relationship holds for some true but unknown values of $$\beta_0, \beta_1$$.
• Greek letters remind us that they are not the same as the LS estimates $$b_0$$ or $$b_1$$
• or any other estimate for that matter.
• The errors $$\varepsilon_i$$ are independent, additive, "idiosyncratic noise".
• Their distribution is known up to its spread $$\sigma^2$$.
• Greek letters remind us that $$\varepsilon_i$$ is not $$e_i$$.

## Data generating mechanism

Before looking at any data, the model specifies,

• how $$Y$$ varies with $$x$$ on average,
• and the influence of factors other than $$x$$, namely $$\varepsilon$$ which is independent of $$x$$.

We think about the data as being one possible realization of data that could have been generated from the model $Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2).$

• $$\sigma^2$$ controls the dispersion of $$Y_i$$ around $$\beta_0 + \beta_1 x_i$$.

## Conditional variance

In this setup $Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2).$

The conditional variance is $\mathbb{V}\mathrm{ar}\{Y \mid x\} = \mathbb{V}\mathrm{ar}\{ \beta_0 + \beta_1 x + \varepsilon \mid x\} = \mathbb{V}\mathrm{ar}\{\varepsilon\} = \sigma^2.$

• We customarily write $$x$$ on the right-hand-side of the conditioning bar even though it is not really necessary
• for all sorts of reasons.

Analogous to our sliced-boxplots (house price v. size) example

• $$\sigma^2 < \mathbb{V}\mathrm{ar}\{Y\}$$ if $$x$$ and $$Y$$ are related.
• The bigger $$R^2 = 1 - \sigma^2/\mathbb{V}\mathrm{ar}\{Y\}$$ the more $$x$$ matters.

## Estimation

How do we estimate the unknown parameters, $$\beta_0$$, $$\beta_1$$ and $$\sigma^2$$?

• Well we have one way already – well, three in fact! – but it wasn't very statistical.
• Now that we have a model we have more options.

How would we estimate the parameters by maximum likelihood?

• Start by writing down the likelihood.

$L(\beta_0, \beta_1, \sigma^2) = \cdots$

## Maximum likelihood coefficients

The log likelihood is

$\ell(\beta_1, \beta_2, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2$

We can see already that the derivatives with respect to $$\beta_0$$ and $$\beta_1$$ are the same as the derivatives of $\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2,$ and when we set them to zero the $$-1/2\sigma^2$$ constant out front will cancel.

• In other words, we we have the least squares problem all over again.
• The MLE is $$\hat{\beta}_0 = b_0$$ and $$\hat{\beta}_1 = b_1$$.

## Maximum likelihood variance

All that remains is $$\sigma^2$$, however an analogy works there too.

Let $$\hat{\mu}_i \equiv \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$, and write

$\ell(\hat{\beta}_1, \hat{\beta}_2, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}_i)^2.$

Now we can see what will happen when we differentiate w.r.t. $$\sigma^2$$.

• The same thing as last time.
• From that we deduce the following estimate, where $$\hat{\varepsilon}_i \equiv e_i$$.

$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\varepsilon}_i^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2$

## Bias and DoF

Just like before this is a biased estimate

• and the bias correction involves adjusting for degrees of freedom.
• Two DoFs are lost in this case because two parameters ($$\beta_0$$ and $$\beta_1$$) were estimated from the data before that same data was again used to estimate the variance.

For that reason we usually take $\hat{\sigma}^2 = s^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2.$

• The proof that this is unbiased is similar to the one in the likelihood supplement.
• The intuition here is that lines are defined by two points, so you need at least three points for variance to be properly defined.

## Residual standard error

size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
summary(reg <- lm(price ~ size))
##
## Call:
## lm(formula = price ~ size)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -30.425  -8.618   0.575  10.766  18.498
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   38.885      9.094   4.276 0.000903 ***
## size          35.386      4.494   7.874 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.14 on 13 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8133
## F-statistic:    62 on 1 and 13 DF,  p-value: 2.66e-06

## Residual standard error (2)

R calls $$s = \sqrt{s^2} = \sqrt{\hat{\sigma}^2}$$ the residual standard error.

• Remember, standard error just means estimated standard deviation,
• so residual standard error is the estimated standard deviation of the residuals.
s <- summary(reg)$sigma n <- length(size) c(s, sqrt(sum(reg$resid^2)/(n-2)))
## [1] 14.1384 14.1384

## Caution

We have estimates now, and they are not the same as the true parameters:

• $$\beta_0$$ is not $$b_0$$, $$\beta_1$$ is not $$b_1$$ and $$\varepsilon_i$$ is not $$e_i$$.

## Estimation uncertainty

How much do our estimates depend on the particular random sample we happen to observe?

Before deriving this sampling distribution, lets do some Monte Carlo experiments.

• Randomly draw different sample $$Y_i$$-values at some $$x_i$$'s
• and for each sample compute the estimate of $$b_0$$, $$b_1$$ and $$s$$.

If the estimates don't vary much from sample to sample, then it doesn't matter which sample you happen to observe.

• If the estimates do vary a lot, then it matters!

Code for our first experiment is given in lmmc.R.

## MC sampling distribution of lines

The least squares (LS) lines are much closer to the true line when $$n=50$$;

• when $$n=5$$, some lines are close, others aren't – we need to get lucky!

Also notice that the LS lines are, more often than not,

• closer to the true line near the middle of the cloud of data.
• They get farther apart away from the middle of the cloud.
• This is as true for small samples $$(n=5)$$ as it is for large $$(n=50)$$.

A Monte Carlo demonstration in linear.R.

## MC sampling distribution recap

What did we just do?

• We "imagined", through simulation, the sampling distribution of a LS line.

What happens in real life?

• We just get one data set, and we don't know the true generating model.
• But we can still "imagine", and we can do it without simulation!
• Keep that hyperbolic shape in mind, it'll be back.

## Sampling distn of the estimators

We have several options here, e.g.,

• study $$b_0$$, $$b_1$$ and $$s^2$$ directly, or
• use asymptotic normality of MLE.

For now we are going to take the first route, and we're going to be a little quick-and-dirty about it.

• We will revisit these results with greater theoretical rigor when we get to multiple linear regression, which will require a more general treatment.

The program involves finding standard errors for the estimators:

• Find the standard errors of the LS coefficients, $$b_0$$ and $$b_1$$.
• Find the standard errors of $$s^2$$.
• Combine those into a standard errors of $$Y(x)$$ for a new $$x$$.

## Sampling distribution for the slope

On your homework you are asked to show that

$\mathbb{V}\mathrm{ar}\{b_1\} = \frac{\sigma^2}{(n-1) s_x^2} \quad \mbox{where} \quad s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2.$

Slope uncertainty depends on three factors

1. sample size $$(n)$$
2. error variance $$(\sigma^2 \equiv \sigma_\varepsilon^2)$$
3. the spread of the inputs $$(s_x^2)$$.

It turns out (suspend disbelief) that …

$b_1 \sim \mathcal{N}(\beta_1, \sigma_{b_1}^2) \quad \mbox{ where } \quad \sigma_{b_1}^2 = \frac{\sigma^2}{(n-1) s_x^2}$

## Sampling distribution for the intercept

This one is rather easier. $\mathbb{V}\mathrm{ar}\{b_0\} = \mathbb{V}\mathrm{ar}\{\bar{Y} - \bar{x} b_1\} = \mathbb{V}\mathrm{ar}(\bar{Y}) + \bar{x}^2\mathbb{V}\mathrm{ar}\{b_1\} - 2\bar{x} \mathbb{C}\mathrm{ov}(\bar{Y},b_1)$

• since $$\bar{Y}$$ and $$b_1$$ are uncorrelated: the slope wouldn't change if you shifted the data up and down.
• Uncertainty depends on the same three factors.

So $b_0 \sim \mathcal{N}(\beta_0, \sigma_{b_0}^2) \quad \mbox{ where } \quad \sigma_{b_0}^2 = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{(n-1) s_x^2} \right)$

## Standard errors of the coefficients

These formulas aren't especially practical since they involve unknown $$\sigma^2$$.

• Solution: use $$s^2$$ instead: $s_{b_1} = \sqrt{\frac{s^2}{(n-1)s_x^2}} \quad\quad s_{b_0} = \sqrt{s^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{(n-1)s^2_x}\right)}$

• These are the standard errors for the regression coefficients.

That changes the sampling distribution, and we'll need to use the fact that $s^2 \sim \frac{\sigma^2}{n-2} \chi^2_{n-2}.$

• As an analogue of our previous result for estimating means this is obvious.
• Proving it is a little more complicated and involves Cochran's Theorem.

## Sampling distribution

Now we are ready to summarize our results for the sampling distribution of the coefficients.

$b_0 \sim t_{n-2}(\beta_0, s_{b_0}^2) \quad \mbox{ and } \quad b_1 \sim t_{n-2}(\beta_1, s_{b_1}^2)$

These distributions form the basis for

1. confidence intervals for $$\beta_0$$ and $$\beta_1$$
2. tests of hypothesis about these coefficients, and their $$p$$-values.
3. And the derivation of the predictive distribution for $$Y(x)$$.

Lets look at the first two and return to the third an a moment.

summary(reg)coefficients ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.88468 9.093904 4.275906 9.027119e-04 ## size 35.38596 4.494083 7.873901 2.659867e-06 sb0 <- sqrt(s^2*((1/n)+mean(size)^2/((n-1)*var(size)))) sb1 <- sqrt(s^2/((n-1)*var(size))) c(sb0, sb1) ## [1] 9.093904 4.494083 ## Confidence intervals b0 <- 38.885 b1 <- 35.386 t025 <- qt(.975, df=n-2) rbind(c(b0 - sb0*t025, b0 + sb0*t025), b1 + c(-1,1)*sb1*t025) ## [,1] [,2] ## [1,] 19.23882 58.53118 ## [2,] 25.67712 45.09488 confint(reg, level=0.95) ## 2.5 % 97.5 % ## (Intercept) 19.23850 58.53087 ## size 25.67709 45.09484 ## $$t$$-tests The summary also provides information about $$t$$-tests under the hypotheses \begin{aligned} \mathcal{H}_0 &: \beta_j = 0 \\ \mathcal{H}_1 &: \beta_j \ne 0 & \mbox{for any j; for now j \in \{0,1\}.} \end{aligned} Compare to the summary output above: c(b0/sb0, 2*pt(-abs(b0/sb0), df=n-2)) ## [1] 4.2759413858 0.0009026534 c(b1/sb1, 2*pt(-abs(b1/sb1), df=n-2)) ## [1] 7.873909e+00 2.659836e-06 ## CAPM hypothesis tests Lets do another example: • revisit the CAPM regression for the Windsor fund. Does Windsor have a non-zero intercept; i.e., does it make/lose money independent of the market? \begin{aligned} \mathcal{H}_0 &: \beta_0 = 0 && \mbox{and there is no free money} \\ \mathcal{H}_1 &: \beta_0 \ne 0 && \mbox{and Windsor is cashing regardless of the market} \end{aligned} • Recall that the estimate of the intercept, $$b_0$$, is the "alpha" of the fund. The summary of the regression in R: mfund <- read.csv("mfunds.csv") capm <- lm(mfundwindsor ~ mfund$valmrkt) summary(capm) ## ## Call: ## lm(formula = mfund$windsor ~ mfund$valmrkt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.069920 -0.010772 -0.000192 0.009178 0.062547 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.003647 0.001409 2.588 0.0105 * ## mfund$valmrkt 0.935717   0.029150  32.100   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01872 on 178 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8519
## F-statistic:  1030 on 1 and 178 DF,  p-value: < 2.2e-16

## Windsor has alpha

By hand:

zb0 <- 0.003647/.001409
zb0
## [1] 2.588361
2*pt(-abs(zb0), df=178)
## [1] 0.0104401

It turns out that we reject the null at $$\alpha = 0.05$$ with a $$p$$-value of $$\varphi = 0.0104$$.

• So Windsor does have an "alpha" over the market.

## What about beta?

Looking now at the slope, this is a rare case where the null hypothesis is not zero:

\begin{aligned} \mathcal{H}_0 &: \beta_1 = 1 && \mbox{Windsor is just the market (+ alpha)} \\ \mathcal{H}_1 &: \beta_1 \ne 1 && \mbox{Windsor softens or exaggerates the market moves} \end{aligned}

asking if Windsor moves in differently than the market (e.g., more conservative).

• Recall that the estimate of the slope $$b_1$$ is the "beta" of the stock.

This time R's $$t$$/$$p$$-values are not what we want.

zb1 <- (0.935717 - 1)/0.029150
2*pt(-abs(zb1), df=178)
## [1] 0.02871786
• Reject the null at 5%; Windsor is more conservative than the market.

## Forecasting

The conditional forecasting problem:

• Given a new input $$x_f$$ and sample data $$\{x_i, y_i\}_{i=1}^n$$ predict the "future" observation $$Y_f = Y(x_f)$$.

The solution is to use our LS fitted value: $$\hat{Y}_f = b_0 + b_1 x_f$$.

• Thats the easy bit.

The hard (and very important!) part of forecasting is assessing uncertainty about predictions.

• both $$Y_f$$ and $$\hat{Y}_f$$ are capitalized for a reason, and they are not the same thing.

## Predictive error

When predicting with $$\hat{Y}_f$$ the prediction error is $e_f = Y_f - \hat{Y}_f = Y_f - b_0 - b_1 x_f$

• we know $$x_f$$, we don't know $$Y_f$$ and everything else is random.

## Decomposing predictive error

We can decompose $$e_f$$ into two sources of error:

• inherent idiosyncratic randomness (due to $$\varepsilon$$)
• estimation error in the intercept and slope estimates
• i.e., discrepancy between our fitted line and the truth.

\begin{aligned} e_f &= Y_f - \hat{Y}_f = (Y_f - \mathbb{E}\{Y_f \mid x_f\}) + \mathbb{E}\{Y_f \mid x_f\}-\hat{Y}_f \\ &= \varepsilon_f + (\mathbb{E}\{Y_f \mid x_f\} -\hat{Y}_f ) =\mathrm{ idiosynchratic } + \mathrm{ fit } \\ &= \varepsilon_f + (\beta_0 - b_0) + (\beta_1 - b_1)x_f. \end{aligned}

The variance is $\mathbb{V}\mathrm{ar}\{e_f\} = \mathbb{V}\mathrm{ar}\{\varepsilon_f\} + \mathbb{V}\mathrm{ar}\{\mathbb{E}\{Y_f \mid x_f\} -\hat{Y}_f \} \ = \sigma^2 + \mathbb{V}\mathrm{ar}\{\hat{Y}_f\}\\$

• So what about the fit error $$\mathbb{V}\mathrm{ar}\{\hat{Y}_f\}$$?

## Variance of fit error

By variances of linear combinations of RVs:

$\mathbb{V}\mathrm{ar}\{b_0 + b_1 x_f\} = \mathbb{V}\mathrm{ar}\{b_0\} + x_f^2 \mathbb{V}\mathrm{ar}\{b_1\} + 2x_f\mathbb{C}\mathrm{ov}(b_0,b_1)$

• Those first two terms are easy; what about the last one?

Intuitively, the intercept and slope are negatively correlated

• If the slope estimate is too high, the intercept will be too low.

It turns out that $\mathbb{C}\mathrm{ov}(b_0, b_1) = - \sigma^2 \left(\frac{\bar{x}}{(n-1) s_x^2} \right).$

## Hyperbolic shape

Combining terms and a little simplification yields

\begin{aligned} \mathbb{V}\mathrm{ar}\{\hat{Y}_f\} = \mathbb{V}\mathrm{ar}\{b_0 + b_1 x_f\} &= \mathbb{V}\mathrm{ar}\{b_0\} + x_f^2 \mathbb{V}\mathrm{ar}\{b_1\} + 2x_f\mathbb{C}\mathrm{ov}(b_0,b_1) \\ &= \sigma^2 \left[\frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right]. \end{aligned}

Going back to the expression for the $$\mathbb{V}\mathrm{ar}\{ e_f \}$$ we obtain

\begin{aligned} \mathbb{V}\mathrm{ar}\{e_f\} &= \sigma^2 + \mathbb{V}\mathrm{ar}\{\hat{Y}_f\}\\ &= \sigma^2 \left[1 + \frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right]. \end{aligned}

## Predictive standard error

Finally, plugging in our estimate $$s^2$$ for $$\sigma^2$$ gives what we call $$s_{\mathrm{pred}}^2$$: $s_{\mathrm{pred}}^2 = s^2 \left[1 + \frac{1}{n} + \frac{(x_f - \bar{x})^2}{(n-1) s_x^2} \right].$

• The square root is $$s_{\mathrm{pred}}$$, predictive standard error. $s_{\mathrm{pred}} = \sqrt{s^2 + s_{\mathrm{fit}}^2}.$

A large predictive error variance (high uncertainty) comes from

• Large $$s$$ (i.e., large $$\varepsilon$$'s)
• Small $$n$$ (not enough data).
• Small $$s_x$$ (not enough observed spread in covariates).
• Large $$x_f - \bar{x}$$ (predicting far from the middle of the data).

## Hyperbolic shape

For $$x_f$$ far from $$\bar{x}$$ uncertainty grows quadratically and we get our hyperbolic shaped error-bars.

## Predicting in R

Back to our house price example:

xf <- data.frame(size=c(1.48,2,3))
predict(reg, newdata=xf, interval="prediction")
##         fit       lwr      upr
## 1  91.25591  59.50249 123.0093
## 2 109.65661  78.07862 141.2346
## 3 145.04257 111.58989 178.4952
• interval="prediction" gives lwr and upr, otherwise we just get fit
• $$s_{\mathrm{pred}}$$ is not shown in this output

## $$s_{\mathrm{fit}}$$ and $$s_\mathrm{pred}$$ in R

p <- predict(reg, newdata=xf, se.fit=TRUE)
sfit <- p$se.fit rbind(sfit, hand=s*sqrt(1/n + (xf$size-mean(size))^2/((n-1)*var(size))))
##            1        2        3
## sfit 4.01762 3.709547 6.315213
## hand 4.01762 3.709547 6.315213
spred <- sqrt(s^2+sfit^2)
b <- reg\$coef
b[1] + b[2]*xf[1,] + c(0,-1,1)*qt(.975, df=n-2)*spred[1]
## [1]  91.25591  59.50249 123.00932