I thought that’s what we were doing

So far we have been fitting lines to points.

Prediction and the modeling goal

A prediction rule is any function where you input an \(x\) and it outputs \(\hat{y} \equiv \hat{y}(x)\) as a predicted response at \(x\).

The least squares line generates a prediction rule: \[ \hat{y} = b_0 + b_1 x. \]

\(\hat{y}\) is just what we expect for a given \(x\);


Forecast accuracy

We need to devise a notion of forecast accuracy.

Forecasts are useless without some kind of uncertainty qualification/quantification.

One method is to specify a range of \(Y(x)\) values that are likely at a given \(x\) value,

Full accounting of uncertainty

To construct a “proper” prediction interval we will have to:

  1. obtain estimates of the coefficients, \(b_0\) and \(b_1\), generating the “fitted” line
  2. assess the likely range of residual values corresponding to a \(Y(x)\) that has not yet been observed, centered at the fitted line \(Y(x) = b_0 + b_1 x\).
  3. acknowledge that the “fitted line” may be fooled by particular realizations of the residuals
    • i.e., that our estimated coefficients \(b_0\) and \(b_1\) are random.

We know how to do Step 1 already.

Simple linear regression (SLR) model

Here it is: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), \quad \mbox{ for } \quad i=1,\dots, n. \]

Similar, but with important differences.

Data generating mechanism

Before looking at any data, the model specifies,

We think about the data as being one possible realization of data that could have been generated from the model \[ Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2). \]


Conditional variance

In this setup \[ Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2). \]

The conditional variance is \[ \mathbb{V}\mathrm{ar}\{Y \mid x\} = \mathbb{V}\mathrm{ar}\{ \beta_0 + \beta_1 x + \varepsilon \mid x\} = \mathbb{V}\mathrm{ar}\{\varepsilon\} = \sigma^2. \]

Analogous to our sliced-boxplots (house price v. size) example


How do we estimate the unknown parameters, \(\beta_0\), \(\beta_1\) and \(\sigma^2\)?

How would we estimate the parameters by maximum likelihood?

\[ L(\beta_0, \beta_1, \sigma^2) = \cdots \]

Maximum likelihood coefficients

The log likelihood is

\[ \ell(\beta_0, \beta_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

We can see already that the derivatives with respect to \(\beta_0\) and \(\beta_1\) are the same as the derivatives of \[ \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2, \] and when we set them to zero the \(-1/2\sigma^2\) constant out front will cancel.

Maximum likelihood variance

All that remains is \(\sigma^2\), however an analogy works there too.

Let \(\hat{\mu}_i \equiv \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\), and write

\[ \ell(\hat{\beta}_0, \hat{\beta}_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}_i)^2. \]

Now we can see what will happen when we differentiate w.r.t. \(\sigma^2\).

\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\varepsilon}_i^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

Bias and DoF

Just like before this is a biased estimate

For that reason we usually take \[ \hat{\sigma}^2 = s^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2. \]

Residual standard error

size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
summary(reg <- lm(price ~ size))
## Call:
## lm(formula = price ~ size)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.425  -8.618   0.575  10.766  18.498 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.885      9.094   4.276 0.000903 ***
## size          35.386      4.494   7.874 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 14.14 on 13 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8133 
## F-statistic:    62 on 1 and 13 DF,  p-value: 2.66e-06

Residual standard error (2)

R calls \(s = \sqrt{s^2} = \sqrt{\hat{\sigma}^2}\) the residual standard error.

s <- summary(reg)$sigma
n <- length(size)
c(s, sqrt(sum(reg$resid^2)/(n-2)))
## [1] 14.1384 14.1384


We have estimates now, and they are not the same as the true parameters:

Estimation uncertainty

How much do our estimates depend on the particular random sample we happen to observe?

Before deriving this sampling distribution, lets do some Monte Carlo experiments.

If the estimates don’t vary much from sample to sample, then it doesn’t matter which sample you happen to observe.

Code for our first experiment is given in lmmc.R.