Department of Statistics, Virginia Tech

So far we have been fitting lines to points.

- We haven’t really been doing statistics because we haven’t really introduced any probabilities.
- We don’t really have a
**model**. - And we’ll need a model if we are going to perform inference:
- to get a sense of estimation uncertainty,
- to test for the usefulness of explanatory variables (\(x\)),
- and to put “error bars” on our predictions from the line(s).

A **prediction rule** is any function where you input an \(x\) and it outputs \(\hat{y} \equiv \hat{y}(x)\) as a predicted response at \(x\).

The least squares line generates a prediction rule: \[ \hat{y} = b_0 + b_1 x. \]

- The rule tells us what to do when a
**new \(x\)**comes along; - run it through the formula and obtain a guess \(\hat{y}(x)\).

\(\hat{y}\) is just **what we expect** for a given \(x\);

- it is not going to be a perfect prediction.
- If we actually sample a \(y\)-value at \(x\) we will not get exactly \(\hat{y}\).

We need to devise a notion of **forecast accuracy**.

- How sure are we about our forecast?
- Or how different could \(Y(x)\) be from what we expect?

Forecasts are useless without some kind of uncertainty qualification/quantification.

One method is to specify a range of \(Y(x)\) values that are likely at a given \(x\) value,

- a so-called
**prediction interval**, quickly summarizing the distribution of \(Y(x)\).

To construct a “proper” prediction interval we will have to:

- obtain estimates of the coefficients, \(b_0\) and \(b_1\), generating the “fitted” line
- assess the likely range of residual values corresponding to a \(Y(x)\) that has
**not**yet been observed, centered at the fitted line \(Y(x) = b_0 + b_1 x\). - acknowledge that the “fitted line” may be fooled by particular realizations of the residuals
- i.e., that our estimated coefficients \(b_0\) and \(b_1\) are random.

We know how to do Step 1 already.

- Steps 2 and 3 require investing in a
**probability model**.

Here it is: \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), \quad \mbox{ for } \quad i=1,\dots, n. \]

Similar, but with important differences.

- It is a
**model**, so we are*assuming*this relationship holds for some**true but unknown**values of \(\beta_0, \beta_1\). - Greek letters remind us that they are not the same as the LS estimates \(b_0\) or \(b_1\)
- or any other estimate for that matter.

- The errors \(\varepsilon_i\) are independent, additive, “idiosyncratic noise”.
- Their distribution is
*known*up to its spread \(\sigma^2\). - Greek letters remind us that \(\varepsilon_i\) is not \(e_i\).

Before looking at any data, the model specifies,

- how \(Y\) varies with \(x\)
*on average*, - and the influence of factors other than \(x\), namely \(\varepsilon\) which is independent of \(x\).

We think about the data as being one possible realization of data that *could* have been generated from the model \[
Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2).
\]

- \(\sigma^2\) controls the
**dispersion**of \(Y_i\) around \(\beta_0 + \beta_1 x_i\).

In this setup \[ Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2). \]

The conditional variance is \[ \mathbb{V}\mathrm{ar}\{Y \mid x\} = \mathbb{V}\mathrm{ar}\{ \beta_0 + \beta_1 x + \varepsilon \mid x\} = \mathbb{V}\mathrm{ar}\{\varepsilon\} = \sigma^2. \]

- We customarily write \(x\) on the right-hand-side of the conditioning bar even though it is not really necessary
- for all sorts of reasons.

Analogous to our sliced-boxplots (house price v. size) example

- \(\sigma^2 < \mathbb{V}\mathrm{ar}\{Y\}\) if \(x\) and \(Y\) are related.
- The bigger \(R^2 = 1 - \sigma^2/\mathbb{V}\mathrm{ar}\{Y\}\) the more \(x\) matters.

How do we estimate the unknown parameters, \(\beta_0\), \(\beta_1\) and \(\sigma^2\)?

- Well we have one way already – well, three in fact! – but it wasn’t very statistical.
- Now that we have a model we have more options.

How would we estimate the parameters by maximum likelihood?

- Start by writing down the likelihood.

\[ L(\beta_0, \beta_1, \sigma^2) = \cdots \]

The log likelihood is

\[ \ell(\beta_0, \beta_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

We can see already that the derivatives with respect to \(\beta_0\) and \(\beta_1\) are the same as the derivatives of \[ \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2, \] and when we set them to zero the \(-1/2\sigma^2\) constant out front will cancel.

- In other words, we we have the least squares problem all over again.
- The MLE is \(\hat{\beta}_0 = b_0\) and \(\hat{\beta}_1 = b_1\).

All that remains is \(\sigma^2\), however an analogy works there too.

Let \(\hat{\mu}_i \equiv \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\), and write

\[ \ell(\hat{\beta}_0, \hat{\beta}_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}_i)^2. \]

Now we can see what will happen when we differentiate w.r.t. \(\sigma^2\).

- The same thing as last time.
- From that we deduce the following estimate, where \(\hat{\varepsilon}_i \equiv e_i\).

\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\varepsilon}_i^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

Just like before this is a biased estimate

- and the bias correction involves adjusting for degrees of freedom.
- Two DoFs are lost in this case because two parameters (\(\beta_0\) and \(\beta_1\)) were estimated from the data before that same data was again used to estimate the variance.

For that reason we usually take \[ \hat{\sigma}^2 = s^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2. \]

- The proof that this is unbiased is similar to the one in the likelihood supplement.
- The intuition here is that lines are defined by two points, so you need at least three points for variance to be properly defined.

```
size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
summary(reg <- lm(price ~ size))
```

```
##
## Call:
## lm(formula = price ~ size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.425 -8.618 0.575 10.766 18.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.885 9.094 4.276 0.000903 ***
## size 35.386 4.494 7.874 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.14 on 13 degrees of freedom
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8133
## F-statistic: 62 on 1 and 13 DF, p-value: 2.66e-06
```

R calls \(s = \sqrt{s^2} = \sqrt{\hat{\sigma}^2}\) the **residual standard error**.

- Remember, standard error just means estimated standard deviation,
- so residual standard error is the estimated standard deviation of the residuals.

```
s <- summary(reg)$sigma
n <- length(size)
c(s, sqrt(sum(reg$resid^2)/(n-2)))
```

`## [1] 14.1384 14.1384`

We have estimates now, and they are *not* the same as the true parameters:

- \(\beta_0\) is not \(b_0\), \(\beta_1\) is not \(b_1\) and \(\varepsilon_i\) is not \(e_i\).

How much do our estimates depend on the particular random sample we happen to observe?

Before deriving this sampling distribution, lets do some Monte Carlo experiments.

- Randomly draw different sample \(Y_i\)-values at some \(x_i\)’s
- and for each sample compute the estimate of \(b_0\), \(b_1\) and \(s\).

If the estimates don’t vary much from sample to sample, then it doesn’t matter which sample you happen to observe.

- If the estimates do vary a lot, then it matters!

*Code for our first experiment is given in lmmc.R.*