## I thought that’s what we were doing

So far we have been fitting lines to points.

• We haven’t really been doing statistics because we haven’t really introduced any probabilities.
• We don’t really have a model.
• And we’ll need a model if we are going to perform inference:
• to get a sense of estimation uncertainty,
• to test for the usefulness of explanatory variables ($$x$$),
• and to put “error bars” on our predictions from the line(s).

## Prediction and the modeling goal

A prediction rule is any function where you input an $$x$$ and it outputs $$\hat{y} \equiv \hat{y}(x)$$ as a predicted response at $$x$$.

The least squares line generates a prediction rule: $\hat{y} = b_0 + b_1 x.$

• The rule tells us what to do when a new $$x$$ comes along;
• run it through the formula and obtain a guess $$\hat{y}(x)$$.

$$\hat{y}$$ is just what we expect for a given $$x$$;

• it is not going to be a perfect prediction.
• If we actually sample a $$y$$-value at $$x$$ we will not get exactly $$\hat{y}$$.

## Forecast accuracy

We need to devise a notion of forecast accuracy.

• How sure are we about our forecast?
• Or how different could $$Y(x)$$ be from what we expect?

Forecasts are useless without some kind of uncertainty qualification/quantification.

One method is to specify a range of $$Y(x)$$ values that are likely at a given $$x$$ value,

• a so-called prediction interval, quickly summarizing the distribution of $$Y(x)$$.

## Full accounting of uncertainty

To construct a “proper” prediction interval we will have to:

1. obtain estimates of the coefficients, $$b_0$$ and $$b_1$$, generating the “fitted” line
2. assess the likely range of residual values corresponding to a $$Y(x)$$ that has not yet been observed, centered at the fitted line $$Y(x) = b_0 + b_1 x$$.
3. acknowledge that the “fitted line” may be fooled by particular realizations of the residuals
• i.e., that our estimated coefficients $$b_0$$ and $$b_1$$ are random.

We know how to do Step 1 already.

• Steps 2 and 3 require investing in a probability model.

## Simple linear regression (SLR) model

Here it is: $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), \quad \mbox{ for } \quad i=1,\dots, n.$

Similar, but with important differences.

• It is a model, so we are assuming this relationship holds for some true but unknown values of $$\beta_0, \beta_1$$.
• Greek letters remind us that they are not the same as the LS estimates $$b_0$$ or $$b_1$$
• or any other estimate for that matter.
• The errors $$\varepsilon_i$$ are independent, additive, “idiosyncratic noise”.
• Their distribution is known up to its spread $$\sigma^2$$.
• Greek letters remind us that $$\varepsilon_i$$ is not $$e_i$$.

## Data generating mechanism

Before looking at any data, the model specifies,

• how $$Y$$ varies with $$x$$ on average,
• and the influence of factors other than $$x$$, namely $$\varepsilon$$ which is independent of $$x$$.

We think about the data as being one possible realization of data that could have been generated from the model $Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2).$

• $$\sigma^2$$ controls the dispersion of $$Y_i$$ around $$\beta_0 + \beta_1 x_i$$.

## Conditional variance

In this setup $Y_i \mid x_i \stackrel{\mathrm{ind}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2).$

The conditional variance is $\mathbb{V}\mathrm{ar}\{Y \mid x\} = \mathbb{V}\mathrm{ar}\{ \beta_0 + \beta_1 x + \varepsilon \mid x\} = \mathbb{V}\mathrm{ar}\{\varepsilon\} = \sigma^2.$

• We customarily write $$x$$ on the right-hand-side of the conditioning bar even though it is not really necessary
• for all sorts of reasons.

Analogous to our sliced-boxplots (house price v. size) example

• $$\sigma^2 < \mathbb{V}\mathrm{ar}\{Y\}$$ if $$x$$ and $$Y$$ are related.
• The bigger $$R^2 = 1 - \sigma^2/\mathbb{V}\mathrm{ar}\{Y\}$$ the more $$x$$ matters.

## Estimation

How do we estimate the unknown parameters, $$\beta_0$$, $$\beta_1$$ and $$\sigma^2$$?

• Well we have one way already – well, three in fact! – but it wasn’t very statistical.
• Now that we have a model we have more options.

How would we estimate the parameters by maximum likelihood?

• Start by writing down the likelihood.

$L(\beta_0, \beta_1, \sigma^2) = \cdots$

## Maximum likelihood coefficients

The log likelihood is

$\ell(\beta_0, \beta_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2$

We can see already that the derivatives with respect to $$\beta_0$$ and $$\beta_1$$ are the same as the derivatives of $\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2,$ and when we set them to zero the $$-1/2\sigma^2$$ constant out front will cancel.

• In other words, we we have the least squares problem all over again.
• The MLE is $$\hat{\beta}_0 = b_0$$ and $$\hat{\beta}_1 = b_1$$.

## Maximum likelihood variance

All that remains is $$\sigma^2$$, however an analogy works there too.

Let $$\hat{\mu}_i \equiv \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$, and write

$\ell(\hat{\beta}_0, \hat{\beta}_1, \sigma^2) = c -\frac{n}{2} \log (2\pi\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}_i)^2.$

Now we can see what will happen when we differentiate w.r.t. $$\sigma^2$$.

• The same thing as last time.
• From that we deduce the following estimate, where $$\hat{\varepsilon}_i \equiv e_i$$.

$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\varepsilon}_i^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2$

## Bias and DoF

Just like before this is a biased estimate

• and the bias correction involves adjusting for degrees of freedom.
• Two DoFs are lost in this case because two parameters ($$\beta_0$$ and $$\beta_1$$) were estimated from the data before that same data was again used to estimate the variance.

For that reason we usually take $\hat{\sigma}^2 = s^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2.$

• The proof that this is unbiased is similar to the one in the likelihood supplement.
• The intuition here is that lines are defined by two points, so you need at least three points for variance to be properly defined.

## Residual standard error

size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
summary(reg <- lm(price ~ size))
##
## Call:
## lm(formula = price ~ size)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -30.425  -8.618   0.575  10.766  18.498
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   38.885      9.094   4.276 0.000903 ***
## size          35.386      4.494   7.874 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.14 on 13 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8133
## F-statistic:    62 on 1 and 13 DF,  p-value: 2.66e-06

## Residual standard error (2)

R calls $$s = \sqrt{s^2} = \sqrt{\hat{\sigma}^2}$$ the residual standard error.

• Remember, standard error just means estimated standard deviation,
• so residual standard error is the estimated standard deviation of the residuals.
s <- summary(reg)$sigma n <- length(size) c(s, sqrt(sum(reg$resid^2)/(n-2)))
## [1] 14.1384 14.1384

## Caution

We have estimates now, and they are not the same as the true parameters:

• $$\beta_0$$ is not $$b_0$$, $$\beta_1$$ is not $$b_1$$ and $$\varepsilon_i$$ is not $$e_i$$.

## Estimation uncertainty

How much do our estimates depend on the particular random sample we happen to observe?

Before deriving this sampling distribution, lets do some Monte Carlo experiments.

• Randomly draw different sample $$Y_i$$-values at some $$x_i$$’s
• and for each sample compute the estimate of $$b_0$$, $$b_1$$ and $$s$$.

If the estimates don’t vary much from sample to sample, then it doesn’t matter which sample you happen to observe.

• If the estimates do vary a lot, then it matters!

Code for our first experiment is given in lmmc.R.