## Goals

Least squares offers the most generic form of fitting for supervised learning.

We will focus on linear least squares to start out.

• It provides a conceptually simple method for investigating a functional relationship between one or more factors and an outcome of interest.

• The relationship is expressed in the form of an equation, our model, connecting the response or dependent variable to one or more explanatory variables.

We will start with just two variables.

• Don't worry, we will get fancy in due course.

## Vocabulary

Our two variables will generically be called $$X$$ and $$Y$$.

$$Y$$: output, dependent variable, response, outcome

$$X$$: input, independent variable, explanatory variable, covariate, factor

Sometimes the words covariate and factor have special meaning, and sometimes people also refer to $$Y$$ as a covariate.

• Covariate generally means a variable that "correlates" with some other variable.
• The term factor can be for a categorical variable,
• where we talk about a factor taking on levels.
• At least that's how it is in R.

## Warm-up

As a bit of a warm-up, lets think about how relationships between two variables, $$X$$ and $$Y$$, can be expressed by probabilistic constructs,

• and then we can see how that is related to regression and forecasting.

Consider a regression of house price on size

• price is our response, $$Y$$
• size is our explanatory variable, $$X$$

Our interest is in the distribution of $$Y$$, not alone (i.e., not marginally),

• but as $$X$$ varies (i.e., conditionally);
• because we think that $$X$$ might be useful for predicting $$Y$$.

## Conditional distributions

Regression is really about modeling the conditional distribution of $$Y$$ given $$X$$. $\mathbb{P}(Y \mid X)$

If we have data comprised of $$(x_i, y_i)$$ pairs, $$i=1, \dots, n$$,

• where these quantities have been observed to co-occur together,
• then we can use these to learn about $$\mathbb{P}(Y \mid X)$$.

We will find it useful to build a model for these conditional distributions,

• but for now lets just explore the potential intuitively.

A conditional distribution can be obtained by "slicing" the $$X$$-$$Y$$ point cloud.

• the marginal distribution ignores the slices.

## Conditional v. marginal distribution

The conditional distributions answer the forecasting problem:

• if I know that a house is between 1 and 1.5 thousand sq.ft.,
• then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.
• The conditional means/medians seem to line up along a "regression line".
• The conditional distributions have much smaller dispersion than the marginal distribution.
• Apparently $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

## When is $$X$$ useful for predicting $$Y$$?

This suggests two general points.

• If $$X$$ has no forecasting power, then the marginal and conditionals will be the same. In particular $\mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\}$

• If $$X$$ has some forecasting information, then
• the conditional means will be different than the marginal or overall mean $\mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\},$
• and the conditional variance of $$Y$$ given $$X$$ will be (substantially) less than the marginal variance of $$Y$$ $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

Intuition from an example where $$X$$ has no predictive power.

Note the code supporting the house price example(s) is in cndprice.R.

## A line-based conditional distribution

How do we get away from grouping the data artificially?

• We're going to "fit" a line instead, a representation of $$Y \mid X$$ via $$y=ax+b$$,
• and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
• i.e., just using plain averages $$\bar{Y}$$, ignoring $$X$$.

There are three steps

1. Finding a line that makes sense;
2. Arguing that it is a good line;
3. Deriving its statistical properties,
• and subsequently determining if $$X$$ is useful for predicting $$Y$$.

## Covariance

First we need a little review on concepts relating two random quantities, $$X$$ and $$Y$$.

Recall the definition of covariance $\mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}.$

• Linear combinations factor as $$\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)$$.
• Variance is a special case: $$\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)$$,
• and $$\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}$$.

The sample analog is calculated as $s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).$

Covariance measures how $$X$$ and $$Y$$ vary with each other around their means.

## Correlation

Correlation is standardized covariance: $\mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}.$

The correlation is scale-invariant (or scale-free), which means the units of measurement don't matter:

• It is always true that $$-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1$$.
• It gives the direction ($$+$$ or $$-$$) and the strength ($$0 \rightarrow 1$$) of a linear relationship between $$X$$ and $$Y$$.

The sample analog is $r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}.$

## Linear correlation

Correlation only measures linear relationships.

• $$\mathbb{C}\mathrm{orr}(X,Y) = 0$$ does not mean the variables are unrelated!

See corr.R.

## Correlation and regression

"Imagine" that $$Y = b_0 + b_1 X + e$$,

• where $$e$$ is some "idiosyncratic noise" that is independent of $$X$$.

Now, \begin{aligned} \mathbb{C}\mathrm{ov}(X,Y) &= \mathbb{C}\mathrm{ov}(X,b_0 + b_1 X + e) \\ & \;\, \vdots \\ %&= \mathbb{C}\mathrm{ov}(X, b_1 X) \\ %&= b_1 \mathbb{V}\mathrm{ar}\{X\} \\ \mbox{so } \quad b_1 &= \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\mathbb{V}\mathrm{ar}\{X\}}. \\ \mbox{In sample terms, } \quad b_1 &= \frac{s_{xy}}{s_x^2}. \end{aligned}

• Remember that for later.

## Slope and correlation

Here is a more "interpretable" version in terms of correlations: \begin{aligned} \mathbb{C}\mathrm{orr}(X,Y) &= b_1 \frac{\sqrt{\mathbb{V}\mathrm{ar}\{X\}}}{\sqrt{\mathbb{V}\mathrm{ar}\{Y\}}}. \\ \mbox{In sample terms, } \quad r_{xy} &= b_1 \frac{s_x}{s_y} \quad \mbox{ so } \quad b_1 = r_{xy} \frac{s_y}{s_x}. \end{aligned}

Apparently, slope from noisy data pairs $$(x_1,y_1), \dots, (x_n, y_n)$$ is: $b_1 = r_{xy} \frac{s_y}{s_x} = \mbox{corr } \times \frac{\mathrm{rise}}{\mathrm{run}}.$

• or correlation times "units $$Y$$" per "units $$X$$".

## Wage data

Data:

• Greenberg and Kosters, 1970, Rand Corporation.

## Pretty good fit

My eyeballs are pretty good.

plot(size, price, pch=20)
abline(35, 40, col=2)

## What is a good line?

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and intercept parameters in the model $\hat{y} = b_0 + b_1 x.$

That involves

• choosing a criteria,
• i.e., quantifying how good a line is relative to the data
• and matching that criteria with a solution
• i.e., finding the best line subject to that criteria.

## Fitted values and residuals

Although there are lots of ways to choose a criteria

• only a small handful lead to solutions that are "easy" to compute
• and which have nice statistical properties.

Most reasonable criteria involve measuring the amount by which the fitted value

• obtained from the line for each point in the data, $$\hat{y}_i = b_0 + b_1 x_i$$

differs from the observed value of the response in the data, $$y_i$$.

• This amount is called the residual: $$e_i = y_i - \hat{y}_i$$.
• Good lines produce small residuals.

The residual $$e_i$$ is the discrepancy between the fitted $$\hat{y}_i$$ and observed $$y_i$$ values.

• Note that we can write $$y_i = \hat{y}_i + (y_i - \hat{y}_i) = \hat{y}_i + e_i$$.

## Least squares

A reasonable goal is to minimize the size of all residuals:

• If they were all zero we would have a perfect line.
• There is a trade-off between moving closer to some points and at the same time moving away from other points.

Since some residuals are positive and some are negative, we need one more ingredient.

• $$|e_i|$$ treats positives and negatives equally.
• So does $$e_i^2$$ which is easier to work with mathematically.

The method of least squares chooses $$b_0$$ and $$b_1$$ to minimize $$\sum_{i=1}^n e_i^2$$.