## Goals

Least squares offers the most generic form of fitting for supervised learning.

We will focus on linear least squares to start out.

• It provides a conceptually simple method for investigating a functional relationship between one or more factors and an outcome of interest.

• The relationship is expressed in the form of an equation, our model, connecting the response or dependent variable to one or more explanatory variables.

We will start with just two variables.

• Don’t worry, we will get fancy in due course.

## Vocabulary

Our two variables will generically be called $$X$$ and $$Y$$.

$$Y$$: output, dependent variable, response, outcome

$$X$$: input, independent variable, explanatory variable, covariate, factor

Sometimes the words covariate and factor have special meaning, and sometimes people also refer to $$Y$$ as a covariate.

• Covariate generally means a variable that “correlates” with some other variable.
• The term factor can be for a categorical variable,
• where we talk about a factor taking on levels.
• At least that’s how it is in R.

# Forecasting potential

## Warm-up

As a bit of a warm-up, lets think about how relationships between two variables, $$X$$ and $$Y$$, can be expressed by probabilistic constructs,

• and then we can see how that is related to regression and forecasting.

Consider a regression of house price on size

• price is our response, $$Y$$
• size is our explanatory variable, $$X$$

Our interest is in the distribution of $$Y$$, not alone (i.e., not marginally),

• but as $$X$$ varies (i.e., conditionally);
• because we think that $$X$$ might be useful for predicting $$Y$$.

## Conditional distributions

Regression is really about modeling the conditional distribution of $$Y$$ given $$X$$. $\mathbb{P}(Y \mid X)$

If we have data comprised of $$(x_i, y_i)$$ pairs, $$i=1, \dots, n$$,

• where these quantities have been observed to co-occur together,
• then we can use these to learn about $$\mathbb{P}(Y \mid X)$$.

We will find it useful to build a model for these conditional distributions,

• but for now lets just explore the potential intuitively.

A conditional distribution can be obtained by “slicing” the $$X$$-$$Y$$ point cloud.

• the marginal distribution ignores the slices.

## Conditional v. marginal distribution

The conditional distributions answer the forecasting problem:

• if I know that a house is between 1 and 1.5 thousand sq.ft.,
• then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.
• The conditional means/medians seem to line up along a “regression line”.
• The conditional distributions have much smaller dispersion than the marginal distribution.
• Apparently $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

## When is $$X$$ useful for predicting $$Y$$?

This suggests two general points.

• If $$X$$ has no forecasting power, then the marginal and conditionals will be the same. In particular $\mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\}$

• If $$X$$ has some forecasting information, then
• the conditional means will be different than the marginal or overall mean $\mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\},$
• and the conditional variance of $$Y$$ given $$X$$ will be (substantially) less than the marginal variance of $$Y$$ $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

Intuition from an example where $$X$$ has no predictive power.

Note the code supporting the house price example(s) is in cndprice.R.

# Fitting lines to points

## A line-based conditional distribution

How do we get away from grouping the data artificially?

• We’re going to “fit” a line instead, a representation of $$Y \mid X$$ via $$y=ax+b$$,
• and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
• i.e., just using plain averages $$\bar{Y}$$, ignoring $$X$$.

There are three steps

1. Finding a line that makes sense;
2. Arguing that it is a good line;
3. Deriving its statistical properties,
• and subsequently determining if $$X$$ is useful for predicting $$Y$$.

## Covariance

First we need a little review on concepts relating two random quantities, $$X$$ and $$Y$$.

Recall the definition of covariance $\mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}.$

• Linear combinations factor as $$\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)$$.
• Variance is a special case: $$\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)$$,
• and $$\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}$$.

The sample analog is calculated as $s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).$

Covariance measures how $$X$$ and $$Y$$ vary with each other around their means.

## Correlation

Correlation is standardized covariance: $\mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}.$

The correlation is scale-invariant (or scale-free), which means the units of measurement don’t matter:

• It is always true that $$-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1$$.
• It gives the direction ($$+$$ or $$-$$) and the strength ($$0 \rightarrow 1$$) of a linear relationship between $$X$$ and $$Y$$.

The sample analog is $r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}.$