## Goals

Least squares offers the most generic form of fitting for supervised learning.

We will focus on linear least squares to start out.

• It provides a conceptually simple method for investigating a functional relationship between one or more factors and an outcome of interest.

• The relationship is expressed in the form of an equation, our model, connecting the response or dependent variable to one or more explanatory variables.

• Don’t worry, we will get fancy in due course.

## Vocabulary

Our two variables will generically be called $$X$$ and $$Y$$.

$$Y$$: output, dependent variable, response, outcome

$$X$$: input, independent variable, explanatory variable, covariate, factor

Sometimes the words covariate and factor have special meaning, and sometimes people also refer to $$Y$$ as a covariate.

• Covariate generally means a variable that “correlates” with some other variable.
• The term factor can be for a categorical variable,
• where we talk about a factor taking on levels.
• At least that’s how it is in R.

# Forecasting potential

## Warm-up

As a bit of a warm-up, lets think about how relationships between two variables, $$X$$ and $$Y$$, can be expressed by probabilistic constructs,

• and then we can see how that is related to regression and forecasting.

Consider a regression of house price on size

• price is our response, $$Y$$
• size is our explanatory variable, $$X$$

Our interest is in the distribution of $$Y$$, not alone (i.e., not marginally),

• but as $$X$$ varies (i.e., conditionally);
• because we think that $$X$$ might be useful for predicting $$Y$$.

## Conditional distributions

Regression is really about modeling the conditional distribution of $$Y$$ given $$X$$. $\mathbb{P}(Y \mid X)$

If we have data comprised of $$(x_i, y_i)$$ pairs, $$i=1, \dots, n$$,

• where these quantities have been observed to co-occur together,
• then we can use these to learn about $$\mathbb{P}(Y \mid X)$$.

We will find it useful to build a model for these conditional distributions,

• but for now lets just explore the potential intuitively.

A conditional distribution can be obtained by “slicing” the $$X$$-$$Y$$ point cloud.

• the marginal distribution ignores the slices.

## Conditional v. marginal distribution

The conditional distributions answer the forecasting problem:

• if I know that a house is between 1 and 1.5 thousand sq.ft.,
• then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.
• The conditional means/medians seem to line up along a “regression line”.
• The conditional distributions have much smaller dispersion than the marginal distribution.
• Apparently $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

## When is $$X$$ useful for predicting $$Y$$?

This suggests two general points.

• If $$X$$ has no forecasting power, then the marginal and conditionals will be the same. In particular $\mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\}$

• If $$X$$ has some forecasting information, then
• the conditional means will be different than the marginal or overall mean $\mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\},$
• and the conditional variance of $$Y$$ given $$X$$ will be (substantially) less than the marginal variance of $$Y$$ $\mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}.$

Intuition from an example where $$X$$ has no predictive power.

Note the code supporting the house price example(s) is in cndprice.R.

# Fitting lines to points

## A line-based conditional distribution

How do we get away from grouping the data artificially?

• We’re going to “fit” a line instead, a representation of $$Y \mid X$$ via $$y=ax+b$$,
• and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
• i.e., just using plain averages $$\bar{Y}$$, ignoring $$X$$.

There are three steps

1. Finding a line that makes sense;
2. Arguing that it is a good line;
3. Deriving its statistical properties,
• and subsequently determining if $$X$$ is useful for predicting $$Y$$.

## Covariance

First we need a little review on concepts relating two random quantities, $$X$$ and $$Y$$.

Recall the definition of covariance $\mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}.$

• Linear combinations factor as $$\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)$$.
• Variance is a special case: $$\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)$$,
• and $$\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}$$.

The sample analog is calculated as $s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).$

Covariance measures how $$X$$ and $$Y$$ vary with each other around their means.

## Correlation

Correlation is standardized covariance: $\mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}.$

The correlation is scale-invariant (or scale-free), which means the units of measurement don’t matter:

• It is always true that $$-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1$$.
• It gives the direction ($$+$$ or $$-$$) and the strength ($$0 \rightarrow 1$$) of a linear relationship between $$X$$ and $$Y$$.

The sample analog is $r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}.$