Least squares offers the most generic form of fitting for supervised learning.

We will focus on linear least squares to start out.

We will start with just two variables.


Our two variables will generically be called \(X\) and \(Y\).

\(Y\): output, dependent variable, response, outcome

\(X\): input, independent variable, explanatory variable, covariate, factor

Sometimes the words covariate and factor have special meaning, and sometimes people also refer to \(Y\) as a covariate.

Forecasting potential


As a bit of a warm-up, lets think about how relationships between two variables, \(X\) and \(Y\), can be expressed by probabilistic constructs,

  • and then we can see how that is related to regression and forecasting.

Consider a regression of house price on size

  • price is our response, \(Y\)
  • size is our explanatory variable, \(X\)

Our interest is in the distribution of \(Y\), not alone (i.e., not marginally),

  • but as \(X\) varies (i.e., conditionally);
  • because we think that \(X\) might be useful for predicting \(Y\).

Conditional distributions

Regression is really about modeling the conditional distribution of \(Y\) given \(X\). \[ \mathbb{P}(Y \mid X) \]

If we have data comprised of \((x_i, y_i)\) pairs, \(i=1, \dots, n\),

  • where these quantities have been observed to co-occur together,
  • then we can use these to learn about \(\mathbb{P}(Y \mid X)\).

We will find it useful to build a model for these conditional distributions,

  • but for now lets just explore the potential intuitively.

A conditional distribution can be obtained by “slicing” the \(X\)-\(Y\) point cloud.

  • the marginal distribution ignores the slices.

Conditional v. marginal distribution

The conditional distributions answer the forecasting problem:

  • if I know that a house is between 1 and 1.5 thousand sq.ft.,
    • then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.
  • The conditional means/medians seem to line up along a “regression line”.
  • The conditional distributions have much smaller dispersion than the marginal distribution.
  • Apparently \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

When is \(X\) useful for predicting \(Y\)?

This suggests two general points.

  • If \(X\) has no forecasting power, then the marginal and conditionals will be the same. In particular \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\} \]

  • If \(X\) has some forecasting information, then
    • the conditional means will be different than the marginal or overall mean \[ \mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\}, \]
    • and the conditional variance of \(Y\) given \(X\) will be (substantially) less than the marginal variance of \(Y\) \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

Intuition from an example where \(X\) has no predictive power.

Note the code supporting the house price example(s) is in cndprice.R.

Fitting lines to points

A line-based conditional distribution

How do we get away from grouping the data artificially?

  • We’re going to “fit” a line instead, a representation of \(Y \mid X\) via \(y=ax+b\),
  • and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
    • i.e., just using plain averages \(\bar{Y}\), ignoring \(X\).

There are three steps

  1. Finding a line that makes sense;
  2. Arguing that it is a good line;
  3. Deriving its statistical properties,
    • and subsequently determining if \(X\) is useful for predicting \(Y\).


First we need a little review on concepts relating two random quantities, \(X\) and \(Y\).

Recall the definition of covariance \[ \mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}. \]

  • Linear combinations factor as \(\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)\).
  • Variance is a special case: \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)\),
  • and \(\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}\).

The sample analog is calculated as \[ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}). \]

Covariance measures how \(X\) and \(Y\) vary with each other around their means.


Correlation is standardized covariance: \[ \mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}. \]

The correlation is scale-invariant (or scale-free), which means the units of measurement don’t matter:

  • It is always true that \(-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1\).
  • It gives the direction (\(+\) or \(-\)) and the strength (\(0 \rightarrow 1\)) of a linear relationship between \(X\) and \(Y\).

The sample analog is \[ r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}. \]