Least squares offers the most generic form of fitting for supervised learning.

We will focus on linear least squares to start out.

  • It provides a conceptually simple method for investigating a functional relationship between one or more factors and an outcome of interest.

  • The relationship is expressed in the form of an equation, our model, connecting the response or dependent variable to one or more explanatory variables.

We will start with just two variables.

  • Don't worry, we will get fancy in due course.


Our two variables will generically be called \(X\) and \(Y\).

\(Y\): output, dependent variable, response, outcome

\(X\): input, independent variable, explanatory variable, covariate, factor

Sometimes the words covariate and factor have special meaning, and sometimes people also refer to \(Y\) as a covariate.

  • Covariate generally means a variable that "correlates" with some other variable.
  • The term factor can be for a categorical variable,
    • where we talk about a factor taking on levels.
    • At least that's how it is in R.

Forecasting potential


As a bit of a warm-up, lets think about how relationships between two variables, \(X\) and \(Y\), can be expressed by probabilistic constructs,

  • and then we can see how that is related to regression and forecasting.

Consider a regression of house price on size

  • price is our response, \(Y\)
  • size is our explanatory variable, \(X\)

Our interest is in the distribution of \(Y\), not alone (i.e., not marginally),

  • but as \(X\) varies (i.e., conditionally);
  • because we think that \(X\) might be useful for predicting \(Y\).

Conditional distributions

Regression is really about modeling the conditional distribution of \(Y\) given \(X\). \[ \mathbb{P}(Y \mid X) \]

If we have data comprised of \((x_i, y_i)\) pairs, \(i=1, \dots, n\),

  • where these quantities have been observed to co-occur together,
  • then we can use these to learn about \(\mathbb{P}(Y \mid X)\).

We will find it useful to build a model for these conditional distributions,

  • but for now lets just explore the potential intuitively.

A conditional distribution can be obtained by "slicing" the \(X\)-\(Y\) point cloud.

  • the marginal distribution ignores the slices.

Conditional v. marginal distribution

The conditional distributions answer the forecasting problem:

  • if I know that a house is between 1 and 1.5 thousand sq.ft.,
    • then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.
  • The conditional means/medians seem to line up along a "regression line".
  • The conditional distributions have much smaller dispersion than the marginal distribution.
  • Apparently \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

When is \(X\) useful for predicting \(Y\)?

This suggests two general points.

  • If \(X\) has no forecasting power, then the marginal and conditionals will be the same. In particular \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\} \]

  • If \(X\) has some forecasting information, then
    • the conditional means will be different than the marginal or overall mean \[ \mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\}, \]
    • and the conditional variance of \(Y\) given \(X\) will be (substantially) less than the marginal variance of \(Y\) \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

Intuition from an example where \(X\) has no predictive power.

Note the code supporting the house price example(s) is in cndprice.R.

Fitting lines to points

A line-based conditional distribution

How do we get away from grouping the data artificially?

  • We're going to "fit" a line instead, a representation of \(Y \mid X\) via \(y=ax+b\),
  • and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
    • i.e., just using plain averages \(\bar{Y}\), ignoring \(X\).

There are three steps

  1. Finding a line that makes sense;
  2. Arguing that it is a good line;
  3. Deriving its statistical properties,
    • and subsequently determining if \(X\) is useful for predicting \(Y\).


First we need a little review on concepts relating two random quantities, \(X\) and \(Y\).

Recall the definition of covariance \[ \mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}. \]

  • Linear combinations factor as \(\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)\).
  • Variance is a special case: \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)\),
  • and \(\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}\).

The sample analog is calculated as \[ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}). \]

Covariance measures how \(X\) and \(Y\) vary with each other around their means.


Correlation is standardized covariance: \[ \mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}. \]

The correlation is scale-invariant (or scale-free), which means the units of measurement don't matter:

  • It is always true that \(-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1\).
  • It gives the direction (\(+\) or \(-\)) and the strength (\(0 \rightarrow 1\)) of a linear relationship between \(X\) and \(Y\).

The sample analog is \[ r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}. \]

Linear correlation

Correlation only measures linear relationships.

  • \(\mathbb{C}\mathrm{orr}(X,Y) = 0\) does not mean the variables are unrelated!

See corr.R.

Correlation and regression

"Imagine" that \(Y = b_0 + b_1 X + e\),

  • where \(e\) is some "idiosyncratic noise" that is independent of \(X\).

Now, \[ \begin{aligned} \mathbb{C}\mathrm{ov}(X,Y) &= \mathbb{C}\mathrm{ov}(X,b_0 + b_1 X + e) \\ & \;\, \vdots \\ %&= \mathbb{C}\mathrm{ov}(X, b_1 X) \\ %&= b_1 \mathbb{V}\mathrm{ar}\{X\} \\ \mbox{so } \quad b_1 &= \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\mathbb{V}\mathrm{ar}\{X\}}. \\ \mbox{In sample terms, } \quad b_1 &= \frac{s_{xy}}{s_x^2}. \end{aligned} \]

  • Remember that for later.

Slope and correlation

Here is a more "interpretable" version in terms of correlations: \[ \begin{aligned} \mathbb{C}\mathrm{orr}(X,Y) &= b_1 \frac{\sqrt{\mathbb{V}\mathrm{ar}\{X\}}}{\sqrt{\mathbb{V}\mathrm{ar}\{Y\}}}. \\ \mbox{In sample terms, } \quad r_{xy} &= b_1 \frac{s_x}{s_y} \quad \mbox{ so } \quad b_1 = r_{xy} \frac{s_y}{s_x}. \end{aligned} \]

Apparently, slope from noisy data pairs \((x_1,y_1), \dots, (x_n, y_n)\) is: \[ b_1 = r_{xy} \frac{s_y}{s_x} = \mbox{corr } \times \frac{\mathrm{rise}}{\mathrm{run}}. \]

  • or correlation times "units \(Y\)" per "units \(X\)".

Wage data


  • Greenberg and Kosters, 1970, Rand Corporation.
  • 39 demographic groupings of 6000 households with the male head earning less than $15,000 annually in 1966.


  • Estimate the relationship between pay and labor supply.
  • Use this information to influence social policy decisions and the debate on a guaranteed national wage.

Possible solution:

  • Fit a linear relationship summarizing effect of pay on labor supply for the working poor.

Estimating the slope in R

Read in the data and change to \(X\) and \(Y\) for convenience.

D <- read.csv("wages.csv")
Y <- D$HRS

Use correlation to calculate the slope of the line

b1 <- cor(X,Y) * sd(Y) / sd(X)
## [1] 80.93679
  • So every extra dollar-per-hour results in \(\sim 81\) hours more work.

Visualizing the data

It is probably best to start with a scatterplot of the data, but the slides looked prettier this way.

plot(X, Y, xlab="rate", ylab="hours")

Intercept of the line

We have the slope, what about the intercept?

  • The intercept coefficient \(b_0\) determines where the line crosses the \(y\)-axis.
  • That's hard to visualize, especially when the data are far from the \(y\)-axis.
  • Is there another point we'd like the line to go through?
    • What about the middle of the cloud of data: \((\bar{x}, \bar{y})\)?

Solve, \[ \begin{aligned} \bar{y} &= b_0 + b_1 \bar{x} \\ \mbox{so } \quad b_0 &= \bar{y} - b_1 \bar{x}. \end{aligned} \]

b0 <- mean(Y) - b1 * mean(X)
## [1] 1913.009

Visualizing the line

plot(X, Y, xlab="rate", ylab="hours")
abline(b0, b1, col=2)


Here is a summary of what we know.

  • Rather than a smattering of points to stare at, we can predict

\[ \mathrm{hours} = 1913 + 81 \times \mathrm{rate} + e. \]

Some questions

  • Is this a good line?
  • Does it hold outside the range of the data?
  • Is pay rate the only variable of importance?

To answer these questions we'll need to work a little harder.

The best fitting line

"Housing data"

For the next several slides we're going to use some toy data with a house pricing "story", just to fix ideas.

size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,1.8,2,2.4,2.5,2.7,3.2,3.5)
price <- c(70,83,74,93,89,58,85,114,95,100,138,111,124,161,172)
data.frame(size=size, price=price)[1:10,]
##    size price
## 1   0.8    70
## 2   0.9    83
## 3   1.0    74
## 4   1.1    93
## 5   1.4    89
## 6   1.4    58
## 7   1.5    85
## 8   1.6   114
## 9   1.8    95
## 10  2.0   100


Before you do anything, plot the data in the \(X \times Y\) plane.

plot(size, price, pch=20)

Linear relationship

My eyes tell me that a linear relationship is plausible

  • as size goes up, price goes up.

And I can even guess what that relationship might be.

  • I used the "eyeball" method, and came up with \[ b_0 = 35 \quad \mbox{ and } \quad b_1 = 40. \]

  • And once I'm happy with those values I can use them to make a prediction.

For example, given a house with size \(x = 2.2\) (thousand sqft), the predicted price would be., \[ \hat{y}(x) = 35 + 40 \times 2.2 = 123 \quad \mbox{ (thousand dollars) } \]

  • \(b_1\) must have units that convert "Ksqft" into "K$"

Pretty good fit

My eyeballs are pretty good.

plot(size, price, pch=20)
abline(35, 40, col=2)

What is a good line?

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and intercept parameters in the model \[ \hat{y} = b_0 + b_1 x. \]

That involves

  • choosing a criteria,
    • i.e., quantifying how good a line is relative to the data
  • and matching that criteria with a solution
    • i.e., finding the best line subject to that criteria.

Fitted values and residuals

Although there are lots of ways to choose a criteria

  • only a small handful lead to solutions that are "easy" to compute
  • and which have nice statistical properties.

Most reasonable criteria involve measuring the amount by which the fitted value

  • obtained from the line for each point in the data, \(\hat{y}_i = b_0 + b_1 x_i\)

differs from the observed value of the response in the data, \(y_i\).

  • This amount is called the residual: \(e_i = y_i - \hat{y}_i\).
  • Good lines produce small residuals.

The residual \(e_i\) is the discrepancy between the fitted \(\hat{y}_i\) and observed \(y_i\) values.

  • Note that we can write \(y_i = \hat{y}_i + (y_i - \hat{y}_i) = \hat{y}_i + e_i\).

Least squares

A reasonable goal is to minimize the size of all residuals:

  • If they were all zero we would have a perfect line.
  • There is a trade-off between moving closer to some points and at the same time moving away from other points.

Since some residuals are positive and some are negative, we need one more ingredient.

  • \(|e_i|\) treats positives and negatives equally.
  • So does \(e_i^2\) which is easier to work with mathematically.

The method of least squares chooses \(b_0\) and \(b_1\) to minimize \(\sum_{i=1}^n e_i^2\).