Department of Statistics, Virginia Tech

Least squares offers the most generic form of fitting for supervised learning.

We will focus on **linear least squares** to start out.

It provides a conceptually simple method for investigating a functional relationship between one or more

**factors**and an**outcome**of interest.The relationship is expressed in the form of an equation, our

**model**, connecting the**response**or**dependent variable**to one or more**explanatory variables**.

We will start with just **two** variables.

- Don’t worry, we will get fancy in due course.

Our two variables will generically be called \(X\) and \(Y\).

\(Y\): output, dependent variable, response, outcome

\(X\): input, independent variable, explanatory variable, covariate, factor

Sometimes the words **covariate** and **factor** have special meaning, and sometimes people also refer to \(Y\) as a covariate.

- Covariate generally means a variable that “correlates” with some other variable.
- The term factor can be for a categorical variable,
- where we talk about a factor taking on
**levels**. - At least that’s how it is in R.

- where we talk about a factor taking on

As a bit of a warm-up, lets think about how relationships between two variables, \(X\) and \(Y\), can be expressed by probabilistic constructs,

- and then we can see how that is related to regression and forecasting.

Consider a regression of house **price** on **size**

- price is our response, \(Y\)
- size is our explanatory variable, \(X\)

Our interest is in the distribution of \(Y\), not alone (i.e., not **marginally**),

- but as \(X\) varies (i.e.,
**conditionally**); - because we think that \(X\) might be useful for predicting \(Y\).

Regression is really about modeling the conditional distribution of \(Y\) given \(X\). \[ \mathbb{P}(Y \mid X) \]

If we have data comprised of \((x_i, y_i)\) pairs, \(i=1, \dots, n\),

- where these quantities have been observed to co-occur together,
- then we can use these to learn about \(\mathbb{P}(Y \mid X)\).

We will find it useful to build a model for these conditional distributions,

- but for now lets just explore the potential intuitively.

A **conditional distribution** can be obtained by “slicing” the \(X\)-\(Y\) point cloud.

- the
**marginal**distribution ignores the slices.

The conditional distributions answer the forecasting problem:

- if I know that a house is between 1 and 1.5 thousand sq.ft.,
- then the conditional distribution (second boxplot) gives me a point forecast (the median) and a prediction interval.

- The conditional means/medians seem to line up along a “regression line”.
- The conditional distributions have much smaller dispersion than the marginal distribution.
- Apparently \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

This suggests two general points.

If \(X\) has no forecasting power, then the marginal and conditionals will be the same. In particular \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \approx \mathbb{V}\mathrm{ar}\{ Y\} \]

- If \(X\) has some forecasting information, then
- the conditional means will be different than the marginal or overall mean \[ \mathbb{E}\{Y \mid X\} \ne \mathbb{E}\{ Y\}, \]
- and the conditional variance of \(Y\) given \(X\) will be (substantially) less than the marginal variance of \(Y\) \[ \mathbb{V}\mathrm{ar}\{Y \mid X\} \ll \mathbb{V}\mathrm{ar}\{ Y\}. \]

Intuition from an example where \(X\) has no predictive power.

*Note the code supporting the house price example(s) is in cndprice.R.*

How do we get away from grouping the data artificially?

- We’re going to “fit” a line instead, a representation of \(Y \mid X\) via \(y=ax+b\),
- and check if the predictions from the line (representing our conditional distribution) lead to a reduction in variance compared to having no line,
- i.e., just using plain averages \(\bar{Y}\), ignoring \(X\).

There are three steps

- Finding a line that makes sense;
- Arguing that it is a good line;
- Deriving its statistical properties,
- and subsequently determining if \(X\) is useful for predicting \(Y\).

First we need a little review on concepts relating two random quantities, \(X\) and \(Y\).

Recall the definition of **covariance** \[
\mathbb{C}\mathrm{ov}(X,Y) = \mathbb{E}\{(X - \mathbb{E}\{X\})(Y - \mathbb{E}\{Y\}) \}.
\]

- Linear combinations factor as \(\mathbb{C}\mathrm{ov}(aX+b,cY+d) = ac\mathbb{C}\mathrm{ov}(X,Y)\).
- Variance is a special case: \(\mathbb{V}\mathrm{ar}\{Y\} = \mathbb{C}\mathrm{ov}(Y,Y)\),
- and \(\mathbb{V}\mathrm{ar}\{aY\} = a^2 \mathbb{V}\mathrm{ar}\{Y\}\).

The sample analog is calculated as \[ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}). \]

Covariance measures how \(X\) and \(Y\) vary with each other around their means.

**Correlation** is standardized covariance: \[
\mathbb{C}\mathrm{orr}(X,Y) = \frac{\mathbb{C}\mathrm{ov}(X,Y)}{\sqrt{\mathbb{V}\mathrm{ar}\{X\} \mathbb{V}\mathrm{ar}\{Y\}}}.
\]

The correlation is scale-invariant (or scale-free), which means the units of measurement don’t matter:

- It is always true that \(-1 \leq \mathbb{C}\mathrm{orr}(X,Y) \leq 1\).
- It gives the direction (\(+\) or \(-\)) and the strength (\(0 \rightarrow 1\)) of a
**linear**relationship between \(X\) and \(Y\).

The sample analog is \[ r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})}{s_x} \frac{(y_i - \bar{y})}{s_y}. \]