# Least squares regression

Consider paired data $$(X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)$$ and suppose we plotted those on $$(x,y)$$ axes and asked, what is the best fitting line capturing the pattern in these data?

• The characterization of that line is referred to as a regression of $$Y$$ on $$X$$.

The regression of $$Y$$ on $$X$$ is a linear regression if the regression equation is of the form

$\mathbb{E}\{Y \mid X = x\} = \alpha + \beta x$

• for some unknown constants $$\alpha$$, called the $$y$$-intercept, and $$\beta$$, called the slope.

How do we estimate those coefficients from data?

• Among all possible values of $$a$$ for $$\alpha$$ and $$b$$ for $$\beta$$ we need a criteria which judges the goodness-of-fit of the line those estimates prescribe, relative to the data we have observed.
• And we need that criteria to not only be intuitive, but easy to solve and have nice statistical properties.

Most reasonable criteria involve measuring the amount by which the fitted value

• obtained from the line for each point in the data, $$\hat{Y}_i = a + b X_i$$,

differs from the observed value of the response in the data, $$Y_i$$.

• This amount is called the residual: $$e_i = Y_i - \hat{Y}_i$$.
• The residual is the discrepancy between the fitted and observed values.
• Good lines produce small residuals.

A reasonable goal is to minimize the size of all residuals.

• If they were all zero we would have a perfect line.
• There is a trade-off between moving closer to some points and at the same time moving away from others.

Since some residuals are positive and some are negative, we need one more ingredient.

• $$|e_i|$$ treats positives and negatives equally.
• So does $$e_i^2$$ which is easier to work with mathematically.

The method of least squares chooses $$a$$ and $$b$$ to minimize the sum of squared residuals:

$\sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n (Y_i - [a + b X_i])^2.$

The solution, obtained using calculus, was derived in class. The results are provided below.

\begin{aligned} a &= \bar{Y} - b \bar{X} \\ b &= \frac{n \sum_{i=1}^n X_i Y_i - \left( \sum_{i=1}^n X_i\right)\left( \sum_{i=1}^n Y_i \right) }{ n \sum_{i=1}^n X_i^2 - \left(\sum_{i=1}^n X_i \right)^2} \\ & = \frac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (\bar{X} - X_i)^2} = \frac{\mathbb{C}\mathrm{ov}(X, Y)}{\mathbb{V}\mathrm{ar}(X)} = \mathbb{C}\mathrm{or}(X,Y) \frac{\mathrm{sd}(Y)}{\mathrm{sd}(X)}. \end{aligned}

• Observe that the slope is (linear) correlation, times rise over run (just like in middle school).

## Testing the slope

Therefore, tests of correlation can be used to test about the slope, $$\beta$$. As long as we are willing to make the assumption that the residual $$Y - \mathbb{E}\{Y|X\}$$ is independent of $$X$$, then Spearman rank correlation test can be used on pairs $$(X_i, U_i)$$ where

$U_i = Y_i - \beta_0 X_i, \quad i=1,\dots, n,$

for a particular value $$\beta_0$$ of $$\beta$$ as specified by the null hypothesis.

• Curiously, the test does not directly involve a calculation of the least squares estimate $$b$$, described above.

A two-tailed test may be prescribed as

\begin{aligned} \mathcal{H}_0 &: \; \beta = \beta_0 \\ \mathcal{H}_1 &: \; \beta \ne \beta_0. \end{aligned}

• The test may be conducted by calculating $$\rho$$ on the $$(X_i, U_i)$$ pairs above, and rejecting if the $$p$$-value of the corresponding two-tailed Spearman rank correlation test is below the specified level, e.g., $$0.05$$.

A lower-tailed test is

\begin{aligned} \mathcal{H}_0 &: \; \beta \geq \beta_0 \\ \mathcal{H}_1 &: \; \beta \leq \beta_0. \end{aligned}

• The test may be conducted by calculating $$\rho$$ on the $$(X_i, U_i)$$ pairs above, and rejecting if the $$p$$-value of the corresponding lower-tailed Spearman rank correlation test is below the specified level, e.g., $$0.05$$.

Similarly, an upper-tailed involves

\begin{aligned} \mathcal{H}_0 &: \; \beta \leq \beta_0 \\ \mathcal{H}_1 &: \; \beta \geq \beta_0. \end{aligned}

• The test may be conducted by calculating $$\rho$$ on the $$(X_i, U_i)$$ pairs above, and rejecting if the $$p$$-value of the corresponding upper-tailed Spearman rank correlation test is below the specified level, e.g., $$0.05$$.

### Employee benefits (redux, redux)

Metro officials gathered aggregate data summarizing the number of employees that receive transit benefits and the average price that the employer spends per employee for them to use the transit benefits. It is claimed that an employer saves an average of 0.25 (dollars) per employee registered. At a 5% level, what can we conclude?

emp <- data.frame(number=c(173, 149, 124, 64, 88, 113, 142, 27, 39, 51),
price=c(2.14, 2.39, 2.19, 2.56, 2.44, 2.29, 2.18, 2.55, 2.32, 2.27))

Here is how R automates the least squares analysis.

fit <- lm(price ~ number, data=emp)
a <- as.numeric(coef(fit)[1])
b <- as.numeric(coef(fit)[2])
c(intercept=a, slope=b)
##    intercept        slope
##  2.514580295 -0.001871962

But we can calculate it ourselves using the formulas above.

• First the slope.
x <- emp$number y <- emp$price
cor(x,y)*sd(y)/sd(x)
## [1] -0.001871962
• Then the intercept.
mean(y) - b*mean(x)
## [1] 2.51458

So the fitted values are given by

$\hat{y} = 2.51 - 0.001871 x$

• and the interpretation is that the average monthly price drop is 0.001871 for employee registered.

Lets take a look at a scatter plot with the best fitting line on it.

plot(x, y, main="Bulk Discount?", xlab="number", ylab="price")
abline(fit)
legend("topright", c("LS line"), lty=1)

Now we wish to test that $$\beta = - 0.25$$, whereas the evidence above (via $$b = -0.001871$$) is that $$\beta$$ is much smaller than that in absolute value.

• However, note that this test does not involve the least squares $$b$$-value we calculated above.

First calculate the $$U_i$$ values.

beta <- - 0.25
u <- y - beta*x
u
##  [1] 45.39 39.64 33.19 18.56 24.44 30.54 37.68  9.30 12.07 15.02

Now calculate Spearman’s-$$\rho$$ on the resulting $$(X_i, U_i)$$ pairs.

n <- length(x)
rx <- rank(x)
ru <- rank(u)
numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
## [1] 1
• Or more simply as:
cor(x, u, method="spearman")
## [1] 1
• Obviously with a correlation estimated to be 1 we can’t possibly accept, but we’ll carry on with the calculations anyhow.

Now lets calculate the $$p$$-value.

library(SuppDists)
2*min(pSpearman(rho, n), 1-pSpearman(rho, n))
## [1] 5.511464e-07
• Or more simply as:
cor.test(x, u, method="spearman")
##
##  Spearman's rank correlation rho
##
## data:  x and u
## S = 3.6637e-14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
##   1
• with nearly identical results.

## Confidence interval for the slope

Whereas tests on the slope usually use Spearman’s $$\rho$$, a typical confidence interval uses Kendall’s $$\tau$$.

• Again, the CI does not directly involve a calculation of the least squares slope parameter, $$b$$.

For each pair of points $$(X_i, Y_i)$$ and $$(X_j, Y_j)$$, such that $$i < j$$ and $$X_i \ne X_j$$, compute the “two-point slope”,

$S_{ij} = \frac{Y_j - Y_i}{X_j - X_i}.$

• Recall that the pair is called concordant when $$S_{ij} > 0$$ and discordant when $$S_{ij} < 0$$; otherwise we count it as half concordant and discordant ($$S_{ij} = 0$$).
• But we don’t actually need to count the number of concordant and discordant pairs in this setup.

Let $$N$$ bet the number of slopes computed, i.e., ignoring $$i \geq j$$ and ties in the $$X$$s. Order the slopes obtained and let

$S^{(1)} \leq S^{(2)} \leq \cdots \leq S^{(N)}$

denote the ordered slopes.

For a $$1- \alpha$$ confidence level, find $$w_{1-\alpha/2}$$, the $$1-\alpha/2$$ quantile of $$T = N_c - N_d$$, the difference between the number of concordant and discordant pairs,

<!- using table A11 in the text, - or –> - via $$\tau = (N_c - N_d)/(N_c + N_d)$$ using the qKendall function in the SuppDists library in R.

Let $$r$$ and $$s$$ be given by

\begin{aligned} r &= \frac{1}{2} (N - w_{1-\alpha/2}) \\ s &= \frac{1}{2} (N + w_{1-\alpha/2}) + 1 = N + 1 - r. \end{aligned}

Round $$r$$ down and $$s$$ up to the next integer if they are not already integers. The $$1-\alpha$$ confidence interval for $$\beta$$ is given by the interval $$(S^{(r)}, S^{(s)})$$. That is,

$\mathbb{P}( S^{(r)} < \beta < S^{(s)}) \geq 1-\alpha.$

### Employee benefits (confidence interval)

How about a confidence interval for the slope?

• (Again, this will not directly involve the calculated least squares $$b$$-value.)

First calculate the “two-point slopes”.

S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
##  [1] -0.0104166667 -0.0010204082  0.0080000000 -0.0038532110 -0.0020000000
##  [6] -0.0061666667 -0.0035294118 -0.0008196721 -0.0069444444 -0.0050000000
## [11] -0.0025000000  0.0027777778 -0.0090909091 -0.0055102041 -0.0060000000
## [16] -0.0012903226  0.0300000000 -0.0005555556 -0.0048717949 -0.0048148148
## [21] -0.0037931034 -0.0028082192 -0.0013114754 -0.0037113402  0.0002702703
## [26] -0.0018032787 -0.0030232558 -0.0032173913 -0.0013432836  0.0006363636
## [31] -0.0015294118  0.0096000000  0.0024489796 -0.0004054054 -0.0013592233
## [36] -0.0191666667 -0.0010655738  0.0012244898 -0.0010958904  0.0223076923
## [41]  0.0045945946  0.0003225806 -0.0009890110 -0.0116666667 -0.0041666667
• Lets check that we have the right number of these.
c(length(S), choose(n, 2), length(unique(S)))
## [1] 45 45 45
• It is easy to see that we have no ties in $$S$$, above. If we did we would have to be more careful. (See the bespoke library function below.)
• There are also no duplicate $$X$$-values, otherwise we would have Inf’s in the $$S$$-values.

Now we can sort them.

S <- sort(S)

Now calculate $$r$$ and $$s$$ from the Kendall-$$\tau$$ null distribution.

N <- length(S)
w <- qKendall(0.975, n)*length(S)  ## the extra multiplication is to put it back on the T=Nc-Nd scale
r <- floor(0.5*(N-w))
s <- ceiling(N+1-r)
c(r,s)
## [1] 12 34

And then we can grab those values from the sorted $$S$$. The 95% confidence interval for $$\beta$$ is:

c(S[r], S[s])
## [1] -0.0041666667 -0.0004054054

Is our estimate of $$b$$ inside?

b
## [1] -0.001871962
• Yes, it is!
• And the CI doesn’t include zero, so we would reject the hypothesis that $$\beta = 0$$ at the 5% level.

## Reporting nonparametric estimates of the coefficients

To actually report a nonparametric estimate of the coefficients of $$\alpha$$ and $$\beta$$,

• as opposed to their least squares estimates $$a$$ and $$b$$,

the Kendall’s $$\tau$$ calculations above suggest taking the sample median analogue:

\begin{aligned} b_1 &= S^{(n/2)} \quad \mbox{the median of the S_{ij}'s} \\ a_1 &= Y^{(n/2)} -b_1 X^{(n/2)} \quad \mbox{similarly.} \end{aligned}

• These estimates will be less sensitive to outlying $$Y$$-values “pulled” by the square in the least squares criteria.

### Employee benefits (confidence interval)

Here is our nonparametric estimate of the slope.

b1 <- median(S)
• Note that this is closer to the center of the interval we found above.
c(S[r], S[s])
## [1] -0.0041666667 -0.0004054054

How does the line with that slope compare to the one we calculated via least-squares above?

• It makes sense to calculate the intercept using medians too.
a1 <- median(y) - b1 * median(x)
a1
## [1] 2.458706
plot(x, y, main="Bulk Discount?", xlab="number", ylab="price")
abline(fit)
abline(a1, b1, col=2, lty=2)
legend("topright", c("LS line", "NP line"), lty=1:2, col=1:2)

• Not too different, really.

## Library

There are no libraries that I know of which perform precisely these computations. But it would be pretty easy to make our own.

First, consider testing hypotheses.

np.slope.test <- function(x, y, beta0, alternative=c("two.sided", "less", "greater"), ...)
{
alternative = match.arg(alternative)
u <- y - beta0*x
phi <- cor.test(x, u, alternative, method="spearman", ...)$p.value rho <- cor(x, u, method="spearman") return(list(beta0=beta0, rho=rho, alternative=alternative, p.value=phi)) } • The “...” argument allows arguments like exact and continuity to be passed through to cor.test if desired. • Note that since we are using library subroutines, the results won’t be exactly the same as if we did the calculation “by hand”, following the description above or in the text. Lets try it out. np.slope.test(x, y, -0.25) ##$beta0
## [1] -0.25
##
## $rho ## [1] 1 ## ##$alternative
## [1] "two.sided"
##
## $p.value ## [1] 0 • Perfect. How about for the confidence interval? np.slope.CI <- function(x, y, alpha=0.05) { S <- outer(y,y,'-')/outer(x,x,'-') S <- as.numeric(S[upper.tri(S)]) NcPNd <- length(S) ## calculate before ties S <- S[is.finite(S)] ## for ties in the Xs S <- sort(S) n <- length(x) w <- qKendall(1-alpha/2, n)*NcPNd ## back on the T=Nc-Nd scale N <- length(S) r <- floor(0.5*(N-w)) s <- ceiling(N+1-r) beta.hat <- median(S) return(list(interval=c(S[r], S[s]), rs=c(r,s), level=1-alpha, beta.hat=beta.hat)) } • Note how the code above is careful about ties in the $$X$$s. Lets try it out. np.slope.CI(x, y) ##$interval
## [1] -0.0041666667 -0.0004054054
##
## $rs ## [1] 12 34 ## ##$level
## [1] 0.95
##
## \$beta.hat
## [1] -0.001529412
• Nice.