Department of Statistics, Virginia Tech

Consider paired data \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\) and suppose we plotted those on \((x,y)\) axes and asked, what is the best fitting line capturing the pattern in these data?

- The characterization of that line is referred to as a
**regression**of \(Y\) on \(X\).

The regression of \(Y\) on \(X\) is a **linear regression** if the regression equation is of the form

\[ \mathbb{E}\{Y \mid X = x\} = \alpha + \beta x \]

- for some unknown constants \(\alpha\), called the \(y\)-
*intercept*, and \(\beta\), called the*slope*.

How do we estimate those coefficients from data?

- Among all possible values of \(a\) for \(\alpha\) and \(b\) for \(\beta\) we need a
**criteria**which judges the goodness-of-fit of the line those estimates prescribe, relative to the data we have observed.

- And we need that criteria to not only be intuitive, but easy to solve and have nice statistical properties.

Most reasonable criteria involve measuring the amount by which the **fitted value**

- obtained from the line for each point in the data, \(\hat{Y}_i = a + b X_i\),

differs from the observed value of the response in the data, \(Y_i\).

- This amount is called the
**residual**: \(e_i = Y_i - \hat{Y}_i\). - The residual is the discrepancy between the fitted and observed values.
- Good lines produce small residuals.

A reasonable goal is to minimize the size of *all* residuals.

- If they were all zero we would have a perfect line.
- There is a trade-off between moving closer to some points and at the same time moving away from others.

Since some residuals are positive and some are negative, we need one more ingredient.

- \(|e_i|\) treats positives and negatives equally.
- So does \(e_i^2\) which is easier to work with mathematically.

The method of **least squares** chooses \(a\) and \(b\) to minimize the sum of squared residuals:

\[ \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n (Y_i - [a + b X_i])^2. \]

The solution, obtained using calculus, was derived in class. The results are provided below.

\[ \begin{aligned} a &= \bar{Y} - b \bar{X} \\ b &= \frac{n \sum_{i=1}^n X_i Y_i - \left( \sum_{i=1}^n X_i\right)\left( \sum_{i=1}^n Y_i \right) }{ n \sum_{i=1}^n X_i^2 - \left(\sum_{i=1}^n X_i \right)^2} \\ & = \frac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (\bar{X} - X_i)^2} = \frac{\mathbb{C}\mathrm{ov}(X, Y)}{\mathbb{V}\mathrm{ar}(X)} = \mathbb{C}\mathrm{or}(X,Y) \frac{\mathrm{sd}(Y)}{\mathrm{sd}(X)}. \end{aligned} \]

- Observe that the slope is (linear) correlation, times rise over run (just like in middle school).

Therefore, tests of correlation can be used to test about the slope, \(\beta\). As long as we are willing to make the assumption that the residual \(Y - \mathbb{E}\{Y|X\}\) is independent of \(X\), then Spearman rank correlation test can be used on pairs \((X_i, U_i)\) where

\[ U_i = Y_i - \beta_0 X_i, \quad i=1,\dots, n, \]

for a particular value \(\beta_0\) of \(\beta\) as specified by the null hypothesis.

- Curiously, the test does not directly involve a calculation of the least squares estimate \(b\), described above.

A **two-tailed** test may be prescribed as

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta = \beta_0 \\ \mathcal{H}_1 &: \; \beta \ne \beta_0. \end{aligned} \]

- The test may be conducted by calculating \(\rho\) on the \((X_i, U_i)\) pairs above, and rejecting if the \(p\)-value of the corresponding two-tailed Spearman rank correlation test is below the specified level, e.g., \(0.05\).

A **lower-tailed** test is

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta \geq \beta_0 \\ \mathcal{H}_1 &: \; \beta \leq \beta_0. \end{aligned} \]

- The test may be conducted by calculating \(\rho\) on the \((X_i, U_i)\) pairs above, and rejecting if the \(p\)-value of the corresponding lower-tailed Spearman rank correlation test is below the specified level, e.g., \(0.05\).

Similarly, an **upper-tailed** involves

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta \leq \beta_0 \\ \mathcal{H}_1 &: \; \beta \geq \beta_0. \end{aligned} \]

- The test may be conducted by calculating \(\rho\) on the \((X_i, U_i)\) pairs above, and rejecting if the \(p\)-value of the corresponding upper-tailed Spearman rank correlation test is below the specified level, e.g., \(0.05\).

Metro officials gathered aggregate data summarizing the number of employees that receive transit benefits and the average price that the employer spends per employee for them to use the transit benefits. It is claimed that an employer saves an average of 0.25 (dollars) per employee registered. At a 5% level, what can we conclude?

```
emp <- data.frame(number=c(173, 149, 124, 64, 88, 113, 142, 27, 39, 51),
price=c(2.14, 2.39, 2.19, 2.56, 2.44, 2.29, 2.18, 2.55, 2.32, 2.27))
```

Here is how R automates the least squares analysis.

```
fit <- lm(price ~ number, data=emp)
a <- as.numeric(coef(fit)[1])
b <- as.numeric(coef(fit)[2])
c(intercept=a, slope=b)
```

```
## intercept slope
## 2.514580295 -0.001871962
```

But we can calculate it ourselves using the formulas above.

- First the slope.

```
x <- emp$number
y <- emp$price
cor(x,y)*sd(y)/sd(x)
```

`## [1] -0.001871962`

- Then the intercept.

`mean(y) - b*mean(x)`

`## [1] 2.51458`

So the fitted values are given by

\[ \hat{y} = 2.51 - 0.001871 x \]

- and the interpretation is that the average monthly price drop is 0.001871 for employee registered.

Lets take a look at a scatter plot with the best fitting line on it.

```
plot(x, y, main="Bulk Discount?", xlab="number", ylab="price")
abline(fit)
legend("topright", c("LS line"), lty=1)
```

Now we wish to test that \(\beta = - 0.25\), whereas the evidence above (via \(b = -0.001871\)) is that \(\beta\) is much smaller than that in absolute value.

- However, note that this test does not involve the least squares \(b\)-value we calculated above.

First calculate the \(U_i\) values.

```
beta <- - 0.25
u <- y - beta*x
u
```

`## [1] 45.39 39.64 33.19 18.56 24.44 30.54 37.68 9.30 12.07 15.02`

Now calculate Spearman’s-\(\rho\) on the resulting \((X_i, U_i)\) pairs.

```
n <- length(x)
rx <- rank(x)
ru <- rank(u)
numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
```

`## [1] 1`

- Or more simply as:

`cor(x, u, method="spearman")`

`## [1] 1`

- Obviously with a correlation estimated to be 1 we can’t possibly accept, but we’ll carry on with the calculations anyhow.

Now lets calculate the \(p\)-value.

```
library(SuppDists)
2*min(pSpearman(rho, n), 1-pSpearman(rho, n))
```

`## [1] 5.511464e-07`

- Or more simply as:

`cor.test(x, u, method="spearman")`

```
##
## Spearman's rank correlation rho
##
## data: x and u
## S = 3.6637e-14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1
```

- with nearly identical results.

Whereas tests on the slope usually use Spearman’s \(\rho\), a typical confidence interval uses Kendall’s \(\tau\).

- Again, the CI does not directly involve a calculation of the least squares slope parameter, \(b\).

For each pair of points \((X_i, Y_i)\) and \((X_j, Y_j)\), such that \(i < j\) and \(X_i \ne X_j\), compute the “two-point slope”,

\[ S_{ij} = \frac{Y_j - Y_i}{X_j - X_i}. \]

- Recall that the pair is called concordant when \(S_{ij} > 0\) and discordant when \(S_{ij} < 0\); otherwise we count it as half concordant and discordant (\(S_{ij} = 0\)).
- But we don’t actually need to count the number of concordant and discordant pairs in this setup.

Let \(N\) bet the number of slopes computed, i.e., ignoring \(i \geq j\) and ties in the \(X\)s. Order the slopes obtained and let

\[ S^{(1)} \leq S^{(2)} \leq \cdots \leq S^{(N)} \]

denote the ordered slopes.

For a \(1- \alpha\) confidence level, find \(w_{1-\alpha/2}\), the \(1-\alpha/2\) quantile of \(T = N_c - N_d\), the difference between the number of concordant and discordant pairs,

<!- using table A11 in the text, - or –> - via \(\tau = (N_c - N_d)/(N_c + N_d)\) using the `qKendall`

function in the `SuppDists`

library in R.

Let \(r\) and \(s\) be given by

\[ \begin{aligned} r &= \frac{1}{2} (N - w_{1-\alpha/2}) \\ s &= \frac{1}{2} (N + w_{1-\alpha/2}) + 1 = N + 1 - r. \end{aligned} \]

Round \(r\) down and \(s\) up to the next integer if they are not already integers. The \(1-\alpha\) confidence interval for \(\beta\) is given by the interval \((S^{(r)}, S^{(s)})\). That is,

\[ \mathbb{P}( S^{(r)} < \beta < S^{(s)}) \geq 1-\alpha. \]

How about a confidence interval for the slope?

- (Again, this will not directly involve the calculated least squares \(b\)-value.)

First calculate the “two-point slopes”.

```
S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
```

```
## [1] -0.0104166667 -0.0010204082 0.0080000000 -0.0038532110 -0.0020000000
## [6] -0.0061666667 -0.0035294118 -0.0008196721 -0.0069444444 -0.0050000000
## [11] -0.0025000000 0.0027777778 -0.0090909091 -0.0055102041 -0.0060000000
## [16] -0.0012903226 0.0300000000 -0.0005555556 -0.0048717949 -0.0048148148
## [21] -0.0037931034 -0.0028082192 -0.0013114754 -0.0037113402 0.0002702703
## [26] -0.0018032787 -0.0030232558 -0.0032173913 -0.0013432836 0.0006363636
## [31] -0.0015294118 0.0096000000 0.0024489796 -0.0004054054 -0.0013592233
## [36] -0.0191666667 -0.0010655738 0.0012244898 -0.0010958904 0.0223076923
## [41] 0.0045945946 0.0003225806 -0.0009890110 -0.0116666667 -0.0041666667
```

- Lets check that we have the right number of these.

`c(length(S), choose(n, 2), length(unique(S)))`

`## [1] 45 45 45`

- It is easy to see that we have no ties in \(S\), above. If we did we would have to be more careful. (See the bespoke library function below.)
- There are also no duplicate \(X\)-values, otherwise we would have
`Inf`

’s in the \(S\)-values.

Now we can sort them.

`S <- sort(S)`

Now calculate \(r\) and \(s\) from the Kendall-\(\tau\) null distribution.

```
N <- length(S)
w <- qKendall(0.975, n)*length(S) ## the extra multiplication is to put it back on the T=Nc-Nd scale
r <- floor(0.5*(N-w))
s <- ceiling(N+1-r)
c(r,s)
```

`## [1] 12 34`

And then we can grab those values from the sorted \(S\). The 95% confidence interval for \(\beta\) is:

`c(S[r], S[s])`

`## [1] -0.0041666667 -0.0004054054`

Is our estimate of \(b\) inside?

`b`

`## [1] -0.001871962`

- Yes, it is!
- And the CI doesn’t include zero, so we would reject the hypothesis that \(\beta = 0\) at the 5% level.

To actually report a nonparametric estimate of the coefficients of \(\alpha\) and \(\beta\),

- as opposed to their least squares estimates \(a\) and \(b\),

the Kendall’s \(\tau\) calculations above suggest taking the sample median analogue:

\[ \begin{aligned} b_1 &= S^{(n/2)} \quad \mbox{the median of the $S_{ij}$'s} \\ a_1 &= Y^{(n/2)} -b_1 X^{(n/2)} \quad \mbox{similarly.} \end{aligned} \]

- These estimates will be less sensitive to outlying \(Y\)-values “pulled” by the square in the least squares criteria.

Here is our nonparametric estimate of the slope.

`b1 <- median(S)`

- Note that this is closer to the center of the interval we found above.

`c(S[r], S[s])`

`## [1] -0.0041666667 -0.0004054054`

How does the line with that slope compare to the one we calculated via least-squares above?

- It makes sense to calculate the intercept using medians too.

```
a1 <- median(y) - b1 * median(x)
a1
```

`## [1] 2.458706`

```
plot(x, y, main="Bulk Discount?", xlab="number", ylab="price")
abline(fit)
abline(a1, b1, col=2, lty=2)
legend("topright", c("LS line", "NP line"), lty=1:2, col=1:2)
```

- Not too different, really.

There are no libraries that I know of which perform precisely these computations. But it would be pretty easy to make our own.

First, consider testing hypotheses.

```
np.slope.test <- function(x, y, beta0, alternative=c("two.sided", "less", "greater"), ...)
{
alternative = match.arg(alternative)
u <- y - beta0*x
phi <- cor.test(x, u, alternative, method="spearman", ...)$p.value
rho <- cor(x, u, method="spearman")
return(list(beta0=beta0, rho=rho, alternative=alternative, p.value=phi))
}
```

- The “
`...`

” argument allows arguments like`exact`

and`continuity`

to be passed through to`cor.test`

if desired. - Note that since we are using library subroutines, the results won’t be exactly the same as if we did the calculation “by hand”, following the description above or in the text.

Lets try it out.

`np.slope.test(x, y, -0.25)`

```
## $beta0
## [1] -0.25
##
## $rho
## [1] 1
##
## $alternative
## [1] "two.sided"
##
## $p.value
## [1] 0
```

- Perfect.

How about for the confidence interval?

```
np.slope.CI <- function(x, y, alpha=0.05)
{
S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
NcPNd <- length(S) ## calculate before ties
S <- S[is.finite(S)] ## for ties in the Xs
S <- sort(S)
n <- length(x)
w <- qKendall(1-alpha/2, n)*NcPNd ## back on the T=Nc-Nd scale
N <- length(S)
r <- floor(0.5*(N-w))
s <- ceiling(N+1-r)
beta.hat <- median(S)
return(list(interval=c(S[r], S[s]), rs=c(r,s), level=1-alpha, beta.hat=beta.hat))
}
```

- Note how the code above is careful about ties in the \(X\)s.

Lets try it out.

`np.slope.CI(x, y)`

```
## $interval
## [1] -0.0041666667 -0.0004054054
##
## $rs
## [1] 12 34
##
## $level
## [1] 0.95
##
## $beta.hat
## [1] -0.001529412
```

- Nice.