Department of Statistics, Virginia Tech

This homework is due on **Tuesday, December 5th at 2pm** (the start of class). Please turn in all your work. This homework primarily covers rank-based correlation and linear regression.

*A husband and a wife go bowling together and they kept their scores for 10 games to see if there was a relationship between their scores. The scores were recorded in order as follows.*

```
bowling <- data.frame(husband=c(147, 158, 131, 142, 183, 151, 196, 129, 155, 158),
wife=c(122, 128, 125, 123, 115, 120, 108, 143, 124, 123))
```

- (8 pts)
*Compute Kendall’s \(\tau\) “by hand”.*

To aid in counting the the number of concordant and discordant pairs it helps to order by the \(X\)’s, which we will take to be `husband`

.

```
o <- order(bowling$husband)
bowling[o,]
```

```
## husband wife
## 8 129 143
## 3 131 125
## 4 142 123
## 1 147 122
## 6 151 120
## 9 155 124
## 2 158 128
## 10 158 123
## 5 183 115
## 7 196 108
```

- Notice that there is a tie in the \(X\) values shown in the first column. We will need to skip this pair when counting below.

Now, an easy way to count the number of concordant and discordant pairs is to count, for each value in the second, \(Y\) column (`wife`

), how many are above or below that value.

```
n <- nrow(bowling)
xo <- bowling$husband[o]
yo <- bowling$wife[o]
concord <- discord <- rep(NA, n)
for(i in 1:(n-1)) {
xu <- xo[i] != xo[(i+1):n] ## for skipping X ties
yties <- sum((yo[i] == yo[(i+1):n]) * xu)
concord[i] <- sum((yo[i] < yo[(i+1):n])*xu) + yties/2
discord[i] <- sum((yo[i] > yo[(i+1):n])*xu) + yties/2
}
cbind(bowling[o, ], concord, discord)
```

```
## husband wife concord discord
## 8 129 143 0.0 9.0
## 3 131 125 1.0 7.0
## 4 142 123 2.5 4.5
## 1 147 122 3.0 3.0
## 6 151 120 3.0 2.0
## 9 155 124 1.0 3.0
## 2 158 128 0.0 2.0
## 10 158 123 0.0 2.0
## 5 183 115 0.0 1.0
## 7 196 108 NA NA
```

Translating those numbers into \(N_c\) and \(N_d\) proceeds as follows, and then our estimate of \(\tau\) …

```
Nc <- sum(concord, na.rm=TRUE)
Nd <- sum(discord, na.rm=TRUE)
tau <- (Nc - Nd) / (Nc + Nd)
tau
```

`## [1] -0.5227273`

- (7 pts)
*Test the hypothesis of independence using a two-tailed test based on a., “by hand”.*

The hypotheses are

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{$X$ and $Y$ are independent} \\ \mathcal{H}_1 &: \; \mbox{Pairs of observations tend to be concordant, or tend to be discordant.} \end{aligned} \]

Since there were ties in the \(Y\)-values we’ll need to use the CLT approximation to calculate the \(p\)-value.

```
z1 <- (Nc - Nd + 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
z2 <- (Nc - Nd - 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
2 * min(pnorm(z1), pnorm(z2, lower.tail=FALSE))
```

`## [1] 0.04909798`

- Reject the null hypothesis at the 5% level (barely) and conclude that ties tend to be concordant, or tend to be discordant, which means there is some correlation between the scores.

- (5 pts)
*Calculate a. and b. “using a software library”.*

We can get both with one command.

`cor.test(bowling$husband, bowling$wife, method="kendall", exact=FALSE, continuity=TRUE)`

```
##
## Kendall's rank correlation tau
##
## data: bowling$husband and bowling$wife
## z = -1.9835, p-value = 0.04731
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.5227273
```

- Identical for the estimate of \(\tau\);
- very similar for the \(p\)-value, leading to the same conclusion.

*A new worker is assigned to a machine that manufactures bolts. Each day a sample of bolts is examined and the percent defective is recorded. Do the following data indicate a significant improvement over time for that worker?*

`bolts <- data.frame(day=1:13, def=c(6.1, 7.5, 7.7, 5.9, 5.2, 6.1, 5.3, 4.5, 4.9, 4.6, 3.0, 4.0, 3.7))`

You should perform the following tasks “by hand” or “using a software library”, as you please.

- (5 pts)
*Calculate numerical summaries of your data, this includes at least finding the mean, the median and standard deviation for the*`def`

variable.

First consider the output of the `summary`

command.

`summary(bolts$def)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.500 5.200 5.269 6.100 7.700
```

The estimated standard deviation is given as follows.

`sd(bolts$def)`

`## [1] 1.386473`

- On their own these do not tell us much about the relationships in the data.

- (5 pts)
*Construct a helpful plot to investigate relationships in your data, perhaps starting with a scatterplot. Comment on what you observe; do the data appear to be related?*

The desired scatterplot is shown below.

`plot(bolts$day, bolts$def, xlab="day", ylab="percent defective", main="Improvement over time?")`

- Indeed, it looks like there is a fairly substantial downward trend: percent
`def`

ective decreases as`day`

increases.

- (5 pts)
*Perform an appropriate test of association using Spearman’s \(\rho\).*

The hypotheses are:

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{Day and defective percentage are mutually independent or there is a} \\ & \;\;\;\; \mbox{tendency for larger values of day to be paired with larger defective percentages } (\rho \geq 0) \\ \mathcal{H}_1 &: \; \mbox{There is a tendency for smaller values of day to be paired with larger defective percentages } (\rho < 0). \end{aligned} \]

We will perform the test using software.

`cor.test(bolts$da, bolts$def, alternative="less", method="spearman", exact=FALSE)`

```
##
## Spearman's rank correlation rho
##
## data: bolts$da and bolts$def
## S = 693.45, p-value = 1.05e-05
## alternative hypothesis: true rho is less than 0
## sample estimates:
## rho
## -0.9050903
```

- Resoundingly reject \(\mathcal{H}_0\) at the 5% level and conclude that there is a significant improvement over time for that worker.

- (5 pts)
*Perform an appropriate test of association using Kendall’s \(\tau\).*

The hypotheses are:

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{Day and defective percentage are multually indeendent or pairs of} \\ & \;\;\;\; \mbox{observations tend to be concordant} (\tau \geq 0) \\ \mathcal{H}_1 &: \; \mbox{Day and defective percentage tend to be discordant} (\tau < 0). \end{aligned} \]

Again, we will perform the test using software. There are ties in the counts of concordant pairs, so we will have to use the CLT verion.

`cor.test(bolts$da, bolts$def, alternative="less", method="kendall", exact=FALSE, correct=TRUE)`

```
##
## Kendall's rank correlation tau
##
## data: bolts$da and bolts$def
## z = -3.484, p-value = 0.000247
## alternative hypothesis: true tau is less than 0
## sample estimates:
## tau
## -0.7354992
```

- Again, we resoundingly reject \(\mathcal{H}_0\) at the 5% level and conclude that there is a significant improvement over time for that worker.

*A driver kept track of the number of miles she traveled and the number of gallons put in the tank each time she bought gasoline.*

```
mpg <- data.frame(miles=c(142, 116, 194, 250, 88, 157, 225, 159, 43, 208),
gallons=c(11.1, 5.7, 14.2, 15.8, 7.5, 12.5, 17.9, 8.8, 3.4, 15.2))
```

- (5 pts)
*Draw a diagram (e.g., a scatterplot) showing these points, using gallons as the \(x\)-axis. Ideally this should be performed using software.*

See the solution to part c. below, on which the least squares regression line is added.

(7 pts)

*Estimate \(a\), and \(b\) in a linear model, \(y=a+bx\), using the method of least squares.*(4 pts)

*“by hand”*

The slope is correlation times rise over run.

```
b <- cor(mpg$gallons, mpg$miles) * sd(mpg$miles) / sd(mpg$gallons)
b
```

`## [1] 12.61741`

The intercept is calculated by putting a line with that slope through the middle of the cloud of points.

```
a <- mean(mpg$miles) - b * mean(mpg$gallons)
a
```

`## [1] 16.75887`

- (3 pts)
*“using a software library”*

The `lm`

command in R gives.

```
fit <- lm(miles ~ gallons, data=mpg)
fit
```

```
##
## Call:
## lm(formula = miles ~ gallons, data = mpg)
##
## Coefficients:
## (Intercept) gallons
## 16.76 12.62
```

- Identical (up to rounding).

- (3 pts)
*Plot the least squares regression line on the diagram of part a., ideally using software.*

The plot from part a. and the lines from this part (c.) are shown below.

```
plot(mpg$gallons, mpg$miles, xlab="gallons", ylab="miles", main="Miles per Gallon")
abline(a, b)
legend("topleft", "Least Squares line", lty=1)
```

- There appears to be an increasing relationship.

- (7 pts)
*Suppose the EPA estimated this car’s mileage at 18 miles per gallon. Test the null hypothesis that this figure applies to this particular car and driver. Use the test for the slope.*- (4 pts)
*“by hand”*

- (4 pts)

We can convert the question at hand into the following two-tailed set of hypotheses.

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta = 18 \\ \mathcal{H}_1 &: \; \beta \ne 18. \end{aligned} \]

To make our lives a little easier when performing the test, lets define \(X\) and \(Y\) variables.

```
y <- mpg$miles
x <- mpg$gallons
```

The test is performed by first calculating \(U_i = Y_i - \beta_0 X_i\) for our \(\beta_0 = 18\).

```
beta0 <- 18
u <- y - beta0*x
u
```

`## [1] -57.8 13.4 -61.6 -34.4 -47.0 -68.0 -97.2 0.6 -18.2 -65.6`

Now calculate Spearman’s-\(\rho\) on the resulting \((X_i, U_i)\) pairs.

```
n <- length(x)
rx <- rank(x)
ru <- rank(u)
rbind(rx, ru)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## rx 5 2 7 9 3 6 10 4 1 8
## ru 5 10 4 7 6 2 1 9 8 3
```

- First, observe that there are no ties in the ranks, which will impact how we calculate the \(p\)-value below.

… Continuing with the calculations …

```
numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
```

`## [1] -0.7090909`

Now lets calculate the \(p\)-value. Since there are no ties in the ranks, we can use the exact calculation.

```
library(SuppDists)
2*min(pSpearman(rho, n), 1-pSpearman(rho, n))
```

`## [1] 0.02675209`

- We reject the null hypothesis at the 5% level and conclude that this car’s mileage is not at 18 miles per gallon for this particular car and driver.

- (3 pts)
*“using software”*

Here is a version that uses the \(U\) values we calculated above.

`cor.test(x, u, method="spearman")`

```
##
## Spearman's rank correlation rho
##
## data: x and u
## S = 282, p-value = 0.02751
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.7090909
```

Alternatively, we can borrow the library function we created in the lecture notes.

```
np.slope.test <- function(x, y, beta0, alternative=c("two.sided", "less", "greater"), ...)
{
alternative = match.arg(alternative)
u <- y - beta0*x
phi <- cor.test(x, u, alternative, method="spearman", ...)$p.value
rho <- cor(x, u, method="spearman")
return(list(beta0=beta0, rho=rho, alternative=alternative, p.value=phi))
}
```

When appled to our `mpg`

data we get the following.

`np.slope.test(mpg$gallons, mpg$miles, 18)`

```
## $beta0
## [1] 18
##
## $rho
## [1] -0.7090909
##
## $alternative
## [1] "two.sided"
##
## $p.value
## [1] 0.02751412
```

- In both cases: very similar result; same conclusion.

- (5 pts)
*Calculate a 95% confidence interval for the slope.*

Since the question didn’t specify “by hand” or “using a software library”, we’ll borrow the library function we wrote in lecture.

```
np.slope.CI <- function(x, y, alpha=0.05)
{
S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
NcPNd <- length(S) ## calculate before ties
S <- S[is.finite(S)] ## for ties in the Xs
S <- sort(S)
n <- length(x)
w <- qKendall(1-alpha/2, n)*NcPNd ## back on the T=Nc-Nd scale
N <- length(S)
r <- floor(0.5*(N-w))
s <- ceiling(N+1-r)
beta.hat <- median(S)
return(list(interval=c(S[r], S[s]), rs=c(r,s), level=1-alpha, beta.hat=beta.hat))
}
```

Here it is applied to our `mpg`

data.

`np.slope.CI(mpg$gallons, mpg$miles)`

```
## $interval
## [1] 8.934426 16.693548
##
## $rs
## [1] 12 34
##
## $level
## [1] 0.95
##
## $beta.hat
## [1] 13.17308
```

- (3 pts)
*Provide the estimate for the slope, obtained from the median of the slopes from part e. Solve this “by hand”.*

The “software library” answer is reported above, but we were asked to do it “by hand”.

First calculate the two-point slopes.

```
S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
```

```
## [1] 4.8148148 16.7741935 9.1764706 22.9787234 13.2673267
## [6] 35.0000000 15.0000000 -15.5555556 15.8208955 19.5180723
## [11] 10.7142857 6.0294118 21.7647059 28.1818182 13.8000000
## [16] 12.2058824 8.9344262 8.3783784 -11.9047619 13.1730769
## [21] 12.5925926 -7.3913043 13.8709677 6.4814815 13.0000000
## [26] 54.6153846 -0.5405405 7.2527473 12.8571429 31.7391304
## [31] 13.9814815 16.6935484 10.9756098 12.5274725 12.5517241
## [36] 21.4814815 16.0975610 9.6842105 14.0000000 70.0000000
## [41] 15.5844156 18.8888889 6.2962963 7.6562500 13.9830508
```

- Note that there are no
`Inf`

values, so nothing to remove.

And then take the median of those slopes as our estimate for \(\beta\).

```
b.np <- median(S)
b.np
```

`## [1] 13.17308`

- This agrees with our result above.

The question didn’t ask for this, but we can follow the same formula for the intercept as above, except with medians instead of means.

```
a.np <- median(y) - b.np * median(x)
a.np
```

`## [1] 2.557692`

*A random sample of American colleges and universities resulted in the following numbers of students and faculty (Spring 1973).*

```
univ <- data.frame(name=c("American International", "Bethany Nazarene", "Carlow", "David Lipscomb",
"Florida International", "Heidelberg", "Lake Erie", "Mary Harin Baylor", "Newburry", "St. Ambrose",
"Smith", "Texas Women's", "Wofford"),
students=c(2546, 1355, 87, 1858, 4500, 1141, 784, 1063, 753, 1189, 2755, 5602, 988),
faculty=c(129, 75, 87, 99, 300, 109, 77, 64, 61, 90, 240, 300, 73))
```

- (5 pts)
*Draw a diagram (e.g., a scatterplot) showing these points, using*`faculty`

as the \(x\)-axis. Ideally this should be performed using software.

See the solution to part c. below, on which the least squares regression line is added.

- (7 pts)
*Estimate \(a\), and \(b\) in a linear model, \(y=a+bx\), using the method of least squares.*- (4 pts)
*“by hand”*

- (4 pts)

The slope is correlation times rise over run.

```
b <- cor(univ$faculty, univ$students) * sd(univ$students) / sd(univ$faculty)
b
```

`## [1] 16.8282`

The intercept is calculated by putting a line with that slope through the middle of the cloud of points.

```
a <- mean(univ$students) - b * mean(univ$faculty)
a
```

`## [1] -311.8655`

- (3 pts)
*“using a software library”*

The `lm`

command in R gives.

```
fit <- lm(students ~ faculty, data=univ)
fit
```

```
##
## Call:
## lm(formula = students ~ faculty, data = univ)
##
## Coefficients:
## (Intercept) faculty
## -311.87 16.83
```

- (3 pts)
*Plot the least squares regression line on the diagram of part a, ideally using software.*

The plot from part a. and the lines from this part (c.) are shown below.

```
plot(univ$faculty, univ$students, xlab="faculty", ylab="students", main="Student-to-Faculty ratio")
abline(a, b)
legend("topleft", "Least Squares line", lty=1)
```

- There appears to be an increasing relationship.

(7 pts)

*Test the hypothesis that an increase of one faculty member is accompanied by an average increase of 15 students.*(4 pts)

*“by hand”*

We can convert the question at hand into the following two-tailed set of hypotheses.

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta = 15 \\ \mathcal{H}_1 &: \; \beta \ne 15. \end{aligned} \]

To make our lives a little easier when performing the test, lets define \(X\) and \(Y\) variables.

```
y <- univ$students
x <- univ$faculty
```

The test is performed by first calculating \(U_i = Y_i - \beta_0 X_i\) for our \(\beta_0 = 18\).

```
beta0 <- 15
u <- y - beta0*x
u
```

```
## [1] 611 230 -1218 373 0 -494 -371 103 -162 -161 -845
## [12] 1102 -107
```

Now calculate Spearman’s-\(\rho\) on the resulting \((X_i, U_i)\) pairs.

```
n <- length(x)
rx <- rank(x)
ru <- rank(u)
rbind(rx, ru)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## rx 10 4 6 8 12.5 9 5 2 1 7 11 12.5 3
## ru 12 10 1 11 8.0 3 4 9 5 6 2 13.0 7
```

- First, observe that there are ties in the ranks, which will impact how we calculate the \(p\)-value below.

… Continuing with the calculations …

```
numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
```

`## [1] 0.1898214`

Now lets calculate the \(p\)-value. Since there were ties we must use the CLT approximation.

`2*pnorm(-abs(rho)*sqrt(length(x)-1))`

`## [1] 0.5108206`

- We fail to reject the null hypothesis and conclude that there is not enough evidence to say that an increase of one faculty member is not accompanied by an average increase of 15 students.

- (3 pts)
*“using software”*

Here is a version that uses the \(U\) values we calculated above.

`cor.test(x, u, method="spearman", exact=FALSE)`

```
##
## Spearman's rank correlation rho
##
## data: x and u
## S = 294.91, p-value = 0.5345
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1898214
```

Alternatively, we can borrow the library function we created, which we don’t need to paste in again since we have it above. When applied to our `univ`

data we get the following.

`np.slope.test(univ$faculty, univ$students, 15, exact=FALSE)`

```
## $beta0
## [1] 15
##
## $rho
## [1] 0.1898214
##
## $alternative
## [1] "two.sided"
##
## $p.value
## [1] 0.5345087
```

- In both cases: very similar result; same conclusion.

- (5 pts)
*Calculate a 95% confidence interval for the slope.*

Since the question didn’t specify “by hand” or “using a software library”, we’ll borrow the library function we wrote in lecture, and which is already pasted above. Here it is applied to our `univ`

data.

`np.slope.CI(univ$faculty, univ$students)`

```
## $interval
## [1] 9.613636 23.356021
##
## $rs
## [1] 22 56
##
## $level
## [1] 0.95
##
## $beta.hat
## [1] 17.43791
```

- (3 pts)
*Provide the estimate for the slope, obtained from the median of the slopes from part e. Solve this “by hand”.*

The “software library” answer is reported above, but we were asked to do it “by hand”.

First calculate the two-point slopes.

```
S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
```

```
## [1] 22.055556 58.547619 -105.666667 22.933333 20.958333
## [6] 147.583333 11.426901 13.977778 20.718310 13.144279
## [11] 70.250000 -6.294118 47.909091 -71.700000 17.586387
## [16] 33.884615 -285.500000 -69.700000 48.818182 16.663677
## [21] 11.156250 22.815385 26.545455 -42.434783 22.714286
## [26] 14.563559 1.733333 -21.461538 26.367647 43.000000
## [31] -25.615385 29.078947 15.677824 8.083333 1.937500
## [36] 103.333333 34.794872 -11.066667 367.333333 74.333333
## [41] 15.766667 -2.526316 31.153846 4.846154 15.034483
## [46] 1.882883 8.484848 17.437908 6.361702 29.083333
## [51] 12.320611 12.092025 9.613636 11.184358 10.440000
## [56] 17.871345 18.875556 25.892019 18.626866 -Inf
## [61] 23.356021 21.605381 19.233051 20.288703 21.014286
## [66] 47.450000 27.821429 183.500000 -64.357143 33.461538
## [71] 15.471366 4.250000 -51.000000 -8.333333 19.583333
## [76] 11.823529 10.580838 20.325991
```

- Note that there is one
`Inf`

value, so we have to get rid of that one before we go further.

`S <- S[is.finite(S)]`

Now we can take the median of those slopes as our estimate for \(\beta\).

```
b.np <- median(S)
b.np
```

`## [1] 17.43791`

- This agrees with our result above.

The question didn’t ask for this, but we can follow the same formula for the intercept as above, except with medians instead of means.

```
a.np <- median(y) - b.np * median(x)
a.np
```

`## [1] -380.4118`