Instructions

This homework is due on Tuesday, December 5th at 2pm (the start of class). Please turn in all your work. This homework primarily covers rank-based correlation and linear regression.

Problem 1: Couples’ bowling (20 points)

A husband and a wife go bowling together and they kept their scores for 10 games to see if there was a relationship between their scores. The scores were recorded in order as follows.

bowling <- data.frame(husband=c(147, 158, 131, 142, 183, 151, 196, 129, 155, 158),
    wife=c(122, 128, 125, 123, 115, 120, 108, 143, 124, 123))
  1. (8 pts) Compute Kendall’s \(\tau\) “by hand”.

To aid in counting the the number of concordant and discordant pairs it helps to order by the \(X\)’s, which we will take to be husband.

o <- order(bowling$husband)
bowling[o,]
##    husband wife
## 8      129  143
## 3      131  125
## 4      142  123
## 1      147  122
## 6      151  120
## 9      155  124
## 2      158  128
## 10     158  123
## 5      183  115
## 7      196  108
  • Notice that there is a tie in the \(X\) values shown in the first column. We will need to skip this pair when counting below.

Now, an easy way to count the number of concordant and discordant pairs is to count, for each value in the second, \(Y\) column (wife), how many are above or below that value.

n <- nrow(bowling)
xo <- bowling$husband[o]
yo <- bowling$wife[o]
concord <- discord <- rep(NA, n)
for(i in 1:(n-1)) {
    xu <- xo[i] != xo[(i+1):n] ## for skipping X ties
    yties <- sum((yo[i] == yo[(i+1):n]) * xu)
    concord[i] <- sum((yo[i] < yo[(i+1):n])*xu) + yties/2
    discord[i] <- sum((yo[i] > yo[(i+1):n])*xu) + yties/2
}
cbind(bowling[o, ], concord, discord)
##    husband wife concord discord
## 8      129  143     0.0     9.0
## 3      131  125     1.0     7.0
## 4      142  123     2.5     4.5
## 1      147  122     3.0     3.0
## 6      151  120     3.0     2.0
## 9      155  124     1.0     3.0
## 2      158  128     0.0     2.0
## 10     158  123     0.0     2.0
## 5      183  115     0.0     1.0
## 7      196  108      NA      NA

Translating those numbers into \(N_c\) and \(N_d\) proceeds as follows, and then our estimate of \(\tau\)

Nc <- sum(concord, na.rm=TRUE)
Nd <- sum(discord, na.rm=TRUE)
tau <- (Nc - Nd) / (Nc + Nd)
tau
## [1] -0.5227273
  1. (7 pts) Test the hypothesis of independence using a two-tailed test based on a., “by hand”.

The hypotheses are

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{$X$ and $Y$ are independent} \\ \mathcal{H}_1 &: \; \mbox{Pairs of observations tend to be concordant, or tend to be discordant.} \end{aligned} \]

Since there were ties in the \(Y\)-values we’ll need to use the CLT approximation to calculate the \(p\)-value.

z1 <- (Nc - Nd + 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
z2 <- (Nc - Nd - 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
2 * min(pnorm(z1), pnorm(z2, lower.tail=FALSE))
## [1] 0.04909798
  • Reject the null hypothesis at the 5% level (barely) and conclude that ties tend to be concordant, or tend to be discordant, which means there is some correlation between the scores.
  1. (5 pts) Calculate a. and b. “using a software library”.

We can get both with one command.

cor.test(bowling$husband, bowling$wife, method="kendall", exact=FALSE, continuity=TRUE)
## 
##  Kendall's rank correlation tau
## 
## data:  bowling$husband and bowling$wife
## z = -1.9835, p-value = 0.04731
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.5227273
  • Identical for the estimate of \(\tau\);
  • very similar for the \(p\)-value, leading to the same conclusion.

Problem 2: Defective bolts (20 points)

A new worker is assigned to a machine that manufactures bolts. Each day a sample of bolts is examined and the percent defective is recorded. Do the following data indicate a significant improvement over time for that worker?

bolts <- data.frame(day=1:13, def=c(6.1, 7.5, 7.7, 5.9, 5.2, 6.1, 5.3, 4.5, 4.9, 4.6, 3.0, 4.0, 3.7))

You should perform the following tasks “by hand” or “using a software library”, as you please.

  1. (5 pts) Calculate numerical summaries of your data, this includes at least finding the mean, the median and standard deviation for the def variable.

First consider the output of the summary command.

summary(bolts$def)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.500   5.200   5.269   6.100   7.700

The estimated standard deviation is given as follows.

sd(bolts$def)
## [1] 1.386473
  • On their own these do not tell us much about the relationships in the data.
  1. (5 pts) Construct a helpful plot to investigate relationships in your data, perhaps starting with a scatterplot. Comment on what you observe; do the data appear to be related?

The desired scatterplot is shown below.

plot(bolts$day, bolts$def, xlab="day", ylab="percent defective", main="Improvement over time?")

  • Indeed, it looks like there is a fairly substantial downward trend: percent defective decreases as day increases.
  1. (5 pts) Perform an appropriate test of association using Spearman’s \(\rho\).

The hypotheses are:

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{Day and defective percentage are mutually independent or there is a} \\ & \;\;\;\; \mbox{tendency for larger values of day to be paired with larger defective percentages } (\rho \geq 0) \\ \mathcal{H}_1 &: \; \mbox{There is a tendency for smaller values of day to be paired with larger defective percentages } (\rho < 0). \end{aligned} \]

We will perform the test using software.

cor.test(bolts$da, bolts$def, alternative="less", method="spearman", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  bolts$da and bolts$def
## S = 693.45, p-value = 1.05e-05
## alternative hypothesis: true rho is less than 0
## sample estimates:
##        rho 
## -0.9050903
  • Resoundingly reject \(\mathcal{H}_0\) at the 5% level and conclude that there is a significant improvement over time for that worker.
  1. (5 pts) Perform an appropriate test of association using Kendall’s \(\tau\).

The hypotheses are:

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{Day and defective percentage are multually indeendent or pairs of} \\ & \;\;\;\; \mbox{observations tend to be concordant} (\tau \geq 0) \\ \mathcal{H}_1 &: \; \mbox{Day and defective percentage tend to be discordant} (\tau < 0). \end{aligned} \]

Again, we will perform the test using software. There are ties in the counts of concordant pairs, so we will have to use the CLT verion.

cor.test(bolts$da, bolts$def, alternative="less", method="kendall", exact=FALSE, correct=TRUE)
## 
##  Kendall's rank correlation tau
## 
## data:  bolts$da and bolts$def
## z = -3.484, p-value = 0.000247
## alternative hypothesis: true tau is less than 0
## sample estimates:
##        tau 
## -0.7354992
  • Again, we resoundingly reject \(\mathcal{H}_0\) at the 5% level and conclude that there is a significant improvement over time for that worker.

Problem 3: Miles per gallon (30 points)

A driver kept track of the number of miles she traveled and the number of gallons put in the tank each time she bought gasoline.

mpg <- data.frame(miles=c(142, 116, 194, 250, 88, 157, 225, 159, 43, 208),
    gallons=c(11.1, 5.7, 14.2, 15.8, 7.5, 12.5, 17.9, 8.8, 3.4, 15.2))
  1. (5 pts) Draw a diagram (e.g., a scatterplot) showing these points, using gallons as the \(x\)-axis. Ideally this should be performed using software.

See the solution to part c. below, on which the least squares regression line is added.

  1. (7 pts) Estimate \(a\), and \(b\) in a linear model, \(y=a+bx\), using the method of least squares.

  2. (4 pts) “by hand”

The slope is correlation times rise over run.

b <- cor(mpg$gallons, mpg$miles) * sd(mpg$miles) / sd(mpg$gallons)
b
## [1] 12.61741

The intercept is calculated by putting a line with that slope through the middle of the cloud of points.

a <- mean(mpg$miles) - b * mean(mpg$gallons)
a
## [1] 16.75887
  1. (3 pts) “using a software library”

The lm command in R gives.

fit <- lm(miles ~ gallons, data=mpg)
fit
## 
## Call:
## lm(formula = miles ~ gallons, data = mpg)
## 
## Coefficients:
## (Intercept)      gallons  
##       16.76        12.62
  • Identical (up to rounding).
  1. (3 pts) Plot the least squares regression line on the diagram of part a., ideally using software.

The plot from part a. and the lines from this part (c.) are shown below.

plot(mpg$gallons, mpg$miles, xlab="gallons", ylab="miles", main="Miles per Gallon")
abline(a, b)
legend("topleft", "Least Squares line", lty=1)

  • There appears to be an increasing relationship.
  1. (7 pts) Suppose the EPA estimated this car’s mileage at 18 miles per gallon. Test the null hypothesis that this figure applies to this particular car and driver. Use the test for the slope.
    1. (4 pts) “by hand”

We can convert the question at hand into the following two-tailed set of hypotheses.

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta = 18 \\ \mathcal{H}_1 &: \; \beta \ne 18. \end{aligned} \]

To make our lives a little easier when performing the test, lets define \(X\) and \(Y\) variables.

y <- mpg$miles
x <- mpg$gallons

The test is performed by first calculating \(U_i = Y_i - \beta_0 X_i\) for our \(\beta_0 = 18\).

beta0 <- 18
u <- y - beta0*x
u
##  [1] -57.8  13.4 -61.6 -34.4 -47.0 -68.0 -97.2   0.6 -18.2 -65.6

Now calculate Spearman’s-\(\rho\) on the resulting \((X_i, U_i)\) pairs.

n <- length(x)
rx <- rank(x)
ru <- rank(u)
rbind(rx, ru)
##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## rx    5    2    7    9    3    6   10    4    1     8
## ru    5   10    4    7    6    2    1    9    8     3
  • First, observe that there are no ties in the ranks, which will impact how we calculate the \(p\)-value below.

… Continuing with the calculations …

numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
## [1] -0.7090909

Now lets calculate the \(p\)-value. Since there are no ties in the ranks, we can use the exact calculation.

library(SuppDists)
2*min(pSpearman(rho, n), 1-pSpearman(rho, n))
## [1] 0.02675209
  • We reject the null hypothesis at the 5% level and conclude that this car’s mileage is not at 18 miles per gallon for this particular car and driver.
  1. (3 pts) “using software”

Here is a version that uses the \(U\) values we calculated above.

cor.test(x, u, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  x and u
## S = 282, p-value = 0.02751
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.7090909

Alternatively, we can borrow the library function we created in the lecture notes.

np.slope.test <- function(x, y, beta0, alternative=c("two.sided", "less", "greater"), ...)
  {
    alternative = match.arg(alternative)
    u <- y - beta0*x
    phi <- cor.test(x, u, alternative, method="spearman", ...)$p.value
    rho <- cor(x, u, method="spearman")
    return(list(beta0=beta0, rho=rho, alternative=alternative, p.value=phi))
}

When appled to our mpg data we get the following.

np.slope.test(mpg$gallons, mpg$miles, 18)
## $beta0
## [1] 18
## 
## $rho
## [1] -0.7090909
## 
## $alternative
## [1] "two.sided"
## 
## $p.value
## [1] 0.02751412
  • In both cases: very similar result; same conclusion.
  1. (5 pts) Calculate a 95% confidence interval for the slope.

Since the question didn’t specify “by hand” or “using a software library”, we’ll borrow the library function we wrote in lecture.

np.slope.CI <- function(x, y, alpha=0.05)
  {
    S <- outer(y,y,'-')/outer(x,x,'-')
    S <- as.numeric(S[upper.tri(S)])
    NcPNd <- length(S)                ## calculate before ties
    S <- S[is.finite(S)]              ## for ties in the Xs
    S <- sort(S)
    n <- length(x)
    w <- qKendall(1-alpha/2, n)*NcPNd ## back on the T=Nc-Nd scale
    N <- length(S)
    r <- floor(0.5*(N-w))
    s <- ceiling(N+1-r)
    beta.hat <- median(S)
    return(list(interval=c(S[r], S[s]), rs=c(r,s), level=1-alpha, beta.hat=beta.hat))
  }

Here it is applied to our mpg data.

np.slope.CI(mpg$gallons, mpg$miles)
## $interval
## [1]  8.934426 16.693548
## 
## $rs
## [1] 12 34
## 
## $level
## [1] 0.95
## 
## $beta.hat
## [1] 13.17308
  1. (3 pts) Provide the estimate for the slope, obtained from the median of the slopes from part e. Solve this “by hand”.

The “software library” answer is reported above, but we were asked to do it “by hand”.

First calculate the two-point slopes.

S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
##  [1]   4.8148148  16.7741935   9.1764706  22.9787234  13.2673267
##  [6]  35.0000000  15.0000000 -15.5555556  15.8208955  19.5180723
## [11]  10.7142857   6.0294118  21.7647059  28.1818182  13.8000000
## [16]  12.2058824   8.9344262   8.3783784 -11.9047619  13.1730769
## [21]  12.5925926  -7.3913043  13.8709677   6.4814815  13.0000000
## [26]  54.6153846  -0.5405405   7.2527473  12.8571429  31.7391304
## [31]  13.9814815  16.6935484  10.9756098  12.5274725  12.5517241
## [36]  21.4814815  16.0975610   9.6842105  14.0000000  70.0000000
## [41]  15.5844156  18.8888889   6.2962963   7.6562500  13.9830508
  • Note that there are no Inf values, so nothing to remove.

And then take the median of those slopes as our estimate for \(\beta\).

b.np <- median(S)
b.np
## [1] 13.17308
  • This agrees with our result above.

The question didn’t ask for this, but we can follow the same formula for the intercept as above, except with medians instead of means.

a.np <- median(y) - b.np * median(x)
a.np
## [1] 2.557692

Problem 4: Student-to-faculty ratio (30 points)

A random sample of American colleges and universities resulted in the following numbers of students and faculty (Spring 1973).

univ <- data.frame(name=c("American International", "Bethany Nazarene", "Carlow", "David Lipscomb", 
    "Florida International", "Heidelberg", "Lake Erie", "Mary Harin Baylor", "Newburry", "St. Ambrose", 
    "Smith", "Texas Women's", "Wofford"),
    students=c(2546, 1355, 87, 1858, 4500, 1141, 784, 1063, 753, 1189, 2755, 5602, 988),
    faculty=c(129, 75, 87, 99, 300, 109, 77, 64, 61, 90, 240, 300, 73))
  1. (5 pts) Draw a diagram (e.g., a scatterplot) showing these points, using faculty as the \(x\)-axis. Ideally this should be performed using software.

See the solution to part c. below, on which the least squares regression line is added.

  1. (7 pts) Estimate \(a\), and \(b\) in a linear model, \(y=a+bx\), using the method of least squares.
    1. (4 pts) “by hand”

The slope is correlation times rise over run.

b <- cor(univ$faculty, univ$students) * sd(univ$students) / sd(univ$faculty)
b
## [1] 16.8282

The intercept is calculated by putting a line with that slope through the middle of the cloud of points.

a <- mean(univ$students) - b * mean(univ$faculty)
a
## [1] -311.8655
  1. (3 pts) “using a software library”

The lm command in R gives.

fit <- lm(students ~ faculty, data=univ)
fit
## 
## Call:
## lm(formula = students ~ faculty, data = univ)
## 
## Coefficients:
## (Intercept)      faculty  
##     -311.87        16.83
  1. (3 pts) Plot the least squares regression line on the diagram of part a, ideally using software.

The plot from part a. and the lines from this part (c.) are shown below.

plot(univ$faculty, univ$students, xlab="faculty", ylab="students", main="Student-to-Faculty ratio")
abline(a, b)
legend("topleft", "Least Squares line", lty=1)

  • There appears to be an increasing relationship.
  1. (7 pts) Test the hypothesis that an increase of one faculty member is accompanied by an average increase of 15 students.

  2. (4 pts) “by hand”

We can convert the question at hand into the following two-tailed set of hypotheses.

\[ \begin{aligned} \mathcal{H}_0 &: \; \beta = 15 \\ \mathcal{H}_1 &: \; \beta \ne 15. \end{aligned} \]

To make our lives a little easier when performing the test, lets define \(X\) and \(Y\) variables.

y <- univ$students
x <- univ$faculty

The test is performed by first calculating \(U_i = Y_i - \beta_0 X_i\) for our \(\beta_0 = 18\).

beta0 <- 15
u <- y - beta0*x
u
##  [1]   611   230 -1218   373     0  -494  -371   103  -162  -161  -845
## [12]  1102  -107

Now calculate Spearman’s-\(\rho\) on the resulting \((X_i, U_i)\) pairs.

n <- length(x)
rx <- rank(x)
ru <- rank(u)
rbind(rx, ru)
##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## rx   10    4    6    8 12.5    9    5    2    1     7    11  12.5     3
## ru   12   10    1   11  8.0    3    4    9    5     6     2  13.0     7
  • First, observe that there are ties in the ranks, which will impact how we calculate the \(p\)-value below.

… Continuing with the calculations …

numer <- sum(rx*ru) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ru^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
## [1] 0.1898214

Now lets calculate the \(p\)-value. Since there were ties we must use the CLT approximation.

2*pnorm(-abs(rho)*sqrt(length(x)-1))
## [1] 0.5108206
  • We fail to reject the null hypothesis and conclude that there is not enough evidence to say that an increase of one faculty member is not accompanied by an average increase of 15 students.
  1. (3 pts) “using software”

Here is a version that uses the \(U\) values we calculated above.

cor.test(x, u, method="spearman", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  x and u
## S = 294.91, p-value = 0.5345
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1898214

Alternatively, we can borrow the library function we created, which we don’t need to paste in again since we have it above. When applied to our univ data we get the following.

np.slope.test(univ$faculty, univ$students, 15, exact=FALSE)
## $beta0
## [1] 15
## 
## $rho
## [1] 0.1898214
## 
## $alternative
## [1] "two.sided"
## 
## $p.value
## [1] 0.5345087
  • In both cases: very similar result; same conclusion.
  1. (5 pts) Calculate a 95% confidence interval for the slope.

Since the question didn’t specify “by hand” or “using a software library”, we’ll borrow the library function we wrote in lecture, and which is already pasted above. Here it is applied to our univ data.

np.slope.CI(univ$faculty, univ$students)
## $interval
## [1]  9.613636 23.356021
## 
## $rs
## [1] 22 56
## 
## $level
## [1] 0.95
## 
## $beta.hat
## [1] 17.43791
  1. (3 pts) Provide the estimate for the slope, obtained from the median of the slopes from part e. Solve this “by hand”.

The “software library” answer is reported above, but we were asked to do it “by hand”.

First calculate the two-point slopes.

S <- outer(y,y,'-')/outer(x,x,'-')
S <- as.numeric(S[upper.tri(S)])
S
##  [1]   22.055556   58.547619 -105.666667   22.933333   20.958333
##  [6]  147.583333   11.426901   13.977778   20.718310   13.144279
## [11]   70.250000   -6.294118   47.909091  -71.700000   17.586387
## [16]   33.884615 -285.500000  -69.700000   48.818182   16.663677
## [21]   11.156250   22.815385   26.545455  -42.434783   22.714286
## [26]   14.563559    1.733333  -21.461538   26.367647   43.000000
## [31]  -25.615385   29.078947   15.677824    8.083333    1.937500
## [36]  103.333333   34.794872  -11.066667  367.333333   74.333333
## [41]   15.766667   -2.526316   31.153846    4.846154   15.034483
## [46]    1.882883    8.484848   17.437908    6.361702   29.083333
## [51]   12.320611   12.092025    9.613636   11.184358   10.440000
## [56]   17.871345   18.875556   25.892019   18.626866        -Inf
## [61]   23.356021   21.605381   19.233051   20.288703   21.014286
## [66]   47.450000   27.821429  183.500000  -64.357143   33.461538
## [71]   15.471366    4.250000  -51.000000   -8.333333   19.583333
## [76]   11.823529   10.580838   20.325991
  • Note that there is one Inf value, so we have to get rid of that one before we go further.
S <- S[is.finite(S)]

Now we can take the median of those slopes as our estimate for \(\beta\).

b.np <- median(S)
b.np
## [1] 17.43791
  • This agrees with our result above.

The question didn’t ask for this, but we can follow the same formula for the intercept as above, except with medians instead of means.

a.np <- median(y) - b.np * median(x)
a.np
## [1] -380.4118