Kendall’s \(\tau\)

Kendall’s \(\tau\) measures correlation by counting the number of pairs of \((X_i, Y_i)\) where both coordinates in one agree in their comparison with both coordinates in the another \((X_j, Y_j)\), either lesser or greater, and vice-versa where they don’t agree. The former are called concordant pairs, and the latter discordant. I.e.,

\[ \frac{Y_j - Y_i}{X_j - X_i} > 0 \quad \mbox{ as long as } X_i \ne X_j. \]

\[ \frac{Y_j - Y_i}{X_j - X_i} < 0 \quad \mbox{ as long as } X_i \ne X_j. \]

Kendall’s \(\tau\) is the difference between the number of concordant pairs \(N_c\) in the sample and the number of discordant pairs \(N_d\), divided by the total number of pairs \({n \choose 2} = n (n-1) /2\):

\[ \tau = \frac{N_c - N_d}{n(n-1)/2}. \]

In the case of ties in the \(Y\)-values, we make the following adjustment. When

\[ \frac{Y_j - Y_i}{X_j - X_i} = 0 \quad \mbox{ as long as } X_i \ne X_j \]

we count it as half concordant and half discordant, i.e., adding \(1/2\) to both \(N_c\) and \(N_d\). Whenever \(X_i = X_j\), the comparison is void and no augmentation to \(N_c\) or \(N_d\) is made for that case. The definition of \(\tau\) is adjusted to follow

\[ \tau = \frac{N_c - N_d}{N_c + N_d}. \]

Another description for how to calculate \(\tau\) comes from the documentation for pKendall in R:

\[ \tau = 1 - \frac{4s}{n(n-1).} \]

In R we we may simply calculate cor(x, y, method="kendall").

In order to turn a measure of \(\tau\) into a test for the strength of the association we need the null distribution. This is calculated in the same spirit as our other rank-based tests (e.g., Wilcox and squared ranks).

Again, we know how to do that now, but we won’t reinvent the wheel.

\[ \begin{aligned} && Z &= \tau \frac{3 \sqrt{n(n-1)}}{\sqrt{2(2n+5)}} \sim \mathcal{N}(0,1), \\ \mbox{or equivalently (and slightly more simply) } && Z &= (N_c - N_d) \frac{\sqrt{18}}{\sqrt{n(n-1)(2n+5)}} \sim \mathcal{N}(0,1). \end{aligned} \]

The hypotheses are specified as follows, and the specific verbiage here is due to the insensitivity of \(\rho\) to some types of dependence.

For a two-tailed test we consider

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are independent.} \\ \mathcal{H}_1 &: \; \mbox{Pairs of observations either rend to be concordant, or tend to be discordant.} \end{aligned} \]

The \(p\)-value is calculated in the usual way as

\[ \varphi = 2 \min\{\mathbb{P}(T \leq \tau), \mathbb{P}(T > \tau) \} \quad \mbox{ or } \quad \varphi \approx 2\min\left\{\mathbb{P}\left(Z \leq \frac{(N_c - N_d+1)\sqrt{18}}{\sqrt{n(n-1)(2n+5)}}\right), \mathbb{P}\left(Z \geq \frac{(N_c - N_d-1)\sqrt{18}}{\sqrt{n(n-1)(2n+5)}}\right)\right\} \]

For a lower-tailed test (for negative correlation), we have

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are independent.} \\ \mathcal{H}_1 &: \; \mbox{Pairs of observations tend to be discordant.} \end{aligned} \]

The \(p\)-value is calculated as

\[ \varphi = \mathbb{P}(T \leq \tau) \quad \mbox{ or } \quad \varphi \approx\mathbb{P}\left(Z \leq \frac{(N_c - N_d+1)\sqrt{18}}{\sqrt{n(n-1)(2n+5)}}\right). \]

For an upper-tailed test (for positive correlation), we consider

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are mutually independent.} \\ \mathcal{H}_1 &: \; \mbox{Pairs of observations tend to be concordant.} \end{aligned} \]

The \(p\)-value is calculated as

\[ \varphi = \mathbb{P}(T > \tau) \quad \mbox{ or } \quad \varphi \approx \mathbb{P}\left(Z \geq \frac{(N_c - N_d-1)\sqrt{18}}{\sqrt{n(n-1)(2n+5)}}\right). \]

Employee benefits (redux)

Metro officials gathered aggregate data summarizing the number of employees that receive transit benefits and the average price that the employer spends per employee for them to use the transit benefits. Is there evidence to suggest that the average price per employee drops as the number of employees using the service increases (i.e. employer discount)?

emp <- data.frame(number=c(173, 149, 124, 64, 88, 113, 142, 27, 39, 51),
    price=c(2.14, 2.39, 2.19, 2.56, 2.44, 2.29, 2.18, 2.55, 2.32, 2.27))

First, lets present the data in a way that makes it easy to eyeball the number of concordant and discordant pairs.

o <- order(emp$number)
emp[o,]
##    number price
## 8      27  2.55
## 9      39  2.32
## 10     51  2.27
## 4      64  2.56
## 5      88  2.44
## 6     113  2.29
## 3     124  2.19
## 7     142  2.18
## 2     149  2.39
## 1     173  2.14

Now, an easy way to count the number of concordant and discordant pairs is to count, for each value in the second, \(Y\) column, how many are above or below that value.

n <- nrow(emp)
yo <- emp$price[o]
concord <- discord <- rep(NA, n)
for(i in 1:(n-1)) {
    yties <- sum(yo[i] == yo[(i+1):n])
    concord[i] <- sum(yo[i] < yo[(i+1):n]) + yties/2
    discord[i] <- sum(yo[i] > yo[(i+1):n]) + yties/2
}
cbind(emp[o, ], concord, discord)
##    number price concord discord
## 8      27  2.55       1       8
## 9      39  2.32       3       5
## 10     51  2.27       4       3
## 4      64  2.56       0       6
## 5      88  2.44       0       5
## 6     113  2.29       1       3
## 3     124  2.19       1       2
## 7     142  2.18       1       1
## 2     149  2.39       0       1
## 1     173  2.14      NA      NA
  • Notice that there are no \(X\)-ties in the data, otherwise we would have had to count more carefully.

Translating those numbers into \(N_c\) and \(N_d\) proceeds as follows, and then our estimate of \(\tau\)

Nc <- sum(concord, na.rm=TRUE)
Nd <- sum(discord, na.rm=TRUE)
tau <- (Nc - Nd) / (Nc + Nd)
tau
## [1] -0.5111111

Lets check that against R’s cor command.

cor(emp$number, emp$price, method="kendall")
## [1] -0.5111111
  • Nice! (Which one is easier?)
  • So the estimate of \(\tau\) is negative; that means indeed there is some kind of employer discount.

Lets see if that result is statistically significant by performing a lower-tailed test.

library(SuppDists)
pKendall(tau, nrow(emp))
## [1] 0.02331129
  • We reject the null hypothesis at the 5% level and conclude that there is indeed an employer discount.

R has a built-in library function called cor.test which can perform a test on Kendall’s \(\tau\).

cor.test(emp$number, emp$price, alternative="less", method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  emp$number and emp$price
## T = 11, p-value = 0.02331
## alternative hypothesis: true tau is less than 0
## sample estimates:
##        tau 
## -0.5111111
  • Bingo.

Rating of service versus commute time (redux)

From a Metro survey, officials analyzed the overall rating that passengers gave to the transit system and their most popular time of commute. Below is sample data from passengers. Note that the higher the rating, the more satisfied that the customer was with the transit system. Is there a significant correlation in the two variables at the five percent level of significance?

survey <- data.frame(
    time=c("7:00", "7:20", "8:00", "7:30", "8:00", "8:30", "8:45", "7:25", "9:00", "8:00", "7:55", 
        "7:15", "7:45", "9:00"),
    rating=c(5.5, 8.5, 10, 8.5, 7.0, 6.5, 7.0, 9.0, 4.5, 7.5, 8.5, 6.5, 8.0, 10))

Again, the times are in a “string” format, so we will need to make a little conversion into something “rankable”.

ptime <- as.POSIXct(strptime(survey$time, "%H:%M"))

Since we have \(X\)-ties this time around, lets see how counting by hand would work in that case.

o <- order(ptime)
n <- length(ptime)
yo <- survey$rating[o]
xo <- ptime[o]
concord <- discord <- rep(NA, n)
for(i in 1:(n-1)) {
    xu <- xo[i] != xo[(i+1):n] 
    yties <- sum((yo[i] == yo[(i+1):n]) * xu)
    concord[i] <- sum((yo[i] < yo[(i+1):n])*xu) + yties/2
    discord[i] <- sum((yo[i] > yo[(i+1):n])*xu) + yties/2
}
cbind(survey[o,], concord, discord)
##    time rating concord discord
## 1  7:00    5.5    12.0     1.0
## 12 7:15    6.5    10.5     1.5
## 2  7:20    8.5     4.0     7.0
## 8  7:25    9.0     2.0     8.0
## 4  7:30    8.5     2.5     6.5
## 13 7:45    8.0     3.0     5.0
## 11 7:55    8.5     2.0     5.0
## 3  8:00   10.0     0.5     3.5
## 5  8:00    7.0     1.5     2.5
## 10 8:00    7.5     1.0     3.0
## 6  8:30    6.5     2.0     1.0
## 7  8:45    7.0     1.0     1.0
## 9  9:00    4.5     0.0     0.0
## 14 9:00   10.0      NA      NA

Translating those numbers into \(N_c\) and \(N_d\) proceeds as follows, and then our estimate of \(\tau\)

Nc <- sum(concord, na.rm=TRUE)
Nd <- sum(discord, na.rm=TRUE)
tau <- (Nc - Nd) / (Nc + Nd)
tau
## [1] -0.03448276

Now we can calculate our \(\tau\) (in the simple way).

cor(as.numeric(ptime), survey$rating, method="kendall")
## [1] -0.03488608
  • Not identical, but very close. (I’m not sure how to account for the difference actually.)

Looks like a pretty small correlation, but we can test if it is small enough to be considered as zero.

The \(p\)-value of the two-tailed test may be calculated as follows.

n <- nrow(survey)
2*min(pKendall(tau, n), 1-pKendall(tau, n))
## [1] 0.9144913
  • Not enough evidence to reject the null hypothesis that time and rating are mutually independent at the 5% level.

Since there were ties in the ranks it is better to use the normal approximation.

z1 <- (Nc - Nd + 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
z2 <- (Nc - Nd - 1) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
2 * min(pnorm(z1), pnorm(z2, lower.tail=FALSE))
## [1] 0.912814
  • In either case, there is not enough evidence to reject the null hypothesis that time and rating are mutually independent.

Here is a library version.

cor.test(as.numeric(ptime), survey$rating, method="kendall", exact=FALSE, continuity=TRUE)
## 
##  Kendall's rank correlation tau
## 
## data:  as.numeric(ptime) and survey$rating
## z = -0.11135, p-value = 0.9113
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.03488608
  • Very close. Notice that we used continuity=TRUE.
  • Observe that we would get (nearly) the same result as continuity=FALSE if we did the following.
z <- (Nc-Nd) * sqrt(18)/sqrt(n*(n-1)*(2*n+5))
2 * pnorm(-abs(z))
## [1] 0.8695464
cor.test(as.numeric(ptime), survey$rating, method="kendall", exact=FALSE)$p.value
## [1] 0.8673479
  • Very close.

Test for trend (redux)

One application of Kendall’s \(\tau\) applies to a single sample \(Y_1, \dots, Y_n\) where the other set of coordinates, the \(X_i\)’s, are derived from the order (or time) at which the \(X_i\) samples were collected. Then a calculation of \(\tau\), and an associated test can be interpreted as a test for trend, similar to the Cox & Stuart test.

Measuring precipitation

The following values record annual precipitation over the span of 1950–1968 inclusive.

precip <- c(45.25, 45.83, 41.77, 36.26, 45.27, 52.25, 35.37, 57.16, 35.38, 58.32, 41.05, 22.72, 
    45.73, 37.90, 41.72, 36.07, 49.83, 36.24, 39.90)

So the idea is to create an \(X\) to go with these \(Y\)s. Quite simply …

year <- 1950:1968

Skipping the plot to look for trend, and jumping right to the Kendall’s \(\tau\) calculation.

tau <- cor(year, precip, method="kendall")
tau
## [1] -0.1345029
  • With a number like \(\tau = -0.13\) above we might be tempted to conclude a (modestly strong) negative correlation, although notice that this is lower than our Spearman’s \(\rho=-0.25\). But both measurements are without a benchmark unless we consider their sampling distribution.

The \(p\)-value associated with a two-tailed test may be calculated as follows.

2*min(pKendall(tau, length(precip)), 1-pKendall(tau, length(precip)))
## [1] 0.4467731

How about using the library?

cor.test(year, precip, method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  year and precip
## T = 74, p-value = 0.4467
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.1345029
  • A nearly identical result.