Department of Statistics, Virginia Tech

Here we introduce non-parametric measures of correlation and related tests for independence, starting with a simple extension of ordinary (Pearson) correlation.

This is “ordinary” correlation, or in other words covariance between two samples normalized (divided) by the product of their sample standard deviations. It is usually denoted by \(r\) or \(r_{xy}\). Suppose we have sample pairs \((X_i, Y_i)\) for \(i=1, \dots, n\). Then Pearson’s correlation coefficient is given by.

\[ r = \frac{\sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})}{\left[\sum_{i=1}^n (X_i - \bar{X})^2 \sum_{i=1}^n(Y_i - \bar{Y})^2 \right]^{\frac{1}{2}}}\equiv \frac{\mathbb{C}\mathrm{ov}(X, Y)}{\sqrt{\mathbb{V}\mathrm{ar}(X) \mathbb{V}\mathrm{ar}(Y) }} \]

- It is calculated by
`cor(X,Y)`

in R.

Observe that \(-1 \leq r \leq 1\) with the sign giving the direction of the relationship and the magnitude giving the strength.

- It is a measure of
**linear correlation**because the stronger the association the more plotted pairs \((X_i, Y_i)\) tend to fall upon a straight line.

```
library(mvtnorm)
par(mfrow=c(2,2), mai=c(.5,.5,.1,.1))
plot(rmvnorm(200,rep(0,2), matrix(c(1,1,1,1), ncol=2)),pch=20,xlab="",ylab="",xlim=c(-3,3),ylim=c(-3,3))
text(x=-2, y=2.5, "corr = 1", col=2, cex=1.3)
plot(rmvnorm(200,rep(0,2), matrix(c(1,.5,.5,1), ncol=2)),pch=20,xlab="",ylab="",xlim=c(-3,3), ylim=c(-3,3))
text(x=-2, y=2.5, "corr = .5", col=2, cex=1.3)
plot(rmvnorm(200,rep(0,2), matrix(c(1,.8,.8,1), ncol=2)),pch=20,xlab="",ylab="",xlim=c(-3,3), ylim=c(-3,3))
text(x=-2, y=2.5, "corr = .8", col=2, cex=1.3)
plot(rmvnorm(200,rep(0,2), matrix(c(1,-.8,-.8,1), ncol=2)),pch=20,xlab="",ylab="",xlim=c(-3,3),ylim=c(-3,3))
text(x=2, y=2.5, "corr = -.8", col=2, cex=1.3)
```

- It can fail spectacularly when the data do not exhibit a linear relationship, or when there are outliers.

```
par(mfrow=c(1,2), mai=c(.5,.5,.5,.1))
x <- rnorm(200)
y <- -x^2 + rnorm(200,0,.2)
z <- cbind(x,y)
plot(z, pch=20, xlab="", ylab="", main=paste("corr =", round(cor(z[,1], z[,2]),2)))
z <- rbind(rmvnorm(199,c(0,0)), c(20,20))
plot(z, pch=20, xlab="", ylab="", main=paste("corr =", round(cor(z[,1], z[,2]),2)))
```

Although Pearson’s \(r\) is intuitive, it is hard to perform statistical tests in order to determine whether the strength of an association is significant because the null distribution of the statistic does not have a simple closed form. That is remedied by working with ranks instead.

Spearman’s rank correlation is basically that: Pearson’s correlation based on the ranks \(R(X_i)\) and \(R(Y_i)\) of \(X_i\) and \(Y_i\) rather than on the raw values themselves, and with ties in rank handled in the usual way.

- Note that the ranks are calculated on the \(X\)s and \(Y\)s separately, not in a combined sample.

Spearman’s correlation is called \(\rho\), the Greek equivalent of \(r\). It works out to be the following.

\[ \rho = \frac{\sum_{i=1}^n R(X_i) R(Y_i) - n \left(\frac{n+1}{2}\right)^2}{ \left(\sum_{i=1}^n R(X_i)^2 - n\left(\frac{n+1}{2}\right)^2 \right)^\frac{1}{2} \left(\sum_{i=1}^n R(Y_i)^2 - n\left(\frac{n+1}{2}\right)^2 \right)^\frac{1}{2}} \]

In class we saw that

- there is a simplification in the case of no tied ranks,
- and that \(\rho =\)
`cor(rank(x), rank(y))`

.

In order to turn a measure of \(\rho\) into a test for the strength of the association we need the null distribution. This is calculated in the same spirit as our other rank-based tests (e.g., Wilcox and squared ranks).

- I.e., the mass function involves counting the number of ways to get a particular \(\rho\) value from each permutation on the pair of \(\{1, \dots, n\}\) ranks, divided by the total number of such permutations.

We know how to do that now, but we won’t reinvent the wheel.

- Table A10 in the text provides evaluations of the CDF for \(n \leq 30\).
- We can use the CLT and get that \(Z = \rho \sqrt{n-1} \sim \mathcal{N}(0,1)\) for \(n > 30\).
- An add-on package
`SuppDists`

contains a`pSpearman`

function.

When there are ties in the ranks the CLT method is usually preferred, as the averaged ranks challenge the calculation of the CDF of the null distribution.

The **hypotheses** are specified as follows, and the specific verbiage here is due to the insensitivity of \(\rho\) to some types of dependence.

For a two-tailed test we consider

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are mutually independent.} \\ \mathcal{H}_1 &: \; \mbox{Either (a) there is a tendency for the larger values of $X$ to be paired with larger values of $Y$, or} \\ & \;\;\;\; \mbox{(b) larger values of $X$ with smaller values of $Y$, and vice versa with smaller and larger switched.} \end{aligned} \]

The \(p\)-value is calculated in the usual way as

\[ \varphi = 2 \min\{\mathbb{P}(R \leq \rho), \mathbb{P}(R > \rho) \} \quad \mbox{ or } \quad \varphi \approx 2 \mathbb{P}(Z \leq - |\rho|\sqrt{n-1}) \]

- depending on whether the exact or CLT approximated null distribution is preferred, respectively.

For a lower-tailed test (for negative correlation), we have

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are mutually independent.} \\ \mathcal{H}_1 &: \; \mbox{There is a tendency for the smaller values of $X$ to be paired with larger values of $Y$, and vice versa.} \end{aligned} \]

The \(p\)-value is calculated as

\[ \varphi = \mathbb{P}(R \leq \rho) \quad \mbox{ or } \quad \varphi \approx \mathbb{P}(Z \leq \rho\sqrt{n-1}). \]

For an upper-tailed test (for positive correlation), we consider

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{The $X_i$ and $Y_i$ are mutually independent.} \\ \mathcal{H}_1 &: \; \mbox{There is a tendency for the larger values of $X$ to be paired with larger values of $Y$, and vice versa.} \end{aligned} \]

The \(p\)-value is calculated as

\[ \varphi = \mathbb{P}(R > \rho) \quad \mbox{ or } \quad \varphi \approx \mathbb{P}(Z \geq \rho\sqrt{n-1}). \]

Metro officials gathered aggregate data summarizing the number of employees that receive transit benefits and the average price that the employer spends per employee for them to use the transit benefits. Is there evidence to suggest that the average price per employee drops as the number of employees using the service increases (i.e. employer discount)?

```
emp <- data.frame(number=c(173, 149, 124, 64, 88, 113, 142, 27, 39, 51),
price=c(2.14, 2.39, 2.19, 2.56, 2.44, 2.29, 2.18, 2.55, 2.32, 2.27))
```

First, lets confirm that the (complicated) formula for \(\rho\) agrees with our characterization as \(\rho\) being the correlation of ranks.

```
rx <- rank(emp$number)
ry <- rank(emp$price)
n <- nrow(emp)
numer <- sum(rx*ry) - n*((n+1)/2)^2
denomx <- sqrt(sum(rx^2) - n*((n+1)/2)^2)
denomy <- sqrt(sum(ry^2) - n*((n+1)/2)^2)
rho <- numer/(denomx*denomy)
rho
```

`## [1] -0.6`

`cor(rank(emp$number), rank(emp$price))`

`## [1] -0.6`

- Nice! (Which one is easier?)
- So the rank correlation is negative; that means indeed there is some kind of employer discount.

Lets see if that result is statistically significant by performing a lower-tailed test.

```
library(SuppDists)
pSpearman(rho, n)
```

`## [1] 0.03671324`

- We reject the null hypothesis at the 5% level and conclude that there is an employer discount.

R has a built-in library function called `cor.test`

which can perform a test on Spearman’s \(\rho\). It uses slightly different calculations, but mostly yields \(p\)-values that are very similar.

`cor.test(emp$number, emp$price, alternative="less", method="spearman")`

```
##
## Spearman's rank correlation rho
##
## data: emp$number and emp$price
## S = 264, p-value = 0.03656
## alternative hypothesis: true rho is less than 0
## sample estimates:
## rho
## -0.6
```

- Very close, but not exact.

From a Metro survey, officials analyzed the overall rating that passengers gave to the transit system and their most popular time of commute. Below is sample data from passengers. Note that the higher the rating, the more satisfied that the customer was with the transit system. Is there a significant correlation in the two variables at the five percent level of significance?

```
survey <- data.frame(
time=c("7:00", "7:20", "8:00", "7:30", "8:00", "8:30", "8:45", "7:25", "9:00", "8:00", "7:55",
"7:15", "7:45", "9:00"),
rating=c(5.5, 8.5, 10, 8.5, 7.0, 6.5, 7.0, 9.0, 4.5, 7.5, 8.5, 6.5, 8.0, 10))
```

Notice that the times are in a “string” format, so we will need to make a little conversion into something “`rank`

able”.

`ptime <- as.POSIXct(strptime(survey$time, "%H:%M"))`

Now we can calculate our \(\rho\) (in the simple way).

```
rho <- cor(rank(ptime), rank(survey$rating))
rho
```

`## [1] 0.00779512`

- Looks like a pretty small correlation, but we can test if it is small enough to be considered as zero.

The \(p\)-value of the two-tailed test may be calculated as follows.

```
n <- nrow(survey)
2*min(pSpearman(rho, n), 1-pSpearman(rho, n))
```

`## [1] 0.9940636`

- (Basically zero.)
- Not enough evidence to reject the null hypothesis that time and rating are mutually independent at the 5% level.

Now, there were some ties in the ranks (just look at the `time`

variable, which has ties). Lets do the CLT version instead.

```
z <- rho * sqrt(n-1)
2*pnorm(-abs(z))
```

`## [1] 0.9775778`

- In either case, there is not enough evidence to reject the null hypothesis that time and rating are mutually independent.

Here is a library version.

`cor.test(as.numeric(ptime), survey$rating, method="spearman", exact=FALSE, continuity=FALSE)`

```
##
## Spearman's rank correlation rho
##
## data: as.numeric(ptime) and survey$rating
## S = 451.45, p-value = 0.9789
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.00779512
```

- Again, almost exact the same result.

One application of Spearman’s \(\rho\) applies to a single sample \(Y_1, \dots, Y_n\) where the other set of coordinates, the \(X_i\)’s, are derived from the order (or time) at which the \(X_i\) samples were collected. Then a calculation of \(\rho\), and an associated test can be interpreted as a test for trend, similar to the Cox & Stuart test.

- In fact, tests for trend based on ranks are generally more powerful Cox & Stuart style tests.
- Nevertheless, rank-based tests for trend are not as widely applicable as Cox & Stuart tests.

The following values record annual precipitation over the span of 1950–1968 inclusive.

```
precip <- c(45.25, 45.83, 41.77, 36.26, 45.27, 52.25, 35.37, 57.16, 35.38, 58.32, 41.05, 22.72,
45.73, 37.90, 41.72, 36.07, 49.83, 36.24, 39.90)
```

So the idea is to create an \(X\) to go with these \(Y\)s. Quite simply …

`year <- 1950:1968`

Of course, it always makes sense to visualize data when checking for trend.

`plot(year, precip, main="precipitation over the years")`

- Doesn’t look like it has any trend to me. Lets test.

```
rho <- cor(rank(year), rank(precip))
rho
```

`## [1] -0.245614`

- With a number like \(\rho = -0.25\) above we might be tempted to conclude a (modestly strong) negative correlation. But that measurement is without a benchmark unless we consider its sampling distribution.

The \(p\)-value associated with a two-tailed test may be calculated as follows.

`2*min(pSpearman(rho, length(precip)), 1-pSpearman(rho, length(precip)))`

`## [1] 0.3076766`

How about using the library?

`cor.test(year, precip, method="spearman")`

```
##
## Spearman's rank correlation rho
##
## data: year and precip
## S = 1420, p-value = 0.3094
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.245614
```

- A nearly identical result.