Department of Statistics, Virginia Tech

The tests presented here compare samples from multiple (first two and then many) populations and checks if they have the same distribution or not. A difference between these and, say, the methods based on contingency tables is that they use the entire sample, rather than a binarization of classification of the same into discrete categories. Therefore, they generally have higher power.

The test presented here is known both as the Mann–Whitney and Wilcoxon test, with each author presenting equivalent forms around the same time. It determines if two (ordinal) samples came from the same population.

The data consist of two (ordinal) random samples

\[ X_1, X_2, \dots, X_{n_x} \quad \quad Y_1, Y_2, \dots, Y_{n_y} \]

from two populations, or from a randomized set of two blocks from one population (under different treatments). We assume independent sampling, and mutual independence between the two samples.

The test works with the ranks, from 1 to \(N\), of those observations under the combined sample of size \(N = n_x + n_y\). Let \(R(X_i)\) and \(R(Y_i)\) denote the rank assigned to \(X_i\) and \(Y_i\) for all \(i\) and \(j\).

- If several sample values are equal to each other (tied), assign to each the average of the ranks that would have been assigned to them had there been no ties.

There are three **test statistics**, depending on the nature of the sample(s) and depending on how you wish to carry out the calculations for the test.

If there are no ties, or just a few ties, the sum of the ranks assigned to the sample from population 1 can be used as a test statistic.

\[ T = \sum_{i=1}^n R(X_i) \]

The distribution of this statistic was discussed at a high level in class; see ranks.R for a bespoke implementation. A slight modification of \(T\), namely \(T' = T - n_x(n_x + 1)/2\) has a so-called Wilcox-test distribution with parameters \(n_x\) and \(n_y\); for example see `? rwilcox`

in R.

If there are many ties, then we subtract off the mean from \(T\) and divide by its standard deviation to get

\[ T_1 = \frac{T - n_x \frac{N+1}{2}}{\sqrt{\frac{n_x n_y}{N(N-1)} \sum_{i=1}^N R_i^2 - \frac{n_x n_y(N+1)^2}{4(N-1)}}}, \]

where \(\sum_{i=1}^N R_i^2\) referrs to the sum of the squares of all \(N\) ranks or average ranks actually used in both samples. Then Central Limit Theorem says that \(T_1 \sim \mathcal{N}(0,1)\), asymtotically.

We have the usual three types of test: two-tailed, lower-tailed and upper-tailed, by which to compare the distribution functions \(F_X\) and \(F_Y\).

For a **two-tailed** test, we have

\[ \begin{aligned} \mathcal{H}_0 &: \; F_X(x) = F_Y(x) \; \mbox{ for all $x$} \\ \mathcal{H}_1 &: \; F_X(x) \ne F_Y(x) \; \mbox{ for some $x$}. \end{aligned} \]

When using the \(T'\) statistic, via the observed value \(t'\) when there are few ties, the \(p\)-value may be calculated as

\[ \varphi = 2 \left\{ \begin{array}{ll} \mathbb{P}(T' \geq t'-1) & \mbox{ if } t' > \frac{n_x n_y}{2}, \mbox{ i.e., above the null mean} \\ \mathbb{P}(T' \leq t') & \mbox{otherwise} \end{array} \right. \]

When using \(T_1\), via observed \(t_1\), the normal approximation gives

\[ \varphi = 2 \min \{ \mathbb{P}(T_1 \leq t_1), \mathbb{P}(T_1 \geq t_1)\}. \]

For a **lower-tailed** test the hypotheses are

\[ \begin{aligned} \mathcal{H}_0 &: \; F_X(x) \leq F_Y(x) \quad \mbox{ or equivalenty } \quad \mathbb{E}\{X\} \geq \mathbb{E} \{Y\}. \\ \mathcal{H}_1 &: \; F_X(x) > F_Y(x) \quad \mbox{ or equivalenty } \quad \mathbb{E}\{X\} < \mathbb{E} \{Y\}. \end{aligned} \]

This is a little confusing because the form of the hypotheses look opposite of other lower-tailed tests, which is why the explanation in terms of expectations is helpful. If, at a particular value \(x\), one distribution \(F_X(x)\) is lower than \(F_Y(x)\), then most of the distribution of \(X\) is to the *left* of \(Y\)’s.

Using \(T'\) tor \(T_1\), the \(p\)-value may be calculated as

\[ \varphi = \mathbb{P}(T' \leq t') \quad \mbox{ or } \quad \varphi = \mathbb{P}(T_1 \leq t_1). \]

Reversing things, for an **upper-tailed** test, the hypotheses are

\[ \begin{aligned} \mathcal{H}_0 &: \; F_X(x) \geq F_Y(x) \quad \mbox{ or equivalenty } \quad \mathbb{E}\{X\} \leq \mathbb{E} \{Y\}. \\ \mathcal{H}_1 &: \; F_X(x) < F_Y(x) \quad \mbox{ or equivalenty } \quad \mathbb{E}\{X\} > \mathbb{E} \{Y\}. \end{aligned} \]

Using \(T'\) tor \(T_1\), the \(p\)-value may be calculated as

\[ \varphi = \mathbb{P}(T' \geq t'-1) \quad \mbox{ or } \quad \varphi = \mathbb{P}(T_1 \geq t_1). \]

Metro is one of few transit systems that still produces fare tokens. After their survey, they found that a significant number Washington passengers still use fare tokens, but not Virginia passengers. To validate this, the total number of tokens used in a sample of DC fare boxes versus Virginia fare boxes were compared after six months. At \(\alpha = 0.05\), is there enough evidence supporting the claim that less tokens are used on Virginia routes than on Washington DC bus routes?

```
VA <- c(53, 62, 58, 61, 43, 36, 50, 52, 41, 46)
DC <- c(67, 65, 82, 44, 59, 56, 93)
```

We will perform a lower-tailed test. First, lets extract the data sizes, treating VA as \(X\) and DC as \(Y\).

```
nx <- length(VA)
ny <- length(DC)
```

The ranks for the combined sample may be computed as follows.

`r <- rank(c(VA, DC))`

To get ranks from the \(X\) population we can use the `seq_along`

function.

```
rx <- r[seq_along(VA)]
rx
```

`## [1] 8 13 10 12 3 1 6 7 2 5`

Then, \(t\) is obtained as the sum of those numbers.

```
t <- sum(rx)
t
```

`## [1] 67`

Lets see how many ties there are

```
ties <- length(r) - length(unique(r))
ties
```

`## [1] 0`

- None, so we don’t need to use the normal approximation.

And then we can make the slight adjustment to use the `pwilcox`

distribution function.

```
tprime <- t - nx*(nx+1)/2
tprime
```

`## [1] 12`

the \(p\)-value may be calculated as follows.

`pwilcox(tprime, nx, ny)`

`## [1] 0.01249486`

- We reject the null hypothesis, and conclude that fewer tokens were used on VA Metrobus routes than on DC routes.

Using the library function.

`wilcox.test(VA, DC, alternative="less")`

```
##
## Wilcoxon rank sum test
##
## data: VA and DC
## W = 12, p-value = 0.01249
## alternative hypothesis: true location shift is less than 0
```

- Identical result.

Referring to the scenario in the previous example, suppose that the study was expanded to compare fare boxes in Washington DC versus Maryland Metrobus routes. At \(\alpha = 0.05\), is there enough evidence to suggest that the number of tokens used is significantly different?

```
DC <- c(72, 45, 73, 88, 45, 27, 95, 76, 78, 79, 71, 92, 93, 20, 85, 94, 69, 95, 86, 99, 97, 81, 91,
70, 75, 82, 98, 96, 87, 100)
MD <- c(45, 74, 52, 57, 77, 31, 64, 45, 68, 20, 84, 20, 45, 35, 15, 83, 20, 67, 80, 89, 61, 17, 45, 90)
```

Here we’ll do a two-tailed test. First, lets extract the data sizes, treating MD as \(X\) and DC as \(Y\).

```
nx <- length(MD)
ny <- length(DC)
```

The ranks for the combined sample may be computed as follows.

`r <- rank(c(MD, DC))`

To get ranks from the \(X\) population we can use the `seq_along`

function.

`rx <- r[seq_along(MD)]`

Then, \(t\) is obtained as the sum of those numbers.

```
t <- sum(rx)
t
```

`## [1] 442.5`

Lets see how many ties there are

```
ties <- length(r) - length(unique(r))
ties
```

`## [1] 9`

Substantial, so we will use \(T_1\) and the normal approximation. First, some auxiliary calculations.

```
N <- nx + ny
r2sum <- sum(r^2)
r2sum
```

`## [1] 53932`

Then finishing off the \(T_1\) calculation.

```
t1 <- t - nx*(N+1)/2
t1 <- t1/sqrt(nx*ny/((N*(N-1)))*r2sum - nx*ny*(N+1)^2/(4*(N-1)))
t1
```

`## [1] -3.789513`

And then the \(p\)-value calculation from a standard normal.

`2*min(pnorm(t1), pnorm(t1, lower.tail=FALSE))`

`## [1] 0.0001509431`

- Reject the null hypothesis and conclude that the number of tokens used in DC as different from MD.

How about with the library function?

`wilcox.test(MD, DC, exact=FALSE, correct=FALSE)`

```
##
## Wilcoxon rank sum test
##
## data: MD and DC
## W = 142.5, p-value = 0.0001509
## alternative hypothesis: true location shift is not equal to 0
```

- Same result; specifying
`exact=FALSE`

prevents a warning. - Giving
`correct=TRUE`

, the default, results in a continuity correction, but the result is almost identical in this case.

Here we generalize to multiple samples. The data consist of \(k\) (mutually) independent random samples of possibly different sizes. Denote the \(i^\mathrm{th}\) random sample of size \(n_i\) by \(X_{i1}, X_{i2}, \dots, X_{in_i}\). Let \(N\) denote the total number of observations, \(N=\sum_{i=1}^k n_i\).

As with the Wilcoxon test, the first step is to rank the collection of \(N\) observations from all samples, and then let \(R(X_{ij})\) represent the rank assigned to \(X_{ij}\). Then, let

\[ R_i = \sum_{j=1}^{n_i} R(X_{ij}), \quad i=1,\dots,k. \]

- Again, when there are ties, assign the average rank of each of the tied observations.

The **test statistic** is defined as

\[ T = \frac{1}{S^2}\left(\sum_{i=1}^k \frac{R_i^2}{n_i} - \frac{N(N+1)^2}{4} \right) \quad \mbox{ and where } \quad S^2 = \frac{1}{N-1}\left(\sum_{ij} R(X_{ij})^2 - N \frac{(N+1)^2}{4} \right). \]

If there are no ties, then it turns out that \(S^2 = N(N+1)/12\).

Under the null hypothesis, given below, \(T \sim \chi^2_{k-1}\) asymptotically.

Since there are multiple populations, only a two-tailed test makes sense. The hypotheses are

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{All of the $k$ population distributions functions are identical.} \\ \mathcal{H}_1 &: \; \mbox{At least one of the populations tends to yield larger observations} \\ & \;\;\;\; \mbox{than at least on of the other populations.} \end{aligned} \]

The \(p\)-value is approximately equal to \(\varphi = \mathbb{P}(T \geq t)\) under that \(\chi^2_{k-1}\) distribution.

- If the null hypothesis is rejected, Wilcoxon tests on pairs of samples can commence to find out where the differences are.

When trains end their service in the evening, customers can still get back home using a series of Metro buses. However, doing so can be a big expense. Customers were asked on the survey how much they pay to get home using Metro buses when they miss their last train. Below is the data from three different train lines. Is there evidence to conclude a significant difference in the amount paid on Metro bus to get home?

```
lines <- list(one=c(10.4, 11.8, 16.4, 17.6, 14.5),
two=c(7.9, 10.1, 8.7, 5.9, 13.5),
three=c(12.6, 10.5, 12.9, 11.2))
```

First lets calculate the \(n_i\) and \(N\).

```
n <- sapply(lines, length)
N <- sum(n)
```

Then calculate the global ranks.

`r <- rank(unlist(lines))`

Then get the squares of the ranks for each sample.

```
k <- length(lines)
g <- factor(rep.int(seq_len(k), n))
rs <- tapply(r, g, sum)
rs
```

```
## 1 2 3
## 52 21 32
```

Now we can evaluate the test statistic.

```
s2 <- (sum(r^2) - N*(N+1)^2/4)/(N-1)
t <- (sum(rs^2/n) - N*(N+1)^2/4)/s2
t
```

`## [1] 5.571429`

And then we can calculate the \(p\)-value.

`pchisq(t, k-1, lower.tail=FALSE)`

`## [1] 0.06168501`

- We fail to reject \(\mathcal{H}_0\) and conclude that the amount paid on Metrobus to get home depending on the train line that you use is equivalent.

Results from a library function?

`kruskal.test(lines)`

```
##
## Kruskal-Wallis rank sum test
##
## data: lines
## Kruskal-Wallis chi-squared = 5.5714, df = 2, p-value = 0.06169
```

- Identical result.

Refer to the scenario in the previous example. Suppose that the study was conducted using passengers from four rail lines. Conduct the same test again using this sample data at the five percent level of significance.

```
lines <- list(one=c(10.5, 10.5, 10.7, 10.9, 11.9),
two=c(10.2, 10.4, 10.6, 11.6, 11.8),
three=c(10.5, 11.2, 12.6, 12.9),
four=c(10.5, 11.0, 11.1, 11.3, 11.4, 11.7, 12.1, 12.2, 12.3, 12.4, 12.8, 13.2, 13.3, 13.5, 14.5))
```

First lets calculate the \(n_i\) and \(N\).

```
n <- sapply(lines, length)
N <- sum(n)
```

Then calculate the global ranks.

`r <- rank(unlist(lines))`

Then get the squares of the ranks for each sample.

```
k <- length(lines)
g <- factor(rep.int(seq_len(k), n))
rs <- tapply(r, g, sum)
rs
```

```
## 1 2 3 4
## 44.0 42.0 64.5 284.5
```

Now we can evaluate the test statistic.

```
s2 <- (sum(r^2) - N*(N+1)^2/4)/(N-1)
t <- (sum(rs^2/n) - N*(N+1)^2/4)/s2
t
```

`## [1] 9.002576`

And then we can calculate the \(p\)-value.

`pchisq(t, k-1, lower.tail=FALSE)`

`## [1] 0.02925666`

- Reject \(\mathcal{H}_0\) and conclude that there is a significant difference in the amount paid on Metrobus to get home depending on the train line that you use.

`kruskal.test(lines)`

```
##
## Kruskal-Wallis rank sum test
##
## data: lines
## Kruskal-Wallis chi-squared = 9.0026, df = 3, p-value = 0.02926
```

- Itential result.

How do we resolve the multiple comparisons? Consider using the `wilcox.test`

function on all pairs.

```
data.frame(
t12=wilcox.test(lines[[1]], lines[[2]], exact=FALSE)$p.value,
t13=wilcox.test(lines[[1]], lines[[3]], exact=FALSE)$p.value,
t14=wilcox.test(lines[[1]], lines[[4]], exact=FALSE)$p.value,
t23=wilcox.test(lines[[2]], lines[[3]], exact=FALSE)$p.value,
t24=wilcox.test(lines[[2]], lines[[4]], exact=FALSE)$p.value,
t34=wilcox.test(lines[[3]], lines[[4]], exact=FALSE)$p.value)
```

```
## t12 t13 t14 t23 t24 t34
## 1 0.6751736 0.2622393 0.01437618 0.2703441 0.02909633 0.6169206
```

- Looks like line 4 is different than lines one and two; but lines one two and three are not distinguishable from one another.
- But you have to be careful when performing so many multiple comparisons; at the 5% level we expect 1 false positive for every 20 tests.