Department of Statistics, Virginia Tech

The tests presented here compare samples from multiple (first two and then many) populations and checks if they have the same variance. These form the basis of many similar tests that are each a little too complicated to perform by hand. We’ll see the basics of how it works, and then look at a couple of libraries that automate the fancier tests.

The data consist of two mutually independent random samples

\[ X_1, X_2, \dots, X_{n_x} \quad \mbox{ and } \quad Y_1, Y_2, \dots, Y_{n_y}. \]

The first step is to convert each \(X_i\) and \(Y_i\) into its absolute deviation from the mean using

\[ U_i = | X_i - \mu_x|,\; i=1,\dots,n_x \quad \mbox{ and } \quad V_j = | Y_j - \mu_y |,\; j=1,\dots,n_y \]

where \(\mu_x\) and \(\mu_y\) are the population means of \(X\) and \(Y\). If these are unknown, then approximate with \(\bar{x}\) for \(\mu_x\) and \(\bar{y}\) for \(\mu_y\).

Assign ranks from \(1\) to \(N=n_x + n_y\) to the combined sample of \(U\)s and \(V\)s, using average ranks in the case of ties, and let \(R(U_i)\) and \(R(V_i)\) denote these values.

If there are no ties, then the typical **test statistic** is

\[ T = \sum_{i=1}^{n_x} [R(U_i)]^2, \]

however the **null distribution** of this statistic is not easily computed. Table A9 provides some calculations for \(n_x,n_y \leq 10\), and for larger values an approximation based on a Gaussian distribution is provided.

However, since in the case of ties we will need to make an adjustment – and essentially use a Gaussian approximation via the CLT anyways – we will present the following statistic \(T_1\) and use it as our main test statistic in examples and homework problems.

- This statistic and null distribution are essentially equivalent to the Gaussian approximation mentioned above.

To obtain \(T_1\) subtract of the mean of \(T\) and divide by the standard deviation to get

\[ T_1 = \frac{T - n \overline{R^2}}{\left[\frac{n_x n_y}{N(N-1)} \sum_{i=1}^N R_i^4 - \frac{n_x n_y}{N-1} (\overline{R^2})^2 \right]^{\frac{1}{2}}} \]

where \(\overline{R^2}\) represents the average of the squared ranks of both samples combined

\[ \overline{R^2} = \frac{1}{N} \left\{\sum_{i=1}^{n_x} [R(U_i)]^2 + \sum_{j=1}^{n_y} [R(V_j)]^2 \right\}, \]

and \(\sum R_i^4\) represents the sum of the ranks raised to the fourth power

\[ \sum_{i=1}^N R_i^4 = \sum_{i=1}^{n_x} [R(U_i)]^4 + \sum_{j=1}^{n_y} [R(V_j)]^4. \]

The **null distribution** is standard normal: \(T_1 \sim \mathcal{N}(0,1)\).

The **hypotheses** are broken down into the usual three cases.

Two-tailed tests involve

\[ \begin{aligned} \mathcal{H}_0 : & \; \mbox{$X$ and $Y$ are identically distributed except for possibly different means} \\ \mathcal{H}_1 : & \; \mathbb{V}\mathrm{ar}(X) \ne \mathbb{V}\mathrm{ar}(Y) \end{aligned} \]

Using \(T_1 \sim \mathcal{N}(0,1)\), the \(p\)-value is calculated in the usual way

\[ \varphi = 2 \mathbb{P}(T_1 < -|t_1|) \quad \mbox{ or } \quad \varphi = 2 \min\{\mathbb{P}(T_1 \leq t_1), \mathbb{P}(T_1 \geq t_1) \}. \]

A lower-tailed test involves

\[ \begin{aligned} \mathcal{H}_0 : & \; \mbox{$X$ and $Y$ are identically distributed except for possibly different means and } \mathbb{V}\mathrm{ar}(X) \geq \mathbb{V}\mathbb{ar}(Y) \\ \mathcal{H}_1 : & \; \mathbb{V}\mathrm{ar}(X) < \mathbb{V}\mathrm{ar}(Y) \end{aligned} \]

The \(p\)-value is \(\varphi = \mathbb{P}(T_1 \leq t_1)\).

An upper-tailed test involves

\[ \begin{aligned} \mathcal{H}_0 : & \; \mbox{$X$ and $Y$ are identically distributed except for possibly different means and } \mathbb{V}\mathrm{ar}(X) \leq \mathbb{V}\mathbb{ar}(Y) \\ \mathcal{H}_1 : & \; \mathbb{V}\mathrm{ar}(X) > \mathbb{V}\mathrm{ar}(Y) \end{aligned} \]

The \(p\)-value is \(\varphi = \mathbb{P}(T_1 \geq t_1)\).

Metro has, lately, seen an increase in support for allowing bikes on trains at all hours because of passenger complaints. Now, Metro officials want to determine whether particular Metro lines see more bikes. Suppose that the red and green lines are tested since they contain stops nearest to popular trails. At the five-percent level of significance can we conclude there is equal variability in the two train lines?

```
red <- c(67, 65, 82, 44, 59, 56, 93)
green <- c(53, 62, 58, 61, 43, 36, 50, 52, 41, 46)
```

Here we perform a two-tailed test, with the hypotheses above: treating the red line as \(X\) and green line as \(Y\).

First calculate the averages to use in place of the means.

```
x <- red
y <- green
xbar <- mean(x)
ybar <- mean(y)
```

Now we can calculate the \(U\)s and \(V\)s.

```
u <- abs(x - xbar)
v <- abs(y - ybar)
```

Next calculate the ranks of the combined sample.

`r <- rank(c(u, v))`

- We could check for ties, but we are going to use \(T_1\) regardless, so it doesn’t matter.

Lets calculate \(t\), the observed value of \(T\).

```
ru <- r[seq_along(u)]
rv <- r[-seq_along(u)]
t <- sum(ru^2)
t
```

`## [1] 968`

Now lets calculate the standardized version, \(t_1\) for \(T_1\). First some auxiliary calculations …

```
nx <- length(x)
ny <- length(y)
N <- nx + ny
r2bar <- (sum(ru^2) + sum(rv^2))/N
r4sum <- sum(r^4)
```

We’re ready to go.

```
t1 <- (t - nx*r2bar)/sqrt(nx*ny/(N*(N-1))*r4sum - nx*ny/(N-1)*r2bar^2)
t1
```

`## [1] 1.227762`

And the \(p\)-value may be calculated as follows.

`2*pnorm(-abs(t1))`

`## [1] 0.2195363`

- We fail to reject \(\mathcal{H}_0\) and conclude that the level of variability in the number of bikes present on the red line is equivalent to that on the green line.

Doors are supposed to be open for an average of 13 seconds for both lines. Because the green line does not experience as much congestion as the red line, Metro officials want to determine whether the variability of train waiting times at Chinatown Metro station is higher on the green line than the red line. Determine this at the five percent level of significance.

```
green2 <- c(11.9, 12.1, 10.7, 10.9, 13.5)
red2 <- c(10.2, 12.8, 10.5, 10.5, 13.2, 11.0, 11.7, 11.3, 11.4, 10.6, 12.2, 11.1, 12.3, 10.5, 13.3)
```

We will treat the green line as \(X\) and the red line as \(Y\) and perform an upper-tailed test. Lets re-code the data and adjust by the absolute deviations from (estimated) means.

```
x <- green2
y <- red2
xbar <- mean(x)
ybar <- mean(y)
u <- abs(x - xbar)
v <- abs(y - ybar)
```

Now calculate the ranks and the observed value of \(t\).

```
r <- rank(c(u, v))
ru <- r[seq_along(u)]
rv <- r[-seq_along(u)]
t <- sum(ru^2)
t
```

`## [1] 696`

Finally, lets calculate the standardized version \(t_1\) via the auxiliary calculations.

```
nx <- length(x)
ny <- length(y)
N <- nx + ny
r2bar <- (sum(ru^2) + sum(rv^2))/N
r4sum <- sum(r^4)
t1 <- (t - nx*r2bar)/sqrt(nx*ny/(N*(N-1))*r4sum - nx*ny/(N-1)*r2bar^2)
t1
```

`## [1] -0.08498553`

We are ready to calculate the \(p\)-value for this upper-tailed test.

`pnorm(t1, lower.tail=FALSE)`

`## [1] 0.5338636`

- Fail to reject the null hypothesis and conclude that the level of variability in the distribution of train-waiting times is equivalent on both Metro rail lines.

There is no software library that I am aware of which automates this test, but we can easily write one of our own by cutting-and-pasting from above. We’ll see below that there is a library performing a very similar suite of tests.

```
sqranks.test.approx <- function(x, y, alternative=c("two.sided", "greater", "less"))
{
## check for partial matches of the alternative argument
alternative <- match.arg(alternative)
## calculate the absolute discrepancies from the (estimated) mean
xbar <- mean(x)
ybar <- mean(y)
u <- abs(x - xbar)
v <- abs(y - ybar)
## calculate the observed ranks, and t
r <- rank(c(u, v))
ru <- r[seq_along(u)]
rv <- r[-seq_along(u)]
t <- sum(ru^2)
## calculate the standardized version, t1
nx <- length(x)
ny <- length(y)
N <- nx + ny
r2bar <- (sum(ru^2) + sum(rv^2))/N
r4sum <- sum(r^4)
t1 <- (t - nx*r2bar)/sqrt(nx*ny/(N*(N-1))*r4sum - nx*ny/(N-1)*r2bar^2)
## calculate the p-value
if(alternative == "two.sided")
phi <- 2*pnorm(-abs(t1))
else if(alternative == "less")
phi <- pnorm(t1)
else if(alternative == "greater")
phi <- pnorm(t1, lower.tail=FALSE)
## return the stats and p-value
return(list(t=t, t1=t1, p.value=phi))
}
```

Now lets try it out on our two examples.

`sqranks.test.approx(red, green)`

```
## $t
## [1] 968
##
## $t1
## [1] 1.227762
##
## $p.value
## [1] 0.2195363
```

`sqranks.test.approx(green2, red2, "greater")`

```
## $t
## [1] 696
##
## $t1
## [1] -0.08498553
##
## $p.value
## [1] 0.5338636
```

- Perfect.
- Also see ranks.R for the exact version.

A related test is called the “Ansari–Bradley” two-sample test for a difference in scale parameters. It targets the ratio of the “scales” (i.e., variances) of the two populations. Here it is on our first two-tailed example.

`ansari.test(red, green)`

```
##
## Ansari-Bradley test
##
## data: red and green
## AB = 30, p-value = 0.5896
## alternative hypothesis: true ratio of scales is not equal to 1
```

- Different \(p\)-value but sample conclusion.

Here it is on our second upper-tail test.

`ansari.test(green2, red2, "greater")`

```
## Warning in ansari.test.default(green2, red2, "greater"): cannot compute
## exact p-value with ties
```

```
##
## Ansari-Bradley test
##
## data: green2 and red2
## AB = 29, p-value = 0.6043
## alternative hypothesis: true ratio of scales is greater than 1
```

- Again, different details same conclusion, except with a warning about ties.

Finally, another similar one is the “Mood” test.

`mood.test(red, green)`

```
##
## Mood two-sample test of scale
##
## data: red and green
## Z = 0.7836, p-value = 0.4333
## alternative hypothesis: two.sided
```

`mood.test(green2, red2, "greater")`

```
##
## Mood two-sample test of scale
##
## data: green2 and red2
## Z = -0.43188, p-value = 0.6671
## alternative hypothesis: greater
```

- As above; not identical \(p\)-values, but similar result.

When there are two or more samples we can generalize things in the following way.

The data consist of \(k\) (mutually) independent random samples of possibly different sizes. Denote the \(i^\mathrm{th}\) random sample of size \(n_i\) by \(X_{i1}, X_{i2}, \dots, X_{in_i}\). Let \(N\) denote the total number of observations, \(N=\sum_{i=1}^k n_i\). Then calculate the absolute deviation between each sample and its (estimated) mean: \(U_{ij} = | X_{ij} - \bar{X}_i |\). Rank the combined sample of \(N\) absolute discrepancies, obtaining \(R(U_{ij})\). Finally, calculate the sum of squares of the ranks for each

\[ S_i = \sum_{j=1}^{n_i} R(U_{ij})^2 \]

- Thus, \(S_1\) corresponds to our \(T\) in the two-sample case.

The **test statistic** is

\[ T_2 = \frac{1}{D^2} \left[\sum_{i=1}^k \frac{S_i^2}{n_i} - N(\bar{S})^2 \right] \]

where

\[ \bar{S} = \frac{1}{N} \sum_{j=1}^k S_j \quad \mbox{ and } \quad D^2 = \frac{1}{N-1} \left[ \sum_{i=1}^N R_i^4 - N(\bar{S})^2 \right]. \]

- If there are no ties then \(D^2\) and \(\bar{S}\) simplify to

\[ D^2 = N(N+1)(2N+1)(8N+11)/180 \quad \mbox{ and } \quad \bar{S} = (N+1)(2N+1)/6. \]

The **null distribution** is approximately chi-squared with \(k-1\) degrees of freedom.

Only two-tailed hypotheses make sense, which may be framed as follows.

\[ \begin{aligned} \mathcal{H}_0 &: \; \mbox{All $k$ populations are identical, except for possibly different means} \\ \mathcal{H}_1 &: \; \mbox{Some of the population variances are not equal to each other} \end{aligned} \]

As usual, the \(p\)-value is calculated as \(\varphi = \mathbb{P}(T_2 > t_2)\) under that \(\chi^2_{k-1}\) null distribution.

Suppose that the red, green and blue lines are tested in the bike and metro example. At the 0.05 level of significance can we conclude there is equal variability in the three train lines?

```
bikes <- list(green=c(53, 62, 58, 61, 43, 36, 50, 52),
red=c(67, 65, 82, 44, 59, 56, 93),
blue=c(42, 44, 72, 64, 49, 50, 91))
```

First, lets summarize the sizes of the samples.

```
k <- length(bikes)
n <- sapply(bikes, length)
N <- sum(n)
```

Lets calculate absolute deviations from respective (estimated) means.

`U <- lapply(bikes, function(x) { abs(x - mean(x)) })`

Next calculate the ranks of the combined samples.

`r <- rank(unlist(U))`

Then, taking a page out of the Kruskal–Wallis test code, we can calculate the sum of the squared ranks for each sample as follows.

```
g <- factor(rep.int(seq_len(k), n))
s <- tapply(r, g, function(x) { sum(x^2) })
s
```

```
## 1 2 3
## 798 1410 1587
```

Next calculate the observed values of auxiliary \(D^2\) and \(\bar{S}\) quantities.

```
sbar <- sum(s)/N
d2 <- (sum(r^4) - N*sbar^2)/(N-1)
c(sbar, d2)
```

`## [1] 172.5 23655.5`

Finally, we are ready to calculate \(t_2\), the observed value of \(T_2\).

```
t2 <- (sum(s^2/n) - N*sbar^2)/d2
t2
```

`## [1] 2.907267`

The \(p\)-value is calculated as follows.

`pchisq(t2, k-1, lower.tail=FALSE)`

`## [1] 0.2337195`

- We fail to reject the null hypothesis at the 5% level and conclude that the level of variability in the number of bikes present on the three lines is equivalent.

Again, there is no library that I am aware of that implements this exact test, but we could easily write one ourselves.

```
sqranks.2plus.test <- function(xlist)
{
## extract dimensions
k <- length(xlist)
n <- sapply(xlist, length)
N <- sum(n)
## calculate absolute deviations from (estimated) mean and then rank them
U <- lapply(xlist, function(x) { abs(x - mean(x)) })
r <- rank(unlist(U))
## calculate s
g <- factor(rep.int(seq_len(k), n))
s <- tapply(r, g, function(x) { sum(x^2) })
## calculate t2 via auxiliary quantities
sbar <- sum(s)/N
d2 <- (sum(r^4) - N*sbar^2)/(N-1)
t2 <- (sum(s^2/n) - N*sbar^2)/d2
## finish with p-value
phi <- pchisq(t2, k-1, lower.tail=FALSE)
return(list(tstat=t2, p.value=phi))
}
```

Lets try it.

`sqranks.2plus.test(bikes)`

```
## $tstat
## [1] 2.907267
##
## $p.value
## [1] 0.2337195
```

- Same result.

We can also use it for our two-tailed two-sample test.

`sqranks.2plus.test(list(red, green))`

```
## $tstat
## [1] 1.5074
##
## $p.value
## [1] 0.2195363
```

- The \(p\)-value is different but the conclusion is the same.

A very similar test for the same (two-plus sample) situation is called the “Fligner-Killeen” test. This test also calculates sums of squared ranks, however it additionally performs an intermediate quantiles calculation which offers a degree of robustness. The details are beyond the scope of the class.

Here it is on the three-sample bikes example.

`fligner.test(bikes)`

```
##
## Fligner-Killeen test of homogeneity of variances
##
## data: bikes
## Fligner-Killeen:med chi-squared = 1.1722, df = 2, p-value = 0.5565
```

- Again, different test statistic and \(p\)-value, but same conclusion.

And on the first two-sample test that we did at the start.

`fligner.test(list(red, green))`

```
##
## Fligner-Killeen test of homogeneity of variances
##
## data: list(red, green)
## Fligner-Killeen:med chi-squared = 1.1914, df = 1, p-value = 0.2751
```

- Again, different stat and \(p\)-value, but same conclusion.