Median test

The median test is an application of Pearson’s Chi-squared test (and/or the test for independence) where there are two “columns”, defined by the counts of the numbers of samples from the populations, in the “rows”, which are above and below the median, respectively. The words “columns” and “rows” are in quotes because actually the tables are transposed in most presentations. That is, there are actually two rows and \(c\) columns for each of \(c\) populations. I think that is because a short and fat table looks better on the page than a tall and skinny one does.

The data are comprised of ordinal random samples of size \(n_i\) from each of \(c\) populations, \(i=1, \dots, c\). The combined sample median is determined from all \(N = n_1 + \dots + n_c\) samples, called the grand median. Let \(O_{1i}\) to be the observed number of observations from the \(i^\mathrm{th}\) sample that exceed the grand median; and let \(O_{2i} = n_i - O_{1i}\) be the number that are less than or equal to that value. It is customary to arrange these observed counts into a contingency table in the following way.

Sample 1 2 \(\cdots\) \(c\) Total
\(>\) Median \(O_{11}\) \(O_{12}\) \(\cdots\) \(O_{1c}\) a
\(\leq\) Median \(O_{21}\) \(O_{22}\) \(\cdots\) \(O_{2c}\) b
Total \(n_1\) \(n_2\) \(\cdots\) \(n_c\) \(N\)

The test statistic \(T\) is the same as for the Chi-squared test, however when you consider that \(O_{2i} = n_i - O_{1i}\), and that there are only two rows, there are some simplifications. Also, be careful to transpose your \(r\)’s and \(c\)’s.

\[ T = \frac{N^2}{ab} \sum_{i=1}^c \frac{\left(O_{1i} - \frac{n_ia}{N} \right)^2}{n_i} \]

The null distribution is \(T \sim \chi^2_{c-1}\), asymptotically. This also agrees with the special case of the Chi-squared test.

The hypotheses are as follows.

\[ \begin{aligned} \mathcal{H}_0: & \; \mbox{All $c$ populations have the same median.} \\ \mathcal{H}_1: & \; \mbox{At least two of the populations have different medians.} \end{aligned} \]

Since it is basically a Chi-squared test, the \(p\)-value is unchanged: \(\varphi = \mathbb{P}(T \geq t\)).

If the null hypothesis is rejected, one can always pair-off the populations in order to determine which (pairs) are actually different from one another, in a \(2 \times 2\) comparison.

Chemical catalysts

Three catalysts are used in a chemical process; a control (no catalyst) is also included. The following are yield data from the process.

chem <- list(
    control=c(74.5, 76.1, 75.9, 78.1, 76.2),
    cat1=c(77.5, 82.0, 80.6, 84.9, 81.0),
    cat2=c(81.5, 82.3, 81.4, 79.5, 83.0),
    cat3=c(78.1, 80.2, 81.5, 83.0, 82.1))

The difference in population medians could be interpreted as a difference in the value of the catalyst used. Determine if all catalysts and control have the same median.

To demonstrate the sapply function, which will be useful later, lets use it to extract the length of each element of the list.

n <- sapply(chem, length)
n
## control    cat1    cat2    cat3 
##       5       5       5       5

Now, the first step in performing the test is determining the overall median.

grandmed <- median(unlist(chem))
grandmed
## [1] 80.8

With that we can create the table. First lets calculate the “greater” row and the “lesser” row.

Og <- sapply(chem, function(x) { sum(x > grandmed) } )
Oleq <- n - Og
O <- rbind(Og, Oleq)
rownames(O) <- c("> Med", "<= Med")
O
##        control cat1 cat2 cat3
## > Med        0    3    4    3
## <= Med       5    2    1    2

Then we can calculate the row sums.

a <- sum(Og)
b <- sum(Oleq)
N <- sum(n)

Now we can dress up the table for inspection, although we don’t need it for anything specifically.

tab <- rbind(Og, Oleq, n)
tab <- cbind(tab, c(a, b, N))
rownames(tab)[nrow(tab)] <- colnames(tab)[ncol(tab)] <- "tot"
tab
##      control cat1 cat2 cat3 tot
## Og         0    3    4    3  10
## Oleq       5    2    1    2  10
## tot        5    5    5    5  20

Now we are ready to calculate the test statistic. The formula above says to make the following calculation.

t <- (N^2 / (a*b)) * sum((Og - n*a/N)^2/n)
t
## [1] 7.2

But if we forgot that formula and wanted to use the old Chi-squared test one (just remembering to invert rows and columns), we’d get the same result:

E <- outer(c(a,b), n/N)
t <- sum((O-E)^2/E)
t
## [1] 7.2
  • Woot!

Now we are ready to calculate the \(p\)-value. The degrees of freedom is the length of our list minus one.

df <- length(chem) - 1

The \(p\)-value is calculated as follows.

pchisq(t, df, lower.tail=FALSE)
## [1] 0.06578905
  • We fail to reject \(\mathcal{H}_0\) at the 5% level and conclude that all of the catalysts have the same median, and therefore perform similarly.

Shelf height

The effect of shelf height on the supermarket sales of canned dog food is investigated. An experiment was conducted at a small supermarket on the sales of a single brand of dog food, involving three levels of shelf height: knee level, waist level and eye level. Sales, in hundreds of dollars, of the dog food per day for the three shelf heights are provided below. Are the median sales for all shelf heights the same at a 0.05 significance level?

height <- list(
    knee=c(77, 82, 86, 78, 81, 86, 77, 81),
    waist=c(88, 94, 93, 90, 91, 94, 90, 87),
    eye=c(85, 85, 87, 81, 80, 79, 87, 93))

Basically cutting-and-pasting from above, replacing chem with height

Lets extract the length of each element of the list.

n <- sapply(height, length)
n
##  knee waist   eye 
##     8     8     8

The first step is to determine the overall median.

grandmed <- median(unlist(height))
grandmed
## [1] 86

Now we can create the table. First lets calculate the “greater” row and the “lesser” row.

Og <- sapply(height, function(x) { sum(x > grandmed) } )
Oleq <- n - Og
O <- rbind(Og, Oleq)
rownames(O) <- c("> Med", "<= Med")
O
##        knee waist eye
## > Med     0     8   3
## <= Med    8     0   5

Then we can calculate the row sums.

a <- sum(Og)
b <- sum(Oleq)
N <- sum(n)

Now we can dress up the table for inspection, although we don’t need it for anything specifically.

tab <- rbind(Og, Oleq, n)
tab <- cbind(tab, c(a, b, N))
rownames(tab)[nrow(tab)] <- colnames(tab)[ncol(tab)] <- "tot"
tab
##      knee waist eye tot
## Og      0     8   3  11
## Oleq    8     0   5  13
## tot     8     8   8  24

Now we are ready to calculate the test statistics.

t <- (N^2 / (a*b)) * sum((Og - n*a/N)^2/n)
t
## [1] 16.44755

Alternatively, we could use the old Chi-squared test one (just remembering to invert rows and columns).

E <- outer(c(a,b), n/N)
t <- sum((O-E)^2/E)
t
## [1] 16.44755
  • Boom!

Now we are ready to calculate the \(p\)-value. The degrees of freedom is the length of our list minus one.

df <- length(height) - 1

The \(p\)-value is calculated as follows.

pchisq(t, df, lower.tail=FALSE)
## [1] 0.0002682004
  • Reject \(\mathcal{H}_0\) at the 5% level and conclude that the median sales for all shelf heights are not the same.

Since we rejected the null hypothesis, we can embark on some pairwise comparisons to determine which heights are, in fact, have different medians. But before we do that, which would mean lots more cutting-and-pasting, lets make our own library function.

Library routine

There is not, to my knowledge, a built-in library routine for the median test in R. But it should be pretty easy to write one based on the steps above. The function below assumes that \(x\) is a list of ordinal observations.

median.test <- function(x) 
  {
    ## summarize lengths of populations
    n <- sapply(x, length)

    ## calculate grand mean and build observation table
    grandmed <- median(unlist(x))
    Og <- sapply(x, function(x) { sum(x > grandmed) } )
    Oleq <- n - Og
    O <- rbind(Og, Oleq)

    ## calculate totals columns of table
    a <- sum(Og)
    b <- sum(Oleq)
    N <- sum(n)
    
    ## build the table for visualization
    tab <- rbind(Og, Oleq, n)
    tab <- cbind(tab, c(a, b, N))
    rownames(tab)[nrow(tab)] <- colnames(tab)[ncol(tab)] <- "tot"

    ## calculate the test statistic, null distribution and p-value 
    t <- (N^2 / (a*b)) * sum((Og - n*a/N)^2/n)
    df <- length(x) - 1
    phi <- pchisq(t, df, lower.tail=FALSE)

    ## return info
    return(list(grand.med=grandmed, table=tab, t.stat=t, p.value=phi))
}

Now we can test it out on our two examples. First on the chem catalyst data.

median.test(chem)
## $grand.med
## [1] 80.8
## 
## $table
##      control cat1 cat2 cat3 tot
## Og         0    3    4    3  10
## Oleq       5    2    1    2  10
## tot        5    5    5    5  20
## 
## $t.stat
## [1] 7.2
## 
## $p.value
## [1] 0.06578905
  • Yup.

Then on the shelf height data.

median.test(height)
## $grand.med
## [1] 86
## 
## $table
##      knee waist eye tot
## Og      0     8   3  11
## Oleq    8     0   5  13
## tot     8     8   8  24
## 
## $t.stat
## [1] 16.44755
## 
## $p.value
## [1] 0.0002682004
  • Perfect.

Multiple comparisons: tests on shelf height

What about comparing the first two heights, knee-level and waist-level?

median.test(height[1:2])
## $grand.med
## [1] 86.5
## 
## $table
##      knee waist tot
## Og      0     8   8
## Oleq    8     0   8
## tot     8     8  16
## 
## $t.stat
## [1] 16
## 
## $p.value
## [1] 6.334248e-05
  • Easy reject of the null: knee-level and waist-level have different have different medians.

What about the first and third heights, knee-level and eye-level?

median.test(height[c(1,3)])
## $grand.med
## [1] 81.5
## 
## $table
##      knee eye tot
## Og      3   5   8
## Oleq    5   3   8
## tot     8   8  16
## 
## $t.stat
## [1] 1
## 
## $p.value
## [1] 0.3173105
  • Not enough evidence to reject the null, these have equal medians.

Finally, how about the second and third heights, waist-level and eye-level?

median.test(height[2:3])
## $grand.med
## [1] 87.5
## 
## $table
##      waist eye tot
## Og       7   1   8
## Oleq     1   7   8
## tot      8   8  16
## 
## $t.stat
## [1] 9
## 
## $p.value
## [1] 0.002699796
  • Reject the null: waist and eye level have different medians.

Chi-square goodness-of-fit test

So far we have been performing tests concerning specific (yet sometimes relative) features of distributions of random quantities. Suppose instead that we simultaeously wanted to be more specific (a single population having a certain distribution) and more all-encompassing: not just the median or certain relative comparisons, but the entire distribution. That’s what the goodness-of-fit test is for. It is originally due to Karl Pearson, whose name is attached to all of the tests above, but this is the one he is most famous for.

The data consist of \(N\) independent observations of a random variable \(X\) grouped into \(c\) classes and arranged into a \(1 \times c\) contingency table.

Class: 1 2 \(\cdots\) \(c\) Total
Observed frequencies \(O_1\) \(O_2\) \(\cdots\) \(O_c\) N

The test statistic is the same as the ones above, except that we have a presumed distribution for \(X\) in our null hypothesis, specified as the probabilities \(p_j^*\) for \(j=1,\dots,c\). Define the expected number of observations in class \(j\) as \(E_j = p_j \cdot N\), and then calculate the test statistic as

\[ T = \sum_{j=1}^c \frac{(O_j - E_j)^2}{E_j}, \quad \mbox {just as before, but with fewer indices.} \]

Again, as before, the null distribution of \(T\) is asymptotically \(\chi^2_{(c-1)}\). Rules-of-thumb for accuracy of the approximation include checks that \(N \geq 10\), \(c \geq 3\), \(N^2/c \geq 10\), and \(E_j \geq 0.25\) for \(j=1,\dots,c\), and the remedy of reducing \(c\), by merging categories, to help out on this last requirement is typical.

The hypotheses are:

\[ \begin{aligned} \mathcal{H}_0 :& \; \mathbb{P}(X \mbox{ is in class } j) = p_j^* \mbox{ for } j=1,\dots, c \\ \mathcal{H}_1 :& \; \mathbb{P}(X \mbox{ is in class } j) \ne p_j^* \mbox{ for at least one class.} \end{aligned} \]

The \(p\) value is approximately \(\varphi = \mathbb{P}(T \geq t)\).

Sometimes the distribution of \(X\), described by the probabilities \(p_j^*\), has parameters that must be estimated. For example, think about a Gaussian distribution which would have two parameters, \(\mu\) and \(\sigma^2\). In that case, the degrees of freedom in the asymptotic null distribution must be adjusted to \(c-k-1\) where the number of estimated parameters is \(k\).

Security access cards

A company makes security access cards. The company claims that 30% of the cards are low security, 60% are medium security and 10% are high security. Suppose a random sample of 100 cards has found that 50 are low security, 45 are medium security and 5 are high security. Is this consistent with the company’s claim? Use a 0.05 level of significance.

The null hypothesis \(\mathcal{H}_0\) is that \(\mathbb{P}(\mathrm{low}) = 0.3\), \(\mathbb{P}(\mathrm{med}) = 0.6\), and \(\mathbb{P}(\mathrm{high}) = 0.1\), and the alternative \(\mathcal{H}_1\) is that at least one of those proportions is different.

Here are those probabilities in R.

pstar <- c(0.3, 0.6, 0.1)

Here is the data in R.

O <- c(50, 45, 5)
N <- sum(O)
N
## [1] 100

The expected frequencies under \(\mathcal{H}_0\) may be calculated as follows.

E <- pstar*N
E
## [1] 30 60 10

Then we can calculate the test statistic as usual.

t <- sum((O - E)^2/E)
t
## [1] 19.58333

The degrees of freedom of then null \(\chi^2\) distribution is

df <- length(O) - 1

so the \(p\)-value may be calculated as follows.

pchisq(t, df, lower.tail=FALSE)
## [1] 5.591563e-05
  • This is an easy reject of \(\mathcal{H}_0\) at the 5% level, so we conclude that at least one of the proportions specified by the security company is different than they claim.

The chisq.test library in R can handle this case, by providing it with a single vector of observations, and a vector of probabilities.

chisq.test(O, p=pstar)
## 
##  Chi-squared test for given probabilities
## 
## data:  O
## X-squared = 19.583, df = 2, p-value = 5.592e-05
  • Essentially the same result.

Hardy–Weinberg equilibrium

If gene frequencies are in equilibrium, the genotypes AA, Aa and aa occur in a population with frequencies \((1 – \theta)^2\), \(2\theta(1 – \theta)\), and \(\theta^2\), according to the Hardy Weinberg Law. In a sample from Hong Kong in 1937, blood types occurred with the following frequencies.

O <- c(M=342, MN=500, N=187)
N <- sum(O)
N
## [1] 1029

To perform the test we first need to obtain an estimate for \(\theta\). As simple way is by maximum likelihood. The log likelihood is

\[ \ell = 2\cdot 342 \log (1-\theta) + 500 \log (2\theta(1-\theta)) + 2 \cdot 187 \log(\theta). \]

The derivative with respect to \(\theta\) is

\[ \ell' = - \frac{2\cdot 342 + 500}{1-\theta} + \frac{500 + 2 \cdot 187}{\theta}. \]

Setting that equal to zero and solving gives \(\hat{\theta} = 0.4247\). Now, using that number we can establish \(p^*\) as

thetahat <- 0.4247
pstar <- c((1-thetahat)^2, 2*thetahat*(1-thetahat), thetahat^2)
pstar
## [1] 0.3309701 0.4886598 0.1803701

Now we can calculate our expected frequencies under \(p^*\).

E <- pstar*N
E
## [1] 340.5682 502.8310 185.6008

Finally, the null distribution has a degrees of freedom that is determined by subtracting the number of parameters estimated, \(k = 1\), from the number of categories, \(c=3\), minus one.

df <- length(O) - 1 - 1
df
## [1] 1

With that we can complete the test by calculating the the test statistic and the \(p\)-value.

t <- sum((O - E)^2/E)
t
## [1] 0.03250557
pchisq(t, df, lower.tail=FALSE)
## [1] 0.8569225
  • So we fail to reject \(\mathcal{H}_0\) and conclude that the probabilities are as hypothesized by the Hardy-Weinberg equilibrium.

We can’t directly use the chisq.test library call because it won’t subtract off \(k\) from the degrees of freedom, but if we were keen we could certainly write our own version!