# Proportion test

A random sample of $$n_1$$ observations is drawn from one population, and classified into one of two groups; and the same is done from a second population of size $$n_2$$.

Class 1 Class 2 Total
Population 1 $$O_{11}$$ $$O_{12}$$ $$n_1$$
Population 2 $$O_{21}$$ $$O_{22}$$ $$n_2$$
Total $$C_1$$ $$C_2$$ $$N = n_1 + n_2$$

The test statistic is defined to be $$T_1 = 0$$ if any of the column totals $$C_1$$ or $$C_2$$ is zero; otherwise $T_1 = \frac{\sqrt{N} (O_{11}O_{22} - O_{12}O_{21})}{\sqrt{n_1 n_2 C_1 C_2}}.$

$$T_1$$ is asymptotically standard normal when we assert the null hypothesis that the probability of class 1 for the two populations are the same. Consequently, $$T_1^2 \sim \chi^2_1$$, which is useful in two-tailed tests,

• so the proportion test is sometimes called the Chi-squared test for differences in probabilities.

## Bug bites and plague

To assess the association between a certain bug bite and a plague a group of subjects is categorized based on disease and exposure status. A sample of 100 subjects is collected. 40 subjects reported having a bug bite and had plague, where 10 had plague but no bug bite. Additionally 20 had the bug bite and no plague and 30 did not have either one.

Test whether the proportion of plague is the same for people with and without the bug bite. In other words, perform a two-tailed test of the hypotheses $$\mathcal{H}_0: p_1 = p_2$$, versus $$\mathcal{H}_1: p_1 \ne p_2$$.

Here is what our contingency table looks like in R

O <- rbind(c(40, 20), c(10, 30))
O
##      [,1] [,2]
## [1,]   40   20
## [2,]   10   30
n <- rowSums(O)
N <- sum(n)
C <- colSums(O)

and in a pretty table.

Plague Yes Plague No Total
Bite Yes 40 20 60
Bite No 10 30 40
Total 50 50 100

The observed value of the test statistic may be calculated in R as follows.

t1 <- sqrt(N) * (O[1,1]*O[2,2] - O[1,2]*O[2,1]) / sqrt(n[1]*n[2]*C[1]*C[2])
t1
## [1] 4.082483

Since this is a two-tailed test, we have a choice about whether to leverage asymptotic standard normality of $$T_1$$ or Chi-squared for $$T_1^2$$ when calculating the $$p$$-value.

2 * pnorm(-abs(t1))
## [1] 4.455709e-05
pchisq(t1^2, 1, lower.tail=FALSE)
## [1] 4.455709e-05
• But, of course, they give the same result.
• At the 5% level this is an easy reject of $$\mathcal{H}_0$$. We conclude that the proportion of plague is not the same for people with and without the bug bite.

The library function prop.test can be used as an automation tool.

prop.test(O, correct=FALSE)
##
##  2-sample test for equality of proportions without continuity
##  correction
##
## data:  O
## X-squared = 16.667, df = 1, p-value = 4.456e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2371271 0.5962063
## sample estimates:
##    prop 1    prop 2
## 0.6666667 0.2500000
• The result is identical.

## Diet and heart disease

A study is conducted to assess the association between cholesterol in diet and coronary heart disease. The study looked at subjects with a high cholesterol diet, of which 11 had the disease and 4 did not. Also, information was collected on subjects with a low cholesterol diet, of which 2 had the disease and 6 did not.

Is the probability of having coronary heart disease higher in patients with high cholesterol diet?

Here we are looking at an an upper-tailed test of $$\mathcal{H}_0: p_1 \leq p_2$$, versus $$\mathcal{H}_1: p_1 > p_2$$.

The contingency table is comprised of the following entries

O <- rbind(c(11, 4), c(2, 6))
n <- rowSums(O)
N <- sum(n)
C <- colSums(O)

and displayed as follows.

Disease Yes Disease No Total
High Cholesterol Yes 11 4 15
High Cholesterol No 2 6 8
Total 13 10 23

Cutting and pasting from above, the observed value of the test statistic may be calculated in R as follows.

t1 <- sqrt(N) * (O[1,1]*O[2,2] - O[1,2]*O[2,1]) / sqrt(n[1]*n[2]*C[1]*C[2])
t1
## [1] 2.227048

For this upper-tailed test, the $$p$$-value is calculated via $$\varphi = \mathbb{P}(Z \geq t_1)$$, which we can obtain in R as follows.

pnorm(t1, lower.tail=FALSE)
## [1] 0.01297203
• Therefore we reject $$\mathcal{H}_0$$ at the 5% level and conclude that the probability of having coronary heart disease is higher in patients with a high cholesterol diet.

The library call below verifies this result.

prop.test(O, alternative="greater", correct=FALSE)
## Warning in prop.test(O, alternative = "greater", correct = FALSE): Chi-
## squared approximation may be incorrect
##
##  2-sample test for equality of proportions without continuity
##  correction
##
## data:  O
## X-squared = 4.9597, df = 1, p-value = 0.01297
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.1691941 1.0000000
## sample estimates:
##    prop 1    prop 2
## 0.7333333 0.2500000
• However, notice the warning that we get; maybe we don’t have enough data to trust the asymptotic approximation.

# Fisher’s Exact Test

Fisher’s exact test also involves 2x2 contingency tables on $$N$$ observations, but the difference is that we assume that the row and column sums are fixed in advance, rather than being random, so that the table can be represented as follows.

Col 1 Col 2 Total
Row 1 $$x$$ $$r-x$$ $$r$$
Row 2 $$c-x$$ $$N-r-c+x$$ $$N-r$$
Total $$c$$ $$N-c$$ $$N$$
• I.e., each observation is classified into exactly one cell.

The observed test statistic is $$t_2 = x$$, the number of observations in the sell in row 1, column 1, and $$T_2 \sim \mathrm{HyperGeom}(N,r,c)$$, where the hypergeometric mass function is given by $P(T_2 = x) = f_{N,r,c}(x) = \frac{{r \choose x} {N-r \choose c-x}}{{N \choose c}} \quad \mbox{ for } x=0,1,\dots,\min\{r,c\}$ and takes on a mass of zero for all other values of $$x$$.

• Careful, this parameterization agrees with Wikipedia, but not with R’s *hyper functions.
• To use phyper, for example, specify (m=r, n=N-r, k=c).

The $$p$$-values are calculated in a similar fashion to those for the binomial test, yet using the hypergeometric distribution instead.

## Gender bias in hiring (upper-tailed)

Twenty six newly hired business majors, 13 males and 13 females, are assigned by the bank president for their new jobs. 15 of the new jobs are tellers and 11 are account representatives. We are interested in determining whether females are more likely to be assigned to the account representative job.

Suppose 5 females are assigned as tellers. What can you conclude?

A lower-tailed test seems appropriate if we are concerned that there are too few female tellers (i.e., that they are more likely to be assigned the account representative job), and where the table is set up so that females are the second row and teller is in the second column. So we have $$\mathcal{H}_0: p_1 \geq p_2$$, versus $$\mathcal{H}_1: p_1 < p_2$$.

So the contingency table looks as follows.

Account Rep Teller Total
Male 3 10 13
Female 8 5 13
Total 11 15 26

And from the table we can extract out the relevant quantities in R.

N <- 26
r <- 13
c <- 11
t2 <- x <- 3

To calculate the $$p$$-value we compute $$\mathbb{P}(T_2 \leq t_2)$$ under a hypergeometric distribution with $$N=26$$, $$r=13$$ and $$c= 11$$.

phyper(t2, r, N-r, c)
## [1] 0.05535065
• Not enough evidence to reject $$\mathcal{H}_0$$ at the 5% level, so we conclude that female employees are not more likely to be assigned the account representative job.

The library function fisher.test provides a nice automation. To use it we have to give it the full contingency table.

fisher.test(rbind(c(x, r-x),c(c-x, N-r-c+x)), alternative="less")
##
##  Fisher's Exact Test for Count Data
##
## data:  rbind(c(x, r - x), c(c - x, N - r - c + x))
## p-value = 0.05535
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.000000 1.038739
## sample estimates:
## odds ratio
##  0.2013596

## Gene testing (two-tailed)

The alcohol dehydrogenase gene was sequenced in several individuals of three species of Drosophila. Varying sites were classified as synonymous (the nucleotide variation does not change an amino acid) or amino acid replacements, and they were also classified as polymorphic (varying within a species) or fixed differences between species. The two nominal variables are thus synonymicity (“synonymous” or “replacement”) and fixity (“polymorphic” or “fixed”).

In the absence of natural selection, the ratio of synonymous to replacement sites should be the same for polymorphisms and fixed differences. There were 43 synonymous polymorphisms, 2 replacement polymorphisms, 17 synonymous fixed differences, and 7 replacement fixed differences.

Here, the contingency table looks like

synonymous replacement Total
polymorphisms 43 2 45
fixed 17 7 24
Total 60 9 69

In R the relevant quantities are:

N <- 69
r <- 45
c <- 60
t2 <- x <- 43

The mean of our hypergrometric is $$cr/N = 39.13$$, so $$t_2$$ is above that value. Cutting and pasting from any of our many two-tailed binomial examples earlier in class, but replacing binomials with the appropriate hypergeometric gives the following $$p$$-value calculation.

lhs <- seq(0, floor(c*r/N))
eps <- sqrt(.Machine\$double.eps) ## numerical wiggle room
tL <- sum(dhyper(lhs, r, N-r, c) <= dhyper(t2, r, N-r, c) * (1 + eps))
phyper(tL - 1, r, N-r, c) + phyper(t2 - 1, r, N-r, c, lower.tail=FALSE)
## [1] 0.00665313
• Therefore we reject the null hypothesis at the 5% level; indeed there is a significant difference in synonymous/replacement ratio between polymorphisms and fixed differences.

The library function agrees:

fisher.test(rbind(c(x, r-x),c(c-x, N-r-c+x)))
##
##  Fisher's Exact Test for Count Data
##
## data:  rbind(c(x, r - x), c(c - x, N - r - c + x))
## p-value = 0.006653
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.437432 92.388001
## sample estimates:
## odds ratio
##   8.540913