Department of Statistics, Virginia Tech

A random sample of \(n_1\) observations is drawn from one population, and classified into one of two groups; and the same is done from a second population of size \(n_2\).

Class 1 | Class 2 | Total | |
---|---|---|---|

Population 1 |
\(O_{11}\) | \(O_{12}\) | \(n_1\) |

Population 2 |
\(O_{21}\) | \(O_{22}\) | \(n_2\) |

Total |
\(C_1\) | \(C_2\) | \(N = n_1 + n_2\) |

The test statistic is defined to be \(T_1 = 0\) if any of the column totals \(C_1\) or \(C_2\) is zero; otherwise \[ T_1 = \frac{\sqrt{N} (O_{11}O_{22} - O_{12}O_{21})}{\sqrt{n_1 n_2 C_1 C_2}}. \]

\(T_1\) is asymptotically standard normal when we assert the null hypothesis that the probability of class 1 for the two populations are the same. Consequently, \(T_1^2 \sim \chi^2_1\), which is useful in two-tailed tests,

- so the proportion test is sometimes called the
*Chi-squared test for differences in probabilities*.

To assess the association between a certain bug bite and a plague a group of subjects is categorized based on disease and exposure status. A sample of 100 subjects is collected. 40 subjects reported having a bug bite and had plague, where 10 had plague but no bug bite. Additionally 20 had the bug bite and no plague and 30 did not have either one.

Test whether the proportion of plague is the same for people with and without the bug bite. In other words, perform a two-tailed test of the hypotheses \(\mathcal{H}_0: p_1 = p_2\), versus \(\mathcal{H}_1: p_1 \ne p_2\).

Here is what our contingency table looks like in R

```
O <- rbind(c(40, 20), c(10, 30))
O
```

```
## [,1] [,2]
## [1,] 40 20
## [2,] 10 30
```

```
n <- rowSums(O)
N <- sum(n)
C <- colSums(O)
```

and in a pretty table.

Plague Yes | Plague No | Total | |
---|---|---|---|

Bite Yes |
40 | 20 | 60 |

Bite No |
10 | 30 | 40 |

Total |
50 | 50 | 100 |

The observed value of the test statistic may be calculated in R as follows.

```
t1 <- sqrt(N) * (O[1,1]*O[2,2] - O[1,2]*O[2,1]) / sqrt(n[1]*n[2]*C[1]*C[2])
t1
```

`## [1] 4.082483`

Since this is a two-tailed test, we have a choice about whether to leverage asymptotic standard normality of \(T_1\) or Chi-squared for \(T_1^2\) when calculating the \(p\)-value.

`2 * pnorm(-abs(t1))`

`## [1] 4.455709e-05`

`pchisq(t1^2, 1, lower.tail=FALSE)`

`## [1] 4.455709e-05`

- But, of course, they give the same result.
- At the 5% level this is an easy reject of \(\mathcal{H}_0\). We conclude that the proportion of plague is not the same for people with and without the bug bite.

The library function `prop.test`

can be used as an automation tool.

`prop.test(O, correct=FALSE)`

```
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: O
## X-squared = 16.667, df = 1, p-value = 4.456e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.2371271 0.5962063
## sample estimates:
## prop 1 prop 2
## 0.6666667 0.2500000
```

- The result is identical.

A study is conducted to assess the association between cholesterol in diet and coronary heart disease. The study looked at subjects with a high cholesterol diet, of which 11 had the disease and 4 did not. Also, information was collected on subjects with a low cholesterol diet, of which 2 had the disease and 6 did not.

Is the probability of having coronary heart disease higher in patients with high cholesterol diet?

Here we are looking at an an upper-tailed test of \(\mathcal{H}_0: p_1 \leq p_2\), versus \(\mathcal{H}_1: p_1 > p_2\).

The contingency table is comprised of the following entries

```
O <- rbind(c(11, 4), c(2, 6))
n <- rowSums(O)
N <- sum(n)
C <- colSums(O)
```

and displayed as follows.

Disease Yes | Disease No | Total | |
---|---|---|---|

High Cholesterol Yes |
11 | 4 | 15 |

High Cholesterol No |
2 | 6 | 8 |

Total |
13 | 10 | 23 |

Cutting and pasting from above, the observed value of the test statistic may be calculated in R as follows.

```
t1 <- sqrt(N) * (O[1,1]*O[2,2] - O[1,2]*O[2,1]) / sqrt(n[1]*n[2]*C[1]*C[2])
t1
```

`## [1] 2.227048`

For this upper-tailed test, the \(p\)-value is calculated via \(\varphi = \mathbb{P}(Z \geq t_1)\), which we can obtain in R as follows.

`pnorm(t1, lower.tail=FALSE)`

`## [1] 0.01297203`

- Therefore we reject \(\mathcal{H}_0\) at the 5% level and conclude that the probability of having coronary heart disease is higher in patients with a high cholesterol diet.

The library call below verifies this result.

`prop.test(O, alternative="greater", correct=FALSE)`

```
## Warning in prop.test(O, alternative = "greater", correct = FALSE): Chi-
## squared approximation may be incorrect
```

```
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: O
## X-squared = 4.9597, df = 1, p-value = 0.01297
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.1691941 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.7333333 0.2500000
```

- However, notice the warning that we get; maybe we don’t have enough data to trust the asymptotic approximation.

Fisher’s exact test also involves 2x2 contingency tables on \(N\) observations, but the difference is that we assume that the row and column sums are fixed in advance, rather than being random, so that the table can be represented as follows.

Col 1 | Col 2 | Total | |
---|---|---|---|

Row 1 |
\(x\) | \(r-x\) | \(r\) |

Row 2 |
\(c-x\) | \(N-r-c+x\) | \(N-r\) |

Total |
\(c\) | \(N-c\) | \(N\) |

- I.e., each observation is classified into exactly one cell.

The observed test statistic is \(t_2 = x\), the number of observations in the sell in row 1, column 1, and \(T_2 \sim \mathrm{HyperGeom}(N,r,c)\), where the hypergeometric mass function is given by \[ P(T_2 = x) = f_{N,r,c}(x) = \frac{{r \choose x} {N-r \choose c-x}}{{N \choose c}} \quad \mbox{ for } x=0,1,\dots,\min\{r,c\} \] and takes on a mass of zero for all other values of \(x\).

- Careful, this parameterization agrees with Wikipedia, but not with R’s
`*hyper`

functions.

- To use
`phyper`

, for example, specify`(m=r, n=N-r, k=c)`

.

The \(p\)-values are calculated in a similar fashion to those for the binomial test, yet using the hypergeometric distribution instead.

Twenty six newly hired business majors, 13 males and 13 females, are assigned by the bank president for their new jobs. 15 of the new jobs are tellers and 11 are account representatives. We are interested in determining whether females are more likely to be assigned to the account representative job.

Suppose 5 females are assigned as tellers. What can you conclude?

A lower-tailed test seems appropriate if we are concerned that there are too few female tellers (i.e., that they are more likely to be assigned the account representative job), and where the table is set up so that females are the second row and teller is in the second column. So we have \(\mathcal{H}_0: p_1 \geq p_2\), versus \(\mathcal{H}_1: p_1 < p_2\).

So the contingency table looks as follows.

Account Rep | Teller | Total | |
---|---|---|---|

Male |
3 | 10 | 13 |

Female |
8 | 5 | 13 |

Total |
11 | 15 | 26 |

And from the table we can extract out the relevant quantities in R.

```
N <- 26
r <- 13
c <- 11
t2 <- x <- 3
```

To calculate the \(p\)-value we compute \(\mathbb{P}(T_2 \leq t_2)\) under a hypergeometric distribution with \(N=26\), \(r=13\) and \(c= 11\).

`phyper(t2, r, N-r, c)`

`## [1] 0.05535065`

- Not enough evidence to reject \(\mathcal{H}_0\) at the 5% level, so we conclude that female employees are not more likely to be assigned the account representative job.

The library function `fisher.test`

provides a nice automation. To use it we have to give it the full contingency table.

`fisher.test(rbind(c(x, r-x),c(c-x, N-r-c+x)), alternative="less")`

```
##
## Fisher's Exact Test for Count Data
##
## data: rbind(c(x, r - x), c(c - x, N - r - c + x))
## p-value = 0.05535
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
## 0.000000 1.038739
## sample estimates:
## odds ratio
## 0.2013596
```

- leading to identical results.

The alcohol dehydrogenase gene was sequenced in several individuals of three species of Drosophila. Varying sites were classified as synonymous (the nucleotide variation does not change an amino acid) or amino acid replacements, and they were also classified as polymorphic (varying within a species) or fixed differences between species. The two nominal variables are thus synonymicity (“synonymous” or “replacement”) and fixity (“polymorphic” or “fixed”).

In the absence of natural selection, the ratio of synonymous to replacement sites should be the same for polymorphisms and fixed differences. There were 43 synonymous polymorphisms, 2 replacement polymorphisms, 17 synonymous fixed differences, and 7 replacement fixed differences.

Here, the contingency table looks like

synonymous | replacement | Total | |
---|---|---|---|

polymorphisms |
43 | 2 | 45 |

fixed |
17 | 7 | 24 |

Total |
60 | 9 | 69 |

In R the relevant quantities are:

```
N <- 69
r <- 45
c <- 60
t2 <- x <- 43
```

The mean of our hypergrometric is \(cr/N = 39.13\), so \(t_2\) is above that value. Cutting and pasting from any of our many two-tailed binomial examples earlier in class, but replacing binomials with the appropriate hypergeometric gives the following \(p\)-value calculation.

```
lhs <- seq(0, floor(c*r/N))
eps <- sqrt(.Machine$double.eps) ## numerical wiggle room
tL <- sum(dhyper(lhs, r, N-r, c) <= dhyper(t2, r, N-r, c) * (1 + eps))
phyper(tL - 1, r, N-r, c) + phyper(t2 - 1, r, N-r, c, lower.tail=FALSE)
```

`## [1] 0.00665313`

- Therefore we reject the null hypothesis at the 5% level; indeed there is a significant difference in synonymous/replacement ratio between polymorphisms and fixed differences.

The library function agrees:

`fisher.test(rbind(c(x, r-x),c(c-x, N-r-c+x)))`

```
##
## Fisher's Exact Test for Count Data
##
## data: rbind(c(x, r - x), c(c - x, N - r - c + x))
## p-value = 0.006653
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.437432 92.388001
## sample estimates:
## odds ratio
## 8.540913
```