Department of Statistics, Virginia Tech

This homework is due on **Thursday, September 14th at 2pm** (the start of class). Please turn in all your work. The purpose of this homework is to refresh concepts learned in previous statistics courses.

*An experiment consists of tossing a die and then flipping a coin once if the number on the die is even. If the number on the die is odd, the coin is flipped twice. Using the notation 4H, for example to denote the outcome that the die comes up 4 and then the coin comes up heads, and 3HT to denote the outcome that the die comes up 3 followed by a head and then a tail on the coin, show all the elements of the sample space. Hint: There are 18.*

\[ S=\{1\mathrm{HH},1\mathrm{HT},1\mathrm{TH},1\mathrm{TT},2\mathrm{H},2\mathrm{T}, 3\mathrm{HH},3\mathrm{HT},3\mathrm{TH},3\mathrm{TT},4\mathrm{H},4\mathrm{T}, 5\mathrm{HH},5\mathrm{HT},5\mathrm{TH},5\mathrm{TT},6\mathrm{H},6\mathrm{T}\} \]

*In the field of quality control the science of statistics is often used to determine if a process is “out of control”. Suppose the process is, indeed, out of control and 20% of items produced are defective.*

*If three items arrive off the process line in succession, what is the probability that all three are defective?*

\[ \mathbb{P}(D_1,D_2,D_3)=\mathbb{P}(D_1) \cdot \mathbb{P}(D_2) \cdot \mathbb{P}(D_3)=(0.2)^3=0.008 \]

*If four items arrive in succession, what is the probability that three are defective?*

\[ \mathbb{P}(\mbox{three out of four are defective})={4\choose3}(0.2)^3(1-0.2)=0.0256 \]

*The probability that a patient recovers from a rare blood disease is 0.4. If 15 people are known to have contracted this disease, what is the probability that*

*at least 10 survive?*

\[ \mathbb{P}(X\geq10)=\sum_{x=10}^{15}{15\choose x}(0.4)^x(1-0.4)^{15-x}=0.0338 \]

*from 3 to 8 survive?*

\[ \mathbb{P}(3\leq X\leq8)=\sum_{x=3}^{8}{15\choose x}(0.4)^x(1-0.4)^{15-x}=0.8779 \]

*exactly 5 survive?*

\[ \mathbb{P}(X=5)={15\choose 5}(0.4)^5(1-0.4)^{15-5}=0.1859 \]

*A multiple-choice quiz has 200 questions, each with 4 possible answers of which only 1 is the correct answer. Suppose that a student has no knowledge on 80 of the 200 problems and therefore will guess.*

If \(T\) is the (random) number of correct answers, then \(T\sim\mathrm{Bin}(80, 0.25)\).

```
n <- 80
p <- 0.25
```

*Calculate the probability that*

*guesswork in the 80 problems will yield less than 20 correct answers.*

```
plt80 <- pbinom(19, n, p)
plt80
```

`## [1] 0.4571637`

Therefore \(\mathbb{P}(T < 20) = \mathbb{P}(T \leq 19) \approx\) 0.4572.

*guesswork in the 80 problems will yield more than 40 correct answers.*

```
pgt40 <- 1 - pbinom(40, n, p)
pgt40
```

`## [1] 4.174442e-07`

Thereofre \(\mathbb{P}(T > 40) = 1 - \mathbb{P}(Y \leq 40) \approx\) 0.

*guesswork yields from 25 to 30 correct answers inclusive.*

```
pf25t30 <- pbinom(30, n, p) - pbinom(24, n, p)
pf25t30
```

`## [1] 0.1192705`

Therefore \(\mathbb{P}(25 \leq T \leq 30) = \mathbb{P}(T \leq 30) - \mathbb{P(T < 24)} \approx\) 0.1193.

*The following data represent the running times of films produced by motion picture companies.*

`times <- c(103, 94, 110, 87, 98, 97, 82, 123, 92, 175, 88, 118)`

*Assume a normal distribution and do the following:*

*Find a 95% confidence interval for the mean of film running times.*

The CI for the mean follows \(\bar{x} \pm t^{n-1}_{\alpha/2}\times s/\sqrt{n}\), so in R we have:

```
xbar <- mean(times)
s <- sd(times)
n <- length(times)
se <- s/sqrt(n)
alpha <- 0.05
tq <- qt(1-alpha/2, n-1)
CIm <- xbar + c(-1,1) * tq * se
CIm
```

`## [1] 89.59785 121.56882`

So the 95% CI for the mean is [89.5979, 121.5688].

*Find a 95% confidence interval for the variance of running times.*

The CI for the variance follows

\[ \left[\frac{(n-1)s^2}{\chi^{2(n-1)}_{\alpha/2}}, \frac{(n-1)s^2}{\chi^{2(n-1)}_{1-\alpha/2}}\right], \]

so in R we have:

```
cq1 <- qchisq(alpha/2, n-1)
cq2 <- qchisq(1-alpha/2, n-1)
CIv <- (n-1)*s^2 / c(cq2, cq1)
CIv
```

`## [1] 317.6506 1824.7841`

So the 95% CI for the variance is [317.6506, 1824.7841].

*You now find that these data were collected from two different companies as follows:*

`times2 <- list(C1=times[1:5], C2=times[6:length(times)])`

*Find a 90% confidence interval for the difference between the average running times of films produced by the two companies under the assumption of unknown but equal variances.*

The difference in means is estimated by the difference in averages,

```
m1 <- mean(times2$C1)
m2 <- mean(times2$C2)
delta <- m1 - m2
```

and with equal variances we calculate a pooled estimate of \(\sigma^2\).

```
n1 <- length(times2$C1)
n2 <- length(times2$C2)
v1 <- var(times2$C1)
v2 <- var(times2$C2)
dof <- n1 + n2 - 2
v <- ((n1-1)*v1 + (n2-1)*v2)/dof
```

Then, the standard error follows the usual formula for two Gaussian populations.

```
se <- sqrt(v/n1 + v/n2)
data.frame(delta=delta, se=se)
```

```
## delta se
## 1 -12.31429 14.95207
```

Finally, we are ready to calculate the CI.

```
alpha <- 0.1
tq <- qt(1-alpha/2, dof)
delta + c(-1,1)*se*tq
```

`## [1] -39.41433 14.78576`

*Find a 90% confidence interval for the difference between the average running times of films produced by the two companies under the assumption of unknown and unequal variance*

When the variances are unequal we need to calculate a new standard error and DoF.

```
se <- sqrt(v1/n1 + v2/n2)
dof <- round((v1/n1 + v2/n2)^2 / ((v1/n1)^2/(n1-1) + (v2/n2)^2/(n2-1)))
dof
```

`## [1] 7`

Now we can present the CI, using the same formula as above.

```
alpha <- 0.1
tq <- qt(1-alpha/2, dof)
delta + c(-1,1)*se*tq
```

`## [1] -36.52092 11.89235`

*In a random sample of \(n=500\) families owning television sets in the city of Hamilton, Canada, it is found that \(x=345\) subscribed to HBO. Find a 95% confidence interval for the actual proportion of families in this city who subscribe to HBO.*

Here is the R version of the relevant info.

```
alpha <- 0.05
n <- 500
t <- 345
```

- First a CI via optimization.

```
alpha <- 0.05
f <- function(x, alpha, lower.tail=TRUE) { alpha - pbinom(t, n, x, lower.tail=lower.tail) }
pU <- uniroot(f, c(0,1), alpha=alpha/2)$root
pL <- uniroot(f, c(0,1), alpha=alpha/2, lower.tail=FALSE)$root
c(pL, pU)
```

`## [1] 0.6494662 0.7303171`

- Alternatively via the Beta quantiles.

`c(qbeta(alpha/2, t+1, n-t+1), qbeta(1-alpha/2, t+1, n-t))`

`## [1] 0.6481060 0.7303116`

Finally, we can do the whole thing with one call to a library routine.

`binom.test(t, n, p=3/4)`

```
##
## Exact binomial test
##
## data: t and n
## number of successes = 345, number of trials = 500, p-value =
## 0.002693
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
## 0.6474243 0.7303116
## sample estimates:
## probability of success
## 0.69
```

*A certain change in a process for manufacture of component parts is being considered. Samples are taken using both the existing and the new procedure so as to determine if the new process results in an improvement. If 75 of 1500 items from the existing procedure were found to be defective, and 80 of 2000 items from the new procedure were found to be defective, find a 90% confidence interval for the true difference in the fraction of defectives between the existing and the new process.*

Although these are Bernoulli processes, and I usually prefer to do binomial calculations directly rather than using Gaussian approximations, we have no results for tests or CIs for differences in binomial parameters. In this case, the Gaussian approximation is much easier than the alternative, in particular since it is not at all obvious what that is.

The relevant formula is:

\[ \hat{p}_1-\hat{p}_2\pm z_{\alpha/2}\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]

In R:

```
n1 <- 1500; x1 <- 75
n2 <- 2000; x2 <- 80
p1 <- x1/n1; v1 <- p1*(1-p1)
p2 <- x2/n2; v2 <- p2*(1-p2)
se <- sqrt(v1/n1 + v2/n2)
alpha <- 0.1
zq <- qnorm(1-alpha/2)
CI <- p1-p2 + c(-1,1) * zq * se
CI
```

`## [1] -0.001731239 0.021731239`

- Therefore the CI is [-0.0017, 0.0217].

*The Edison Electric Institute has published figures on the annual number of kilowatt hours expended by various home appliances. It is claimed that a vacuum cleaner expends an average of 46 kilowatt hours per year. If a random samples of 12 homes included in a planned study indicates that vacuum cleaners expend an average of 42 kilowatt hours per year with a standard deviation of 11.9 kilowatt hours, does this suggest at the 0.05 level of significance that vacuum cleaners expend, on the average, less than 46 kilowatt hours annually? Assume the population of kilowatt hours to be normal.*

To settle the matter we consider the following lower-tailed test pitting \(\mathcal{H}_0: \mu \geq 46\) against \(\mathcal{H}_1: \mu<46\).

In R we calculate the observed \(t\)-statistic as follows.

```
mu0 <- 46
xbar <- 42
se <- 11.9/sqrt(12)
tstat <- (xbar - mu0)/se
tstat
```

`## [1] -1.164404`

That statistic should follow a Student-\(t\) with \(n-1=11\) degrees of freedom. Therefore the \(p\)-value may be calculated as.

```
phi <- pt(tstat, 11)
phi
```

`## [1] 0.1344464`

- Not enough evidence to reject \(\mathcal{H}_0\) at the 5%
- evel.
- The number of kilowatt hours expended annually by home vacuum cleaners is not significantly less than 46.

*An experiment was performed to compare the abrasive wear of two different laminated materials. Twelve pieces of material 1 were tested by exposing each piece to a machine measuring wear. Ten pieces of material 2 were similarly tested. In each case, the depth of wear was observed. The samples of material 1 gave an average (coded) wear of 85 units with a sample standard deviation of 4, while the samples of material 2 gave an average of 81 and a sample standard deviation of 5. Can we conclude at the 0.05 level of significance that the abrasive wear of material 1 exceeds that of material 2 by more than 2 units? Assume the populations to be approximately normal with equal variances.*

Let \(\mu_1\) and \(\mu_2\) represent the population means of the abrasive wear for material 1 and material 2, respectively, and consider the following hypotheses: _0: _1-_22\(, versus, \mathcal{H}_1: \mu_1-\mu_2>2\), an upper-tailed test.

The observed value of the \(t\)-statistic may be calculated as follows.

```
delta <- 85 - 81
dof <- 12 + 10 - 2
v <- (11*4^2 + 10*5^2)/dof
se <- sqrt(v/12 + v/10)
tstat <- (delta - 2)/se
tstat
```

`## [1] 1.012091`

The \(p\)-value is \(\mathbb{P}(T_{20}) > 1.0121)\).

`1 - pt(tstat, dof)`

`## [1] 0.1617916`

- This is not low enough to reject \(\mathcal{H}_0\) at the 5% level.
- We are unable to conclude that the abrasive wear of material 1 exceeds that of material 2 by more than 2 units.

*A builder claims that heat pumps are installed in 70% of all homes being constructed today in the city of Richmond, VA. Would you agree with this claim if a random survey of new homes in this city shows that 8 out of 15 had heat pumps installed? Use a 0.01 level of significance.*

This is clearly a binomial experiment, and a two-tailed test involving \(\mathcal{H}_0: p=0.7\), versus \(\mathcal{H}_1: p\neq 0.7\). The observed value of the test statistic is \(t = 8\), with \(X \sim \mathrm{Bin}(18, 0.7)\). We may calculate the \(p\)-value in R as follows, using the fact that the observed value \(t=8\) is below the expected value of

```
n <- 15
pstar <- 0.7
t <- 8
n*pstar
```

`## [1] 10.5`

```
rhs <- seq(ceiling(n*pstar), n)
tU <- sum(dbinom(rhs, n, pstar) <= dbinom(t, n, pstar))
phi <- pbinom(t, n, pstar) + pbinom(n - tU, n, pstar, lower.tail=FALSE)
phi
```

`## [1] 0.1664102`

This matches the output of the library function, `binom.test`

.

`binom.test(t, n, p=0.7)`

```
##
## Exact binomial test
##
## data: t and n
## number of successes = 8, number of trials = 15, p-value = 0.1664
## alternative hypothesis: true probability of success is not equal to 0.7
## 95 percent confidence interval:
## 0.2658613 0.7873333
## sample estimates:
## probability of success
## 0.5333333
```

Following what the text says we could instead calculate

`2*pbinom(t, n, pstar)`

`## [1] 0.2622851`

- No matter how you skin this cat, there is not enough evidence to reject \(\mathcal{H}_0\).
- we conclude that there is insufficient reason to doubt the builders claim.

*A vote is to be taken among the residents of a town and the surrounding county to determine whether a proposed chemical plant should be constructed. The construction site is within the town limits, and for this reason many voters in the county feel that the proposal will pass because of the large proportion of town voters who favor the construction. To determine if there is a significant difference in the proportion of town voters and county voters favoring the proposal, a poll is taken. If 120 of 200 town voters favor the proposal and 240 of 500 county voters favor it, would you agree that the proportion of town voters favoring the proposal is higher than the proportion of county voters at the 0.05 significance level?*

We wish to test the following hypotheses: \(\mathcal{H}_0: p_1-p_2 \leq 0\), versus \(\mathcal{H}_1: p_1 - p_2 > 0\).

As in the CI problem above, although this is a question about Bernoulli processes, we are forced into a Gaussian approximation by the lack of a known distribution for the difference in binomial test statistics.

We start by calculating the estimates of \(p\) for the two populations.

```
p1 <- 120/200
p2 <- 240/500
```

The standard error and test statistic are

```
se <- sqrt(p1*(1-p1)/200 + p2*(1-p2)/500)
z <- (p1-p2)/se
z
```

`## [1] 2.911113`

The \(p\)-value for the right-talied test is calculated as \(\mathbb{P}(Z > 2.9111)\). In R:

`pnorm(z, lower.tail=FALSE)`

`## [1] 0.001800721`

- This is an easy rejection of \(\mathcal{H}_0\) at the \(\alpha = 0.05\) level.
- The proportion of the town voters favoring the proposal is greater than the proportion of the county voters.