--- title: "Homework 2" subtitle: "Int. Data Analytics and Machine Learning (CMDA/CS/STAT 4654)" author: "Robert B. Gramacy ( : )
Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is due on **Wednesday, February 21nd at 4pm** (the start of class). It covers the linear models and diagnostics & transforms lectures. All work must be submitted electronically. For full credit you must show all of your steps. Use of computational tools (e.g., R) is encouraged; and when you do, code inputs and outputs must be shown *in-line* (not as an appendix) and be accompanied by plain English that briefly explains what the code is doing. Extra credit, augmenting your score by at most 10%, is available for (neatly formatted) solutions authored in Rmarkdown, and submitted as a working .Rmd file. ### Problem 1: Estimation (20 pts) Let $Y_1, \dots, Y_n$ be independent Poisson random variables with mean $\theta$. a. (5 pts) Derive the method of moments estimator for $\theta$. b. (5 pts) Derive the maximum likelihood estimator $\hat{\theta}_n$ for $\theta$. How does this compare to what you found in part a.? c. (5 pts) Provide the asymptotic sampling distribution for $\hat{\theta}_n$. d. (5 pts) With the following data values for $y$ given in R below, what are the chances that the data could have been generated from a Poisson with parameter $\theta = 5$? Or $\theta = 6$? Or $\theta = 7$? {r} y <- c(3, 5, 6, 5, 2, 6, 6, 7, 8, 8, 7, 8, 0, 5, 7, 6, 6, 10, 6, 5, 6, 7, 6, 9, 8, 4, 8, 7, 11, 9, 4, 4, 7, 9, 8, 6, 5, 6, 12, 10, 7, 13, 8, 12, 9, 4, 10, 8, 4, 5)  ### Problem 2: Tractors revisited (20 pts) Revisit the tractor data in tractor.csv from [Homework 1](hw1.html). If necessary, first re-calcuate your estimated coefficients from the least squares fit to the data. a. (10 pts) Estimate $\sigma^2$, and use it to give a 95% prediction interval for a three year old tractor, taking uncertainty in your estimates of $\beta_0$ and $\beta_1$ into account. b. (10 pts) On a plot including the data, provide a summary of the predictive distribution (i.e., predictive mean and 95\% interval with both estimated from the data) for tractors ranging from brand new to ten years old. ### Problem 3: Leverage (20 pts) In this problem we will add mathematical heft to the concept of the leverage of a data point. a. (15 pts) Show that the least squares fitted values $\hat{y}_i$, for $i=1,\dots, n$, can be written as a linear combination of all observed $y_j$ values, for $j=1,\dots,n$. That is, show that we can write $$\hat{y}_i = \sum_{j=1}^n h_{ij} y_j \quad \mbox{ for } \quad i=1, \dots, n.$$ *Hint: $h_{ii} = h_i$, the leverage of the $i^{\mathrm{th}}$ data point from lecture.* b. (5 pts) Now, show that the derivative of $\hat{y}_i$ with respect to $y_i$ is the leverage $h_i$. I.e., $\frac{d \hat{y}_i}{d y_i} = h_i$. ### Problem 4: Transforms (20 pts) The file [transforms.csv](transforms.csv) linked from the course website contains 4 pairs of $x$s and $y$s. For each pair: i. Fit the linear regression model $Y = \beta_0 + \beta_1 x + \varepsilon$, where $\varepsilon \sim \mathcal{N}(0,\sigma^2)$. Plot the data and add the fitted line. ii. Provide a scatterplot, normal Q-Q plot, and histogram for the studentized regression residuals. iii. Using the residual scatterplots, state how the SLR model assumptions are violated. iv. Determine the data transformation to correct the problems in (iii), fit the corresponding regression model, and plot the transformed data with new fitted line. v. Provide plots to show that your transformations have (mostly) fixed the model violations. *Each of i.--v. for each of four pairs will be assigned 1 point.* ### Problem 5: Cheese (20 pts) This question considers sales data on volume as well as price and display activity for packages of [Borden Sliced Cheese](cheese.csv). For each of 88 stores stores in different US cities, the data contain repeated observations of the sales volume (vol, in terms of packages sold), unit price, and whether the product was advertised with an in-store display (disp = 1 for display). a. (9 pts) Ignoring price, do the in-store displays have an effect on log sales? Is there reason to suspect that your result is confounded by pricing strategies? *I trust you can figure out what confounded means.* b. (9 pts) A better question: is price elasticity for Borden cheese effected by the presence of in-store advertisement? *Two hints: Testing if one value is equal to another is the same as testing if the difference is equal to zero. If $b$ and $b^\star$ are least squares coefficients from independent regression fits, then $\mathrm{sd}(b-b^\star) = \sqrt{s_{b}^2 + s_{b^\star}^2}$* c. (2 pts) Based on your experience bargain shopping in grocery stores, how would you explain the result you found in (b.) to someone who knows little about economics?