--- title: "Homework 1 Solutions" subtitle: "Int. Data Analytics and Machine Learning (CMDA/CS/STAT 4654)" author: "Robert B. Gramacy ( : )
Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is due on **Wednesday, February 7th at 4pm** (the start of class). It covers the least squares and linear model lectures. All work must be submitted electronically. For full credit you must show all of your steps. Use of computational tools (e.g., R) is encouraged; and when you do, code inputs and outputs must be shown *in-line* (not as an appendix) and be accompanied by plain English that briefly explains what the code is doing. Extra credit, augmenting your score by at most 10%, is available for (neatly formatted) solutions authored in Rmarkdown, and submitted as a working .Rmd file. ### Problem 1: Alternate least squares (20 pts) *Suppose we assume the usual linear relationship $y_i = b_0 + b_1 x_i + e_i$ for $n$ pairs of observations $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$. Now, show what we would calculate for the coefficients $b_0$ and $b_1$ under the following criteria.* a. (8 pts) *Suppose we insist that the average error is zero: $\frac{1}{n}\sum_{i=1}^n e_i = 0$. What condition must $b_0$ satisfy in order for that relationship to hold.?* We have: \begin{aligned} \frac{1}{n}\sum_{i=1}^n e_i = 0 \Rightarrow \frac{1}{n}\sum_{i=1}^n (y_i - b_0 - b_1 x_i) &= 0\\ \mbox{giving } \quad\quad 0 &= \bar{y} - b_0 - b_1 \bar{x} \\ b_0 &= \bar{y} -b_1 \bar{x}. \end{aligned} - In other words, we get the same intercept as we had with the "mean on the line" and least squares approaches from lecture. b. (12 pts) *Now further suppose that we insist upon the correlation between the inputs and the errors being zero: $r_{xe} = 0$. Using the $b_0$ you found in part a., what condition must $b_1$ satisfy in order for that relationship to hold?* We have: \begin{aligned} 0 &= \sum_{i=1}^n e_i (x_i - \bar{x}) = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)(x_i - \bar{x})\\ & = \sum_{i=1}^n (y_i - \bar{y} - b_1 (x_i-\bar{x}))(x_i-\bar{x})\\ \mbox{giving }\quad b_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}. \end{aligned} - In other words, we get the same slope as we had with "correlation times rise over run" and least squares approaches from lecture. ### Problem 2: Variance of the slope (20 pts) *In this problem we are going to calculate $\mathbb{V}\mathrm{ar}\{b_1\}$, the variance of the least squares estimate of the slope. Below is a re-writing of the expression for $b_1$ we had from lecture, with capital letters used to denote what is random. I.e., note that we are only treating the observed $Y_i$ values as random for the purposes of this calculation.* $$b_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n(x_i - \bar{x})^2}$$ a. (12 pts) *First show that $b_1$ can be written as a weighted sum of the $Y_i$ values. I.e., $b_1 = \sum_{i=1}^n w_i Y_i$, and provide an expression for the weights $w_i$.* Using a result from Homework 0, $$b_1 = \frac{\sum(x_i - \bar{x})(Y_i - \bar{Y})}{\sum(x_i - \bar{x})^2} = \frac{\sum(x_i - \bar{x})Y_i}{\sum(x_i - \bar{x})^2 } = \sum \frac{x_i-\bar{x}}{d}Y_i = \sum w_i Y_i$$ where $d = (n-1)s_x^2 = \sum (x_i - \bar{x})^2$ is the sum of squares for $x$. - So the weights are $$w_i = \frac{x_i - \bar{x}}{(n-1) \sum (x_i - \bar{x})^2}.$$ - Note that $b_1$ is more heavily weighted by $x_i$ that are far from $\bar{x}$. b. (8 pts) *Now, calculate $\mathbb{V}\mathrm{ar}\{b_1\} \equiv \mathbb{V}\mathrm{ar}\{b_1 \mid x_1, \dots, x_n\}$ by calculating the variance of the weighted sum you found in part a, treating $Y_i$ as $Y_i \mid x_i$, i.e., conditional on $x_i$.* Using the calculations above, we have: $$\sigma_{b_1}^2 = \mathbb{V}\mathrm{ar}\{b_1\} = \frac{\sigma^2}{\sum (x_i - \bar{x})^2} = \frac{\sigma^2}{(n-1)s_x^2}.$$ ### Problem 3: McDonalds (20 pts) *You are the proud owner of eight McDonald's franchises around Roanoke. You decide to do a little experiment by setting the price of a Happy Meal across the restaurants. (Assume that demographics and clientele are similar for each franchise.) You gather data on the number of Happy Meals sold at each franchise during a week of the pricing experiment.* {r} happymeal <- data.frame(price=c(1.5, 1.5, 1.75, 2.0, 2.0, 2.25, 2.5, 2.5), sales=c(420, 450, 420, 380, 440, 380, 360, 360))  Before we get started, lets "attach'" the data frame so we can access the columns directly. {r} attach(happymeal)  a. (4 pts) *Ignore price. If the sales are iid with mean $390$ and variance $\sigma^2$, what would you estimate for $\sigma^2$? Note, 390 is not equal to the average of the sales numbers above, but that doesn't mean it is a bad estimate.* Since we are using a specified rather than estimated mean, the denominator in the average sum-of-squares is $n$ rather than $n-1$. {r} s2.marg <- mean( (sales-390)^2 ) s2.marg  b. (4 pts) *Now, assume that you model sales as independently distributed with variance $\sigma^2$ and mean $\mathbb{E}\{\mathrm{sales}_i\}=500 - 60\cdot \mathrm{price}_i$. What would you estimate for $\sigma^2$? By comparison to your estimate in (a.), what does this say about this model?* Same goes here; we are using a specified mean as a linear function of price, so we use $n$ rather than $n-1$ in the denominator. {r} s2.guess <- mean((sales - 500 + 60*price)^2) s2.guess  - Since this estimate is lower we can conclude that the model is a better fit than the marginal (mean) estimate obtained without linearly accounting for price. c. (4 pts) *Find the correlation between price and sales, and use this to fit a regression line to the data. Plot the data and your line together and describe the fit.* Following the "correlation times rise over run" and "mean on the line" approach from lecture we obtain the following coefficients. {r} r <- cor(sales, price) r b1 <- r*sd(sales)/sd(price) b1 b0 <- mean(sales) - b1*mean(price) b0  The plot below shows the data and overlays the fitted line. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(price, sales, pch=20, main = "sales vs price regression") abline(a=b0, b=b1, col=2) legend("topright", "fitted line", col=2, lty=1)  d. (4 pts) *What is the average of the residuals from your fit in (iii)? Is this good?* First we calculate the residuals, then average. {r} e <- sales - (b0 + b1*price) mean(e)  As we saw in Question 1, above, we know these residuals should exactly average zero. It is good to get in practice what theory suggests we should obtain. e. (4 pts) *How would you use the residuals to estimate $\sigma^2$ in this fitted model? How does this estimate suggest that the fitted model compares to those in (a.) and (b.)?* Since we estimated two coefficients we should divide by $n-2$ when calculating the average residual sum of squares. (We'll talk more in detail about this later.) {r} s2.model <- sum(e^2)/(length(sales)-2) s2.model  - This is the lowest estimate yet, suggesting an even better fit than any of the guessed models above. ### Problem 4: Synthetic data (20 pts) *Use the rnorm function in R to generate 100 samples of $X \sim \mathcal{N}(0,2)$ (for help use ?rnorm ) and for each draw, simulate $Y_i$ from the simple linear regression model $Y_i = 2.5-1.0X_i + \epsilon_i$, where $\epsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0,3)$.* Before we get started, here is our data-generation code. It is instructive to try running this a few times with different data to see how things the output of parts a.--d. below change. {r} x <- rnorm(100,0,sqrt(2)) y <- 2.5 - x + rnorm(100,0,sqrt(3)) reg <- lm(y~x)  a. (4 pts) *Show the scatter plot of $Y$ versus $X$ along with the true regression line.* The code generating the plot and the plot are below. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(x,y, pch=20, col=grey(.5), ylim=c(-5,10), main="Simulated Data") abline(a=2.5, b=-1) legend("topright", "true", lty=1)  b. (7 pts) *Split the sample into 2 subsets of size 25 and 75. For each subset, run the regression of $Y$ on $X$. Add each fitted regression line (use color) to your plot from (a.). Why are they not the same?* The code below divides the data into two subsets (there are lots of equivalent ways to do this), and then performs the regression on those two subsets. The use of data frames simplifies the lm calls. {r} sub1 <- data.frame(x=x[1:25], y=y[1:25]) sub2 <- data.frame(x=x[26:100], y=y[26:100]) reg1 <- lm(y~x, data=sub1) reg2 <- lm(y~x, data=sub2)  Then we re-recreate the plot above and add the two fitted lines. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(x,y, pch=20, col=grey(.5), ylim=c(-5,10), main="Simulated Data with Fits") abline(a=2.5, b=-1) abline(a=reg1$coef[1], b=reg1$coef[2], col=2) abline(a=reg2$coef[1], b=reg2$coef[2], col=4) legend("topright", col=c(1,2,4), lty=1, legend=c("true", "n=25","n=75"))  - Of course, the fitted lines are not the same as the true line. - And the $n=75$ line should in general be closer to the true line, although this may not be the case for all random subsets of the data. c. (2 pts) *Considering your two samples, what are your marginal sample means (i.e., estimates) for $Y$? What is the true marginal mean?* The true marginal mean is may be calculated from the data generating formulas above as follows. $$\mathbb{E}\{2.5 - X - e\} = 2.5 - \mathbb{E}\{X\} + \mathbb{E}\{e\} = 2.5 - 0 + 0 = 2.5.$$ The sample mean is calculated from the data as follows. {r} samplemean <- c(mean(sub1$y), mean(sub2$y)) samplemean  - On average the first one will be farther from the truth than the second one, because it was based on a smaller data set. d. (7 pts) *Calculate the bounds of the true 90% prediction interval and add them to your plot (use ?qnorm for help with quantiles and and lty=2 in lines() to get a dashed line). What percentage of your observations are outside of this interval?* There are many ways to specify a line in R; the easiest way in this case is to provide two pairs of points as follows. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(x,y, pch=20, col=grey(.5), ylim=c(-5,10), main="Simulated Data with True Interval(s)") abline(a=2.5, b=-1) xx <- c(min(x), max(x)) lines(xx, 2.5 - xx + qnorm(.95, sd=sqrt(3)), lty=2) lines(xx, 2.5 - xx + qnorm(.05, sd=sqrt(3)), lty=2)  We could count the number of points by inspecting the plot above, but that's tedious. The code below breaks down the calculation of the number of points inside the interval(s) into a count of those which are over the lower line and under the upper line. {r} over <- y > 2.5 - x + qnorm(.95, sd=sqrt(3)) under <- y < 2.5 - x + qnorm(.05, sd=sqrt(3))  The output below shows the $x$-coordinates of those points calculated above, mostly as a matter of interest. {r} x[over] x[under]  Then, the proportion outside may be calculated as follows. {r} outside <- (sum(over) + sum(under))/100 outside  - On average, this should be 0.1, but for a particular random data set it could vary in either direction (higher or lower). ### Problem 5: Maintenance costs (20 pts) *The cost of maintenance of a certain type of tractor seems to increase with age. The file tractor.csv contains ages (years) and 6-monthly maintenance costs for $n=17$ such tractors.* First, lets read in the data and attach its columns for easy reference. {r} tractor <- read.csv("tractor.csv") attach(tractor)  a. (4 pts) *Create a plot of tractor maintenance cost versus age.* The data is visualized by the scatterplot below. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(age, cost, main="tractor maintenance")  b. (6 pts) *Find the least squares fit to the model* $$\mathrm{cost}_i = b_0 + b_i \mathrm{age}_i + e_i$$ *in two ways: (i) using the lm command and (ii) by calculating a correlation and standard deviations [verify that they give the same answer]. Overlay the fitted line on the plot from part (a.).* First using the lm command: {r} fit <- lm(cost ~ age) fitcoef  And then with the "correlation times rise over run/mean on the line" approach: {r} b1 <- cor(cost,age)*sd(cost)/sd(age) b0 <- mean(cost) - mean(age)*b1 c(b0, b1)  The code below duplicates the plot above and adds the fitted line to the plot. {r, dev.args=list(bg='transparent'), fig.width=5, fig.height=4.5, fig.align="center"} plot(age, cost, main="tractor maintenance") abline(b0, b1) legend("topleft", "LS fit", lty=1)  c. (5 pts) *What is the variance decomposition for the regression model fit in (b.) (i.e., what are SST, SSE, and SSR?). What isR^2$for the regression?* The necessary coefficients can be extracted from the output of the anova command. {r} anova(fit)  They are as follows. {r} SSE <- 1444158 SSR <- 860884 SST <- 1444158 + 860884  Using those, the$R^2$coefficient of variation can be calculated as follows. {r} R2 <- SSR/SST R2  - The interpretation is that 37% of the variability in the data is explained by the least squares regression line. Alternative ways of calculating the same$R^2$value include the following commands in R. {r} mean((fit$fitted - mean(cost))^2)/mean((cost - mean(cost))^2) summary(fit)$r.squared  d. (5 pts) *Suppose you were considering buying a tractor that is three years old. Using the values obtained in (b.), what would you expect your six-monthly maintenance costs to be?* Using the coefficients estimated above, we just plug them in with$x=3\$ to obtain {r} m <- b0 + b1 * 3 m  so the six-monthly maintenance costs for this 3-year-old tractor is about r round(m)` dollars.