--- title: "Homework 0 Solutions" subtitle: "Int. Data Analytics and Machine Learning (CMDA/CS/STAT 4654)" author: "Robert B. Gramacy ( : )
Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is due on **Wednesday, January 24th at 4pm** (the start of class). Its purpose is to refresh prerequisite concepts. All work must be submitted electronically. For full credit you must show all of your steps. Use of computational tools (e.g., R) is encouraged; and when you do, code inputs and outputs must be shown *in-line* (not as an appendix) and be accompanied by plain English that briefly explains what the code is doing. ### Problem 1: Gaussian probabilities (10 pts) *Suppose $X \sim \mathcal{N}(-10, 5^2)$, i.e., $X$ has a Gaussian (normal) distribution with a mean of $-10$ and a variance of 25.* a. (6 pts) *Compute the following:* \begin{aligned} \mathbb{P}(X > -10) && \mathbb{P}(X < -20) && \mbox{and} && \mathbb{P}(X = 0) \end{aligned} The first one is 0.5 because $-10$ is the mean. The last one is zero because the Gaussian is a density for real-valued quantities, and so the probabality of any singleton is zero. The middle one is calculated by the following R code. {r} pnorm(-20, -10, 5)  b. (4 pts) *Express $\mathbb{P}(-22 \leq X \leq -12)$ in terms of $Z$, the standard normal random variable: $Z \sim \mathcal{N}(0,1)$, and then use that expression to calculate the value of that probability statement.* Subtracting the mean (-10) from both sides and then dividing both sides by the standard deviation (5) gives \begin{aligned} \mathbb{P}(-22 \leq X \leq -12) &= \mathbb{P}\left(\frac{-22 + 10}{5} \leq Z \leq \frac{-12+ 10}{5} \right) \\ &= \mathbb{P}(-2.4 \leq Z \leq -0.4). \end{aligned} The final expression may be evaluated in R as follows. {r} pnorm(-0.4) - pnorm(-2.4)  ### Problem 2: Functions of random variables (10 pts) *Suppose that $\mathbb{E}\{X\} = \mathbb{E}\{Y\} = 0$, $\mathbb{V}\mathrm{ar}\{X\} = \mathbb{V}\mathrm{ar}\{Y\} = 1$ and $\mathbb{C}\mathrm{or}(X,Y) = 0.5$. Compute:* \begin{aligned} \mathbb{E}\{3X - 2Y\} && \mathbb{V}\mathrm{ar}\{3X - 2Y\} && \mbox{and} && \mathbb{E}\{X^2\}. \end{aligned} By linearity of expectations, we have $$\mathbb{E}\{3X - 2Y\} = 3\mathbb{E}\{X\} - 2 \mathbb{E}(Y) = 0.$$ By linearity of (squared non-random coefficients) of variances, we have \begin{aligned} \mathbb{V}\mathrm{ar}\{3X - 2Y\} &= 3^2\mathbb{V}\mathrm{ar}\{X\} + 2^2\mathbb{V}\mathrm{ar}\{Y\} - 2\times 3\times 2 \mathbb{C}\mathrm{ov}(X,Y) \\ &= 13 - 12 \mathbb{C}\mathrm{or}(X,Y)\sqrt{\mathbb{V}\mathrm{ar}\{X\}\mathbb{V}\mathrm{ar}\{Y\}} \\ &= 13 - 12/2 \\ &=7. \end{aligned} Finally, using the definition $\mathbb{V}\mathrm{ar}\{X\} = \mathbb{E}\{X^2\} - \mathbb{E}\{X\}^2$, we have $$\mathbb{E}\{X^2\} = \mathbb{V}\mathrm{ar}\{X\} + \mathbb{E}\{X\}^2 = 1 + 0 = 1.$$ ### Problem 3: Summation notation -- computation (10 pts) *Let $z$ be a vector of length $n = 4$ defined as z in R as follows.* {r} z <- c(2, -2, 3, -3) n <- length(z) n  a. (3 pts) *Compute $\sum_{i=1}^n z_i$ for z defined above.* In R: {r} sum(z)  b. (3 pts) *Let $\bar{z} = \frac{1}{n} \sum_{i=1}^n z_i$ and calculate $\sum_{i=1}^n (z_i - \bar{z})^2$ for z above.* This can be done (at least) two different ways in R. {r} zbar <- mean(z) sum((z - zbar)^2)  Or: {r} var(z)*(n-1)  c. (4 pts) *Provide an expression for the sample variance using summation notation, generically for an independent and identically distributed (iid) sample of observations $z_1, \dots, z_n$. Then calculate the sample variance of z as defined above.* The sample variance is usually defined as $$s_z^2 = \frac{1}{n-1} \sum_{i=1}^n (z_i - \bar{z})^2$$ In R this is just var: {r} var(z)  ### Problem 4: Summation notation -- algebra (10 pts) *For two general collections of $n$ numbers $X_1, \dots, X_n$ and $Y_1, \dots, Y_n$ show that* $$\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) = \sum_{i=1}^n (X_i - \bar{X}) Y_i.$$ Expanding through the second term gives \begin{aligned} \sum_{i=1}^N(X_i-\bar{X})(Y_i-\bar{Y}) &= \sum_{i=1}^N(X_i-\bar{X})Y_i - \bar{Y}\sum_{i=1}^N(X_i-\bar{X}) \\ &= \sum_{i=1}^N(X_i-\bar{X})Y_i - \bar{Y}\times 0, \end{aligned} establishing the desired result. ### Problem 5: The sampling distribution (15 pts) *Suppose that we have a random sample $Y_1, \dots, Y_n$ where $Y_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(\mu, 4)$ for $i=1,\dots, n$ for some value $\mu$ depicting the mean of the Gaussian distribution.* a. (4 pts) *What is the expectation of the sample mean: $\mathbb{E}\{ \bar{Y} \}$?* $$\mathbb{E}\{ \bar{Y} \} = \mathbb{E}\left\{ \frac{1}{n} \sum_{i=1}^n Y_i \right\} = \frac{1}{n} \sum_{i=1}^n \mathbb{E} \{Y_i\} = \frac{1}{n} n \mu = \mu$$ b. (4 pts) *What is the variance of the sample mean: $\mathbb{V}\mathrm{ar}\{ \bar{Y} \}$?* Independent and identically distributed (iid) means, in particular, that the covariance between the $n$ samples is zero. Therefore, by linearity of variances $$\mathbb{V}\mathrm{ar}\{ \bar{Y} \} = \mathbb{V}\mathrm{ar}\left\{ \frac{1}{n} \sum_{i=1}^n Y_i \right\} = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}\mathrm{ar} \{Y_i\} = \frac{1}{n^2} n 4 = \frac{4}{n}.$$ c. (4 pts) *What is the variance for another iid realization, $Y_{n+1}$?* Since they are iid, the variance of all $Y_i$-values is the same. Therefore $\mathbb{V}\mathrm{ar} \{Y_{n+1}\} = 4$ d. (3 pts) *What is the standard error of $\bar{Y}$?* The standard error is defined as the square root of the variance of the estimator, so in this case $\sqrt{\mathbb{V}\mathrm{ar}\{ \bar{Y} \}}$. Using part b. above, the standard error of $\bar{Y}$ is thus $2/\sqrt{n}$. ### Problem 6: Calculus (25 pts) a. (6 pts) *Let $f(x) = \lambda e^{-\lambda x}$ for some fixed parameter $\lambda$ and calculate the following.* \begin{aligned} \frac{d}{dx} f(x) && \int_0^1 f(x) \; dx && \mbox{and} && \int_0^\infty f(x) \; dx \end{aligned} Using the chain rule, the derivative of $f(x)$ is $$\frac{d}{dx} \lambda e^{-\lambda x} = - \lambda^2 e^{-\lambda x}.$$ For the first integral, we could calculate the anti-derivative by hand or we could recognize that $f(x)$ is the density of an exponential random variable. The distribution of an exponential is defined as $$F(x) = \int_0^x f(t) \; dt = \int_0^x \lambda e^{-\lambda t} \; dt = 1 - e^{-\lambda x}.$$ Plugging $x=1$ into the above formula gives $1 - e^{-\lambda}$. We can see that the second integral, with bounds $0$ and $\infty$, evaluates to 1 since all probability densities must "sum" to unity when integrated over their entire range. b. (7 pts) *Let $g(x) = \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\}$ for some fixed parameters $\mu$ and $\sigma^2$ and calculate the following.* \begin{aligned} \frac{d}{dx} g(x) && \int_{-\infty}^\infty g(x) \; dx && \mbox{and} && \int_{\mu}^\infty g(x) \; dx \end{aligned} Similarly, using the chain rule the derivative of $g(x)$ is $$\frac{d}{dx} \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\} = - \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\} \frac{2(x-\mu)}{2 \sigma^2} = - \frac{(x-\mu)}{\sigma^2} \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\}.$$ The first integral above is most easily calculated by recognizing that $g(x)$ is the "kernel" of a Gaussian density. All that is missing is the normalizing constant, which is not a function of $x$. That is, if $h(x)$ is the density of a Gaussian, then, $h(x) = \frac{1}{\sqrt{2\pi \sigma^2}} g(x)$. $$\mbox{Since } \int_{-\infty}^\infty h(x) \;dx = 1, \; \mbox{ we have that } \; \int_{-\infty}^\infty g(x) \; dx= \sqrt{2\pi \sigma^2}.$$ By similar considerations, the final integral evaluates to $\sqrt{2\pi \sigma^2}/2$ since the bounds of integration span the right half of the symmetric Gaussian density. d. (8 pts) *With $g(x)$ defined above, but now viewing it as a function of $x$ and $\mu$, i.e., $g(x, \mu)$, find an expression for the value of $\mu$ for fixed $x_1,\dots, x_n$ which maximizes* $$\prod_{i=1}^n g(x_i, \mu).$$ *Hint: start by taking the log.* The logarithm is a monotonic transformation, so the value of $\mu$ maximizing the above expression would also maximize its log. Working with the log instead simplifies things as follows. $$\log \prod_{i=1}^n g(x_i, \mu) = \sum_{i=1}^n \log g(x_i, \mu) = - \sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2}$$ To maximize, set the derivative to zero and solve. \begin{aligned} 0 &\stackrel{\mathrm{set}}{=} \frac{d}{d\mu} \left(- \sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2} \right) = \frac{1}{2\sigma^2} \sum_{i=1}^n 2(x_i - \mu) \\ \mbox{Therefore } \; \hat{\mu} &= \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}. \end{aligned} c. (4 pts) *Again with $g(x)$ defined as above, setting $\mu=2$ and $\sigma^2=4$, evaluate the following.* $$\int_3^4 g(x) \; dx$$ *Hint: you may find software/numerical procedures helpful here.* Using the cumulative distribution function R this is simply {r} sqrt(2*pi*4) * (pnorm(4, 2, 2) - pnorm(3, 2, 2))  ### Problem 7: Linear algebra (20 pts) a. (4 pts) *Suppose that $X$ is an $n \times p$ matrix, and $Y$ is an $n \times 1$ matrix, i.e., an $n$-vector. Write $X^\top Y$ as a vector of sums using the $\Sigma$ notation.* Let $j \in \{1,\dots,p\}$ index the rows of $X^\top$, so that $X_{:j} = (X_{1j}, \dots, X_{nj})^\top$ is an $n$-vector. We have $$X_{:j} Y = \sum_{i = 1}^n X_{ij} Y_i$$ Since the resulting matix-vector product is $(X_{:1} Y, X_{:2} Y, \dots, X_{:p}Y)$, we have $$\left(\sum_{i = 1}^n X_{i1} Y_i, \sum_{i = 1}^n X_{i2} Y_i, \dots, \sum_{i = 1}^n X_{ip} Y_i\right)^\top.$$ The transpose comes because the result is a $p \times 1$ vector, which is tacked tall and skinny (a column vactor), whereas the presentation above is as a row-vector. b. (3 pts) *Using $X$ and $Y$ as above, what is the dimension of the following compound matrix--vector product?* $$(X^\top X)^{-1} X^\top Y$$ $X$ is $n \times p$ so $X^\top X$ is $p \times p$ and so is its inverse. Therefore above we have $(p \times p)(p \times n)(n \times 1) = (p \times 1)$, so that the result is a $p$-vector. b. (8 pts) *Now, let $\beta$ be a $p \times 1$ vector and $X$ and $Y$ defined as above. Find an expression for the value of $\beta$ that gives the smallest value of* $$|| Y - X \beta ||^2 = (Y - X\beta)^\top (Y - X\beta).$$ *Hint: start with $p=1$ and see if that helps guide you toward the general-$p$ solution.* Expanding out the right hand side gives $$(Y - X\beta)^\top (Y - X\beta) = Y^\top Y - 2 \beta^\top X^\top Y + \beta^\top X^\top X \beta,$$ To maximize, take the derivative with respect to $\beta$, set equal to zero and solve. \begin{aligned} 0 &\stackrel{\mathrm{set}}{=} \frac{d}{d\beta} Y^\top Y - 2 \beta^\top X^\top Y + \beta^\top X^\top X \beta \\ &= -2 X^\top Y + 2 X^\top X \beta \\ \mbox{giving } \; \hat{\beta} &= (X^\top X)^{-1} X^\top Y \\ \end{aligned} I.e., the same as the expression above. c. (2 pts) *What criteria must $X$ satisfy in order for such a solution (your expression for the optimal $\beta$ above) to exist?* $X$ must be of full rank so that $X^\top X$ is invertible. d. (6 pts) Suppose $X$ and $y$ were defined by the X and y variables in R below. Calculate the value of $\beta$ minimizing $|| Y - X \beta ||^2$. *Hint: using R's built-in matrix--vector operations is easier than writing your own with double sums.* {r} X <- cbind(1, 1:10) y <- c(1.391, 0.036, 1.625, 2.427, 3.162, 3.181, 4.715, 1.678, 7.074, 5.981)  In R we have the following. {r} beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y beta.hat