Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is **due on Thursday, November 16th at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is explore `C` and `OpenMP` in R. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in `asc-repo/hwk/hw5`). You don't need to use `Rmarkdown` but your work should be just as pretty if you want full marks. ## Problem 1: Multivariate Normal Likelihood (35 pts) *Consider a multivariate normal generating mechanism defined in terms of a so-called lengthscale parameter $\theta$* $$ Y \sim \mathcal{N}_n (0, \Sigma) \quad \mbox{where} \quad \Sigma_{ij} = \exp \left\{- D_{ij}/\theta \right\} $$ *and the $n \times n$ matrix $D$ contains pairwise (Euclidean) distances between $Y_i$ and $Y_j$ values in a predictor, or $X$-space.* - *Treat $\theta$ as a positive scalar quantity.* - *Such a mechanism comprises the essence of Gaussian process regression.* *Our goal is to infer the unknown $\theta$ parameter from data by evaluating the likelihood for a grid of $\theta$ values* ```{r} thetas <- seq(0.1, 3, length=100) ``` *using data generated as* ```{r} X <- runif(300, -1, 1) D <- as.matrix(dist(matrix(X, ncol=1), diag=TRUE, upper=TRUE)) Sigma <- exp(-D) + diag(sqrt(.Machine$double.eps), nrow(D)) library(mvtnorm) Y <- rmvnorm(10000, sigma=Sigma) ``` *and report on the MLE.* - *The functions in [`mvnllik.R`](mvnllik.R) provide two versions (coded in R with and without the help of external libraries) of evaluating the likelihood via the MVN density function, and it also contains a wrapper function iterating over a vector of potential $\theta$ values.* - *The functions is [`mvnllik.c`](mvnllik.c) provide analogues in C.* First, lets source the R file. ```{r} source("mvnllik.R") ``` *Your task is to* a. *Develop a C/R interface in order to utilize the C version from R, defining the function `logliks` in R, which takes the same arguments as `logliks.R` excepting `ll`.* The C code implmenting the new functionality resides in [`mvnllik_sol.c`](mvnllik_sol.c). That file includes the new function `logliks_R` which is called by the new R/C interface function below (after loading the new shared object). ```{r} logliks <- function(Y, D, thetas, call=c("logliks_R", "logliks_omp_R"), verb=0) { m <- nrow(D) if(ncol(Y) != m) stop(" dimension mismatch") call <- match.arg(call) ret <- .C(call, n = as.integer(nrow(Y)), m = as.integer(m), Y = as.double(t(Y)), D = as.double(D), ## symmetric, no transpose thetas = as.double(thetas), tlen = as.integer(length(thetas)), verb = as.integer(verb), llik = double(length(thetas)), DUP=FALSE)$llik return(ret) } ``` - Note that this function supports both serial and OpenMP versions, described in more detail in part b., below. Running this code requires loading a shared object built as `R CMD SHLIB -o mvnllik_sol.so mvnllik_sol.c`. ```{r} dyn.load("mvnllik_sol.so") ``` - Demonstration is delayed until part c. b. *Extend the C version to utilize `OpenMP` parallelization. There are at least two opportunities to do this, but you only need to do one.* The code supporting this is also in the new `mvnllik_sol.c` file mentioned above, including new C functions `logliks_omp_R` and `logliks_omp`. The implementation parallelizes over each of the `theta` values with `#pragma omp parallel`. c. *Show that all four versions give the same likelihood curve, and maximizing value, over the $\theta$ grid. I.e., show that they give exactly the same output, demonstrating correctness.* First lets perform calculations and records timings (for part d.) for the two non-C options provided in the problem description. ```{r cache=TRUE} verb <- 0 time.R <- system.time(ll.R <- logliks.R(Y, D, thetas, verb=verb)) time.R2 <- system.time(ll.R2 <- logliks.R(Y, D, thetas, verb=verb, ll=loglik2)) ``` Then the two C versions ... ```{r cache=TRUE} time.C <- system.time(ll.C <- logliks(Y, D, thetas, verb=verb)) time.Comp <- system.time(ll.Comp <- logliks(Y, D, thetas, verb=verb, call="logliks_omp_R")) ``` The plots below show that all four versions yield the same log likelihood surface and maximizing value. ```{r, dev.args = list(bg = 'transparent'), fig.width=10, fig.height=4, fig.align="center"} par(mfrow=c(1,4)) plot(thetas, ll.R, type="l", main="R/lib") abline(v=thetas[which.max(ll.R)], col=2) legend("bottomright", "mle", col=2, lty=1) plot(thetas, ll.R2, type="l", main="R/by-hand") abline(v=thetas[which.max(ll.R2)], col=2) plot(thetas, ll.C, type="l", main="C") abline(v=thetas[which.max(ll.C)], col=2) plot(thetas, ll.C, type="l", main="C/OpenMP") abline(v=thetas[which.max(ll.C)], col=2) ``` d. *Provide a timing comparison between the four versions. Be sure to note how many cores your parallel `OpenMP` version is using.* The (elapsed) timings are summarized as follows. ```{r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3]) ``` - The R library (`dmvnorm`) version and our C version are about the same; the pure R version is quite slow. - However our OpenMP version is `r round(time.R[3]/time.Comp[3], 2)` times faster. - My desktop has 8 hyperthreaded cores, so the OpenMP version uses 16 threads. e. *Now, using an R with accelerated linear algebra (Intel MKL, OSX Accelerate, or Microsoft R/MRAN), re-perform your comparison from d. and report the results. (If your timings are not *all* much faster then you're doing it wrong.)* This entire Rmd file is re-compiled with an MKL-linked R and provided as [hw5_sol_mkl.html](hw5_sol_mkl.html). - The result is a more than $3.5\times$ speed-up. ## Problem 2: Parallel bootstrapped linear regression (25 pts) *Parallelize `bootreg` with `OpenMP` and report timings comparing your new parallel version to the R and `C` versions provided in class.* - *Your comparison should include both ordinary R and a version with accelerated linear algebra.* First we load in the R file from lecture. ```{r} source("bootreg.R") ``` The C code supporting this solution can be found in `bootreg_sol.c` and `randomkit.c` with `randomkit.h`. - These latter files provide a thread-safe RNG. I found these while Googling "thread-safe Mersene twister in C". If you download the R source and search for its RNG you will see it is also based on the "Mersene twister" algorithm, so we're replacing apples with threaded apples. - Besides the threaded RNG, the OpenMP is pretty straightforward, having a `#pragma omp parallel` at the top. An exception is a `#pragma omp critical` when seeding the multiple `randomkit` RNGs. The shared object may be compiled as `R CMD SHLIB -o bootreg_sol.so bootreg_sol.c randomkit.c`, and loaded as follows. ```{r} dyn.load("bootreg_sol.so") ``` The following R interface function can be used to call the parallel version on the C side. ```{r} bootols <- function(X, y, B=199, icept=TRUE, inv=FALSE, call=c("bootols_R", "bootols_omp_R")) { if(icept) X <- cbind(1, X) m <- ncol(X) n <- nrow(X) if(n != length(y)) stop("dimension mismatch") call <- match.arg(call) ret <- .C(call, X = as.double(t(X)), y = as.double(y), n = as.integer(n), m = as.integer(m), B = as.integer(B), inv = as.integer(inv), beta.hat = double(m*B), DUP = FALSE) ## must be careful to return a transposed beta.hat matrix return(matrix(ret$beta.hat, nrow=B, byrow=TRUE)) } ``` - This overwrites the version in [`bootreg.R`](bootreg.R). - Notice that it supports both serial and OpenMP parallelized versions via a `call` argument. Now for some timings. - Note that we use `X2` and `Y2` below so as not to clash with `X` and `Y` above. ```{r, cache=TRUE} ## generate testing data n <- 5000 d <- 200 X2 <- 1/matrix(rt(n*d, df=1), ncol=d) beta <- c(1:d, 0) Y2 <- beta[1] + X2 %*% beta[-1] + rnorm(100, sd=3) ## testing bootstrapped OLS time2.Rlm <- system.time(boot.Rlm <- bootols.R(X2, Y2, B=1000, method="lm")) time2.Rinv <- system.time(boot.Rinv <- bootols.R(X2, Y2, B=1000, method="inv")) time2.Rsolve <- system.time(beta.Rsolve <- bootols.R(X2, Y2, B=1000, method="solve")) time2.Rcp <- system.time(beta.Rcp <- bootols.R(X2, Y2, B=1000, method="cp")) time2.Cinv <- system.time(boot.Cinv <- bootols(X2, Y2, B=1000, inv=TRUE)) time2.Csolve <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000)) time2.Comp <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000, call="bootols_omp_R")) ``` The (elapsed) timings are summarized as follows. ```{r} c(Rlm=time2.Rlm[3], Rinv=time2.Rinv[3], Rsolve=time2.Rsolve[3], Rcp=time2.Rcp[3], Cinv=time2.Cinv[3], Csolve=time2.Csolve[3], Comp=time2.Comp[3]) ``` - Here we see about a `r round(time2.Csolve[3]/time2.Comp[3], 2)` factor speedup with the OpenMP version, which is not as impressive as in Question 1, above. - For timings under Rmkl, please see [hw5_sol_mkl.html](hw5_sol_mkl.html). The resulting speedup is is about $16\times$, which is impressive. ## Problem 3: `Rcpp` for `swap` (15 points) *Revisit the swap function from problem 2 of [homework 2](hw2.html) and write a version in `Rcpp` which, when called as `swap.Rcpp(v, i, j)` swaps entry `i` and `j` without copying the vector `v`. Demonstrate correctness, and compare timings with our fast `swap.eval`.* Rcpp uses pointers, which means no copying, and leads to a trivial Cpp implementation. ```{r} library(Rcpp) cppFunction(" void swapCpp(IntegerVector v, int i, int j){ int tmp = v[i-1]; v[i-1] = v[j-1]; v[j-1] = tmp; } ") ``` First lets check for correctness. ```{r} v <- 1:1000000000 swapCpp(v, 1, 2) v[1:5] ``` - Bingo. To check that it is fast relative to our niave and `eval`-based solutions, lets paste those in too. ```{r} swap <- function(v, i, j) { tmp <- v[i] v[i] <- v[j] v[j] <- tmp } swap.eval <- function() { e <- quote({tmp <- v[i]; v[i] <- v[j]; v[j] <- tmp}) eval(e, envir=parent.frame()) } ``` The code below collects timings in duplicate. ```{r} v <- 1:1000000000 i <- 1; j <- 2 library(microbenchmark) time.swap <- c(system.time(swap(v, i, j))[3], system.time(swap(v, i, j))[3]) time.cpp <- c(system.time(swapCpp(v, i, j))[3], system.time(swapCpp(v, i, j))[3]) time.eval <- c(system.time(swap.eval())[3], system.time(swap.eval())[3]) ``` The timings are summarized below. ```{r} cbind(swap=time.swap, cpp=time.cpp, eval=time.eval) ``` - Observe that the Rcpp version is faster than the `eval` version, by just a little bit. - But it does not require a "warm up" run. ## Problem 4: `Rcpp` for `mvnllik` (25 pts) *Write an `Rcpp` version of `logliks`, called `logliks.Rcpp` which takes the same arguments. You may do this one of two ways, both for extra credit:* - *as a wrapper around the `C` function `logliks`;* - *as a `C++` version of the R function `logliks2`.* *Demonstrate correctness (i.e., show that your new version gives the same answers as the old ones) and compare timings with the other methods.* The C++ file [`mvnllik_sol.cpp`](mvnllik_sol.cpp) contains an Rcpp wrapper which supports both serial and OpenMP versions via an argument `omp` which is `FALSE` by default. ```{r} sourceCpp("mvnllik_sol.cpp") ``` - Note that our [`Makevars`](Makevars), which provides the correct linking for the underlying linear algebra libraries, must reside in `~/.R/Makevars`. `sourceCcpp` will not look in the current working directory for this file the way that `R CMD SHLIB` would. The code below collects the output and timings under these two new variations. ```{r, cache=TRUE} time.Rcpp <- system.time(ll.Rcpp <- logliksCpp(t(Y), D, thetas, verb=verb)) time.RcppOmp <- system.time(ll.RcppOmp <- logliksCpp(t(Y), D, thetas, omp=TRUE, verb=verb)) ``` The table below augments the one above with the new timings. ```{r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3]) ``` - Cool! The new Rcpp interface is a little bit faster, perhaps because it does less copying of the arguments. As a partial solution for the second option, we can consider using `RArmadillo` to [replace `dmvnorm`](http://gallery.rcpp.org/articles/dmvnorm_arma/). The code linked from that URL has been pasted into [`mvnllik_arma.cpp`](mvnllik_arma.cpp), and the following wrapper has been written to enable looping over `theta`, etc. ```{r} library(RcppArmadillo) sourceCpp("mvnllik_arma.cpp") loglik.arma.omp <- function(Y, D, theta, nth=8) { if(ncol(Y) != nrow(D)) stop(" dimension mismatch") Sigma <- exp(-D/theta) + diag(sqrt(.Machine$double.eps), nrow(D)) return(sum(dmvnrm_arma_mc(Y, rep(0, ncol(Y)), Sigma, TRUE, nth))) } ``` - Note that this interfaces to `OpenMP`, however here the parallelization is over the rows of `Y` rather than over `theta`, which might not do as good of a job at utilizing all of the available cores (i.e., not all the time). - The [`mvnllik_arma.cpp`](mvnllik_arma.cpp) file also contains a serial version which is not being entertained here. Now we can pass this new `loglik.arma.omp` function to our `logliks.R` function, which was written to utilize a generic log density function. - First we have to set the argument specifying the number of threads. ```{r} formals(loglik.arma.omp)$nth <- 16 time.RarmaOmp <- system.time(ll.RarmaOmp <- logliks.R(Y, D, thetas, verb=verb, ll=loglik.arma.omp)) ``` The table below augments the ones above with the new time from above. ```{r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3], RarmaOmp=time.RarmaOmp[3]) ``` - Not quite as fast as either previous OpenMP-based versions, possibly because there is still some R in play to loop over the `theta`s. - A fancier Rcpp version would include that loop on the C++ side. But since we already have a fast all-C++/C version, we'll leave that for another time. The plots below verify that the new calculations are producing results identical to the old ones. ```{r, dev.args = list(bg = 'transparent'), fig.width=7, fig.height=4, fig.align="center"} par(mfrow=c(1,3)) plot(thetas, ll.Rcpp, type="l", main="Rcpp serial") abline(v=thetas[which.max(ll.R)], col=2) legend("bottomright", "mle", col=2, lty=1) plot(thetas, ll.RcppOmp, type="l", main="RCpp/Openmp") abline(v=thetas[which.max(ll.R2)], col=2) plot(thetas, ll.RarmaOmp, type="l", main="Rarma/OpenMP") abline(v=thetas[which.max(ll.R2)], col=2) ``` Results using MKL are provided in the companion document [hw5_sol_mkl.html](hw5_sol_mkl.html). - As before, the speedup here is about $3.5\times$.