--- title: "Homework 5 Solutions" subtitle: "Advanced Statistical Computing (STAT 6984)" author: "Robert B. Gramacy ( : )
Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is **due on Thursday, November 16th at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is explore C and OpenMP in R. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in asc-repo/hwk/hw5). You don't need to use Rmarkdown but your work should be just as pretty if you want full marks. ## Problem 1: Multivariate Normal Likelihood (35 pts) *Consider a multivariate normal generating mechanism defined in terms of a so-called lengthscale parameter $\theta$* $$Y \sim \mathcal{N}_n (0, \Sigma) \quad \mbox{where} \quad \Sigma_{ij} = \exp \left\{- D_{ij}/\theta \right\}$$ *and the $n \times n$ matrix $D$ contains pairwise (Euclidean) distances between $Y_i$ and $Y_j$ values in a predictor, or $X$-space.* - *Treat $\theta$ as a positive scalar quantity.* - *Such a mechanism comprises the essence of Gaussian process regression.* *Our goal is to infer the unknown $\theta$ parameter from data by evaluating the likelihood for a grid of $\theta$ values* {r} thetas <- seq(0.1, 3, length=100)  *using data generated as* {r} X <- runif(300, -1, 1) D <- as.matrix(dist(matrix(X, ncol=1), diag=TRUE, upper=TRUE)) Sigma <- exp(-D) + diag(sqrt(.Machine$double.eps), nrow(D)) library(mvtnorm) Y <- rmvnorm(10000, sigma=Sigma)  *and report on the MLE.* - *The functions in [mvnllik.R](mvnllik.R) provide two versions (coded in R with and without the help of external libraries) of evaluating the likelihood via the MVN density function, and it also contains a wrapper function iterating over a vector of potential$\theta$values.* - *The functions is [mvnllik.c](mvnllik.c) provide analogues in C.* First, lets source the R file. {r} source("mvnllik.R")  *Your task is to* a. *Develop a C/R interface in order to utilize the C version from R, defining the function logliks in R, which takes the same arguments as logliks.R excepting ll.* The C code implmenting the new functionality resides in [mvnllik_sol.c](mvnllik_sol.c). That file includes the new function logliks_R which is called by the new R/C interface function below (after loading the new shared object). {r} logliks <- function(Y, D, thetas, call=c("logliks_R", "logliks_omp_R"), verb=0) { m <- nrow(D) if(ncol(Y) != m) stop(" dimension mismatch") call <- match.arg(call) ret <- .C(call, n = as.integer(nrow(Y)), m = as.integer(m), Y = as.double(t(Y)), D = as.double(D), ## symmetric, no transpose thetas = as.double(thetas), tlen = as.integer(length(thetas)), verb = as.integer(verb), llik = double(length(thetas)), DUP=FALSE)$llik return(ret) }  - Note that this function supports both serial and OpenMP versions, described in more detail in part b., below. Running this code requires loading a shared object built as R CMD SHLIB -o mvnllik_sol.so mvnllik_sol.c. {r} dyn.load("mvnllik_sol.so")  - Demonstration is delayed until part c. b. *Extend the C version to utilize OpenMP parallelization. There are at least two opportunities to do this, but you only need to do one.* The code supporting this is also in the new mvnllik_sol.c file mentioned above, including new C functions logliks_omp_R and logliks_omp. The implementation parallelizes over each of the theta values with #pragma omp parallel. c. *Show that all four versions give the same likelihood curve, and maximizing value, over the $\theta$ grid. I.e., show that they give exactly the same output, demonstrating correctness.* First lets perform calculations and records timings (for part d.) for the two non-C options provided in the problem description. {r cache=TRUE} verb <- 0 time.R <- system.time(ll.R <- logliks.R(Y, D, thetas, verb=verb)) time.R2 <- system.time(ll.R2 <- logliks.R(Y, D, thetas, verb=verb, ll=loglik2))  Then the two C versions ... {r cache=TRUE} time.C <- system.time(ll.C <- logliks(Y, D, thetas, verb=verb)) time.Comp <- system.time(ll.Comp <- logliks(Y, D, thetas, verb=verb, call="logliks_omp_R"))  The plots below show that all four versions yield the same log likelihood surface and maximizing value. {r, dev.args = list(bg = 'transparent'), fig.width=10, fig.height=4, fig.align="center"} par(mfrow=c(1,4)) plot(thetas, ll.R, type="l", main="R/lib") abline(v=thetas[which.max(ll.R)], col=2) legend("bottomright", "mle", col=2, lty=1) plot(thetas, ll.R2, type="l", main="R/by-hand") abline(v=thetas[which.max(ll.R2)], col=2) plot(thetas, ll.C, type="l", main="C") abline(v=thetas[which.max(ll.C)], col=2) plot(thetas, ll.C, type="l", main="C/OpenMP") abline(v=thetas[which.max(ll.C)], col=2)  d. *Provide a timing comparison between the four versions. Be sure to note how many cores your parallel OpenMP version is using.* The (elapsed) timings are summarized as follows. {r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3])  - The R library (dmvnorm) version and our C version are about the same; the pure R version is quite slow. - However our OpenMP version is r round(time.R[3]/time.Comp[3], 2) times faster. - My desktop has 8 hyperthreaded cores, so the OpenMP version uses 16 threads. e. *Now, using an R with accelerated linear algebra (Intel MKL, OSX Accelerate, or Microsoft R/MRAN), re-perform your comparison from d. and report the results. (If your timings are not *all* much faster then you're doing it wrong.)* This entire Rmd file is re-compiled with an MKL-linked R and provided as [hw5_sol_mkl.html](hw5_sol_mkl.html). - The result is a more than $3.5\times$ speed-up. ## Problem 2: Parallel bootstrapped linear regression (25 pts) *Parallelize bootreg with OpenMP and report timings comparing your new parallel version to the R and C versions provided in class.* - *Your comparison should include both ordinary R and a version with accelerated linear algebra.* First we load in the R file from lecture. {r} source("bootreg.R")  The C code supporting this solution can be found in bootreg_sol.c and randomkit.c with randomkit.h. - These latter files provide a thread-safe RNG. I found these while Googling "thread-safe Mersene twister in C". If you download the R source and search for its RNG you will see it is also based on the "Mersene twister" algorithm, so we're replacing apples with threaded apples. - Besides the threaded RNG, the OpenMP is pretty straightforward, having a #pragma omp parallel at the top. An exception is a #pragma omp critical when seeding the multiple randomkit RNGs. The shared object may be compiled as R CMD SHLIB -o bootreg_sol.so bootreg_sol.c randomkit.c, and loaded as follows. {r} dyn.load("bootreg_sol.so")  The following R interface function can be used to call the parallel version on the C side. {r} bootols <- function(X, y, B=199, icept=TRUE, inv=FALSE, call=c("bootols_R", "bootols_omp_R")) { if(icept) X <- cbind(1, X) m <- ncol(X) n <- nrow(X) if(n != length(y)) stop("dimension mismatch") call <- match.arg(call) ret <- .C(call, X = as.double(t(X)), y = as.double(y), n = as.integer(n), m = as.integer(m), B = as.integer(B), inv = as.integer(inv), beta.hat = double(m*B), DUP = FALSE) ## must be careful to return a transposed beta.hat matrix return(matrix(ret$beta.hat, nrow=B, byrow=TRUE)) }  - This overwrites the version in [bootreg.R](bootreg.R). - Notice that it supports both serial and OpenMP parallelized versions via a call argument. Now for some timings. - Note that we use X2 and Y2 below so as not to clash with X and Y above. {r, cache=TRUE} ## generate testing data n <- 5000 d <- 200 X2 <- 1/matrix(rt(n*d, df=1), ncol=d) beta <- c(1:d, 0) Y2 <- beta[1] + X2 %*% beta[-1] + rnorm(100, sd=3) ## testing bootstrapped OLS time2.Rlm <- system.time(boot.Rlm <- bootols.R(X2, Y2, B=1000, method="lm")) time2.Rinv <- system.time(boot.Rinv <- bootols.R(X2, Y2, B=1000, method="inv")) time2.Rsolve <- system.time(beta.Rsolve <- bootols.R(X2, Y2, B=1000, method="solve")) time2.Rcp <- system.time(beta.Rcp <- bootols.R(X2, Y2, B=1000, method="cp")) time2.Cinv <- system.time(boot.Cinv <- bootols(X2, Y2, B=1000, inv=TRUE)) time2.Csolve <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000)) time2.Comp <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000, call="bootols_omp_R"))  The (elapsed) timings are summarized as follows. {r} c(Rlm=time2.Rlm[3], Rinv=time2.Rinv[3], Rsolve=time2.Rsolve[3], Rcp=time2.Rcp[3], Cinv=time2.Cinv[3], Csolve=time2.Csolve[3], Comp=time2.Comp[3])  - Here we see about a r round(time2.Csolve[3]/time2.Comp[3], 2) factor speedup with the OpenMP version, which is not as impressive as in Question 1, above. - For timings under Rmkl, please see [hw5_sol_mkl.html](hw5_sol_mkl.html). The resulting speedup is is about$16\times$, which is impressive. ## Problem 3: Rcpp for swap (15 points) *Revisit the swap function from problem 2 of [homework 2](hw2.html) and write a version in Rcpp which, when called as swap.Rcpp(v, i, j) swaps entry i and j without copying the vector v. Demonstrate correctness, and compare timings with our fast swap.eval.* Rcpp uses pointers, which means no copying, and leads to a trivial Cpp implementation. {r} library(Rcpp) cppFunction(" void swapCpp(IntegerVector v, int i, int j){ int tmp = v[i-1]; v[i-1] = v[j-1]; v[j-1] = tmp; } ")  First lets check for correctness. {r} v <- 1:1000000000 swapCpp(v, 1, 2) v[1:5]  - Bingo. To check that it is fast relative to our niave and eval-based solutions, lets paste those in too. {r} swap <- function(v, i, j) { tmp <- v[i] v[i] <- v[j] v[j] <- tmp } swap.eval <- function() { e <- quote({tmp <- v[i]; v[i] <- v[j]; v[j] <- tmp}) eval(e, envir=parent.frame()) }  The code below collects timings in duplicate. {r} v <- 1:1000000000 i <- 1; j <- 2 library(microbenchmark) time.swap <- c(system.time(swap(v, i, j))[3], system.time(swap(v, i, j))[3]) time.cpp <- c(system.time(swapCpp(v, i, j))[3], system.time(swapCpp(v, i, j))[3]) time.eval <- c(system.time(swap.eval())[3], system.time(swap.eval())[3])  The timings are summarized below. {r} cbind(swap=time.swap, cpp=time.cpp, eval=time.eval)  - Observe that the Rcpp version is faster than the eval version, by just a little bit. - But it does not require a "warm up" run. ## Problem 4: Rcpp for mvnllik (25 pts) *Write an Rcpp version of logliks, called logliks.Rcpp which takes the same arguments. You may do this one of two ways, both for extra credit:* - *as a wrapper around the C function logliks;* - *as a C++ version of the R function logliks2.* *Demonstrate correctness (i.e., show that your new version gives the same answers as the old ones) and compare timings with the other methods.* The C++ file [mvnllik_sol.cpp](mvnllik_sol.cpp) contains an Rcpp wrapper which supports both serial and OpenMP versions via an argument omp which is FALSE by default. {r} sourceCpp("mvnllik_sol.cpp")  - Note that our [Makevars](Makevars), which provides the correct linking for the underlying linear algebra libraries, must reside in ~/.R/Makevars. sourceCcpp will not look in the current working directory for this file the way that R CMD SHLIB would. The code below collects the output and timings under these two new variations. {r, cache=TRUE} time.Rcpp <- system.time(ll.Rcpp <- logliksCpp(t(Y), D, thetas, verb=verb)) time.RcppOmp <- system.time(ll.RcppOmp <- logliksCpp(t(Y), D, thetas, omp=TRUE, verb=verb))  The table below augments the one above with the new timings. {r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3])  - Cool! The new Rcpp interface is a little bit faster, perhaps because it does less copying of the arguments. As a partial solution for the second option, we can consider using RArmadillo to [replace dmvnorm](http://gallery.rcpp.org/articles/dmvnorm_arma/). The code linked from that URL has been pasted into [mvnllik_arma.cpp](mvnllik_arma.cpp), and the following wrapper has been written to enable looping over theta, etc. {r} library(RcppArmadillo) sourceCpp("mvnllik_arma.cpp") loglik.arma.omp <- function(Y, D, theta, nth=8) { if(ncol(Y) != nrow(D)) stop(" dimension mismatch") Sigma <- exp(-D/theta) + diag(sqrt(.Machine$double.eps), nrow(D)) return(sum(dmvnrm_arma_mc(Y, rep(0, ncol(Y)), Sigma, TRUE, nth))) }  - Note that this interfaces to OpenMP, however here the parallelization is over the rows of Y rather than over theta, which might not do as good of a job at utilizing all of the available cores (i.e., not all the time). - The [mvnllik_arma.cpp](mvnllik_arma.cpp) file also contains a serial version which is not being entertained here. Now we can pass this new loglik.arma.omp function to our logliks.R function, which was written to utilize a generic log density function. - First we have to set the argument specifying the number of threads. {r} formals(loglik.arma.omp)nth <- 16 time.RarmaOmp <- system.time(ll.RarmaOmp <- logliks.R(Y, D, thetas, verb=verb, ll=loglik.arma.omp))  The table below augments the ones above with the new time from above. {r} c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3], RarmaOmp=time.RarmaOmp[3])  - Not quite as fast as either previous OpenMP-based versions, possibly because there is still some R in play to loop over the thetas. - A fancier Rcpp version would include that loop on the C++ side. But since we already have a fast all-C++/C version, we'll leave that for another time. The plots below verify that the new calculations are producing results identical to the old ones. {r, dev.args = list(bg = 'transparent'), fig.width=7, fig.height=4, fig.align="center"} par(mfrow=c(1,3)) plot(thetas, ll.Rcpp, type="l", main="Rcpp serial") abline(v=thetas[which.max(ll.R)], col=2) legend("bottomright", "mle", col=2, lty=1) plot(thetas, ll.RcppOmp, type="l", main="RCpp/Openmp") abline(v=thetas[which.max(ll.R2)], col=2) plot(thetas, ll.RarmaOmp, type="l", main="Rarma/OpenMP") abline(v=thetas[which.max(ll.R2)], col=2)  Results using MKL are provided in the companion document [hw5_sol_mkl.html](hw5_sol_mkl.html). - As before, the speedup here is about3.5\times\$.