## Instructions

This homework is due on Thursday, November 16th at 12:30pm (the start of class). Please turn in all your work. The primary purpose of this homework is explore C and OpenMP in R. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in asc-repo/hwk/hw5). You don’t need to use Rmarkdown but your work should be just as pretty if you want full marks.

## Problem 1: Multivariate Normal Likelihood (35 pts)

Consider a multivariate normal generating mechanism defined in terms of a so-called lengthscale parameter $$\theta$$

$Y \sim \mathcal{N}_n (0, \Sigma) \quad \mbox{where} \quad \Sigma_{ij} = \exp \left\{- D_{ij}/\theta \right\}$

and the $$n \times n$$ matrix $$D$$ contains pairwise (Euclidean) distances between $$Y_i$$ and $$Y_j$$ values in a predictor, or $$X$$-space.

• Treat $$\theta$$ as a positive scalar quantity.
• Such a mechanism comprises the essence of Gaussian process regression.

Our goal is to infer the unknown $$\theta$$ parameter from data by evaluating the likelihood for a grid of $$\theta$$ values

thetas <- seq(0.1, 3, length=100)

using data generated as

X <- runif(300, -1, 1)
D <- as.matrix(dist(matrix(X, ncol=1), diag=TRUE, upper=TRUE))
Sigma <- exp(-D) + diag(sqrt(.Machine$double.eps), nrow(D)) library(mvtnorm) Y <- rmvnorm(10000, sigma=Sigma) and report on the MLE. • The functions in mvnllik.R provide two versions (coded in R with and without the help of external libraries) of evaluating the likelihood via the MVN density function, and it also contains a wrapper function iterating over a vector of potential $$\theta$$ values. • The functions is mvnllik.c provide analogues in C. First, lets source the R file. source("mvnllik.R") Your task is to 1. Develop a C/R interface in order to utilize the C version from R, defining the function logliks in R, which takes the same arguments as logliks.R excepting ll. The C code implmenting the new functionality resides in mvnllik_sol.c. That file includes the new function logliks_R which is called by the new R/C interface function below (after loading the new shared object). logliks <- function(Y, D, thetas, call=c("logliks_R", "logliks_omp_R"), verb=0) { m <- nrow(D) if(ncol(Y) != m) stop(" dimension mismatch") call <- match.arg(call) ret <- .C(call, n = as.integer(nrow(Y)), m = as.integer(m), Y = as.double(t(Y)), D = as.double(D), ## symmetric, no transpose thetas = as.double(thetas), tlen = as.integer(length(thetas)), verb = as.integer(verb), llik = double(length(thetas)), DUP=FALSE)$llik

return(ret)
}
• Note that this function supports both serial and OpenMP versions, described in more detail in part b., below.

Running this code requires loading a shared object built as R CMD SHLIB -o mvnllik_sol.so mvnllik_sol.c.

dyn.load("mvnllik_sol.so")
• Demonstration is delayed until part c.
1. Extend the C version to utilize OpenMP parallelization. There are at least two opportunities to do this, but you only need to do one.

The code supporting this is also in the new mvnllik_sol.c file mentioned above, including new C functions logliks_omp_R and logliks_omp. The implementation parallelizes over each of the theta values with #pragma omp parallel.

1. Show that all four versions give the same likelihood curve, and maximizing value, over the $$\theta$$ grid. I.e., show that they give exactly the same output, demonstrating correctness.

First lets perform calculations and records timings (for part d.) for the two non-C options provided in the problem description.

verb <- 0
time.R <- system.time(ll.R <- logliks.R(Y, D, thetas, verb=verb))
time.R2 <- system.time(ll.R2 <- logliks.R(Y, D, thetas, verb=verb, ll=loglik2))

Then the two C versions …

time.C <- system.time(ll.C <- logliks(Y, D, thetas, verb=verb))
time.Comp <- system.time(ll.Comp <- logliks(Y, D, thetas, verb=verb, call="logliks_omp_R"))
1. Provide a timing comparison between the four versions. Be sure to note how many cores your parallel OpenMP version is using.

The (elapsed) timings are summarized as follows.

c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3])
##    R.elapsed   R2.elapsed    C.elapsed Comp.elapsed
##        4.067       54.344        4.551        1.606
• The R library (dmvnorm) version and our C version are about the same; the pure R version is quite slow.
• However our OpenMP version is 2.53 times faster.
• My desktop has 8 hyperthreaded cores, so the OpenMP version uses 16 threads.
1. Now, using an R with accelerated linear algebra (Intel MKL, OSX Accelerate, or Microsoft R/MRAN), re-perform your comparison from d. and report the results. (If your timings are not all* much faster then you’re doing it wrong.)*

This entire Rmd file is re-compiled with an MKL-linked R and provided as hw5_sol_mkl.html.

• The result is a more than $$3.5\times$$ speed-up.

## Problem 2: Parallel bootstrapped linear regression (25 pts)

Parallelize bootreg with OpenMP and report timings comparing your new parallel version to the R and C versions provided in class.

• Your comparison should include both ordinary R and a version with accelerated linear algebra.

First we load in the R file from lecture.

source("bootreg.R")

The C code supporting this solution can be found in bootreg_sol.c and randomkit.c with randomkit.h.

• These latter files provide a thread-safe RNG. I found these while Googling “thread-safe Mersene twister in C”. If you download the R source and search for its RNG you will see it is also based on the “Mersene twister” algorithm, so we’re replacing apples with threaded apples.
• Besides the threaded RNG, the OpenMP is pretty straightforward, having a #pragma omp parallel at the top. An exception is a #pragma omp critical when seeding the multiple randomkit RNGs.

The shared object may be compiled as R CMD SHLIB -o bootreg_sol.so bootreg_sol.c randomkit.c, and loaded as follows.

dyn.load("bootreg_sol.so")

The following R interface function can be used to call the parallel version on the C side.

bootols <- function(X, y, B=199, icept=TRUE, inv=FALSE, call=c("bootols_R", "bootols_omp_R"))
{
if(icept) X <- cbind(1, X)
m <- ncol(X)
n <- nrow(X)
if(n != length(y)) stop("dimension mismatch")

call <- match.arg(call)

ret <- .C(call,
X = as.double(t(X)),
y = as.double(y),
n = as.integer(n),
m = as.integer(m),
B = as.integer(B),
inv = as.integer(inv),
beta.hat = double(m*B),
DUP = FALSE)

## must be careful to return a transposed beta.hat matrix
return(matrix(ret$beta.hat, nrow=B, byrow=TRUE)) } • This overwrites the version in bootreg.R. • Notice that it supports both serial and OpenMP parallelized versions via a call argument. Now for some timings. • Note that we use X2 and Y2 below so as not to clash with X and Y above. ## generate testing data n <- 5000 d <- 200 X2 <- 1/matrix(rt(n*d, df=1), ncol=d) beta <- c(1:d, 0) Y2 <- beta[1] + X2 %*% beta[-1] + rnorm(100, sd=3) ## testing bootstrapped OLS time2.Rlm <- system.time(boot.Rlm <- bootols.R(X2, Y2, B=1000, method="lm")) time2.Rinv <- system.time(boot.Rinv <- bootols.R(X2, Y2, B=1000, method="inv")) time2.Rsolve <- system.time(beta.Rsolve <- bootols.R(X2, Y2, B=1000, method="solve")) time2.Rcp <- system.time(beta.Rcp <- bootols.R(X2, Y2, B=1000, method="cp")) time2.Cinv <- system.time(boot.Cinv <- bootols(X2, Y2, B=1000, inv=TRUE)) time2.Csolve <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000)) time2.Comp <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000, call="bootols_omp_R")) The (elapsed) timings are summarized as follows. c(Rlm=time2.Rlm[3], Rinv=time2.Rinv[3], Rsolve=time2.Rsolve[3], Rcp=time2.Rcp[3], Cinv=time2.Cinv[3], Csolve=time2.Csolve[3], Comp=time2.Comp[3]) ## Rlm.elapsed Rinv.elapsed Rsolve.elapsed Rcp.elapsed Cinv.elapsed ## 121.756 26.113 16.552 9.110 3.363 ## Csolve.elapsed Comp.elapsed ## 3.686 1.918 • Here we see about a 1.92 factor speedup with the OpenMP version, which is not as impressive as in Question 1, above. • For timings under Rmkl, please see hw5_sol_mkl.html. The resulting speedup is is about $$16\times$$, which is impressive. ## Problem 3: Rcpp for swap (15 points) Revisit the swap function from problem 2 of homework 2 and write a version in Rcpp which, when called as swap.Rcpp(v, i, j) swaps entry i and j without copying the vector v. Demonstrate correctness, and compare timings with our fast swap.eval. Rcpp uses pointers, which means no copying, and leads to a trivial Cpp implementation. library(Rcpp) cppFunction(" void swapCpp(IntegerVector v, int i, int j){ int tmp = v[i-1]; v[i-1] = v[j-1]; v[j-1] = tmp; } ") First lets check for correctness. v <- 1:1000000000 swapCpp(v, 1, 2) v[1:5] ## [1] 2 1 3 4 5 • Bingo. To check that it is fast relative to our niave and eval-based solutions, lets paste those in too. swap <- function(v, i, j) { tmp <- v[i] v[i] <- v[j] v[j] <- tmp } swap.eval <- function() { e <- quote({tmp <- v[i]; v[i] <- v[j]; v[j] <- tmp}) eval(e, envir=parent.frame()) } The code below collects timings in duplicate. v <- 1:1000000000 i <- 1; j <- 2 library(microbenchmark) time.swap <- c(system.time(swap(v, i, j))[3], system.time(swap(v, i, j))[3]) time.cpp <- c(system.time(swapCpp(v, i, j))[3], system.time(swapCpp(v, i, j))[3]) time.eval <- c(system.time(swap.eval())[3], system.time(swap.eval())[3]) The timings are summarized below. cbind(swap=time.swap, cpp=time.cpp, eval=time.eval) ## swap cpp eval ## elapsed 0.800 0 0.796 ## elapsed 0.802 0 0.000 • Observe that the Rcpp version is faster than the eval version, by just a little bit. • But it does not require a “warm up” run. ## Problem 4: Rcpp for mvnllik (25 pts) Write an Rcpp version of logliks, called logliks.Rcpp which takes the same arguments. You may do this one of two ways, both for extra credit: • as a wrapper around the C function logliks; • as a C++ version of the R function logliks2. Demonstrate correctness (i.e., show that your new version gives the same answers as the old ones) and compare timings with the other methods. The C++ file mvnllik_sol.cpp contains an Rcpp wrapper which supports both serial and OpenMP versions via an argument omp which is FALSE by default. sourceCpp("mvnllik_sol.cpp") • Note that our Makevars, which provides the correct linking for the underlying linear algebra libraries, must reside in ~/.R/Makevars. sourceCcpp will not look in the current working directory for this file the way that R CMD SHLIB would. The code below collects the output and timings under these two new variations. time.Rcpp <- system.time(ll.Rcpp <- logliksCpp(t(Y), D, thetas, verb=verb)) time.RcppOmp <- system.time(ll.RcppOmp <- logliksCpp(t(Y), D, thetas, omp=TRUE, verb=verb)) The table below augments the one above with the new timings. c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3]) ## R.elapsed R2.elapsed C.elapsed Comp.elapsed ## 4.067 54.344 4.551 1.606 ## Rcpp.elapsed RcppOmp.elapsed ## 4.583 1.613 • Cool! The new Rcpp interface is a little bit faster, perhaps because it does less copying of the arguments. As a partial solution for the second option, we can consider using RArmadillo to replace dmvnorm. The code linked from that URL has been pasted into mvnllik_arma.cpp, and the following wrapper has been written to enable looping over theta, etc. library(RcppArmadillo) sourceCpp("mvnllik_arma.cpp") loglik.arma.omp <- function(Y, D, theta, nth=8) { if(ncol(Y) != nrow(D)) stop(" dimension mismatch") Sigma <- exp(-D/theta) + diag(sqrt(.Machine$double.eps), nrow(D))
return(sum(dmvnrm_arma_mc(Y, rep(0, ncol(Y)), Sigma, TRUE, nth)))
}
• Note that this interfaces to OpenMP, however here the parallelization is over the rows of Y rather than over theta, which might not do as good of a job at utilizing all of the available cores (i.e., not all the time).
• The mvnllik_arma.cpp file also contains a serial version which is not being entertained here.

Now we can pass this new loglik.arma.omp function to our logliks.R function, which was written to utilize a generic log density function.

• First we have to set the argument specifying the number of threads.
formals(loglik.arma.omp)\$nth <- 16
time.RarmaOmp <- system.time(ll.RarmaOmp <- logliks.R(Y, D, thetas, verb=verb, ll=loglik.arma.omp))

The table below augments the ones above with the new time from above.

c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3],
RcppOmp=time.RcppOmp[3], RarmaOmp=time.RarmaOmp[3])
##        R.elapsed       R2.elapsed        C.elapsed     Comp.elapsed
##            4.067           54.344            4.551            1.606
##     Rcpp.elapsed  RcppOmp.elapsed RarmaOmp.elapsed
##            4.583            1.613            2.471
• Not quite as fast as either previous OpenMP-based versions, possibly because there is still some R in play to loop over the thetas.
• A fancier Rcpp version would include that loop on the C++ side. But since we already have a fast all-C++/C version, we’ll leave that for another time.

Results using MKL are provided in the companion document hw5_sol_mkl.html.

• As before, the speedup here is about $$3.5\times$$.