Department of Statistics, Virginia Tech

This homework is **due on Thursday, November 16th at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is explore `C`

and `OpenMP`

in R. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in `asc-repo/hwk/hw5`

). You don’t need to use `Rmarkdown`

but your work should be just as pretty if you want full marks.

*Consider a multivariate normal generating mechanism defined in terms of a so-called lengthscale parameter \(\theta\)*

\[ Y \sim \mathcal{N}_n (0, \Sigma) \quad \mbox{where} \quad \Sigma_{ij} = \exp \left\{- D_{ij}/\theta \right\} \]

*and the \(n \times n\) matrix \(D\) contains pairwise (Euclidean) distances between \(Y_i\) and \(Y_j\) values in a predictor, or \(X\)-space.*

*Treat \(\theta\) as a positive scalar quantity.**Such a mechanism comprises the essence of Gaussian process regression.*

*Our goal is to infer the unknown \(\theta\) parameter from data by evaluating the likelihood for a grid of \(\theta\) values*

`thetas <- seq(0.1, 3, length=100)`

*using data generated as*

```
X <- runif(300, -1, 1)
D <- as.matrix(dist(matrix(X, ncol=1), diag=TRUE, upper=TRUE))
Sigma <- exp(-D) + diag(sqrt(.Machine$double.eps), nrow(D))
library(mvtnorm)
Y <- rmvnorm(10000, sigma=Sigma)
```

*and report on the MLE.*

*The functions in*`mvnllik.R`

provide two versions (coded in R with and without the help of external libraries) of evaluating the likelihood via the MVN density function, and it also contains a wrapper function iterating over a vector of potential \(\theta\) values.*The functions is*`mvnllik.c`

provide analogues in C.

First, lets source the R file.

`source("mvnllik.R")`

*Your task is to*

*Develop a C/R interface in order to utilize the C version from R, defining the function*`logliks`

in R, which takes the same arguments as`logliks.R`

excepting`ll`

.

The C code implmenting the new functionality resides in `mvnllik_sol.c`

. That file includes the new function `logliks_R`

which is called by the new R/C interface function below (after loading the new shared object).

```
logliks <- function(Y, D, thetas, call=c("logliks_R", "logliks_omp_R"), verb=0)
{
m <- nrow(D)
if(ncol(Y) != m) stop(" dimension mismatch")
call <- match.arg(call)
ret <- .C(call,
n = as.integer(nrow(Y)),
m = as.integer(m),
Y = as.double(t(Y)),
D = as.double(D), ## symmetric, no transpose
thetas = as.double(thetas),
tlen = as.integer(length(thetas)),
verb = as.integer(verb),
llik = double(length(thetas)),
DUP=FALSE)$llik
return(ret)
}
```

- Note that this function supports both serial and OpenMP versions, described in more detail in part b., below.

Running this code requires loading a shared object built as `R CMD SHLIB -o mvnllik_sol.so mvnllik_sol.c`

.

`dyn.load("mvnllik_sol.so")`

- Demonstration is delayed until part c.

*Extend the C version to utilize*`OpenMP`

parallelization. There are at least two opportunities to do this, but you only need to do one.

The code supporting this is also in the new `mvnllik_sol.c`

file mentioned above, including new C functions `logliks_omp_R`

and `logliks_omp`

. The implementation parallelizes over each of the `theta`

values with `#pragma omp parallel`

.

*Show that all four versions give the same likelihood curve, and maximizing value, over the \(\theta\) grid. I.e., show that they give exactly the same output, demonstrating correctness.*

First lets perform calculations and records timings (for part d.) for the two non-C options provided in the problem description.

```
verb <- 0
time.R <- system.time(ll.R <- logliks.R(Y, D, thetas, verb=verb))
time.R2 <- system.time(ll.R2 <- logliks.R(Y, D, thetas, verb=verb, ll=loglik2))
```

Then the two C versions …

```
time.C <- system.time(ll.C <- logliks(Y, D, thetas, verb=verb))
time.Comp <- system.time(ll.Comp <- logliks(Y, D, thetas, verb=verb, call="logliks_omp_R"))
```

*Provide a timing comparison between the four versions. Be sure to note how many cores your parallel*`OpenMP`

version is using.

The (elapsed) timings are summarized as follows.

`c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3])`

```
## R.elapsed R2.elapsed C.elapsed Comp.elapsed
## 4.067 54.344 4.551 1.606
```

- The R library (
`dmvnorm`

) version and our C version are about the same; the pure R version is quite slow. - However our OpenMP version is 2.53 times faster.
- My desktop has 8 hyperthreaded cores, so the OpenMP version uses 16 threads.

*Now, using an R with accelerated linear algebra (Intel MKL, OSX Accelerate, or Microsoft R/MRAN), re-perform your comparison from d. and report the results. (If your timings are not*all* much faster then you’re doing it wrong.)*

This entire Rmd file is re-compiled with an MKL-linked R and provided as hw5_sol_mkl.html.

- The result is a more than \(3.5\times\) speed-up.

*Parallelize bootreg with OpenMP and report timings comparing your new parallel version to the R and C versions provided in class.*

*Your comparison should include both ordinary R and a version with accelerated linear algebra.*

First we load in the R file from lecture.

`source("bootreg.R")`

The C code supporting this solution can be found in `bootreg_sol.c`

and `randomkit.c`

with `randomkit.h`

.

- These latter files provide a thread-safe RNG. I found these while Googling “thread-safe Mersene twister in C”. If you download the R source and search for its RNG you will see it is also based on the “Mersene twister” algorithm, so we’re replacing apples with threaded apples.
- Besides the threaded RNG, the OpenMP is pretty straightforward, having a
`#pragma omp parallel`

at the top. An exception is a`#pragma omp critical`

when seeding the multiple`randomkit`

RNGs.

The shared object may be compiled as `R CMD SHLIB -o bootreg_sol.so bootreg_sol.c randomkit.c`

, and loaded as follows.

`dyn.load("bootreg_sol.so")`

The following R interface function can be used to call the parallel version on the C side.

```
bootols <- function(X, y, B=199, icept=TRUE, inv=FALSE, call=c("bootols_R", "bootols_omp_R"))
{
if(icept) X <- cbind(1, X)
m <- ncol(X)
n <- nrow(X)
if(n != length(y)) stop("dimension mismatch")
call <- match.arg(call)
ret <- .C(call,
X = as.double(t(X)),
y = as.double(y),
n = as.integer(n),
m = as.integer(m),
B = as.integer(B),
inv = as.integer(inv),
beta.hat = double(m*B),
DUP = FALSE)
## must be careful to return a transposed beta.hat matrix
return(matrix(ret$beta.hat, nrow=B, byrow=TRUE))
}
```

- This overwrites the version in
`bootreg.R`

. - Notice that it supports both serial and OpenMP parallelized versions via a
`call`

argument.

Now for some timings.

- Note that we use
`X2`

and`Y2`

below so as not to clash with`X`

and`Y`

above.

```
## generate testing data
n <- 5000
d <- 200
X2 <- 1/matrix(rt(n*d, df=1), ncol=d)
beta <- c(1:d, 0)
Y2 <- beta[1] + X2 %*% beta[-1] + rnorm(100, sd=3)
## testing bootstrapped OLS
time2.Rlm <- system.time(boot.Rlm <- bootols.R(X2, Y2, B=1000, method="lm"))
time2.Rinv <- system.time(boot.Rinv <- bootols.R(X2, Y2, B=1000, method="inv"))
time2.Rsolve <- system.time(beta.Rsolve <- bootols.R(X2, Y2, B=1000, method="solve"))
time2.Rcp <- system.time(beta.Rcp <- bootols.R(X2, Y2, B=1000, method="cp"))
time2.Cinv <- system.time(boot.Cinv <- bootols(X2, Y2, B=1000, inv=TRUE))
time2.Csolve <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000))
time2.Comp <- system.time(beta.Csolve <- bootols(X2, Y2, B=1000, call="bootols_omp_R"))
```

The (elapsed) timings are summarized as follows.

```
c(Rlm=time2.Rlm[3], Rinv=time2.Rinv[3], Rsolve=time2.Rsolve[3], Rcp=time2.Rcp[3],
Cinv=time2.Cinv[3], Csolve=time2.Csolve[3], Comp=time2.Comp[3])
```

```
## Rlm.elapsed Rinv.elapsed Rsolve.elapsed Rcp.elapsed Cinv.elapsed
## 121.756 26.113 16.552 9.110 3.363
## Csolve.elapsed Comp.elapsed
## 3.686 1.918
```

- Here we see about a 1.92 factor speedup with the OpenMP version, which is not as impressive as in Question 1, above.
- For timings under Rmkl, please see hw5_sol_mkl.html. The resulting speedup is is about \(16\times\), which is impressive.

`Rcpp`

for `swap`

(15 points)*Revisit the swap function from problem 2 of homework 2 and write a version in Rcpp which, when called as swap.Rcpp(v, i, j) swaps entry i and j without copying the vector v. Demonstrate correctness, and compare timings with our fast swap.eval.*

Rcpp uses pointers, which means no copying, and leads to a trivial Cpp implementation.

```
library(Rcpp)
cppFunction("
void swapCpp(IntegerVector v, int i, int j){
int tmp = v[i-1];
v[i-1] = v[j-1];
v[j-1] = tmp;
}
")
```

First lets check for correctness.

```
v <- 1:1000000000
swapCpp(v, 1, 2)
v[1:5]
```

`## [1] 2 1 3 4 5`

- Bingo.

To check that it is fast relative to our niave and `eval`

-based solutions, lets paste those in too.

```
swap <- function(v, i, j)
{
tmp <- v[i]
v[i] <- v[j]
v[j] <- tmp
}
swap.eval <- function()
{
e <- quote({tmp <- v[i]; v[i] <- v[j]; v[j] <- tmp})
eval(e, envir=parent.frame())
}
```

The code below collects timings in duplicate.

```
v <- 1:1000000000
i <- 1; j <- 2
library(microbenchmark)
time.swap <- c(system.time(swap(v, i, j))[3], system.time(swap(v, i, j))[3])
time.cpp <- c(system.time(swapCpp(v, i, j))[3], system.time(swapCpp(v, i, j))[3])
time.eval <- c(system.time(swap.eval())[3], system.time(swap.eval())[3])
```

The timings are summarized below.

`cbind(swap=time.swap, cpp=time.cpp, eval=time.eval)`

```
## swap cpp eval
## elapsed 0.800 0 0.796
## elapsed 0.802 0 0.000
```

- Observe that the Rcpp version is faster than the
`eval`

version, by just a little bit. - But it does not require a “warm up” run.

`Rcpp`

for `mvnllik`

(25 pts)*Write an Rcpp version of logliks, called logliks.Rcpp which takes the same arguments. You may do this one of two ways, both for extra credit:*

*as a wrapper around the*`C`

function`logliks`

;*as a*`C++`

version of the R function`logliks2`

.

*Demonstrate correctness (i.e., show that your new version gives the same answers as the old ones) and compare timings with the other methods.*

The C++ file `mvnllik_sol.cpp`

contains an Rcpp wrapper which supports both serial and OpenMP versions via an argument `omp`

which is `FALSE`

by default.

`sourceCpp("mvnllik_sol.cpp")`

- Note that our
`Makevars`

, which provides the correct linking for the underlying linear algebra libraries, must reside in`~/.R/Makevars`

.`sourceCcpp`

will not look in the current working directory for this file the way that`R CMD SHLIB`

would.

The code below collects the output and timings under these two new variations.

```
time.Rcpp <- system.time(ll.Rcpp <- logliksCpp(t(Y), D, thetas, verb=verb))
time.RcppOmp <- system.time(ll.RcppOmp <- logliksCpp(t(Y), D, thetas, omp=TRUE, verb=verb))
```

The table below augments the one above with the new timings.

`c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3], RcppOmp=time.RcppOmp[3])`

```
## R.elapsed R2.elapsed C.elapsed Comp.elapsed
## 4.067 54.344 4.551 1.606
## Rcpp.elapsed RcppOmp.elapsed
## 4.583 1.613
```

- Cool! The new Rcpp interface is a little bit faster, perhaps because it does less copying of the arguments.

As a partial solution for the second option, we can consider using `RArmadillo`

to replace `dmvnorm`

. The code linked from that URL has been pasted into `mvnllik_arma.cpp`

, and the following wrapper has been written to enable looping over `theta`

, etc.

```
library(RcppArmadillo)
sourceCpp("mvnllik_arma.cpp")
loglik.arma.omp <- function(Y, D, theta, nth=8)
{
if(ncol(Y) != nrow(D)) stop(" dimension mismatch")
Sigma <- exp(-D/theta) + diag(sqrt(.Machine$double.eps), nrow(D))
return(sum(dmvnrm_arma_mc(Y, rep(0, ncol(Y)), Sigma, TRUE, nth)))
}
```

- Note that this interfaces to
`OpenMP`

, however here the parallelization is over the rows of`Y`

rather than over`theta`

, which might not do as good of a job at utilizing all of the available cores (i.e., not all the time). - The
`mvnllik_arma.cpp`

file also contains a serial version which is not being entertained here.

Now we can pass this new `loglik.arma.omp`

function to our `logliks.R`

function, which was written to utilize a generic log density function.

- First we have to set the argument specifying the number of threads.

```
formals(loglik.arma.omp)$nth <- 16
time.RarmaOmp <- system.time(ll.RarmaOmp <- logliks.R(Y, D, thetas, verb=verb, ll=loglik.arma.omp))
```

The table below augments the ones above with the new time from above.

```
c(R=time.R[3], R2=time.R2[3], C=time.C[3], Comp=time.Comp[3], Rcpp=time.Rcpp[3],
RcppOmp=time.RcppOmp[3], RarmaOmp=time.RarmaOmp[3])
```

```
## R.elapsed R2.elapsed C.elapsed Comp.elapsed
## 4.067 54.344 4.551 1.606
## Rcpp.elapsed RcppOmp.elapsed RarmaOmp.elapsed
## 4.583 1.613 2.471
```

- Not quite as fast as either previous OpenMP-based versions, possibly because there is still some R in play to loop over the
`theta`

s. - A fancier Rcpp version would include that loop on the C++ side. But since we already have a fast all-C++/C version, we’ll leave that for another time.

Results using MKL are provided in the companion document hw5_sol_mkl.html.

- As before, the speedup here is about \(3.5\times\).