Programming GPUs in R

In these exercises, you will explore the use of the GPUs as compute devices to accelerate computations using R. Our platform will be the NVidia P100 gpu’s that are available on Virginia Tech’s HPC cluster, NewRiver. As discussed in class, you can ssh to the cluster after getting an account with ARC ( making sure to specify you would like access to NewRiver.

Problem 1: Interactive work with a GPU

In this problem, you will use top and nvidia-smi to profile the CPU and GPU use during two runs through a Metropolis-Hastings (MH) parallel tempering simulation.

Consider the double well potential shown in the top of Figure 1 which can be converted to a density through an exponent (bottom plots). In a normal MH simulation, you choose a start location, propose a move from that location, flip a coin to determine if you accept or reject the new location based on the probability of the coin flip vs the probability of being at the new location (+/- some indicator of how hard it was to get from the old to new location). Given some initial value, and a tunable stepsize, MH simulation will work, but will get “stuck” in one of the two hills. You could increase the stepsize, however, this can result in a “flattening” of the resulting distribution draws or non-optimum exporation of small features and in situtaions where the target has large areas of zero probability between features, still often fails to fully explore the space. Figure 2 shows MC runs exploring the bottom right distribution in Figure 1 using tuning values of 0.1 and 1. Note that we are still stuck in a single hill after widening the tuning parameter.

Double well potential and unnormalized densities (gamma=1,16) density.

Double well potential and unnormalized densities (gamma=1,16) density.

MC with and without exchange. A: tune=0.1, B: tune=1.

MC with and without exchange. A: tune=0.1, B: tune=1.

Another take on this is to create a “temperature ladder” where you create separate MC chains that run in parallel. Each chain has as a target a “smoothed” or “flattened” version of the actual target. How do you flatten? You raise the target to a fractional power called the temperature. Periodically, you allow the chains to exchange states. What this does is allows higher temperature MC chains to more freely explore the target space and periodically interject into the MC chain that is confined to the actual target. The tunables now include number of ladder rungs, spacing between rungs, period between exchanges, and more.

Computationally, this adds many more MC runs such that some sort of parallizing is highly desireable. Because these MC chains need to be independent, we will need to be clever about how we parallelize and use memory if these chains are to exchange. We have coded a couple of examples up and are asking you to observe the performance of the coding on our VT clusters.

Here are the steps you should take:

  1. get an interactive job on a GPU node, interact -Aascclass -lnodes=1:ppn=28:gpus=1 -q p100_dev_q -lwalltime=2:00:00
  2. Make note of the host you are on (hostname). You will want three separate terminals (current one for top, one to profile GPU memory, another to profile GPU utilization), i.e. ssh to hostname from a NewRiver login node to create a new terminal.
  3. load the necessary modules for interacting with the GPU in R, module purge; module load gcc/5.2.0 openblas R/3.4.1 cuda/8.0.61 R-gpu/3.4.1
  4. for each of the Rscript commands below, run top, nvidia-smi -q -d MEMORY -l 1, and nvidia-smi -q -d UTILIZATION -l 1 and note (a) peak %CPU, (b) peak %MEM, (c) GPU Memory-Usage and (d) GPU GPU-Util while the script is running.

Rscript MetroH_PT.R CPU &
Rscript MetroH_PT.R CPU-parallel &
Rscript MetroH_PT.R GPU &

Provide the figures generated by the above MH runs and observations you have on the profiling and timings.

Problem 2: Bootstrap on a GPU

Bootstrap using CUDA libraries

For this problem, we should use cuda libraries, such as cuBLAS, cuRAND, and cuSOLVER, to code up the bootstrap function. Please refer our presentation from slides 20 to 34, as well as the cuSOLVER exmaple in the tutorial. In addition, Dr. Gramacy’s C implementation is a good reference since C and C++ are siblings.

Please compare your implementation with the R version, the ordinary C version, as well as the C version with OpenMP implementation (Please specify the number of CPU cores you have used.)