Introduction

GPU (Graphics Processing Unit), was originally developed for video signal processing. It has thousands of relative slow cores, which can only be done fast if you are doing the same operation hundreds of times on thousands of operations. Since they need to shift around a lot more data than CPUs, it has significantly faster and more advanced memory interfaces.

There are three ways to access a GPU through R:

R gpu packages

There are a number of packages that has been built to use GPU functionality. The following list shows most of them:

You can check out each of these libraries at https://cran.r-project.org/web/views/HighPerformanceComputing.html

The disadvantage of some of these packages is that they can be used only with NVIDIA gpus. If you have another gpu like Intel, or AMD, you will not be able to use them. In this section, we are going to introduce the use of one these packages: gpuR. The advantage of this one is that it can be used with any gpu. So, for this section, you do not need to use ARC servers.

gpuR package

This package was created by Charles E. Determan Jr. and published in 2015. As the other packages listed above, gpuR provides a way of using gpu functionality through R. You can find more information about this package and all the operations it offers at https://cran.r-project.org/web/packages/gpuR/index.html.

gpuRprovides two classes to wrap R objects such as matrix or vector. These are gpu and vcl classes. The first one works as a pointer to the data stored in the RAM memory of the CPU, while the second one push the data to the GPU memory and after the gpu runs all the operations, the data is pushed back to the CPU memory. To understand a better this CPU-GPU (or host-device) relationship, see the following graph.

The first thing we notice is that the CPU (or host) memory and the GPU (or device) memory are not a shared memory. Then, there will be data going back and forth between host and device. This will be done by the objects gpuR will create through the use of gpu and vcl classes.

You can find a nice description of what gpuR is and how it works at https://cran.r-project.org/web/packages/gpuR/vignettes/gpuR.pdf

Getting started with gpuR

To install gpuR, you use:

install.packages("gpuR")

After loading the library, you can know, e.g. how many valid gpu you have:

library(gpuR)
detectGPUs()

To create gpu or vcl objects, you use gpuMatrix and vclMatrix commands specifyng the type of the data.

x = matrix(rnorm(1), 4, 4)
#creating a gpu object
gpuX <- gpuR::gpuMatrix(x, type="float")
#creating a vcl object
gpuX <- gpuR::vclMatrix(x, type="float")

gpuR does not copy memory

The advantage of GPU is that it does not copy memory; it works as a pointer. To understand this, let’s try the following code.

If you do not have the “pryr” package, you need first to install this and load the library.

#install.packages("pryr") 
#library(pryr)
x = matrix(rnorm(1), 4, 4)
address(x)
x[1,1] <- 0
address(x)

First, we obtained the address in the host memory of the matrix x. Then, when we modify the R object (in this case x matrix) we are copying the object, that is why we get a different location in the host memory.

gpuX <- gpuR::gpuMatrix(x, type="float")
gpuX@address
gpuX[1,1] <- 0
gpuX[1,1]
gpuX@address

Now, when we create a gpu object though gpuMatrix and modify it, it will do it without copying it because gpuR uses pointers to the host memory. Therefore, the code will be more efficient in terms of memory.

gpuR offers most of the linear algebra operations (such as %%, +, -, , /, t, crossprod, tcrossprod), math or trig functions and some extra functions like dist and eigen.

Matrix multiply

Let’s try matrix multiply and analyze the performance of the two different types of gpu objects.

First, generate an “x” matrix in R, calculate xx matrix, and get the time it takes.

    nr<-1000

    x<-matrix(rnorm(nr*nr,0,1),nrow=nr,ncol=nr)

    time1<-system.time({

        mm1 <- x %*% x

        })

Let’s compare this time to the one when using gpu and vcl objects.

    gpuX = gpuR::gpuMatrix(x, type="float") #point GPU to matrix

    time2<-system.time({

        mm2<-gpuX %*% gpuX

    })

    vclX = gpuR::vclMatrix(x, type="float") #push matrix to GPU

    time3<-system.time({
        mm3<-vclX %*% vclX

    })

Recall that when we use gpuMatrix we use GPU pointer to CPU memory. When we use vclMatrix, we are pushing the matrix to the device or GPU.

We compare the results:

matrix multiply timing
user.self sys.self elapsed
CPU 0.111 0.003 0.018
CPU-GPU 0.019 0.015 0.774
GPU 0.006 0.000 0.005

We see here that GPU timing is much less that CPU-GPU timing, as expected. Depending on the computer, you can get that the CPU timing is higher or lower than the GPU (vcl object) timing for this matrix size. However, as the matrix grows, the advantage of gpu over cpu becomes clearer. Also, we observe that the need for vclMatrix over gpuMatrix becomes more pronounced as does the data transfer penalty.

Solve function

We know that solving a system of equations is MUCH faster than simply inverting a matrix and often can resolve numerical issues revolving around values close to zero.

gpuR offers the solve function and it seems that outperforms R function in terms of runtime if the matrix is big. To observe this performance, we can use the following code to generate the data, create the gpu ojects, and then compare the results.

np<-500 #number of predictors

nr<-10000 #number of observations

X<-cbind(5,1:nr,matrix(rnorm((np-1)*nr,0,.01),nrow=nr,ncol=(np-1)))

beta<-matrix(c(1,3,runif(np-1,0,0.2)),ncol=1)

y<-X%*%beta+matrix(rnorm(nr,0,1),nrow=nr,ncol=1)

time1<-system.time({

  ms2<-solve(crossprod(X), crossprod(X, y))

})

#GPU version, GPU pointer to CPU memory!!

gpuX = gpuR::gpuMatrix(X, type="float") #point GPU to matrix

gpuy = gpuR::gpuMatrix(y, type="float")

time2<-system.time({

  ms4<-gpuR::solve(gpuR::crossprod(gpuX), gpuR::crossprod(gpuX, gpuy))

})

#GPU version, in GPU memory!! (vclMatrix formation is a memory transfer)

vclX = gpuR::vclMatrix(X, type="float") 

vcly = gpuR::vclMatrix(y, type="float")

time3<-system.time({

  ms5<-gpuR::solve(gpuR::crossprod(vclX), gpuR::crossprod(vclX, vcly))

})
solve timing
user.self sys.self elapsed
CPU 0.149 0.004 0.032
CPU-GPU 0.047 0.050 0.911
GPU 0.013 0.010 0.091

We observe the same thing as matrix multiple, the gpu (using vcl object) outperforms cpu-gpu (using gpu object), and if we make the matrix bigger we see that it also outperforms R function. We observe this in the following figure.

Cov and eigen functions

gpuR also offers functions as cov and eigen. However, as it can be seen below they do not seem to provide a faster performance over R functions.

nr <- 10000

nc <- 2000

x<-matrix(rnorm(nr*nr,0,1),nrow=nr,ncol=nc)

system.time(m1 <- eigen(cov(x), symmetric = TRUE))
##    user  system elapsed 
##  23.805   0.289  23.143
gpuX <- gpuR::vclMatrix(x, type="float")

system.time(m2 <- gpuR::eigen(gpuR::cov(gpuX), symmetric = TRUE))
##    user  system elapsed 
##   0.826   2.118  53.279

Bootstrap function

We have used bootstrap function in class a number of times now. In this section we evaluate the performarnce of gpu through the bootstrap function.

First, we generate the following data:

## generate data for bootstrap

n <- 10000

d <- 800

X <- 1/matrix(rt(n*d, df=1), ncol=d)

beta <- c(1:d, 0)

Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3)

B <- 1000

Then, we run the bootstrap using gpu, and vcl objects. We can use the followig code:

time3 <- system.time({

  beta.gpu <- matrix(NA, nrow=B, ncol=ncol(X))

  for(b in 1:B) {

    i <- sample(1:n, n, replace=TRUE)

    Xb <- gpuR::gpuMatrix(X[i,], type="float")

    gpuy <- as.matrix(Y[i])

    yb <- gpuR::gpuMatrix(gpuy, type="float")

  beta.gpu[b,] <- drop(gpuR::solve(gpuR::crossprod(Xb),gpuR::crossprod(Xb,yb)))[1:ncol(X)]

  }

})[3]

time4 <- system.time({

  beta.vcl <- matrix(NA, nrow=B, ncol=ncol(X))

  for(b in 1:B) {

    i <- sample(1:n, n, replace=TRUE)

    Xb <- gpuR::vclMatrix(X[i,], type="float")

    gpuy <- as.matrix(Y[i])

    yb <- gpuR::vclMatrix(gpuy, type="float")

    beta.vcl[b,] <- drop(gpuR::solve(gpuR::crossprod(Xb),gpuR::crossprod(Xb,yb)))[1:ncol(X)]

}

})[3]

Then, we compare R function using crossprod method with the gpu using both types of objects, and here are the results for different data settings.

     | n,d= 5000, 200 | n,d= 10000, 500 | n,d= 10000, 800 |

|:—–|——:|——-:|——-:|

|bootols.R(X, Y, B=1000, method=“cp”) | 29.267| 244.572| 452.036|

|using gpu object | 47.415| 172.152| 320.708|

|using vcl object | 41.742| 138.485| 255.296|

We see that the timing performance of gpu is better when we have a bigger dataset. We also observe that using vcl objects is more efficient than using gpu objects as seen before.

Log into ARC

NewRiver comprises 173 nodes. It contains 6 compute engines designed for distinct workloads, and one of the engines is GPU (Nvidia P100s GPUs) for code acceleration.

Log into the NewRiver login node with your PID and PID password using 2 factor authentication:

(a) Mac/Linux: Open a terminal and type:
ssh YOURPID\@newriver1.arc.vt.edu

(b) Windows: Download PuTTY and in the Host Name field type: 
YOURPID\@newriver1.arc.vt.edu 

When are prompted for a password, enter your password followed by a comma then one of the following key words for your 2FA second factor:

auto
phone
phone2
Yubikey code (if you have one)
Sms code (if you have one)

Here are some examples of what to enter for your password:

yourpassword,auto     (this will use your default second factor such as Duo push)
yourpassword,phone     (this will use the phone number you have chosen for 2FA)

Once your credentials are verified, you will be presented with your command prompt.

There are some key commands to get job on gpu node:

Get job on gpu node:

interact -ApredictHPC -lnodes=1:ppn=14:gpus=1 -q p100_normal_q -ladvres=GPU_TRAINING.102

load required modules

module purge    
module load cuda/8.0.61 gcc/5.2.0 openblas    
module load R/3.4.1 R-gpu/3.4.1    

Now you are ready to take advantage of GPU computing to speed up your work.

Accessing CUDA libraries via C++

In this section, we are guiding you through the use of CUDA libraries. In our presentation, we have gone through cuRAND and cuBLAS. Now, let us take a look at another example of cuBLAS. In addition, we will illustrate a working example of the use of cuSOLVER.

In the class presentation, we have shown how to create two matrices in GPU and perform matrix multiplication there. Here, we are showing another example: passing two existing matrices to GPU and finish the job there.

extern "C"
void cuMM(int *nr_A, int *nc_A, int *nc_B, double *A, double *B, double *C, double *a, double *b)
{
    /* set up variables */
    const double alpha = (double) *a;
    const double beta = (double) *b;

    /* create a handle for cuBLAS */
    cublasHandle_t handle;
    cublasCreate(&handle);

    /* allocate 3 arrays on GPU */
    double *d_A, *d_B, *d_C;
    cudaMalloc(&d_A,*nr_A * *nc_A * sizeof(double));
    cudaMalloc(&d_B,*nc_A * *nc_B * sizeof(double));
    cudaMalloc(&d_C,*nr_A * *nc_B * sizeof(double));

    /* Copy CPU data to GPU */
    cudaMemcpy(d_A, A, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, *nc_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, C, *nr_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);

    /* Multiply A and B on GPU */
    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, *nr_A, *nc_B, *nc_A, &alpha, d_A, *nr_A, d_B, *nc_A, &beta, d_C, *nr_A);

    /* Copy the data back to CPU
    cudaMemcpy(C,d_C,*nr_A * *nc_B * sizeof(double),cudaMemcpyDeviceToHost);

    /* Free GPU memory */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

This example may have more practical application than the one in our class presentation. Next, let us focus on the cuSOLVER.

The cuSOLVER library is built upon cuBLAS and cuSPARSE libraries, and it comprises three libraries:

  • cuSolverDN: to solve dense linear systems of the form \(Ax=b\); Provides QR factorization and LU with partial pivoting to handle a general matrix \(A\), which may be non-symmetric; Cholesky factorization is also provided for symmetirx/Hermitian matrices; The Bunch-Kaufman(LDL) factorization is provided for symmetric indefinite matrices. Singular Value Decomposition (SVD) amd other bidiagonalization routines are also provided.

  • cuSolverSP: to solve sparse linear sytems of the form \(Ax=b\), and the least-squares problem \(x=argmin||A*z-b||\). The core algorithm is based on sparse QR factorization.

  • cuSolverRF: to accelerate solution of sets of linear systems by fast re-factorization when given new coefficients in the same sparsity pattern \(A_ix_i=f_i\), where a sequence of coefficient matrices \(A_i\in R^{n\times n}\), right-hand-sides \(f_i\in R^n\) and solutions \(x_i\in R^n\) are given for \(i=1,...,k\).

cuSolverDN is the most popular one, and therefore, it is our focus. The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are called from cuComplex.h. In addition, cuSolverDN uses some similar types from cuBLAS, such as cusolverDnHandle_t, cublasFillMode_t, cublasOperation_t, cusolverEigType_t, cusolverEigMode_t, and cusolverStatus_t.

Below is a working example that does \((X^\top X)^{-1}\).

First of all, we need those magic header files.

#include <R.h>
#include <stdio.h>
#include <cstdlib>
#include <curand.h>
#include <cublas_v2.h>
#include <cusolverDn.h>
#include <cuda_runtime.h>

The skeleton is as follows:

extern "C"
void cuMM(int *nr_A, int *nc_A, int *nc_B, double *A, double *B, double *C, 
          double *X, double *a, double *b)
{
  \* part 1 *\
     ...
  \* part 2 *\
     ...
  \* part 3 *\
     ...
  \* part 4 *\
     ...
}

Part 1 is straightforward but critical:

/* set up variables*/
const double alpha = (double) *a;
const double beta = (double) *b;
int *d_Ipiv = NULL; // pivoting sequence
int lwork = 0; // size of workspace
double *d_work = NULL; // device workspace for getrf
int *d_info;

/* create a handle for cuBLAS */
cublasHandle_t cublasH;
cublasCreate(&cublasH);
cusolverDnHandle_t cusolverH;
cusolverDnCreate(&cusolverH);

/* Allocate 4 arrays on GPU */
double *d_A, *d_B, *d_C;
cudaMalloc(&d_A,*nr_A * *nc_A * sizeof(double));
cudaMalloc(&d_B,*nc_A * *nc_B * sizeof(double));
cudaMalloc(&d_C,*nr_A * *nc_A * sizeof(double));
cudaMalloc((void**)&d_Ipiv, sizeof(int) * *nr_A);
cudaMalloc((void**)&d_info, sizeof(int));

Part 2 is illustrated as follows:

/* for solver, need to set aside some memory space and put it on device */
cusolverDnDgetrf_bufferSize(cusolverH, *nr_A, *nc_A, d_A, *nr_A, &lwork);
cudaMalloc((void**)&d_work, sizeof(double)*lwork);

/* Copy CPU data to GPU */
cudaMemcpy(d_A, A, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, *nr_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_C, C, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);

It is Part 3’s turn:

/* calculate (A'A)=C */
cublasDgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N, *nr_A, *nc_A, *nc_A, &alpha, d_A, *nr_A, d_A, *nc_A, &beta, d_C, *nr_A);

/* get C back, it gets over written when solving */
cudaMemcpy(C,d_C,*nr_A * *nc_A * sizeof(double),cudaMemcpyDeviceToHost);

/* solve for X in CX=B where B is an identity matrix */
cusolverDnDgetrf(cusolverH, *nr_A, *nc_A, d_C, *nr_A, d_work, d_Ipiv, d_info);
cusolverDnDgetrs(cusolverH, CUBLAS_OP_N, *nr_A, *nc_B, d_C, *nr_A, d_Ipiv, d_B, *nc_B, d_info);

Finally, let us go through the last part:

/* Copy the X back to CPU (note that it is in d_B because solve overwrites it */
cudaMemcpy(X,d_B,*nr_A * *nc_B * sizeof(double),cudaMemcpyDeviceToHost);

/* Free GPU memory */
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

if(cublasH) cublasDestroy(cublasH);
if(cusolverH) cusolverDnDestroy(cusolverH);
cudaDeviceReset();

The last step is to implement a wrapper function, and to compile it in the same way as illustrated in our presentation.

For a complete examle of how this works, look in the testR_BLAS directory in the tar file given by Bobby. Inside this directory, you will see both cuBLAS_MM_d.cu and wrap.R. If, on a NewRiver gpu node, you type bash gpu_cuBLAS.sh, you will see this example run. Note that shell script loads the appropriate modules.

OpenCL

Before starting OpenCL in ARC server

  • You need to get access to the interactive session on Newriver
## Get job on gpu node:  
interact -Aascclass -lnodes=1:ppn=14:gpus=1 -q p100_normal_q   

## load required modules
module purge  
module load cuda/8.0.61 gcc/5.2.0 openblas  
module load R/3.4.1 R-gpu/3.4.1  

OpenCL in R

OpenCL in R interface

  • The OpenCL library in R can wrap the OpenCL code and allow us to use GPU computing.

  • oclPlaforms obtains information about computing platform in our computer such as NVIDIA, AMD, Intel, etc.

  • oclDevices sets GPU device we want to use.

########## OpenCL Code #############
library(OpenCL)

p = oclPlatforms()  ## gets computing platform on server
d = oclDevices(p[[1]])  ## sets GPU device we want to use

print(p)
print(d)

Example of OpenCL in R | Vector Addition

Source: https://gist.github.com/mprymek/8ca298f0ff2b139b0c63

library(OpenCL)
p = oclPlatforms() #gets computing platform
d = oclDevices(p[[1]]) #sets GPU device we want to use

vector_add_code <- c("

/* type 'double' is enable by this statement */

#pragma OPENCL EXTENSION cl_khr_fp64 : enable 
__kernel void gpu_sum(
// first two args are mandatory
__global float* output,
const unsigned int n,
// user args
__global const float* input1,
__global const float* input2)
{
int i = get_global_id(0);
if (i<n) output[i] = input1[i] + input2[i];
};")

Example of OpenCL in R | Vector Addition

  • The OpenCL code is wrapped by oclSimpleKernel and create a function ??k.gpu.sum??.

  • You can run the OpenCL function, “k.gpu.sum” by using oclRun.

## Make R function "k.gpu.sum" by wrapping OpenCL code with oclSimpleKerenl.
k.gpu.sum <- oclSimpleKernel(d[[1]], "gpu_sum", vector_add_code, "best")

# example
vector1 <- as.double(1:10)
vector2 <- as.double(2:11)

# run the GPU code and get result
result <- oclRun(k.gpu.sum, length(vector1), vector1, vector2)


print(result)