GPU (**G**raphics **P**rocessing **U**nit), was originally developed for video signal processing. It has thousands of relative slow cores, which can only be done fast if you are doing the same operation hundreds of times on thousands of operations. Since they need to shift around a lot more data than CPUs, it has significantly faster and more advanced memory interfaces.

There are three ways to access a GPU through R:

- use an R package with the function you are interested in
- expanding but limited number of functions available

- sometimes difficult to install, MUST DL package, then install on GPU node
- sometimes need to do “export PKG_CFLAGS=-I$CUDA_INC”

- expanding but limited number of functions available
- use a CUDA/OpenCL library with the function you are interested in
- limited number of libraries

- write CUDA/OpenCL code and associated R wrapper
- more power = more effort = more reward

There are a number of packages that has been built to use GPU functionality. The following list shows most of them:

gputools (CUDA and cuBlas)

cudaBayesreg (CUDA)

HiPLARM (PLASMA/MAGMA from UTK, CUDA and BLAS)

HiPARb

gmatrix

gpuR (OpenCL, viennaCL)

gpuRcuda (CUDA)

rpud (rpudplus add-on)

RCUDA, OpenCL, and ROpenCL

You can check out each of these libraries at https://cran.r-project.org/web/views/HighPerformanceComputing.html

The disadvantage of some of these packages is that they can be used only with NVIDIA gpus. If you have another gpu like Intel, or AMD, you will not be able to use them. In this section, we are going to introduce the use of one these packages: `gpuR`

. The advantage of this one is that it can be used with any gpu. So, for this section, you do not need to use ARC servers.

This package was created by Charles E. Determan Jr. and published in 2015. As the other packages listed above, `gpuR`

provides a way of using gpu functionality through R. You can find more information about this package and all the operations it offers at https://cran.r-project.org/web/packages/gpuR/index.html.

`gpuR`

provides two classes to wrap R objects such as matrix or vector. These are `gpu`

and `vcl`

classes. The first one works as a pointer to the data stored in the RAM memory of the CPU, while the second one push the data to the GPU memory and after the gpu runs all the operations, the data is pushed back to the CPU memory. To understand a better this CPU-GPU (or host-device) relationship, see the following graph.

The first thing we notice is that the CPU (or host) memory and the GPU (or device) memory are not a shared memory. Then, there will be data going back and forth between host and device. This will be done by the objects `gpuR`

will create through the use of `gpu`

and `vcl`

classes.

You can find a nice description of what `gpuR`

is and how it works at https://cran.r-project.org/web/packages/gpuR/vignettes/gpuR.pdf

To install `gpuR`

, you use:

`install.packages("gpuR")`

After loading the library, you can know, e.g. how many valid gpu you have:

```
library(gpuR)
detectGPUs()
```

To create `gpu`

or `vcl`

objects, you use `gpuMatrix`

and `vclMatrix`

commands specifyng the type of the data.

```
x = matrix(rnorm(1), 4, 4)
#creating a gpu object
gpuX <- gpuR::gpuMatrix(x, type="float")
#creating a vcl object
gpuX <- gpuR::vclMatrix(x, type="float")
```

The advantage of GPU is that it does not copy memory; it works as a pointer. To understand this, let’s try the following code.

If you do not have the “pryr” package, you need first to install this and load the library.

```
#install.packages("pryr")
#library(pryr)
x = matrix(rnorm(1), 4, 4)
address(x)
x[1,1] <- 0
address(x)
```

First, we obtained the address in the host memory of the matrix x. Then, when we modify the R object (in this case x matrix) we are copying the object, that is why we get a different location in the host memory.

```
gpuX <- gpuR::gpuMatrix(x, type="float")
gpuX@address
gpuX[1,1] <- 0
gpuX[1,1]
gpuX@address
```

Now, when we create a gpu object though `gpuMatrix`

and modify it, it will do it without copying it because gpuR uses pointers to the host memory. Therefore, the code will be more efficient in terms of memory.

`gpuR`

offers most of the linear algebra operations (such as %*%, +, -, *, /, t, crossprod, tcrossprod), math or trig functions and some extra functions like dist and eigen.

Let’s try matrix multiply and analyze the performance of the two different types of gpu objects.

First, generate an “x” matrix in R, calculate xx matrix, and get the time it takes.

```
nr<-1000
x<-matrix(rnorm(nr*nr,0,1),nrow=nr,ncol=nr)
time1<-system.time({
mm1 <- x %*% x
})
```

Let’s compare this time to the one when using `gpu`

and `vcl`

objects.

```
gpuX = gpuR::gpuMatrix(x, type="float") #point GPU to matrix
time2<-system.time({
mm2<-gpuX %*% gpuX
})
vclX = gpuR::vclMatrix(x, type="float") #push matrix to GPU
time3<-system.time({
mm3<-vclX %*% vclX
})
```

Recall that when we use `gpuMatrix`

we use GPU pointer to CPU memory. When we use `vclMatrix`

, we are pushing the matrix to the device or GPU.

We compare the results:

user.self | sys.self | elapsed | |
---|---|---|---|

CPU | 0.111 | 0.003 | 0.018 |

CPU-GPU | 0.019 | 0.015 | 0.774 |

GPU | 0.006 | 0.000 | 0.005 |

We see here that GPU timing is much less that CPU-GPU timing, as expected. Depending on the computer, you can get that the CPU timing is higher or lower than the GPU (vcl object) timing for this matrix size. However, as the matrix grows, the advantage of gpu over cpu becomes clearer. Also, we observe that the need for `vclMatrix`

over `gpuMatrix`

becomes more pronounced as does the data transfer penalty.

We know that solving a system of equations is MUCH faster than simply inverting a matrix and often can resolve numerical issues revolving around values close to zero.

`gpuR`

offers the `solve`

function and it seems that outperforms R function in terms of runtime if the matrix is big. To observe this performance, we can use the following code to generate the data, create the gpu ojects, and then compare the results.

```
np<-500 #number of predictors
nr<-10000 #number of observations
X<-cbind(5,1:nr,matrix(rnorm((np-1)*nr,0,.01),nrow=nr,ncol=(np-1)))
beta<-matrix(c(1,3,runif(np-1,0,0.2)),ncol=1)
y<-X%*%beta+matrix(rnorm(nr,0,1),nrow=nr,ncol=1)
time1<-system.time({
ms2<-solve(crossprod(X), crossprod(X, y))
})
#GPU version, GPU pointer to CPU memory!!
gpuX = gpuR::gpuMatrix(X, type="float") #point GPU to matrix
gpuy = gpuR::gpuMatrix(y, type="float")
time2<-system.time({
ms4<-gpuR::solve(gpuR::crossprod(gpuX), gpuR::crossprod(gpuX, gpuy))
})
#GPU version, in GPU memory!! (vclMatrix formation is a memory transfer)
vclX = gpuR::vclMatrix(X, type="float")
vcly = gpuR::vclMatrix(y, type="float")
time3<-system.time({
ms5<-gpuR::solve(gpuR::crossprod(vclX), gpuR::crossprod(vclX, vcly))
})
```

user.self | sys.self | elapsed | |
---|---|---|---|

CPU | 0.149 | 0.004 | 0.032 |

CPU-GPU | 0.047 | 0.050 | 0.911 |

GPU | 0.013 | 0.010 | 0.091 |

We observe the same thing as matrix multiple, the gpu (using vcl object) outperforms cpu-gpu (using gpu object), and if we make the matrix bigger we see that it also outperforms R function. We observe this in the following figure.

`gpuR`

also offers functions as `cov`

and `eigen`

. However, as it can be seen below they do not seem to provide a faster performance over R functions.

```
nr <- 10000
nc <- 2000
x<-matrix(rnorm(nr*nr,0,1),nrow=nr,ncol=nc)
system.time(m1 <- eigen(cov(x), symmetric = TRUE))
```

```
## user system elapsed
## 23.805 0.289 23.143
```

```
gpuX <- gpuR::vclMatrix(x, type="float")
system.time(m2 <- gpuR::eigen(gpuR::cov(gpuX), symmetric = TRUE))
```

```
## user system elapsed
## 0.826 2.118 53.279
```

We have used bootstrap function in class a number of times now. In this section we evaluate the performarnce of gpu through the bootstrap function.

First, we generate the following data:

```
## generate data for bootstrap
n <- 10000
d <- 800
X <- 1/matrix(rt(n*d, df=1), ncol=d)
beta <- c(1:d, 0)
Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3)
B <- 1000
```

Then, we run the bootstrap using gpu, and vcl objects. We can use the followig code:

```
time3 <- system.time({
beta.gpu <- matrix(NA, nrow=B, ncol=ncol(X))
for(b in 1:B) {
i <- sample(1:n, n, replace=TRUE)
Xb <- gpuR::gpuMatrix(X[i,], type="float")
gpuy <- as.matrix(Y[i])
yb <- gpuR::gpuMatrix(gpuy, type="float")
beta.gpu[b,] <- drop(gpuR::solve(gpuR::crossprod(Xb),gpuR::crossprod(Xb,yb)))[1:ncol(X)]
}
})[3]
time4 <- system.time({
beta.vcl <- matrix(NA, nrow=B, ncol=ncol(X))
for(b in 1:B) {
i <- sample(1:n, n, replace=TRUE)
Xb <- gpuR::vclMatrix(X[i,], type="float")
gpuy <- as.matrix(Y[i])
yb <- gpuR::vclMatrix(gpuy, type="float")
beta.vcl[b,] <- drop(gpuR::solve(gpuR::crossprod(Xb),gpuR::crossprod(Xb,yb)))[1:ncol(X)]
}
})[3]
```

Then, we compare R function using crossprod method with the gpu using both types of objects, and here are the results for different data settings.

| n,d= 5000, 200 | n,d= 10000, 500 | n,d= 10000, 800 |

|:—–|——:|——-:|——-:|

|bootols.R(X, Y, B=1000, method=“cp”) | 29.267| 244.572| 452.036|

|using gpu object | 47.415| 172.152| 320.708|

|using vcl object | 41.742| 138.485| 255.296|

We see that the timing performance of gpu is better when we have a bigger dataset. We also observe that using vcl objects is more efficient than using gpu objects as seen before.

**NewRiver** comprises 173 nodes. It contains 6 compute engines designed for distinct workloads, and one of the engines is GPU (Nvidia P100s GPUs) for code acceleration.

Log into the NewRiver login node with your PID and PID password using 2 factor authentication:

```
(a) Mac/Linux: Open a terminal and type:
ssh YOURPID\@newriver1.arc.vt.edu
(b) Windows: Download PuTTY and in the Host Name field type:
YOURPID\@newriver1.arc.vt.edu
```

When are prompted for a password, enter your password followed by a comma then one of the following key words for your 2FA second factor:

```
auto
phone
phone2
Yubikey code (if you have one)
Sms code (if you have one)
```

Here are some examples of what to enter for your password:

```
yourpassword,auto (this will use your default second factor such as Duo push)
yourpassword,phone (this will use the phone number you have chosen for 2FA)
```

Once your credentials are verified, you will be presented with your command prompt.

There are some key commands to get job on gpu node:

interact -ApredictHPC -lnodes=1:ppn=14:gpus=1 -q p100_normal_q -ladvres=GPU_TRAINING.102

```
module purge
module load cuda/8.0.61 gcc/5.2.0 openblas
module load R/3.4.1 R-gpu/3.4.1
```

Now you are ready to take advantage of GPU computing to speed up your work.

In this section, we are guiding you through the use of CUDA libraries. In our presentation, we have gone through **cuRAND** and **cuBLAS**. Now, let us take a look at another example of **cuBLAS**. In addition, we will illustrate a working example of the use of **cuSOLVER**.

In the class presentation, we have shown how to create two matrices in GPU and perform matrix multiplication there. Here, we are showing another example: passing two existing matrices to GPU and finish the job there.

```
extern "C"
void cuMM(int *nr_A, int *nc_A, int *nc_B, double *A, double *B, double *C, double *a, double *b)
{
/* set up variables */
const double alpha = (double) *a;
const double beta = (double) *b;
/* create a handle for cuBLAS */
cublasHandle_t handle;
cublasCreate(&handle);
/* allocate 3 arrays on GPU */
double *d_A, *d_B, *d_C;
cudaMalloc(&d_A,*nr_A * *nc_A * sizeof(double));
cudaMalloc(&d_B,*nc_A * *nc_B * sizeof(double));
cudaMalloc(&d_C,*nr_A * *nc_B * sizeof(double));
/* Copy CPU data to GPU */
cudaMemcpy(d_A, A, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, *nc_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_C, C, *nr_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);
/* Multiply A and B on GPU */
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, *nr_A, *nc_B, *nc_A, &alpha, d_A, *nr_A, d_B, *nc_A, &beta, d_C, *nr_A);
/* Copy the data back to CPU
cudaMemcpy(C,d_C,*nr_A * *nc_B * sizeof(double),cudaMemcpyDeviceToHost);
/* Free GPU memory */
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
```

This example may have more practical application than the one in our class presentation. Next, let us focus on the **cuSOLVER**.

The **cuSOLVER** library is built upon **cuBLAS** and **cuSPARSE** libraries, and it comprises three libraries:

**cuSolverDN**: to solve dense linear systems of the form \(Ax=b\); Provides QR factorization and LU with partial pivoting to handle a general matrix \(A\), which may be non-symmetric; Cholesky factorization is also provided for symmetirx/Hermitian matrices; The Bunch-Kaufman(LDL) factorization is provided for symmetric indefinite matrices. Singular Value Decomposition (SVD) amd other bidiagonalization routines are also provided.**cuSolverSP**: to solve sparse linear sytems of the form \(Ax=b\), and the least-squares problem \(x=argmin||A*z-b||\). The core algorithm is based on sparse QR factorization.**cuSolverRF**: to accelerate solution of sets of linear systems by fast re-factorization when given new coefficients in the same sparsity pattern \(A_ix_i=f_i\), where a sequence of coefficient matrices \(A_i\in R^{n\times n}\), right-hand-sides \(f_i\in R^n\) and solutions \(x_i\in R^n\) are given for \(i=1,...,k\).

**cuSolverDN** is the most popular one, and therefore, it is our focus. The *float*, *double*, *cuComplex*, and *cuDoubleComplex* data types are supported. The first two are standard C data types, while the last two are called from *cuComplex.h*. In addition, **cuSolverDN** uses some similar types from **cuBLAS**, such as **cusolverDnHandle_t**, **cublasFillMode_t**, **cublasOperation_t**, **cusolverEigType_t**, **cusolverEigMode_t**, and **cusolverStatus_t**.

Below is a working example that does \((X^\top X)^{-1}\).

First of all, we need those magic header files.

```
#include <R.h>
#include <stdio.h>
#include <cstdlib>
#include <curand.h>
#include <cublas_v2.h>
#include <cusolverDn.h>
#include <cuda_runtime.h>
```

The skeleton is as follows:

```
extern "C"
void cuMM(int *nr_A, int *nc_A, int *nc_B, double *A, double *B, double *C,
double *X, double *a, double *b)
{
\* part 1 *\
...
\* part 2 *\
...
\* part 3 *\
...
\* part 4 *\
...
}
```

Part 1 is straightforward but critical:

```
/* set up variables*/
const double alpha = (double) *a;
const double beta = (double) *b;
int *d_Ipiv = NULL; // pivoting sequence
int lwork = 0; // size of workspace
double *d_work = NULL; // device workspace for getrf
int *d_info;
/* create a handle for cuBLAS */
cublasHandle_t cublasH;
cublasCreate(&cublasH);
cusolverDnHandle_t cusolverH;
cusolverDnCreate(&cusolverH);
/* Allocate 4 arrays on GPU */
double *d_A, *d_B, *d_C;
cudaMalloc(&d_A,*nr_A * *nc_A * sizeof(double));
cudaMalloc(&d_B,*nc_A * *nc_B * sizeof(double));
cudaMalloc(&d_C,*nr_A * *nc_A * sizeof(double));
cudaMalloc((void**)&d_Ipiv, sizeof(int) * *nr_A);
cudaMalloc((void**)&d_info, sizeof(int));
```

Part 2 is illustrated as follows:

```
/* for solver, need to set aside some memory space and put it on device */
cusolverDnDgetrf_bufferSize(cusolverH, *nr_A, *nc_A, d_A, *nr_A, &lwork);
cudaMalloc((void**)&d_work, sizeof(double)*lwork);
/* Copy CPU data to GPU */
cudaMemcpy(d_A, A, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, *nr_A * *nc_B * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_C, C, *nr_A * *nc_A * sizeof(double), cudaMemcpyHostToDevice);
```

It is Part 3’s turn:

```
/* calculate (A'A)=C */
cublasDgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N, *nr_A, *nc_A, *nc_A, &alpha, d_A, *nr_A, d_A, *nc_A, &beta, d_C, *nr_A);
/* get C back, it gets over written when solving */
cudaMemcpy(C,d_C,*nr_A * *nc_A * sizeof(double),cudaMemcpyDeviceToHost);
/* solve for X in CX=B where B is an identity matrix */
cusolverDnDgetrf(cusolverH, *nr_A, *nc_A, d_C, *nr_A, d_work, d_Ipiv, d_info);
cusolverDnDgetrs(cusolverH, CUBLAS_OP_N, *nr_A, *nc_B, d_C, *nr_A, d_Ipiv, d_B, *nc_B, d_info);
```

Finally, let us go through the last part:

```
/* Copy the X back to CPU (note that it is in d_B because solve overwrites it */
cudaMemcpy(X,d_B,*nr_A * *nc_B * sizeof(double),cudaMemcpyDeviceToHost);
/* Free GPU memory */
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
if(cublasH) cublasDestroy(cublasH);
if(cusolverH) cusolverDnDestroy(cusolverH);
cudaDeviceReset();
```

The last step is to implement a wrapper function, and to compile it in the same way as illustrated in our presentation.

For a complete examle of how this works, look in the testR_BLAS directory in the tar file given by Bobby. Inside this directory, you will see both `cuBLAS_MM_d.cu`

and `wrap.R`

. If, on a NewRiver gpu node, you type `bash gpu_cuBLAS.sh`

, you will see this example run. Note that shell script loads the appropriate modules.

- You need to get access to the interactive session on Newriver

```
## Get job on gpu node:
interact -Aascclass -lnodes=1:ppn=14:gpus=1 -q p100_normal_q
## load required modules
module purge
module load cuda/8.0.61 gcc/5.2.0 openblas
module load R/3.4.1 R-gpu/3.4.1
```

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), etc.

**OpenCL package in R****OpenCL R library manual**Unfortunately, for Windows users, it is not available in the latest R version!

In server of Arc, you can use R with OpenCL library.

The OpenCL library in R can wrap the OpenCL code and allow us to use GPU computing.

`oclPlaforms`

obtains information about computing platform in our computer such as NVIDIA, AMD, Intel, etc.`oclDevices`

sets GPU device we want to use.

```
########## OpenCL Code #############
library(OpenCL)
p = oclPlatforms() ## gets computing platform on server
d = oclDevices(p[[1]]) ## sets GPU device we want to use
print(p)
print(d)
```

Source: https://gist.github.com/mprymek/8ca298f0ff2b139b0c63

```
library(OpenCL)
p = oclPlatforms() #gets computing platform
d = oclDevices(p[[1]]) #sets GPU device we want to use
vector_add_code <- c("
/* type 'double' is enable by this statement */
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void gpu_sum(
// first two args are mandatory
__global float* output,
const unsigned int n,
// user args
__global const float* input1,
__global const float* input2)
{
int i = get_global_id(0);
if (i<n) output[i] = input1[i] + input2[i];
};")
```

The OpenCL code is wrapped by

`oclSimpleKernel`

and create a function ??k.gpu.sum??.You can run the OpenCL function, “k.gpu.sum” by using

`oclRun`

.

```
## Make R function "k.gpu.sum" by wrapping OpenCL code with oclSimpleKerenl.
k.gpu.sum <- oclSimpleKernel(d[[1]], "gpu_sum", vector_add_code, "best")
# example
vector1 <- as.double(1:10)
vector2 <- as.double(2:11)
# run the GPU code and get result
result <- oclRun(k.gpu.sum, length(vector1), vector1, vector2)
print(result)
```