Today's Agenda

  • Introduction
    • What/Why GPUs
  • Basic R-GPU computing examples
    • matrix multiply
    • Solve
  • GPU with external libraries
  • Coding GPUs and memory types
  • Writing/wrapping your own CUDA/OpenCL code
  • Profiling GPU use

Introduction

If you need your code to go faster, you have a few options:

  • Be a more efficient programmer
    • use vectorized operations
    • remove redundant operations
    • manage memory (copy/preallocate)
  • Port your code to C/C++/Fortran
    • full Monte (.C or .Call)
    • Rcpp
  • Parallelize, ie use more cores
    • parallel packages
    • MPI
    • GPU

What is a GPU

  • CPUs have a 2-20 relatively fast cores that can execute completely independently of each other.
    • process data sequentially in the order in which the commands are entered.
    • powering a word processor
  • GPU == graphics processing unit, originally developed for video signal processing

  • GPGU == General-purpose graphics processing unit, use of gpu for other than video

  • GPUs have 1000's of relatively slow cores
    • Handle large amounts of compute at the same time
    • matrix multiplication
    • Have significantly faster and more advanced memory interfaces

Why GPUs

Trend for CPU development.

Trend for CPU development.

Four ways to access GPUs through R

There are four ways to access a GPU through R:

  • R GPU package (prebuilt library)
    • expanding but limited number of functions available
    • sometimes difficult to install
    • sometimes need to do "export PKG_CFLAGS=-I$CUDA_INC"
  • use a CUDA/OpenCL library with the function you are interested in
    • limited number of libraries
  • use directives (not in today's talk)
    • PRAGMA directives allowing compiler to make decision
    • OpenACC
  • write CUDA/OpenCL code and associated R wrapper
    • more power = more effort = more reward

R-packages

Four ways to access GPUs through R. Source: Nvidia

Four ways to access GPUs through R. Source: Nvidia

R gpu packages

Basic GPU (Host-Device relationship)

Basic GPU computing

matrix multiply (1)

  • GPU version, GPU pointer to CPU memory!
  • gpuMatrix is simply a pointer.
suppressMessages(library(gpuR))
nr <- 5000
x <- matrix(rnorm(nr * nr, 0, 1), nrow = nr, ncol = nr)
time1 <- system.time({
    mm1 <- x %*% x
})
gpuX = gpuR::gpuMatrix(x, type = "float")  #point GPU to matrix
time2 <- system.time({
    mm2 <- gpuX %*% gpuX
})

Basic GPU computing

matrix multiply (2)

  • GPU version, in GPU memory!
    +vclMatrix formation is a memory transfer.
vclX = gpuR::vclMatrix(x, type = "float")  #push matrix to GPU
time3 <- system.time({
    mm3 <- vclX %*% vclX
})

Basic GPU computing

matrix multiply results

matrix multiply timing
user.self sys.self elapsed
CPU 15.894 0.109 2.236
CPU-GPU 0.958 0.250 4.051
GPU 0.004 0.002 0.006
  • As the matrix grows, the need for vclMatrix over gpuMatrix becomes more pronounced as does the data transfer penalty.

Basic GPU computing

matrix multiply results

Timing for matrix multiply.

Timing for matrix multiply.

Basic GPU computing

Solve

Basic GPU computing

solve and crossprod

np <- 500  #number of predictors
nr <- 1e+05  #number of observations
X <- cbind(5, 1:nr, matrix(rnorm((np - 1) * nr, 0, 0.01), nrow = nr, ncol = (np - 
    1)))
beta <- matrix(c(1, 3, runif(np - 1, 0, 0.2)), ncol = 1)
y <- X %*% beta + matrix(rnorm(nr, 0, 1), nrow = nr, ncol = 1)
time1 <- system.time({
    ms2 <- solve(crossprod(X), crossprod(X, y))
})
# GPU version, GPU pointer to CPU memory!!
gpuX = gpuR::gpuMatrix(X, type = "float")  #point GPU to matrix
gpuy = gpuR::gpuMatrix(y, type = "float")
time2 <- system.time({
    ms4 <- gpuR::solve(gpuR::crossprod(gpuX), gpuR::crossprod(gpuX, gpuy))
})
# GPU version, in GPU memory!! (vclMatrix formation is a memory transfer)
vclX = gpuR::vclMatrix(X, type = "float")
vcly = gpuR::vclMatrix(y, type = "float")
time3 <- system.time({
    ms5 <- gpuR::solve(gpuR::crossprod(vclX), gpuR::crossprod(vclX, vcly))
})

Basic GPU computing

solve and crossprod results

solve timing
user.self sys.self elapsed
CPU 2.583 0.018 0.402
CPU-GPU 0.484 0.609 2.514
GPU 0.017 0.020 0.737
  • The most commonly used methods are: %%, +, -, , /, crossprod, tcrossprod, and trig functions among others.

Basic GPU computing

solve and crossprod results

Timing for crossprod and friends.

Timing for crossprod and friends.

Bootstrap function

## generate data for bootstrap
X <- 1/matrix(rt(n * d, df = 1), ncol = d)
beta <- c(1:d, 0)
Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd = 3)

We compare:

  • bootols.R(X, Y, B=1000, method="cp")

  • bootols(X, Y, B=1000)

  • R function using gpu

n,d= 5000, 200 n,d= 10000, 500 n,d= 10000, 800
bootols.R(X, Y, B=1000, method="cp") 29.267 244.572 452.036
bootols(X, Y, B=1000) 7.912 38.662 86.476
gpu as pointer 47.415 172.152 320.708
gpu in memory 41.742 138.485 255.296

Accessing CUDA libraries via C