- Introduction
- What/Why GPUs

- What/Why GPUs
- Basic R-GPU computing examples
- matrix multiply
- Solve

- matrix multiply
- GPU with external libraries
- Coding GPUs and memory types
- Writing/wrapping your own CUDA/OpenCL code
- Profiling GPU use

- Introduction
- What/Why GPUs

- What/Why GPUs
- Basic R-GPU computing examples
- matrix multiply
- Solve

- matrix multiply
- GPU with external libraries
- Coding GPUs and memory types
- Writing/wrapping your own CUDA/OpenCL code
- Profiling GPU use

If you need your code to go faster, you have a few options:

- Be a more efficient programmer
- use vectorized operations
- remove redundant operations
- manage memory (copy/preallocate)

- use vectorized operations
- Port your code to C/C++/Fortran
- full Monte (.C or .Call)
- Rcpp

- Parallelize, ie use more cores
- parallel packages
- MPI
…*GPU*

- CPUs have a 2-20 relatively fast cores that can execute completely independently of each other.
- process data sequentially in the order in which the commands are entered.
- powering a word processor

GPU == graphics processing unit, originally developed for video signal processing

GPGU == General-purpose graphics processing unit, use of gpu for other than video

- GPUs have 1000's of relatively slow cores
- Handle large amounts of compute at the same time
- matrix multiplication
- Have significantly faster and more advanced memory interfaces

There are four ways to access a GPU through R:

- R GPU package (prebuilt library)
- expanding but limited number of functions available
- sometimes difficult to install
- sometimes need to do "export PKG_CFLAGS=-I$CUDA_INC"

- expanding but limited number of functions available
- use a CUDA/OpenCL library with the function you are interested in
- limited number of libraries

- limited number of libraries
- use directives (not in today's talk)
- PRAGMA directives allowing compiler to make decision
- OpenACC

- PRAGMA directives allowing compiler to make decision
- write CUDA/OpenCL code and associated R wrapper
- more power = more effort = more reward

-Disclaimer: partial list!

- gputools (CUDA and cuBlas)
- cudaBayesreg (CUDA)
- HiPLARM (PLASMA/MAGMA from UTK, CUDA and BLAS)
- HiPARb
- gmatrix
- gpuR (OpenCL, viennaCL)
- gpuRcuda (CUDA)
- rpud (rpudplus add-on)
- RCUDA, OpenCL, and ROpenCL

https://cran.r-project.org/web/views/HighPerformanceComputing.html

- GPU version, GPU pointer to CPU memory!
`gpuMatrix`

is simply a pointer.

suppressMessages(library(gpuR)) nr <- 5000 x <- matrix(rnorm(nr * nr, 0, 1), nrow = nr, ncol = nr) time1 <- system.time({ mm1 <- x %*% x }) gpuX = gpuR::gpuMatrix(x, type = "float") #point GPU to matrix time2 <- system.time({ mm2 <- gpuX %*% gpuX })

- GPU version, in GPU memory!

+`vclMatrix`

formation is a memory transfer.

vclX = gpuR::vclMatrix(x, type = "float") #push matrix to GPU time3 <- system.time({ mm3 <- vclX %*% vclX })

user.self | sys.self | elapsed | |
---|---|---|---|

CPU | 15.894 | 0.109 | 2.236 |

CPU-GPU | 0.958 | 0.250 | 4.051 |

GPU | 0.004 | 0.002 | 0.006 |

- As the matrix grows, the need for
`vclMatrix`

over`gpuMatrix`

becomes more pronounced as does the data transfer penalty.

- As seen in class before:

\[ \begin{eqnarray*} \hat{\beta} & = & (X'X)^{-1}X'y \ \text{can be rewritten as} \\ (X'X)\hat{\beta} & = & X'y \ \text{where we solve for }\hat{\beta} \end{eqnarray*} \]

…solving a system of equations is MUCH faster than simply inverting a matrix and often can resolve numerical issues revolving around values close to zero.

https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/

https://www.r-bloggers.com/dont-invert-that-matrix-why-and-how/

np <- 500 #number of predictors nr <- 1e+05 #number of observations X <- cbind(5, 1:nr, matrix(rnorm((np - 1) * nr, 0, 0.01), nrow = nr, ncol = (np - 1))) beta <- matrix(c(1, 3, runif(np - 1, 0, 0.2)), ncol = 1) y <- X %*% beta + matrix(rnorm(nr, 0, 1), nrow = nr, ncol = 1) time1 <- system.time({ ms2 <- solve(crossprod(X), crossprod(X, y)) }) # GPU version, GPU pointer to CPU memory!! gpuX = gpuR::gpuMatrix(X, type = "float") #point GPU to matrix gpuy = gpuR::gpuMatrix(y, type = "float") time2 <- system.time({ ms4 <- gpuR::solve(gpuR::crossprod(gpuX), gpuR::crossprod(gpuX, gpuy)) }) # GPU version, in GPU memory!! (vclMatrix formation is a memory transfer) vclX = gpuR::vclMatrix(X, type = "float") vcly = gpuR::vclMatrix(y, type = "float") time3 <- system.time({ ms5 <- gpuR::solve(gpuR::crossprod(vclX), gpuR::crossprod(vclX, vcly)) })

user.self | sys.self | elapsed | |
---|---|---|---|

CPU | 2.583 | 0.018 | 0.402 |

CPU-GPU | 0.484 | 0.609 | 2.514 |

GPU | 0.017 | 0.020 | 0.737 |

- The most commonly used methods are: %
*%, +, -,*, /, crossprod, tcrossprod, and trig functions among others.

## generate data for bootstrap X <- 1/matrix(rt(n * d, df = 1), ncol = d) beta <- c(1:d, 0) Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd = 3)

We compare:

bootols.R(X, Y, B=1000, method="cp")

bootols(X, Y, B=1000)

R function using gpu

n,d= 5000, 200 | n,d= 10000, 500 | n,d= 10000, 800 | |
---|---|---|---|

bootols.R(X, Y, B=1000, method="cp") | 29.267 | 244.572 | 452.036 |

bootols(X, Y, B=1000) | 7.912 | 38.662 | 86.476 |

gpu as pointer | 47.415 | 172.152 | 320.708 |

gpu in memory | 41.742 | 138.485 | 255.296 |