A brief introduction to using R for high-performance computing

George G. Vega Yon
ggvy.cl
USC Integrative Methods of Analysis for Genetic Epidemiology (IMAGE)
Department of Preventive Medicine

November 12, 2018

Agenda

High-Performance Computing: An overview
Parallel computing in R
Extended examples

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)

(Checkout CRAN Task View on HPC)

Big Data

Buy a bigger computer/RAM memory (not the best solution!)
Use out-of-memory storage, i.e., don’t load all your data in the RAM. e.g. The bigmemory, data.table, HadoopStreaming R packages
Efficient algorithms for big data, e.g.: biglm, biglasso
Store it more efficiently, e.g.: Sparse Matrices (take a look at the dgCMatrix objects from the Matrix R package)

Parallel computing

Flynn’s Classical Taxonomy (Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory)

We will be focusing on the Single Instruction stream Multiple Data stream

Parallel computing

Serial vs Parallel

source: Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory

Parallel computing

source: Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory

Some vocabulary for HPC

In raw terms

Supercomputer: A single big machine with thousands of cores/gpus.
High Performance Computing (HPC): Multiple machines within a single network.
High Throughput Computing (HTC): Multiple machines across multiple networks.

You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accessible these days, e.g. AWS provides a service to create HPC clusters at a low cost (allegedly, since nobody understands how pricing works)

GPU vs CPU

NVIDIA Blog

Why use OpenMP if GPU is suited to compute-intensive operations? Well, mostly because OpenMP is VERY easy to implement (easier than CUDA, which is the easiest way to use GPU).

Let’s think before we start…

When is it a good idea to go HPC?

When is it a good idea?

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism:

parallel: R package that provides ‘[s]upport for parallel computation, including random-number generation’.
future: ‘[A] lightweight and unified Future API for sequential and parallel processing of R expression via futures.’
Rcpp + OpenMP: Rcpp is an R package for integrating R with C++, and OpenMP is a library for high-level parallelism for C/C++ and FORTRAN.

Others but not used here

foreach for iterating through lists in parallel.
Rmpi for creating MPI clusters.

And tools for implicit parallelism (out-of-the-box tools that allow the programmer not to worry about parallelization):

gpuR for Matrix manipulation using GPU
tensorflow an R interface to TensorFlow.

A ton of other type of resources, notably the tools for working with batch schedulers such as Slurm, and HTCondor.

U ready for speed?!?!?!

Parallel workflow

(Usually) We do the following:

Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)
Copy/prepare each R session (if you are using a PSOCK cluster):
1. Copy objects with clusterExport
2. Pass expressions with clusterEvalQ
3. Set a seed
Do your call: parApply, parLapply, etc.
Stop the cluster with clusterStop

Types of clusters: PSOCK

Can be created with makePSOCKCluster
Creates brand new R Sessions (so nothing is inherited from the master), e.g.
```
# This creates a cluster with 4 R sessions
cl <- makePSOCKCluster(4)
```
Child sessions are connected to the master session via Socket connections
Can be created outside of the current computer, i.e. across multiple computers!

Types of clusters: Fork

Fork Cluster makeForkCluster:
Uses OS Forking,
Copies the current R session locally (so everything is inherited from the master up to that point).
Data is only duplicated if it is altered (need to double check when this happens!)
Not available on Windows.

Other makeCluster: passed to snow (Simple Network of Workstations)

Ex 1: Parallel RNG with `makePSOCKCluster`

# 1. CREATING A CLUSTER
library(parallel)
nnodes <- 4L
cl     <- makePSOCKcluster(nnodes)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`

# 3. DO YOUR CALL
ans <- parSapply(cl, 1:nnodes, function(x) runif(1e3))
(ans0 <- var(ans))

##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
## [2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
## [3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Making sure is reproducible

# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:nnodes, function(x) runif(1e3)))

# 4. STOP THE CLUSTER
stopCluster(cl)

all.equal(ans0, ans1) # All equal!

## [1] TRUE

Ex 2: Parallel RNG with `makeForkCluster`

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
cl     <- makeForkCluster(nnodes)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123)

# 3. DO YOUR CALL
ans <- do.call(cbind, parLapply(cl, 1:nnodes, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))

(ans0 <- var(ans))

##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
## [2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
## [3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Again, we want to make sure this is reproducible

# Same sequence with same seed
clusterSetRNGStream(cl, 123)
ans1 <- var(do.call(cbind, parLapply(cl, 1:nnodes, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0

# 4. STOP THE CLUSTER
stopCluster(cl)

Well, if you are a Mac-OS/Linux user, there’s a simpler way of doing this…

Ex 3: Parallel RNG with `mclapply` (Forking on the fly)

In the case of mclapply, the forking (cluster creation) is done on the fly!

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
# cl     <- makeForkCluster(nnodes) # mclapply does it on the fly

# 2. PREPARING THE CLUSTER
set.seed(123) 

# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims)))

(ans0 <- var(ans))

##              [,1]        [,2]         [,3]         [,4]
## [1,]  0.085384184 0.002390790  0.006576204 -0.003998278
## [2,]  0.002390790 0.081142190  0.001846963  0.001476244
## [3,]  0.006576204 0.001846963  0.085175347 -0.002807348
## [4,] -0.003998278 0.001476244 -0.002807348  0.082425477

Once more, we want to make sure this is reproducible

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0

# 4. STOP THE CLUSTER
# stopCluster(cl) no need of doing this anymore

RcppArmadillo + OpenMP
=

RcppArmadillo and OpenMP

Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).
Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).
Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread safe.

Pseudo Random Number Generation is not very straight forward… But C++11 has a nice set of functions that can be used together with OpenMP
Need to think about how processors work, cache memory, etc. Otherwise you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing
If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):
```
~$ R --debugger=valgrind
```

RcppArmadillo and OpenMP workflow

Tell Rcpp that you need to include that in the compiler:
```
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
```
Within your function, set the number of cores, e.g
```
// Setting the cores
omp_set_num_threads(cores);
```

RcppArmadillo and OpenMP workflow

Tell the compiler that you’ll be running a block in parallel with OpenMP
```
#pragma omp [directives] [options]
{
  ...your neat parallel code...
}
```
You’ll need to specify how OMP should handle the data:
- shared: Default, all threads access the same copy.
- private: Each thread has its own copy, uninitialized.
- firstprivate Each thread has its own copy, initialized.
- lastprivate Each thread has its own copy. The last value used is returned.
Setting default(none) is a good practice.
Compile!

Ex 5: RcppArmadillo + OpenMP

Our own version of the dist function… but in parallel!

#include <omp.h>
#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;

// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
  
  // Some constants
  int N = (int) X.n_rows;
  int K = (int) X.n_cols;
  
  // Output
  arma::mat D(N,N);
  D.zeros(); // Filling with zeros
  
  // Setting the cores
  omp_set_num_threads(cores);
  
#pragma omp parallel for shared(D, N, K, X) default(none)
  for (int i=0; i<N; ++i)
    for (int j=0; j<i; ++j) {
      for (int k=0; k<K; k++) 
        D.at(i,j) += pow(X.at(i,k) - X.at(j,k), 2.0);
      
      // Computing square root
      D.at(i,j) = sqrt(D.at(i,j));
      D.at(j,i) = D.at(i,j);
    }
      
  
  // My nice distance matrix
  return D;
}

# Simulating data
set.seed(1231)
K <- 5000
n <- 500
x <- matrix(rnorm(n*K), ncol=K)

# Are we getting the same?
table(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros

## 
##      0 
## 250000

# Benchmarking!
rbenchmark::benchmark(
  dist(x),                 # stats::dist
  dist_par(x, cores = 1),  # 1 core
  dist_par(x, cores = 4),  # 2 cores
  dist_par(x, cores = 8), #  4 cores
  replications = 1, order="elapsed"
)[,1:4]

##                     test replications elapsed relative
## 4 dist_par(x, cores = 8)            1   0.636    1.000
## 3 dist_par(x, cores = 4)            1   1.204    1.893
## 2 dist_par(x, cores = 1)            1   2.530    3.978
## 1                dist(x)            1   6.477   10.184

Ex 6: The future

future is an R package that was designed “to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user.”
future class objects are either resolved or unresolved.
If queried, Resolved values are return immediately, and Unresolved values will block the process (i.e. wait) until it is resolved.
Futures can be parallel/serial, in a single (local or remote) computer, or a cluster of them.

Let’s see a brief example

Ex 6: The future (cont’d)

library(future)
plan(multicore)

# We are creating a global variable
a <- 2

# Creating the futures has only the overhead (setup) time
system.time({
  x1 %<-% {Sys.sleep(3);a^2}
  x2 %<-% {Sys.sleep(3);a^3}
})
##    user  system elapsed 
##   0.024   0.008   0.033

# Let's just wait 5 seconds to make sure all the cores have returned
Sys.sleep(3)
system.time({
  print(x1)
  print(x2)
})
## [1] 4
## [1] 8
##    user  system elapsed 
##   0.004   0.000   0.004

Thanks!

gvegayon
@gvegayon
ggvy.cl

Presentation created with revealjs

Session info

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17    revealjs_0.9    digest_0.6.15   rprojroot_1.3-2
##  [5] backports_1.1.2 magrittr_1.5    evaluate_0.10.1 highr_0.6      
##  [9] icon_0.1.0      stringi_1.2.3   rmarkdown_1.10  tools_3.5.0    
## [13] stringr_1.3.0   yaml_2.1.19     compiler_3.5.0  htmltools_0.3.6
## [17] knitr_1.20

Bonus track 1: Simulating $\pi$

We know that $\pi = \frac{A}{r^2}$. We approximate it by randomly adding points $x$ to a square of size 2 centered at the origin.
So, we approximate $\pi$ as $\Pr\{\|x\| \leq 1\}\times 2^2$

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

Bonus track 1: Simulating $\pi$ (cont’d)

# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]

##       test replications elapsed relative
## 1 parallel            1   0.441    1.000
## 2   serial            1   1.154    2.617

##      par      ser        R 
## 3.141126 3.141247 3.141593

Bonus track 2: HPC with Slurm

Suppose that we would like to maximize/minimize a function using an stochastic optimization algorithm, namely, the Artificial Bee Colony algorithm
The following R script (01-slurm-abcoptim.R) was designed to work with Slurm (it requires the R package ABCoptim [@ABCoptim])

# Include this to tell where everything will be living at
.libPaths("~/R/x86_64-pc-linux-gnu-library/3.4/")

# Default CRAN mirror from where to download R packages
options(repos =c(CRAN="https://cloud.r-project.org/"))

# You need to have the ABCoptim R package
library(ABCoptim)

fun <- function(x) {
  -cos(x[1])*cos(x[2])*exp(-((x[1] - pi)^2 + (x[2] - pi)^2))
}

ans <- abc_optim(rep(0,2), fun, lb=-10, ub=10, criter=50)

saveRDS(
   ans,
   file = paste0(
      "~/hpc-with-r/examples/01-slurm-abcoptim-",
      Sys.getenv("SLURM_JOB_ID"),                 # SLURM ENV VAR
      "-",
      Sys.getenv("SLURM_ARRAY_TASK_ID"),          # SLURM ENV VAR
      ".rds"
))

Notice that we are using SLURM_JOB_ID, and SLURM_ARRAY_TASK_ID to save our results (both environment variables created by slurm)

To run the previous R script, we can use the following bash file (01-slurm-abcoptim.sh)

#!/bin/bash 
#SBATCH --tasks=1
#SBATCH --array=1-3
#SBATCH --job-name=01-slurm-abcoptim
#SBATCH --output=01-slurm-abcoptim-%A_%a.out

source /usr/usc/R/3.4.0/setup.sh
Rscript --vanilla ~/hpc-with-r/examples/01-slurm-abcoptim.R

Here we are taking advantage of the Slurm Arrays, so we are running the same R-script in 3 instances (--array=1-3)
To run the job we just need to type
```
$ sbatch 01-slurm-abcoptim.sh
```
Make sure you modify the file paths so that it matches your files!

Now you try it!

A brief introduction to using R for high-performance computing

George G. Vega Yon
ggvy.cl
USC Integrative Methods of Analysis for Genetic Epidemiology (IMAGE)
Department of Preventive Medicine

November 12, 2018

Agenda

High-Performance Computing: An overview

Big Data

Parallel computing

Parallel computing

Serial vs Parallel

Parallel computing

Some vocabulary for HPC

GPU vs CPU

When is it a good idea?

Parallel computing in R

Parallel workflow

Types of clusters: PSOCK

Types of clusters: Fork

Ex 1: Parallel RNG with `makePSOCKCluster`

Ex 2: Parallel RNG with `makeForkCluster`

Ex 3: Parallel RNG with `mclapply` (Forking on the fly)

RcppArmadillo and OpenMP

RcppArmadillo and OpenMP workflow

RcppArmadillo and OpenMP workflow

Ex 5: RcppArmadillo + OpenMP

Ex 6: The future

Ex 6: The future (cont’d)

Thanks!

See also

Session info

Bonus track 1: Simulating \(\pi\)

Bonus track 1: Simulating \(\pi\) (cont’d)

Bonus track 2: HPC with Slurm

A brief introduction to using R for high-performance computing

George G. Vega Yonggvy.clUSC Integrative Methods of Analysis for Genetic Epidemiology (IMAGE)Department of Preventive Medicine

November 12, 2018

Agenda

High-Performance Computing: An overview

Big Data

Parallel computing

Parallel computing

Serial vs Parallel

Parallel computing

Some vocabulary for HPC

GPU vs CPU

When is it a good idea?

Parallel computing in R

Parallel workflow

Types of clusters: PSOCK

Types of clusters: Fork

Ex 1: Parallel RNG with makePSOCKCluster

Ex 2: Parallel RNG with makeForkCluster

Ex 3: Parallel RNG with mclapply (Forking on the fly)

RcppArmadillo and OpenMP

RcppArmadillo and OpenMP workflow

RcppArmadillo and OpenMP workflow

Ex 5: RcppArmadillo + OpenMP

Ex 6: The future

Ex 6: The future (cont’d)

Thanks!

See also

Session info

Bonus track 1: Simulating \(\pi\)

Bonus track 1: Simulating \(\pi\) (cont’d)

Bonus track 2: HPC with Slurm

George G. Vega Yon
ggvy.cl
USC Integrative Methods of Analysis for Genetic Epidemiology (IMAGE)
Department of Preventive Medicine

Ex 1: Parallel RNG with `makePSOCKCluster`

Ex 2: Parallel RNG with `makeForkCluster`

Ex 3: Parallel RNG with `mclapply` (Forking on the fly)