George G. Vega Yon
vegayon@usc.edu
University of Southern California
Department of Preventive Medicine
August 27th, 2019
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
In raw terms
Supercomputer: A single big machine with thousands of cores/gpus.
High Performance Computing (HPC): Multiple machines within a single network.
High Throughput Computing (HTC): Multiple machines across multiple networks.
You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accesible these days, e.g. AWS provides a service to create HPC clusters at a low cost (allegedly, since nobody understands how pricing works)
Now, how many cores does your computer has, the parallel package can tell you that:
## [1] 4
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism
Some examples:
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow
And there’s also a more advanced set of options
Based on the snow
and multicore
R Packages.
Explicit parallelism.
Simple yet powerful idea: Parallel computing as multiple R sessions.
Clusters can be made of both local and remote sessions
Multiple types of cluster: PSOCK
, Fork
, MPI
, etc.
(Usually) We do the following:
Create a PSOCK/FORK
(or other) cluster using makePSOCKCluster
/makeForkCluster
(or makeCluster
)
Copy/prepare each R session (if you are using a PSOCK
cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
Do your call: parApply
, parLapply
, etc.
Stop the cluster with clusterStop
# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(4)
x <- 20
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
clusterExport(cl, "x")
# 3. DO YOUR CALL
clusterEvalQ(cl, {
paste0("Hello from process #", Sys.getpid(), ". I see x and it is equal to ", x)
})
## [[1]]
## [1] "Hello from process #4035. I see x and it is equal to 20"
##
## [[2]]
## [1] "Hello from process #4050. I see x and it is equal to 20"
##
## [[3]]
## [1] "Hello from process #4064. I see x and it is equal to 20"
##
## [[4]]
## [1] "Hello from process #4093. I see x and it is equal to 20"
Problem: Run multiple regressions on a very wide dataset. We need to fit the following model:
\[ y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i),\quad\forall i \]
## [1] 500 999
## x001 x002 x003 x004 x005
## 1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281
## 2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221
## 3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675
## 4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953
## 5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415
## 6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764
## num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
Serial solution: Use apply
(forloop) to solve it
## x001 x002 x003 x004 x005
## (Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
## x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826
Parallel solution: Use parApply
library(parallel)
cl <- makePSOCKcluster(4L)
ans <- parApply(
cl = cl,
X = X,
MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
ans[,1:5]
## x001 x002 x003 x004 x005
## (Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
## x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826
Are we going any faster?
library(bench)
mark(
parallel = parApply(
cl = cl,
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
serial = apply(
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
)
## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 parallel 374ms 410ms 2.44 11.8MB 0
## 2 serial 677ms 677ms 1.48 94.9MB 13.3
\[ Fib(n) = \left\{\begin{array}{ll} n & \mbox{if }n \leq 1 \\ Fib(n-1) + Fib(n - 2) & \mbox{otherwise} \end{array}\right. \]
The following C++ file, called fib.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibCpp(int n) {
if (n < 2) {
return n;
}
return fibCpp(n - 1) + fibCpp(n - 2);
}
Can be compiled within R using Rcpp::sourceCpp("fib.cpp")
. This exports the function Back into R
## [1] 1 1 2 3 5
Rcpp data types are mapped directly to R data types, e.g. vectors of integer in R can be used as IntegerVector
in Rcpp.
#include <Rcpp.h>
using namespace Rcpp;
// inline kind of implementation
int fibCpp(int n) {return (n < 2)? n : fibCpp(n - 1) + fibCpp(n - 2);}
// [[Rcpp::export]]
IntegerVector fibCpp(IntegerVector n) {
IntegerVector res(n.size());
for (int i = 0; i < n.size(); ++i)
res[i] = fibCpp(n[i]);
return res;
}
Back in R
## [1] 1 1 2 3 5
Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).
Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).
Use arma
objects, e.g. arma::mat
, arma::vec
, etc. Or, if you are used to them std::vector
objects as these are thread safe.
Pseudo-Random Number Generation is not very straight forward… But C++11 has a nice set of functions that can be used together with OpenMP
Need to think about how processors work, cache memory, etc. Otherwise you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing
If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):
~$ R --debugger=valgrind
Add the following to your C++ source code to use OpenMP, and tell Rcpp that you need to include that in the compiler:
Tell the compiler that you’ll be running a block in parallel with openmp
You’ll need to specify how OMP should handle the data:
shared
: Default, all threads access the same copy.private
: Each thread has its own copy (although not initialized).firstprivate
Each thread has its own copy initialized.lastprivate
Each thread has its own copy. The last value is the one stored in the main program.Setting default(none)
is a good practice.
Compile!
Computing the distance matrix (see ?dist
)
#include <omp.h>
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
// Some constants
int N = (int) X.n_rows;
int K = (int) X.n_cols;
// Output
arma::mat D(N,N);
D.zeros(); // Filling with zeros
// Setting the cores
omp_set_num_threads(cores);
#pragma omp parallel for shared(D, N, K, X) default(none)
for (int i=0; i<N; ++i)
for (int j=0; j<i; ++j) {
for (int k=0; k<K; k++)
D.at(i,j) += pow(X.at(i,k) - X.at(j,k), 2.0);
// Computing square root
D.at(i,j) = sqrt(D.at(i,j));
D.at(j,i) = D.at(i,j);
}
// My nice distance matrix
return D;
}
# Benchmarking!
rbenchmark::benchmark(
dist(x), # stats::dist
dist_par(x, cores = 1), # 1 core
dist_par(x, cores = 2), # 2 cores
dist_par(x, cores = 4), # 4 cores
replications = 10, order="elapsed"
)[,1:4]
## test replications elapsed relative
## 3 dist_par(x, cores = 2) 10 2.558 1.000
## 4 dist_par(x, cores = 4) 10 2.962 1.158
## 2 dist_par(x, cores = 1) 10 3.428 1.340
## 1 dist(x) 10 4.986 1.949
gvegayon
@gvegayon
ggvy.cl
For more, checkout the CRAN Task View on HPC
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
## test replications elapsed relative
## 1 parallel 1 0.428 1.00
## 2 serial 1 0.796 1.86
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 magrittr_1.5 parallel_3.6.1 tools_3.6.1
## [5] htmltools_0.3.6 icon_0.1.0 yaml_2.2.0 Rcpp_1.0.2
## [9] stringi_1.4.3 rmarkdown_1.14 highr_0.8 knitr_1.24
## [13] stringr_1.4.0 xfun_0.9 digest_0.6.20 evaluate_0.14