High-Performance Computing: An overview
Parallel computing in R
Extended examples
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
Buy a bigger computer/RAM memory (not the best solution!)
Use out-of-memory storage, i.e., don’t load all your data in the RAM. e.g. The bigmemory, data.table, HadoopStreaming R packages
Store it more efficiently, e.g.: Sparse Matrices (take a look at the dgCMatrix objects from the Matrix R package)
Flynn’s Classical Taxonomy (Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory)
We will be focusing on the Single Instruction stream Multiple Data stream
source: Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory
source: Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory
In raw terms
Supercomputer: A single big machine with thousands of cores/gpus.
High Performance Computing (HPC): Multiple machines within a single network.
High Throughput Computing (HTC): Multiple machines across multiple networks.
You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accessible these days, e.g. AWS provides a service to create HPC clusters at a low cost (allegedly, since nobody understands how pricing works)