High Performance Computing with R



Similar documents
Big Data, R, and HPC

Big Data and Parallel Work with R

Introduction to parallel computing in R

Programming with Big Data in R

Parallel Options for R

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Getting Started with domc and foreach

Big Data Analytics and HPC

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Your Best Next Business Solution Big Data In R 24/3/2010

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Introducing R: From Your Laptop to HPC and Big Data

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

RevoScaleR Speed and Scalability

Parallelization in R, Revisited

Introduction to the doredis Package

CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow? How to run R programs faster? Tomas Kalibera

Parallelization: Binary Tree Traversal

User s Manual

Overview of HPC Resources at Vanderbilt

Introduction to ACENET Accelerating Discovery with Computational Research May, 2015

Using Google Compute Engine

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Parallel Computing with MATLAB

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Parallel Computing for Data Science

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Deal with big data in R using bigmemory package

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Big-data Analytics: Challenges and Opportunities

How To Write A Data Processing Pipeline In R

Package snowfall. October 14, 2015

Streaming Data, Concurrency And R

Automating Big Data Benchmarking for Different Architectures with ALOJA

Big data in R EPIC 2015

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Distributed Text Mining with tm

Notes on the SNOW/Rmpi R packages with OpenMPI and Sun Grid Engine

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Parallelization Strategies for Multicore Data Analysis

The Not Quite R (NQR) Project: Explorations Using the Parrot Virtual Machine

Spring 2011 Prof. Hyesoon Kim

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Binary search tree with SIMD bandwidth optimization using SSE

Energy Efficient MapReduce

R and High-Performance Computing

Fast Analytics on Big Data with H20

Parallel and Distributed Computing Programming Assignment 1

Recommended hardware system configurations for ANSYS users

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Package biganalytics

S3IT: Service and Support for Science IT. Scaling R on cloud infrastructure Sergio Maffioletti IS/Cloud S3IT: Service and Support for Science IT

Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Multicore Parallel Computing with OpenMP

Big Data Challenges in Bioinformatics

Version Programming with Big Data in R. Speaking Serial R with a Parallel Accent. Package Examples and Demonstrations.

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

WINDOWS AZURE AND WINDOWS HPC SERVER

High Performance Computing. Course Notes HPC Fundamentals

Pedraforca: ARM + GPU prototype

Cluster, Grid, Cloud Concepts

Performance Analysis and Optimization Tool

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data Processing with Google s MapReduce. Alexandru Costan

Whitepaper: performance of SqlBulkCopy

Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Understanding the Benefits of IBM SPSS Statistics Server

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

wu.cloud: Insights Gained from Operating a Private Cloud System

INF-110. GPFS Installation

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

HPC Wales Skills Academy Course Catalogue 2015

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

MAGENTO HOSTING Progressive Server Performance Improvements

Enhancing SQL Server Performance

7.x Upgrade Instructions Software Pursuits, Inc.

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

HPC with Multicore and GPUs

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Scalable Algorithms- Performance Parameters

Scalability and Classifications

High Performance Computing in CST STUDIO SUITE

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

MAQAO Performance Analysis and Optimization Tool

Laboratory Report. An Appendix to SELinux & grsecurity: A Side-by-Side Comparison of Mandatory Access Control & Access Control List Implementations

Application of Predictive Analytics for Better Alignment of Business and IT

Parallel Computing with the Bioconductor Amazon Machine Image

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Transcription:

High Performance Computing with R Drew Schmidt April 6, 2014 http://r-pbd.org/nimbios Drew Schmidt High Performance Computing with R

Contents 1 Introduction 2 Profiling and Benchmarking 3 Writing Better R Code 4 All About Compilers and R 5 An Overview of Parallelism 6 Shared Memory Parallelism in R 7 Distributed Memory Parallelism with R 8 Exercises 9 Wrapup Drew Schmidt High Performance Computing with R

1 Introduction Compute Resources HPC Myths Drew Schmidt High Performance Computing with R

Compute Resources Compute Resources Your laptop. Your own server. The cloud. NSF resources. Drew Schmidt High Performance Computing with R 1/93

Compute Resources XSEDE (+) Free!!! (+) Access to massive compute resources. (+) OS image and software managed by others. (-) 1-3 month turnaround for new applications. (-) Application consists of more than here s my credit card. (-) Some restrictions apply. https://www.xsede.org/allocations Drew Schmidt High Performance Computing with R 2/93

HPC Myths What is High Performance Computing (HPC)? High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. Drew Schmidt High Performance Computing with R 3/93

HPC Myths HPC Myths 1 We don t need to worry about it. 2 HPC requires a supercomputer. 3 HPC is only for academics. 4 HPC is too expensive. 5 You can t do HPC with R. 6 There s no need for HPC in biology. 7 Every question requires an HPC solution. Drew Schmidt High Performance Computing with R 4/93

HPC Myths Companies Using HPC Drew Schmidt High Performance Computing with R 5/93

HPC Myths Programming with Big Data in R (pbdr) 125 Fitting y~x With Fixed Local Size of ~43.4 MiB Predictors 500 1000 2000 1016.09 100 21.34 42.68 85.35 170.7 341.41 Run Time (Seconds) 75 50 21.34 42.6885.35 170.7 341.41 1016.09 1016.09 25 21.34 42.6 85.35 170.7 341.41 504 1008 2016 4032 8064 Cores 24000 Drew Schmidt High Performance Computing with R 6/93

HPC Myths HPC in the Biological Sciences According to Manuel Peitsch, co-founder of the Swiss Institute of Bioinformatics, HPC is essential and plays a critical role in the life sciences in 4 ways: 1 Massive amounts of data generated by modern omics and genome sequencing technologies. 2 Modeling increasingly large biomolecular systems using quantum mechanics/molecular mechanics and molecular dynamics. 3 Modeling biological networks and simulating how network perturbations lead to adverse outcomes and disease. 4 Simulation of organ function. Drew Schmidt High Performance Computing with R 7/93

2 Profiling and Benchmarking Why Profile? Profiling R Code Other Ways to Profile Summary Drew Schmidt High Performance Computing with R

Why Profile? Why Profile? Because performance matters. Your bottlenecks may surprise you. Because R is dumb. R users claim to be data people... so act like it! Drew Schmidt High Performance Computing with R 8/93

Why Profile? Compilers often correct bad behavior... A Really Dumb Loop int main (){ int x, i; for (i =0; i <10; i ++) x = 1; return 0; } clang -O3 example.c main : # BB #0:. cfi _ startproc xorl %eax, % eax ret clang example.c main :. cfi _ startproc # BB #0: movl $0, -4(% rsp ) movl $0, -12(% rsp ). LBB0 _1: cmpl $10, -12(% rsp ) jge. LBB0 _4 # BB #2: movl $1, -8(% rsp ) # BB #3: movl -12(% rsp ), % eax addl $1, % eax movl %eax, -12(% rsp ) jmp. LBB0 _1. LBB0 _4: movl $0, % eax ret Drew Schmidt High Performance Computing with R 9/93

Why Profile? R will not! Dumb Loop 1 for (i in 1:n){ 2 ta <- t(a) 3 Y <- ta %*% Q 4 Q <- qr.q(qr(y)) 5 Y <- A %*% Q 6 Q <- qr.q(qr(y)) 7 } 8 9 Q 1 ta <- t(a) 2 Better Loop 3 for (i in 1:n){ 4 Y <- ta %*% Q 5 Q <- qr.q(qr(y)) 6 Y <- A %*% Q 7 Q <- qr.q(qr(y)) 8 } 9 10 Q Drew Schmidt High Performance Computing with R 10/93

Why Profile? Example from the clustergenomics Package Exerpt from Original findw function 1 n <- nrow(as.matrix(dx)) 2 3 while (k <=K){ 4 for (i in 1:k){ 5 # Sum of within - cluster dispersion : 6 d.k <- as.matrix(dx)[labx==i,labx==i] 7 D.k <- sum (d.k) 8... Exerpt from Modified findw function 1 dx.mat <- as.matrix(dx) 2 n <- nrow(dx.mat) 3 4 while (k <=K){ 5 for (i in 1:k){ 6 # Sum of within - cluster dispersion : 7 d.k <- dx.mat[labx==i,labx==i] 8 D.k <- sum (d.k) 9... By changing just 2 lines of code, I was able to improve the speed of his method by over 350%! Drew Schmidt High Performance Computing with R 11/93

Profiling R Code Runtime Tools Getting simple timings as a basic measure of performance is easy, and valuable. system.time() rbenchmark Rprof() Drew Schmidt High Performance Computing with R 12/93

Profiling R Code Performance Profiling Tools: system.time() system.time() is a basic R utility for giving run times of expressions > x <- matrix ( rnorm (10000 * 500), nrow =10000, ncol =500) > system. time (t(x) %*% x) user system elapsed 0.459 0.028 0.488 > system. time ( crossprod (x)) user system elapsed 0.234 0.000 0.234 > system. time ( cov (x)) [3] elapsed 1.428 Drew Schmidt High Performance Computing with R 13/93

Profiling R Code Performance Profiling Tools: system.time() Improving the rexpokit Package library ( rexpokitold ) system. time ( expokit _ dgpadm _ Qmat (x)) [3] # 5.496 library ( rexpokit ) system. time ( expokit _ dgpadm _ Qmat (x)) [3] # 4.164 5.496 / 4.164 # 1.319885 Drew Schmidt High Performance Computing with R 14/93

Profiling R Code Performance Profiling Tools: rbenchmark rbenchmark is a simple package that easily benchmarks different functions: x <- matrix ( rnorm (10000 * 500), nrow =10000, ncol =500) f <- function (x) t(x) %*% x g <- function ( x) crossprod ( x) library ( rbenchmark ) benchmark (f(x), g(x)) # test replications elapsed relative # 1 f( x) 100 64.153 2.063 # 2 g( x) 100 31.098 1.000 Drew Schmidt High Performance Computing with R 15/93

Profiling R Code Rprof() A very useful tool for profiling R code Rprof ( filename =" Rprof. out ", append = FALSE, interval =0.02, memory. profiling = FALSE, gc. profiling = FALSE, line. profiling = FALSE, numfiles =100L, bufsize =10000 L) Drew Schmidt High Performance Computing with R 16/93

Profiling R Code Rprof() data ( iris ) myris <- iris [1:100, ] fam <- binomial ( logit ) Rprof () mymdl <- replicate (1, myglm <- glm ( Species ~., family =fam, data = myris )) Rprof ( NULL ) summaryrprof () Rprof () mymdl <- replicate (10, myglm <- glm ( Species ~., family =fam, data = myris )) Rprof ( NULL ) summaryrprof () Rprof () mymdl <- replicate (100, myglm <- glm ( Species ~., family =fam, data = myris )) Rprof ( NULL ) summaryrprof () Drew Schmidt High Performance Computing with R 17/93

Other Ways to Profile Other Profiling Tools Rprofmem() tracemem() perf (Linux) PAPI, TAU,... Drew Schmidt High Performance Computing with R 18/93

Other Ways to Profile Profiling with pbdprof 1. Rebuild pbdr packages Publication-quality graphs R CMD INSTALL pbdmpi _ 0.2-1. tar.gz \ -- configure - args = \ " --enable - pbdprof " 2. Run code mpirun - np 64 Rscript my_ script.r 3. Analyze results 1 library ( pbdprof ) 2 prof <- read. prof ( " profiler _ output. mpip ") 3 plot ( prof ) Drew Schmidt High Performance Computing with R 19/93

Summary Summary Profile, profile, profile. Use system.time() to get a general sense of a method. Use rbenchmark s benchmark() to compare 2 methods. Use Rprof() for more detailed profiling. Other tools exist for more hardcore applications. Drew Schmidt High Performance Computing with R 20/93

3 Writing Better R Code Functions Loops, Ply Functions, and Vectorization Summary Drew Schmidt High Performance Computing with R

Serial R Improvements Slow serial code = slow parallel code Drew Schmidt High Performance Computing with R 21/93

Functions Function Evaluation Function calls are comparatively expensive ( 10x slower than C) In absolute terms, the abstraction is worth the price. Recursion sucks. Avoid at all costs. Drew Schmidt High Performance Computing with R 22/93

Functions Recursion 1 1 fib1 <- function ( n) 2 { 3 if ( n == 0 n == 1) 4 return ( 1L ) 5 else 6 return ( fib1 (n -1) + fib1 (n -2) ) 7 } 8 9 10 fib2 <- function ( n) 11 { 12 if ( n == 0 n == 1) 13 return ( 1L ) 14 15 f0 <- 1L 16 f1 <- 1L 17 18 i <- 1L 19 fib <- 0L Drew Schmidt High Performance Computing with R 23/93

Functions Recursion 2 20 while ( i < n) 21 { 22 fib <- f0 + f1 23 f0 <- f1 24 f1 <- fib 25 i <- i+1 26 } 27 28 return ( fib ) 29 } Drew Schmidt High Performance Computing with R 24/93

Functions Recursion 3 library ( rbenchmark ) n <- 20 benchmark ( fib1 (n), fib2 (n)) # test replications elapsed relative # 1 fib1 ( n) 100 2.047 1023.5 # 2 fib2 ( n) 100 0.002 1.0 system. time ( fib1 (45) ) [3] 2934.146 system. time ( fib2 (45) ) [3] 3.0616 e -5 # Relative performance 2934.146 / 3.0616 e -5 # 95837013 Drew Schmidt High Performance Computing with R 25/93

Loops, Ply Functions, and Vectorization Loops, Plys, and Vectorization Loops are slow. apply(), Reduce() are just for loops. Map(), lapply(), sapply(), mapply() (and most other core ones) are not for loops. Vectorization is the fastest of these options, but tends to be much more memory wasteful. Ply functions are not vectorized. Drew Schmidt High Performance Computing with R 26/93

Loops, Ply Functions, and Vectorization Loops: Best Practices Profile, profile, profile. Evaluate how practical it is to rewrite as an lapply(), vectorize, or push to compiled code. Preallocate, preallocate, preallocate. Drew Schmidt High Performance Computing with R 27/93

Loops, Ply Functions, and Vectorization Loops 1 1 f1 <- function ( n){ 2 x <- c() 3 for (i in 1:n){ 4 x <- c(x, i ^2) 5 } 6 7 x 8 } 9 10 11 f2 <- function ( n){ 12 x <- integer ( n) 13 for (i in 1:n){ 14 x[ i] <- i^2 15 } 16 17 x 18 } Drew Schmidt High Performance Computing with R 28/93

Loops, Ply Functions, and Vectorization Loops 2 library ( rbenchmark ) n <- 1000 benchmark (f1(n), f2(n)) test replications elapsed relative 1 f1( n) 100 0.234 2.412 2 f2( n) 100 0.097 1.000 Drew Schmidt High Performance Computing with R 29/93

Loops, Ply Functions, and Vectorization Ply s: Best Practices Most ply s are just shorthand/higher expressions of loops. Generally not much faster (if at all), especially with the compiler. Thinking in terms of lapply() can be useful however... Drew Schmidt High Performance Computing with R 30/93

Loops, Ply Functions, and Vectorization Vectorization x+y x[, 1] <- 0 rnorm(1000) Drew Schmidt High Performance Computing with R 31/93

Loops, Ply Functions, and Vectorization Plys and Vectorization 1 f3 <- function ( n){ 2 sapply (1: n, function ( i) i ^2) 3 } 4 5 f4 <- function ( n){ 6 (1: n)* (1: n) 7 } library ( rbenchmark ) n <- 1000 benchmark (f1(n), f2(n), f3(n), f4(n)) test replications elapsed relative 1 f1( n) 100 0.210 210 2 f2( n) 100 0.096 96 3 f3( n) 100 0.084 84 4 f4( n) 100 0.001 1 Drew Schmidt High Performance Computing with R 32/93

Loops, Ply Functions, and Vectorization Loops, Plys, and Vectorization 1 1 f1 <- function ( ind ){ 2 sum <- 0 3 for (i in ind ){ 4 if (i%%2 == 0) 5 sum <- sum - log ( i) 6 else 7 sum <- sum + log ( i) 8 } 9 sum 10 } 11 12 f2 <- function ( ind ){ 13 sum ( sapply (X=ind, FUN = function (i) if (i%% 2==0) -log (i) else log (i))) 14 } 15 16 f3 <- function ( ind ){ 17 sign <- replicate ( length ( ind ), 1L) 18 sign [ seq. int ( from =2, to= length ( ind ), by =2) ] <- -1L Drew Schmidt High Performance Computing with R 33/93

Loops, Ply Functions, and Vectorization Loops, Plys, and Vectorization 2 19 20 sum ( sign * log ( ind )) 21 } 22 23 24 library ( rbenchmark ) 25 ind <- 1:50000 26 benchmark (f1(ind ), f2(ind ), f3(ind )) library ( rbenchmark ) ind <- 1:50000 benchmark (f1(ind ), f2(ind ), f3(ind ), replications =10) test replications elapsed relative 1 f1( ind ) 10 0.428 1.309 2 f2( ind ) 10 1.018 3.113 3 f3( ind ) 10 0.327 1.000 Drew Schmidt High Performance Computing with R 34/93

Loops, Ply Functions, and Vectorization Loops, Plys, and Vectorization 3 Drew Schmidt High Performance Computing with R 35/93

Summary Summary Avoid recursion at all costs. Vectorize when you can. Pre-allocate your data in loops. Drew Schmidt High Performance Computing with R 36/93

4 All About Compilers and R Building R with a Different Compiler The Bytecode Compiler Bringing Compiled C/C++/Fortran to R Summary Drew Schmidt High Performance Computing with R

Building R with a Different Compiler Better Compiler GNU (gcc/gfortran) and clang/gfortran are free and will compile anything, but don t produce the fastest binaries. Don t even bother with anything from Microsoft. Intel icc is very fast on intel hardware. ( 20% over GNU) Better compiler = Faster R Drew Schmidt High Performance Computing with R 37/93

Building R with a Different Compiler Compiling R with icc and ifort Faster, but not painless. Requires Intel Composer suite license ($$$) Improvements are most visible on Intel hardware. See Intel s help pages for details. Drew Schmidt High Performance Computing with R 38/93

The Bytecode Compiler The Compiler Package Released in 2011 (Tierney) Bytecode: sort of like machine code for interpreters... Improves R code speed 2-5% generally. Does best on loops. Drew Schmidt High Performance Computing with R 39/93

The Bytecode Compiler Bytecode Compilation By default, packages are not (bytecode) compiled. Exceptions: base (base, stats,... ) and recommended (MASS, Matrix,... ) packages. Downsides to package compilation: (1) bigger install size, (2) longer install process. Upsides: faster. Drew Schmidt High Performance Computing with R 40/93

The Bytecode Compiler Compiling a Function 1 test <- function ( x) x+1 2 test 3 # function ( x) x+1 4 5 library ( compiler ) 6 7 test <- cmpfun ( test ) 8 test 9 # function ( x) x+1 10 # < bytecode : 0 x38c86c8 > 11 12 disassemble ( test ) 13 # list (. Code, list (7L, GETFUN.OP, 1L, MAKEPROM.OP, 2L, PUSHCONSTARG. OP, 14 # 3L, CALL.OP, 0L, RETURN.OP), list (x + 1, +, list (. Code, 15 # list (7L, GETVAR.OP, 0L, RETURN.OP), list (x)), 1)) Drew Schmidt High Performance Computing with R 41/93

The Bytecode Compiler Compiling Packages From R 1 install. packages ("my_ package ", type =" source ", INSTALL _ opts =" --byte - compile ") From The Shell 1 export R_ COMPILE _ PKGS =1 2 R CMD INSTALL my_ package. tar. gz Drew Schmidt High Performance Computing with R 42/93

The Bytecode Compiler Compiling YOUR Package In the DESCRIPTION file, you can set ByteCompile: to require bytecode compilation (overridden by --no-byte-compile). Not recommended during development. CRAN may yell at you. yes Drew Schmidt High Performance Computing with R 43/93

Bringing Compiled C/C++/Fortran to R (Machine Code) Compiled Code Moving to compiled code can be difficult. But performance is very compelling. We will talk about one popular way on Tuesday: Rcpp. Drew Schmidt High Performance Computing with R 44/93

Bringing Compiled C/C++/Fortran to R Extra Credit Compare the bytecode of these two functions: Wasteful Less Wasteful 1 f <- function (A, Q){ 2 n <- ncol ( A) 3 for (i in 1:n){ 4 ta <- t( A) 5 Y <- ta %*% Q 6 Q <- qr.q(qr(y)) 7 Y <- A %*% Q 8 Q <- qr.q(qr(y)) 9 } 10 11 Q 12 } 1 g <- function (A, Q){ 2 n <- ncol ( A) 3 ta <- t( A) 4 for (i in 1:n){ 5 Y <- ta %*% Q 6 Q <- qr.q(qr(y)) 7 Y <- A %*% Q 8 Q <- qr.q(qr(y)) 9 } 10 11 Q 12 } Drew Schmidt High Performance Computing with R 45/93

Summary Summary Compiling R itself with a different compiler can improve performance, but is non-trivial. The compiler package offers small, but free speedup. The (bytecode) compiler works best on loops. Drew Schmidt High Performance Computing with R 46/93

5 An Overview of Parallelism Terminology: Parallelism Choice of BLAS Library Guidelines Summary Drew Schmidt High Performance Computing with R

Terminology: Parallelism Parallelism Serial Programming Parallel Programming Drew Schmidt High Performance Computing with R 47/93

Terminology: Parallelism Parallelism Serial Programming Parallel Programming Drew Schmidt High Performance Computing with R 48/93

Terminology: Parallelism Parallel Programming Vocabulary: Difficulty in Parallelism 1 Implicit parallelism: Parallel details hidden from user Example: Using multi-threaded BLAS 2 Explicit parallelism: Some assembly required... Example: Using the mclapply() from the parallel package 3 Embarrassingly Parallel or loosely coupled: Obvious how to make parallel; lots of independence in computations. Example: Fit two independent models in parallel. 4 Tightly Coupled: Opposite of embarrassingly parallel; lots of dependence in computations. Example: Speed up model fitting for one model. Drew Schmidt High Performance Computing with R 49/93

Terminology: Parallelism Speedup Wallclock Time: Time of the clock on the wall from start to finish Speedup: unitless measure of improvement; more is better. S n1,n 2 = Time for n 1 cores Time for n 2 cores n 1 is often taken to be 1 In this case, comparing parallel algorithm to serial algorithm Drew Schmidt High Performance Computing with R 50/93

Terminology: Parallelism Speedup Good Speedup Bad Speedup 8 8 6 6 Speedup 4 group Application Optimal Speedup 4 group Application Optimal 2 2 2 4 6 8 Cores 2 4 6 8 Cores Drew Schmidt High Performance Computing with R 51/93

Terminology: Parallelism Shared and Distributed Memory Machines Shared Memory Direct access to read/change memory (one node) Distributed No direct access to read/change memory (many nodes); requires communication Examples: laptop, GPU, MIC Examples: cluster, server, supercomputer Drew Schmidt High Performance Computing with R 52/93

Terminology: Parallelism Shared and Distributed Memory Machines Shared Memory Machines Thousands of cores Distributed Memory Machines Hundreds of thousands of cores Nautilus, University of Tennessee 1024 cores 4 TB RAM Kraken, University of Tennessee 112,896 cores 147 TB RAM Drew Schmidt High Performance Computing with R 53/93

Terminology: Parallelism Shared and Distributed Programming from R Shared Memory Examples: parallel, snow, foreach, gputools, HiPLARM Distributed Examples: pbdr, Rmpi, RHadoop, RHIPE CRAN HPC Task View For more examples, see: http://cran.r-project.org/web/ views/highperformancecomputing.html Drew Schmidt High Performance Computing with R 54/93

Choice of BLAS Library The BLAS Basic Linear Algebra Subprograms. Simple vector-vector (level 1), matrix-vector (level 2), and matrix-matrix (level 3). R uses BLAS (and LAPACK) for most linear algebra operations. There are different implementations available, with massively different performance. Several multithreaded BLAS libraries exist. Drew Schmidt High Performance Computing with R 55/93

Choice of BLAS Library Benchmark 50 Comparison of Different BLAS Implementations for Matrix Matrix Multiplication and SVD x%*%x on 2000x2000 matrix (~31 MiB) x%*%x on 4000x4000 matrix (~122 MiB) 40 1 s e t. s e e d (1234) 2 m < 2000 3 n < 2000 4 x < m a t r i x ( 5 rnorm (m n ), 6 m, n ) 7 8 o b j e c t. s i z e ( x ) 9 10 l i b r a r y ( rbenchmark ) 11 12 benchmark ( x% %x ) 13 benchmark ( svd ( x ) ) Average Wall Clock Run Time (10 Runs) 30 20 10 0 50 40 30 svd(x) on 1000x1000 matrix (~8 MiB) svd(x) on 2000x2000 matrix (~31 MiB) 20 10 0 reference atlas openblas1 openblas2 reference atlas openblas1 openblas2 BLAS Impelentation Drew Schmidt High Performance Computing with R 56/93

Choice of BLAS Library Using openblas On Debian and derivatives: 1 sudo apt - get install libopenblas - dev 2 sudo update - alternatives -- config libblas. so.3 Warning: doesn t play nice with the parallel package! Drew Schmidt High Performance Computing with R 57/93

Guidelines Independence Parallelism requires independence. Separate evaluations of R functions is embarrassingly parallel. For bio applications, this may mean splitting calculations by gene. Drew Schmidt High Performance Computing with R 58/93

Guidelines Portability Not all packages (or methods within a package) support all OS s. In the HPC world, that usually means doesn t work on Windows. Drew Schmidt High Performance Computing with R 59/93

Guidelines RNG s in Parallel Be careful! Aided by rlecuyer, rsprng, and dorng packages. Drew Schmidt High Performance Computing with R 60/93

Summary Summary Many kinds of parallelism available to R. Better/parallel BLAS is free speedup for linear algebra, but takes some work. Drew Schmidt High Performance Computing with R 61/93

6 Shared Memory Parallelism in R The parallel Package The foreach Package Drew Schmidt High Performance Computing with R

The parallel Package The parallel Package Comes with R 2.14.0 Includes multicore + most of snow. As such, has 2 disjoint interfaces. Drew Schmidt High Performance Computing with R 62/93

The parallel Package The parallel Package: multicore Operates on fork/join paradigm. Drew Schmidt High Performance Computing with R 63/93

The parallel Package The parallel Package: multicore (+) Data copied to child on write (handled by OS) (+) Very efficient. (-) No Windows support. (-) Not as efficient as threads. Drew Schmidt High Performance Computing with R 64/93

The parallel Package The parallel Package: multicore 1 mclapply (X, FUN,..., 2 mc. preschedule =TRUE, mc.set. seed =TRUE, 3 mc. silent = FALSE, mc. cores = getoption ("mc. cores ", 2L), 4 mc. cleanup =TRUE, mc. allow. recursive = TRUE ) 1 x <- lapply (1:10, sqrt ) 2 3 library ( parallel ) 4 x. mc <- mclapply (1:10, sqrt ) 5 6 all. equal (x.mc, x) 7 # [1] TRUE Drew Schmidt High Performance Computing with R 65/93

The parallel Package The parallel Package: multicore 1 simplify2array ( mclapply (1:10, function ( i) Sys. getpid (), mc. cores =4) ) 2 # [1] 27452 27453 27454 27455 27452 27453 27454 27455 27452 27453 3 4 simplify2array ( mclapply (1:2, function ( i) Sys. getpid (), mc. cores =4) ) 5 # [1] 27457 2745 Drew Schmidt High Performance Computing with R 66/93

The parallel Package The parallel Package: snow Uses sockets. (+) Works on all platforms. (-) More fiddley than mclapply(). (-) Not as efficient as forks. Drew Schmidt High Performance Computing with R 67/93

The parallel Package The parallel Package: multicore 1 ### Set up the worker processes 2 my. cl <- makecluster ( detectcores ()) 3 my. cl 4 # socket cluster with 4 nodes on host l o c a l h o s t 5 6 parsapply (cl, 1:5, sqrt ) 7 8 stopcluster ( my. cl) Drew Schmidt High Performance Computing with R 68/93

The parallel Package The parallel Package: Summary All detectcores() splitindices() multicore mclapply() mcmapply() mcparallel() mccollect() and others... snow makecluster() stopcluster() parlapply() parsapply() and others... Drew Schmidt High Performance Computing with R 69/93

The foreach Package The foreach Package On Cran (Revolution Analytics). Main package is foreach, which is a single interface for a number of backend packages. Backends: domc, dompi, doparallel, doredis, dorng, dosnow. Drew Schmidt High Performance Computing with R 70/93

The foreach Package The foreach Package (+) Works on all platforms (with correct backend). (+) Can even work serial with minor notational change. (+) Write the code once, use whichever backend you prefer. (-) Really bizarre, non-r-ish synatx. (-) Efficiency issues??? Drew Schmidt High Performance Computing with R 71/93

The foreach Package Efficiency Issues??? Coin Flipping with 24 Cores Log Run Time in Seconds 2 0 2 Function lapply mclapply foreach 1 ### Bad performance 2 foreach (i =1: len ) % dopar % tinyfun (i) 3 4 ### Expected performance 5 foreach (i =1: ncores ) % dopar % { 6 out <- numeric ( len / ncores ) 7 for (j in 1:( len / ncores )) 8 out [i] <- tinyfun (j) 9 out 10 } 10 100 1000 10000 1e+05 1e+06 Length of Iterating Set Drew Schmidt High Performance Computing with R 72/93

The foreach Package The foreach Package: General Procedure Load foreach and your backend package. Register your backend. Call foreach Drew Schmidt High Performance Computing with R 73/93

The foreach Package Using foreach: serial 1 library ( foreach ) 2 3 ### Example 1 4 foreach ( i =1:3) %do% sqrt ( i) 5 6 ### Example 2 7 n <- 50 8 reps <- 100 9 10 x <- foreach ( i =1: reps ) %do% { 11 sum ( rnorm (n, mean =i)) / (n* reps ) 12 } Drew Schmidt High Performance Computing with R 74/93

The foreach Package Using foreach: Parallel 1 library ( foreach ) 2 library ( < mybackend >) 3 4 register < MyBackend >() 5 6 ### Example 1 7 foreach ( i =1:3) %dopar% sqrt ( i) 8 9 ### Example 2 10 n <- 50 11 reps <- 100 12 13 x <- foreach ( i =1: reps ) %dopar% { 14 sum ( rnorm (n, mean =i)) / (n* reps ) 15 } Drew Schmidt High Performance Computing with R 75/93

The foreach Package foreach backends multicore 1 library ( doparallel ) 2 registerdoparallel ( cores = ncores ) 3 foreach (i =1:2) % dopar % Sys. getpid () 1 library ( doparallel ) 2 cl <- makecluster ( ncores ) 3 registerdoparallel ( cl=cl) 4 snow 5 foreach (i =1:2) % dopar % Sys. getpid () 6 stopcluster ( cl) Drew Schmidt High Performance Computing with R 76/93

The foreach Package foreach Summary Make sure to register your backend. Different backends may have different performance. Use %dopar% for parallel foreach. %do% and %dopar% must appear on the same line as the foreach() call. Drew Schmidt High Performance Computing with R 77/93

7 Distributed Memory Parallelism with R Distributed Memory Parallelism Rmpi pbdmpi vs Rmpi Summary Drew Schmidt High Performance Computing with R

Distributed Memory Parallelism Why Distribute? Nodes only hold so much ram. Commodity hardware: 32 64 gib. With a few exceptions (ff, bigmemory), R does computations in memory. If your problem doesn t fit in the memory of one node... Drew Schmidt High Performance Computing with R 78/93

Distributed Memory Parallelism Packages for Distributed Memory Parallelism in R Rmpi, and snow via Rmpi RHIPE and RHadoop ecosystem pbdr ecosystem (Monday) Drew Schmidt High Performance Computing with R 79/93

Distributed Memory Parallelism Hasty Explanation of MPI We will return to this on Monday. MPI = Message Passing Interface Recall: Distributed machines can t directly manipulate memory of other nodes. Can indirectly manipulate them, however... Distinct nodes collaborate by passing messages over network. Drew Schmidt High Performance Computing with R 80/93

Rmpi Rmpi Hello World mpi. spawn. Rslaves ( nslaves =2) # 2 slaves are spawned successfully. 0 failed. # master ( rank 0, comm 1) of size 3 is running on: wootabega # slave1 ( rank 1, comm 1) of size 3 is running on: wootabega # slave2 ( rank 2, comm 1) of size 3 is running on: wootabega mpi. remote. exec ( paste ("I am",mpi. comm. rank (),"of",mpi. comm. size ())) # $ slave1 # [1] "I am 1 of 3" # # $ slave2 # [1] "I am 2 of 3" mpi. exit () Drew Schmidt High Performance Computing with R 81/93

Rmpi Using Rmpi from snow library ( snow ) library ( Rmpi ) cl <- makecluster (2, type = " MPI ") clustercall (cl, function () Sys. getpid ()) clustercall ( cl, runif, 2) stopcluster ( cl) mpi. quit () Drew Schmidt High Performance Computing with R 82/93

Rmpi Rmpi Resources Rmpi tutorial: http://math.acadiau.ca/acmmac/rmpi/ Rmpi manual: http: //cran.r-project.org/web/packages/rmpi/rmpi.pdf Drew Schmidt High Performance Computing with R 83/93

pbdmpi vs Rmpi pbdmpi vs Rmpi Rmpi is interactive; pbdmpi is exclusively batch. pbdmpi is easier to install. pbdmpi has a simpler interface. pbdmpi integrates with other pbdr packages. Drew Schmidt High Performance Computing with R 84/93

pbdmpi vs Rmpi Example Syntax Rmpi 1 # int 2 mpi. allreduce (x, type =1) 3 # double 4 mpi. allreduce (x, type =2) 1 allreduce ( x) pbdmpi Types in R 1 > is. integer (1) 2 [1] FALSE 3 > is. integer (2) 4 [1] FALSE 5 > is. integer (1:2) 6 [1] TRUE Drew Schmidt High Performance Computing with R 85/93

Summary Summary Distributed parallelism is necessary when computations no longer fit in ram. Several options available; most go beyond the scope of this talk. More on pbdr Monday! Drew Schmidt High Performance Computing with R 86/93

Exercises 1 1 Suppose we wish to store the square root of all integers from 1 to 10000 in a vector. Do this in each of the following ways, and compare them with rbenchmark: for loop without initialization for loop with initialization Ply function vectorization 2 Revisit the previous example, evaluating the different implementations with Rprof(). 3 Count the number of integer multiples of 5 or 17 which are less than 10,000,000. Solve this with lapply(). Solve this with vectorization. Solve this with mclapply() and 2 cores. Drew Schmidt High Performance Computing with R 87/93

Exercises 2 4 The Monte Hall game is a well known paradox from elementary probability. From Wikipedia: Suppose you re on a game show, and you re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, Do you want to pick door No. 2? Is it to your advantage to switch your choice? Simulate one million trials of the Monte Hall game on 2 cores, switching doors every time, to computationally verify the elementary probability result. Compare the run time against the 1 core run time. Drew Schmidt High Performance Computing with R 88/93

Exercises 3 5 The following is a more substantive example that utilizes multiple cores to perform a real analysis task. Run and evaluate this example. The example is modified from an example of Wei-Chen s from the pbddemo package. There are 148 EIAV sequences in the data set. They are sequencing from multiple blood serum collected longitudinally from one EIA horse over periodical fever cycles after EIAV infection. The virus population evolved within the sick horse over time. Some subtype can break the horse s immune system. It was to identify how many subtypes were evolving within the horse, which type is associated with disease early onset, where/which sample/time point to isolate that subtype virus. Moreover, which mutated region of sequence was critical for that subtype in order to break the immune system. Drew Schmidt High Performance Computing with R 89/93

Exercises 4 1 library ( phyclust, quietly = TRUE ) 2 library ( parallel ) 3 4 ### Load data 5 data. path <- paste (. libpaths () [1], "/ phyclust / data / pony524. phy ", sep = "") 6 pony.524 <- read. phylip ( data. path ) 7 X <- pony.524 $ org 8 K0 <- 1 9 Ka <- 2 10 11 ### Find MLEs 12 ret.k0 <- find. best (X, K0) 13 ret.ka <- find. best (X, Ka) 14 LRT <- -2 * ( ret.ka$ logl - ret.k0$ logl ) 15 16 ### The user defined function 17 FUN <- function ( jid ){ Drew Schmidt High Performance Computing with R 90/93

Exercises 5 18 X.b <- bootstrap. seq. data ( ret.k0)$ org 19 20 ret. K0 <- phyclust ( X.b, K0) 21 repeat { 22 ret. Ka <- phyclust ( X.b, Ka) 23 if(ret.ka$ logl > ret.k0$ logl ){ 24 break 25 } 26 } 27 28 LRT.b <- -2 * ( ret.ka$ logl - ret.k0$ logl ) 29 LRT.b 30 } 31 32 ### Task pull and summary 33 ret <- mclapply (1:100, FUN ) 34 LRT.B <- unlist ( ret ) 35 cat ("K0: ", K0, "\n", 36 "Ka: ", Ka, "\n", Drew Schmidt High Performance Computing with R 91/93

Exercises 6 37 " logl K0: ", ret.k0$logl, "\n", 38 " logl Ka: ", ret.ka$logl, "\n", 39 " LRT : ", LRT, "\n", 40 "p- value : ", mean ( LRT > LRT.B), "\n", sep = "") Drew Schmidt High Performance Computing with R 92/93

9 Wrapup Drew Schmidt High Performance Computing with R

Important Topics Not Discussed Here Distributed computing (for real) pbdr on Monday. Utilizing compiled code Rcpp on Tuesday. Multithreading. GPU s and MIC s. R+Hadoop. Drew Schmidt High Performance Computing with R 93/93

Thanks for coming! Questions? Drew Schmidt High Performance Computing with R 93/93