S3IT: Service and Support for Science IT Scaling R on cloud infrastructure Sergio Maffioletti IS/Cloud S3IT: Service and Support for Science IT Zurich, 19.03.2015
Who am I? Sergio Maffioletti: Cloud and Application specialist Head of the IS/Cloud Services unit Head of the S3IT User support
What is S3IT? Connect IT and Science Zentrale Informatik ScienceIT support unit Dedicated support for computations and data analysis SPEED : faster time to solution ACCESS : to competitive infrastructure ENABLE : remove barriers new possibilities
Disclaimer What is presented here is *not* an authoritative analysis of how R can be scaled. It is just the result of few years of experience in supporting and helping research groups in improving R performances.
What are we going to talk about today? How to scale your R code on a cloud infrastructure 1. What is a could infrastructure 2. What problem are we trying to address 3. Possible scenario for scaling R
What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources Customization and control of the environment Multi-tenancy
What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Virtual Machines (VM) Virtual Storage block devices Virtual private network Self-provisioning and Elasticity of resources Customization and control of the environment Multi-tenancy
What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources End-users can allocate and release resources when needed. Customization and control of the environment Multi-tenancy
What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources Customization and control of the environment End-users can tailor the research infrastructure to his/her specific needs. Multi-tenancy
What problem are we trying to address? You normally run your R scripts on your local workstation At best: 8 cores, 16GB RAM, 100-200GB SSD drive This configuration is quite sufficient for most of the small/medium scale analysis problems
How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply parallel: built on multicore and snow
How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply Single node, multi-core Relatively easy to use within your script Scaling limited to single node availability parallel: built on multicore and snow
How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply parallel: built on multicore and snow Can use the CPUs/cores on a single machine (multicore), or several machines, using MPI (snow) Need to prepare the cluster of resources before initialization Need to apply LoadBalancing to distribute tasks Not widely used in community
mclapply example library(multicore) workerfunc <- function(n) return(nˆ2) values <- 1:100 library(parallel) numworkers <- 8 res <- mclapply(values, workerfunc, mc.cores = numworkers) print(unlist(res))
using parallel package workerfunc <- function(n) return(nˆ2) values <- 1:100 library(parallel) numworkers <- 8 cl <- makecluster(numworkers, type = "MPI") res <- parlapply(cl, values, workerfunc) stopcluster(cl) mpi.exit() print(unlist(res)) Running: mpirun -n 1 R -slave -f simple_mpi.r
Common limitation All make the assumption the infrastructure is available On a cloud infrastructure, provision (and release) of resources is a necessary step
How do you scale your R code on a cloud infrastructure? Scale-up Bigger workstation RStudio Scale-out
Scale-up: Bigger workstation
Scale-up: Bigger workstation Drawbacks: Cannot scale indefinitely (limit is the node size) Higher specs for the node = higher costs
Scale-up: RStudio Web-based interface to run R scripts Same single point of access (group, individual user)
Scale-up: RStudio Web-based interface to run R scripts Same single point of access (group, individual user) Drawbacks: Still runs on single node
Scale-out: GC3Pie Scaling outside of your R script Divide large execution in smaller chunks Run those chunks independently from each other Provision of the infrastructure based on script requirements
Let s see an example: gweight Usecase from Business department Run GetWeight function over 3.5M forum entries Each GetWeight takes 2 That s still 75 days on a 64 cores node Each item in the forum could be processed independently from the others
gweight workflow Takes initial 3.5M entries input (in.csv format) Create smaller.csv files of size chunk (default 1000) For each chunk file, create a dedicated VM On the VM, run GetWeight function with the specific.csv chunk file Terminate VM if no more.csv files have to be processed Aggregate results at the end into a single large.csv result file
Let s see how it work... https://youtu.be/8hn-4qxtahe
Conclusions Scaling R effectively on cloud infrastructure requires knowledge on how to split your computation S3IT provides support and tools (e.g. GC3Pie) Get in touch with us: help@s3it.uzh.ch, Visit our website: www.s3it.uzh.ch