Introduction to parallel computing in R



Similar documents
Getting Started with domc and foreach

Introduction to the doredis Package

Deal with big data in R using bigmemory package

Big data in R EPIC 2015

Parallelization Strategies for Multicore Data Analysis

Package biganalytics

Matrix Algebra in R A Minimal Introduction

Package missforest. February 20, 2015

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Parallel Options for R

Package bigrf. February 19, 2015

Introduction to Parallel Programming and MapReduce

Multiple Linear Regression

R2MLwiN Using the multilevel modelling software package MLwiN from R

Regression Clustering

Data Mining Lab 5: Introduction to Neural Networks

Big Data and Parallel Work with R

High Performance Computing with R

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Distributed R for Big Data

Oracle Data Miner (Extension of SQL Developer 4.0)

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Introducing the Multilevel Model for Change

Map-Reduce for Machine Learning on Multicore

Package dsmodellingclient

A Brief Introduction to SPSS Factor Analysis

Handling attrition and non-response in longitudinal data

Parallel and Distributed Computing Programming Assignment 1

Classification and Regression by randomforest

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

Fast Analytics on Big Data with H20

Applied Multivariate Analysis - Big data analytics

Parallel Computing for Data Science

Decision Trees from large Databases: SLIQ

Generalized Linear Models

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Review Jeopardy. Blue vs. Orange. Review Jeopardy

6 Scalar, Stochastic, Discrete Dynamic Systems

Parallel Computing with Bayesian MCMC and Data Cloning in R with the dclone package

CUDAMat: a CUDA-based matrix class for Python

Partial Least Squares (PLS) Regression.

Large Datasets and You: A Field Guide

Polynomial Neural Network Discovery Client User Guide

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

RevoScaleR Speed and Scalability

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

High Performance Computing. Course Notes HPC Fundamentals

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Regularized Logistic Regression for Mind Reading with Parallel Validation

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Leveraging Ensemble Models in SAS Enterprise Miner

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Bringing Big Data Modelling into the Hands of Domain Experts

Local classification and local likelihoods

Typical Linear Equation Set and Corresponding Matrices

Lecture 10: Regression Trees

TPCalc : a throughput calculator for computer architecture studies

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Java Modules for Time Series Analysis

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Common factor analysis

VHDL Test Bench Tutorial

An Introduction to Partial Least Squares Regression

The Scientific Data Mining Process

Data Mining III: Numeric Estimation

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Your Best Next Business Solution Big Data In R 24/3/2010

Analytics on Big Data

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Machine Learning Big Data using Map Reduce

Parallel Computing with the Bioconductor Amazon Machine Image

Welcome to the IBM Education Assistant module for Tivoli Storage Manager version 6.2 Hyper-V backups. hyper_v_backups.ppt.

InfiniteInsight 6.5 sp4

HP LeftHand SAN Solutions

Statistics in Retail Finance. Chapter 6: Behavioural models

Notes on Factoring. MA 206 Kurt Bryan

Proposal for Undergraduate Certificate in Large Data Analysis

1. Classification problems

Parallelization in R, Revisited

Introducing R: From Your Laptop to HPC and Big Data

R: A self-learn tutorial

Revolutionized DB2 Test Data Management

Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

ANOVA. February 12, 2015

Gerrit Stols

Dimensionality Reduction: Principal Components Analysis

SAS Software to Fit the Generalized Linear Model

Data Analysis of Trends in iphone 5 Sales on ebay

What Is Specific in Load Testing?

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

Using Google Compute Engine

Statistics Graduate Courses

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

The Art of R Programming. Norman Matloff

Transcription:

Introduction to parallel computing in R Clint Leach April 10, 2014 1 Motivation When working with R, you will often encounter situations in which you need to repeat a computation, or a series of computations, many times. This can be accomplished through the use of a for loop. However, if there are a large number of computations that need to be carried out (i.e. many thousands), or if those individual computations are time-consuming (i.e. many minutes), a for loop can be very slow That said, almost all computers now have multicore processors, and as long as these computations do not need to communicate (i.e. they are embarrassingly parallel ), they can be spread across multiple cores and executed in parallel, reducing computation time. Examples of these types of problems include: Run a simulation model using multiple different parameter sets, Run multiple MCMC chains simultaneously, Bootstrapping, cross-validation, etc. 2 Parallel backends By default, R will not take advantage of all the cores available on a computer. In order to execute code in parallel, you have to first make the desired number of cores available to R by registering a parallel backend, which effectively creates a cluster to which computations can be sent. Fortunately there are a number of packages that will handle the nitty-gritty details of this process for you: domc (built on multicore, works for unix-alikes) dosnow (built on snow, works for Windows) doparallel (built on parallel, works for both) The parallel package is essentially a merger of multicore and snow, and automatically uses the appropriate tool for your system, so I would recommend sticking with that. Creating a parallel backend (i.e. cluster) is accomplished through just a few lines of code: 1

library(doparallel) # Find out how many cores are available (if you don't already know) detectcores() [1] 4 # Create cluster with desired number of cores cl <- makecluster(3) # Register cluster registerdoparallel(cl) # Find out how many cores are being used getdoparworkers() [1] 3 The syntax for the other packages is essentially the same, just with register<package>(cl). 3 Executing computations in parallel Regardless of the application, parallel computing boils down to three basic steps: split the problem into pieces, execute in parallel, and collect the results. 3.1 Using foreach These steps can all be handled through the use of the foreach package. parallel analogue to a standard for loop. This provides a library(foreach) x <- foreach(i = 1:3) %dopar% sqrt(i) x [[1]] [1] 1 [[2]] [1] 1.414 [[3]] [1] 1.732 2

As in a for loop, i defines an iterator (though note the use of = instead of in) splits the problem into pieces (step 1). For each value of the iterator then, the %dopar% operator passes the defined computations, here sqrt(i), to the availables cores (step 2). Also note that, in contrast to a for loop, foreach collects the results (step 3) and returns an object, a list by default. This can be changed through the.combine option: # Use the concatenate function to combine results x <- foreach(i = 1:3,.combine = c) %dopar% sqrt(i) # Now x is a vector x [1] 1.000 1.414 1.732 # Can also use + or * to combine results x <- foreach(i = 1:3,.combine = "+") %dopar% sqrt(i) # Now x is a scalar, the sum of all the results x [1] 4.146 Other options include rbind and cbind. It is also important to note that foreach will automatically export any necessary variables (i.e. all variables defined in your current environment will be available to the cores), but any packages needed by the computations need to be passed using the.packages option. 3.2 Parallel apply functions The parallel package also provides parallel analogues for the apply family of functions. parlapply(cl, list(1, 2, 3), sqrt) [[1]] [1] 1 [[2]] [1] 1.414 [[3]] [1] 1.732 3

3.3 Random number generation If the calculations that you are parallelizing involve random number generation (as they often will), you will need to explicilty set up your random number generators so that the random numbers used by the different cores are independent. This can be handled fairly easily through package dorng, which generates independent, reproducible random number chains for each core: library(dorng) # Set the random number seed manually before calling foreach set.seed(123) # Replace %dopar% with %dorng% rand1 <- foreach(i = 1:5) %dorng% runif(3) # Or set seed using.options.rng option in foreach rand2 <- foreach(i = 1:5,.options.RNG = 123) %dorng% runif(3) # The two sets of random numbers are identical (i.e. reproducible) identical(rand1, rand2) [1] TRUE 3.4 Task-specific packages These packages can take advantage of a registered parallel backend without needing foreach: caret classification and regression training; cross-validation, etc; will automatically paralellize if a backend is registered. bugsparallel provides tools for running parallel MCMC chains in WinBUGS using R2WinBugs (still active?). dclone MCMC methods for maximum likelihood estimation, running BUGS chains in parallel. pls partial least squares and principal component regression built-in cross-validation tools can take advantage multicore by setting options. plyr data manipulation and apply-like functions; can set options to run in parallel. 3.5 Caveats and Warnings There is communication overhead to setting up cluster not worth it for simple problems. 4

Error handling default is to stop if one of the taks produces an error, but you lose the output from any tasks that completed successfully; use.errorhandling option in foreach to control how errors should be treated. Can use Performance tab of the Windows Task Manager to double check that things are working correctly (should see CPU usage on the desired number of cores). Shutting down cluster when you re done, be sure to close the parallel backend using stopcluster(cl); otherwise you can run into problems later. 3.6 A (slightly) more substantial example Returning to the tree data you ve been working with, say we have girth and volume measurements for 100 species of trees, instead of just three, and we want to fit a linear regression for each species. This can still be done fairly quickly using a for loop on a single core, we can save a few seconds by running it in parallel. # Generate fake tree data set with 100 observations for 100 species tree.df <- data.frame(species = rep(c(1:100), each = 100), girth = runif(10000, 7, 40)) tree.df$volume <- tree.df$species/10 + 5 * tree.df$girth + rnorm(10000, 0, 3) # Extract species IDs to iterate over species <- unique(tree.df$species) # Run foreach loop and store results in fits object fits <- foreach(i = species,.combine = rbind) %dopar% { sp <- subset(tree.df, subset = species == i) fit <- lm(volume ~ girth, data = sp) return(c(i, fit$coefficients)) } head(fits) (Intercept) girth result.1 1 0.09479 4.998 result.2 2 0.79043 4.980 result.3 3 0.06980 5.009 result.4 4 0.25487 4.996 result.5 5-0.59531 5.055 result.6 6 1.00723 4.984 # What if we want all of the info from the lm object? Change.combine fullfits <- foreach(i = species) %dopar% { sp <- subset(tree.df, subset = species == i) fit <- lm(volume ~ girth, data = sp) 5

} return(fit) attributes(fullfits[[1]]) $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm" 4 Less embarrassing parallel problems Multicore computing is also useful for carrying out single, large computations (e.g. inverting very large matrices, fitting linear models with Big Data ). In these cases, the cores are working together to carry out a single computation and the thus need to communicate (i.e. not embarrassingly parallel anymore). This type of parallel computation is considerably more difficult, but there are some packages that do most of the heavy lifting for you: HiPLAR High Performance Linear Algebra in R automaticaly replaces default matrix commands with multicore computations; Linux only, installation not necessarily straightforward. pbdr programming with big data in R multicore matrix algebra and statistics; available for all OS, with potentially tractable install. Also has extensive introduction manual, Speaking R with a parallel accent. 5 Additional Resources This overview has covered a very thin slice of the tools available, both within the above packages and in R more broadly. The help pages and vignettes for the above packages are very useful and provide additional details and examples. The CRAN task view on parallel computing is also a good resource for digging into the breadth of tools available: http://cran.r-project.org/web/views/highperformancecomputing.html 6