Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management Science Computing in the Cloud, April 6-8, 2014
Outline 1 Introduction Basic Strategy Data Analysis Algorithms 2 Data and Analysis Techniques 3 Examples Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM
Basic Strategy Multicore A core is an individual processor CPUs used to have a single core, and the terms were interchangeable. Modern CPU s have several cores on a single CPU chip. Processors on the same chip share memory allowing much easier implementation of parallel algorithms.
Basic Strategy Multi Node Multi-node machines typically have many interconnected CPU s. Each CPU may have a number of cores which can share memory. Utilizing a multi node machine usually involves explicitly moving data between nodes. This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.
Basic Strategy Points of Parallelism In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence? Data: Statistical Independence If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer. Parallelism of the algorithm Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be computed separately and then recombined?
Data Analysis Algorithms Inputs and Outputs Most data based (statistical) analyses in the life sciences follow a basic functional structure. Inputs D - Data in the form of a vector, list, or other structure. Λ - Parameters of interest usually summarized as a vector, matrix, or list. Outputs φ - A scaler, vector, matrix or combination of these things.
Data Analysis Algorithms Key Algorithms Simulation Parallel chains running at the same time may improve efficiency. Cost of any burn in or discarded samples needs to be considered. Useful if we want to run many chains with different data or parameter values. Cluster Computers highly effective. Use job schedulers. Optimization Inherently serial operation controlled by a master process. Parallel implementation is most likely to occur within the function call.
Sample Data Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain. n = 826 deer sampled. Deer were grouped among 351 farms. Sex of deer and length are explanatory variables. For the response variables, define Y is as 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.
Logistic Regression Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Y is and use the logistic link function to relate the mean p is to the explanatory variables. That is, p is (x is β) = Y is Bin(1, p is (x is β)) exp β 0 + β 1 x s + β 2 x len + β 3 x len x s 1 + exp β 0 + β 1 x s + β 2 x len + β 3 x len x s We are allowing each gender to have its own intercept and slope.
Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( 1 1 + exp(xis T β) ) 1 yis
Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( 1 1 + exp(xis T β) ) 1 yis
Example 1: Multicore M-H Samplers Bayesian Inference in Logistic Regression We ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler. The Prior: β c The posterior π(β y, X) l(β y, X)π(β) l(β y, X) The proposal distribution q(β i β i 1 ) N(β i 1, I 1 (β i 1 ))
Example 1: Multicore M-H Samplers Simulation with the Metropolis Hastings Step Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm. Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β 0 Iteration t (t 1): 1 Given β (t 1), generate β q(β (t 1), β). 2 Compute ( ρ(β (t 1), β) π( = min 1, β)q( β, ) β (t 1) ) π(β (t 1) )q(β (t 1), β) = min(1, π( β)/π(β (t 1) )) 3 With probability ρ(β (t 1), β), accept β and set β t = β; otherwise reject β and set β t = β (t 1)
MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - We can run multiple independent chains each starting from a different initial value β 0. Very Easy to do but we need to allow each chain to burn in. 2 Faster Function - We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data, compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large. Example 1: Multicore M-H Samplers Opportunities for Parallelism.
Example 1: Multicore M-H Samplers Example: Random Walk M-H in R Let s try simulating the posterior for our deer parasite example. 1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R. 2 Method 2 accesses multiple cores through the mclapply function. Run file 22-mcmc-glm-mclapply.R. 3 Method 3 uses the pbdr package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file 23-mcmc-glm-pbdR.R
Example 1: Multicore M-H Samplers Ex 1: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 How can you create 95% credible intervals from the output? 3 Can you time your results to see if there are any improvements?
Example 2: Multicore Bootstrap Sampling Variance Components Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation. Introduce a "random intercept" into our model to take this into account. p is (x is β) = where α i N(0, σ 2 α). Y is Bin(1, p is (x T is β)) exp β 0 + α i + β 1 x s + β 2 x len 1 + exp β 0 + α i + β 1 x s + β 2 x len 1 GLMM - generalized linear mixed model. 2 Can be fit by PQL - Penalized Quasi-Likelihood method. 3 This method is known to produce biased estimates of both β and σ 2 α. 4 Confidence intervals for σ 2 α also biased.
Example 2: Multicore Bootstrap Sampling Bootstrap to the Rescue Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist. Non-Parametric Bootstrap Percentile Method Initialization: Fit the PQL Model to the original data. 1 Sample with replacement the subset of observations from each farm and combine to create a new data set. 2 Compute the PQL estimate of the resampled data set. 3 Collect the estimates of σ 2 α and produce a confidence interval. 4 Create prediction intervals for the individual α i.
Example 2: Multicore Bootstrap Sampling Opportunities for Parallelism. As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Again, run multiple chains since the bootstrap simulation is totally independent. 2 Faster Function - The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data. A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit. 3 Take advantage of gains by using vectorization and avoiding loops.
Example 2: Multicore Bootstrap Sampling Example: Nonparametric Bootstrap Let s try bootstrapping the farm effect for our deer parasite example. 1 First run 01-max_pql.R to fit the initial model. 2 Method 1 is simply a serial implementation with a for loop. Run file 11-npbs_for.R. 3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R. 4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R 5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R
Example 2: Multicore Bootstrap Sampling Ex 2: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 Can you time your results to see if there are any improvements? 3 Estimate mean and the median of variation for the bootstrapped samples? 4 Find a C.I. for beta. 5 More appropriate way to bootstrap?
Example 3: Multicore Methods for fitting GLMM The GLMM Likelihood The generalized linear mixed model likelihood requires us to integrate over the α i with respect to their densities. l(β y, X) = i ( { exp s (y isxis T β + α i) } ) s [1 + exp(x T is β + α p(α i σα)ds 2 i i)] where p(α i σ 2 α) = N(0, σ 2 α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?
Example 3: Multicore Methods for fitting GLMM Approach 1: Maximizing the Likelihood Outer Layer Optimization Level: Inherently Serial. Master Process chooses new parameter values to pass to the function (β, σ 2 α). Function returns a value to the optimization algorithm. Function Evaluation Numerical integration or Monte Carlo integration. Compute the product/sum of the integrals.
Example 3: Multicore Methods for fitting GLMM Opportunities for Parallelism. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Not viable at the outer level. Could try multiple optimizations to check convergence. 2 Faster Function - Break the function up by doing integrations for each group(farm) separately.
Example 3: Multicore Methods for fitting GLMM Bayesian Approach to GLMM p(β, α, σα y) 2 p(y β, α, σα)p(β)p(α σ 2 α)p(σ 2 α) 2 Full Conditionals I S i p(β ) p(y ij β, α i )p(β) i=1 s=1 S i p(β ) p(y ij β, α i )p(β) s=1 S i p(σα ) 2 p(α i σα)p(σ 2 α) 2 s=1
Example 3: Multicore Methods for fitting GLMM Example: Sampling the Posterior Distribution of the GLMM 1 Method 1 is simply a serial implementation with a for loop. 31-mcmc_glmm.R. 2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file 41-mcmc_glmm_mclapply.R. 3 Method 3 like 2 but uses the pbdr package. 42-mcmc_glmm_pbdR.R.
Example 3: Multicore Methods for fitting GLMM Ex 3: Questions 1 Exercise: Find 95% creditable intervals for sd.random. 2 Other ideas.