Parallelization Strategies for Multicore Data Analysis



Similar documents
Bayesian Statistics in One Hour. Patrick Lam

STA 4273H: Statistical Machine Learning

Imputing Values to Missing Data

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

From the help desk: Bootstrapped standard errors

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

BayesX - Software for Bayesian Inference in Structured Additive Regression

Inference on Phase-type Models via MCMC

L3: Statistical Modeling with Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Linear Threshold Units

Generalized Linear Models

Imputing Missing Data using SAS

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Applied Multivariate Analysis - Big data analytics

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Gamma Distribution Fitting


A Basic Introduction to Missing Data

Tutorial on Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo

Bootstrapping Big Data

CHAPTER 2 Estimating Probabilities

Data Mining Practical Machine Learning Tools and Techniques

Introduction to parallel computing in R

Multivariate Normal Distribution

Analysis of Bayesian Dynamic Linear Models

Reliability estimators for the components of series and parallel systems: The Weibull model

Statistical Machine Learning

Centre for Central Banking Studies

PS 271B: Quantitative Methods II. Lecture Notes

Basics of Statistical Machine Learning

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

DURATION ANALYSIS OF FLEET DYNAMICS

Statistics Graduate Courses

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Big Data and Parallel Work with R

Bayesian Phylogeny and Measures of Branch Support

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

Big data in R EPIC 2015

Regression Modeling Strategies

Markov Chain Monte Carlo Simulation Made Simple

Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation

Bayesian inference for population prediction of individuals without health insurance in Florida

Applications of R Software in Bayesian Data Analysis

Gaussian Processes to Speed up Hamiltonian Monte Carlo

How To Understand The Theory Of Probability

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

More details on the inputs, functionality, and output can be found below.

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Lecture 3: Linear methods for classification

Logistic Regression (1/24/13)

Data Mining: An Overview. David Madigan

Model Combination. 24 Novembre 2009

Handling attrition and non-response in longitudinal data

SAS Certificate Applied Statistics and SAS Programming

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Note on the EM Algorithm in Linear Regression Model

Least Squares Estimation

Advanced Big Data Analytics with R and Hadoop

Server Load Prediction

Lecture 6: Logistic Regression

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

A Latent Variable Approach to Validate Credit Rating Systems using R

Multivariate Logistic Regression

A Scalable Bootstrap for Massive Data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Java Modules for Time Series Analysis

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

The Variability of P-Values. Summary

Big Data, Statistics, and the Internet

Part 2: One-parameter models

HT2015: SC4 Statistical Data Mining and Machine Learning

Bayes and Big Data: The Consensus Monte Carlo Algorithm

Detection of changes in variance using binary segmentation and optimal partitioning

Leveraging Ensemble Models in SAS Enterprise Miner

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Model Selection and Claim Frequency for Workers Compensation Insurance

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Local classification and local likelihoods

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

ANOVA. February 12, 2015

Dirichlet Processes A gentle tutorial

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Transcription:

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management Science Computing in the Cloud, April 6-8, 2014

Outline 1 Introduction Basic Strategy Data Analysis Algorithms 2 Data and Analysis Techniques 3 Examples Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM

Basic Strategy Multicore A core is an individual processor CPUs used to have a single core, and the terms were interchangeable. Modern CPU s have several cores on a single CPU chip. Processors on the same chip share memory allowing much easier implementation of parallel algorithms.

Basic Strategy Multi Node Multi-node machines typically have many interconnected CPU s. Each CPU may have a number of cores which can share memory. Utilizing a multi node machine usually involves explicitly moving data between nodes. This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.

Basic Strategy Points of Parallelism In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence? Data: Statistical Independence If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer. Parallelism of the algorithm Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be computed separately and then recombined?

Data Analysis Algorithms Inputs and Outputs Most data based (statistical) analyses in the life sciences follow a basic functional structure. Inputs D - Data in the form of a vector, list, or other structure. Λ - Parameters of interest usually summarized as a vector, matrix, or list. Outputs φ - A scaler, vector, matrix or combination of these things.

Data Analysis Algorithms Key Algorithms Simulation Parallel chains running at the same time may improve efficiency. Cost of any burn in or discarded samples needs to be considered. Useful if we want to run many chains with different data or parameter values. Cluster Computers highly effective. Use job schedulers. Optimization Inherently serial operation controlled by a master process. Parallel implementation is most likely to occur within the function call.

Sample Data Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain. n = 826 deer sampled. Deer were grouped among 351 farms. Sex of deer and length are explanatory variables. For the response variables, define Y is as 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.

Logistic Regression Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Y is and use the logistic link function to relate the mean p is to the explanatory variables. That is, p is (x is β) = Y is Bin(1, p is (x is β)) exp β 0 + β 1 x s + β 2 x len + β 3 x len x s 1 + exp β 0 + β 1 x s + β 2 x len + β 3 x len x s We are allowing each gender to have its own intercept and slope.

Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( 1 1 + exp(xis T β) ) 1 yis

Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( 1 1 + exp(xis T β) ) 1 yis

Example 1: Multicore M-H Samplers Bayesian Inference in Logistic Regression We ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler. The Prior: β c The posterior π(β y, X) l(β y, X)π(β) l(β y, X) The proposal distribution q(β i β i 1 ) N(β i 1, I 1 (β i 1 ))

Example 1: Multicore M-H Samplers Simulation with the Metropolis Hastings Step Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm. Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β 0 Iteration t (t 1): 1 Given β (t 1), generate β q(β (t 1), β). 2 Compute ( ρ(β (t 1), β) π( = min 1, β)q( β, ) β (t 1) ) π(β (t 1) )q(β (t 1), β) = min(1, π( β)/π(β (t 1) )) 3 With probability ρ(β (t 1), β), accept β and set β t = β; otherwise reject β and set β t = β (t 1)

MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - We can run multiple independent chains each starting from a different initial value β 0. Very Easy to do but we need to allow each chain to burn in. 2 Faster Function - We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data, compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large. Example 1: Multicore M-H Samplers Opportunities for Parallelism.

Example 1: Multicore M-H Samplers Example: Random Walk M-H in R Let s try simulating the posterior for our deer parasite example. 1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R. 2 Method 2 accesses multiple cores through the mclapply function. Run file 22-mcmc-glm-mclapply.R. 3 Method 3 uses the pbdr package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file 23-mcmc-glm-pbdR.R

Example 1: Multicore M-H Samplers Ex 1: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 How can you create 95% credible intervals from the output? 3 Can you time your results to see if there are any improvements?

Example 2: Multicore Bootstrap Sampling Variance Components Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation. Introduce a "random intercept" into our model to take this into account. p is (x is β) = where α i N(0, σ 2 α). Y is Bin(1, p is (x T is β)) exp β 0 + α i + β 1 x s + β 2 x len 1 + exp β 0 + α i + β 1 x s + β 2 x len 1 GLMM - generalized linear mixed model. 2 Can be fit by PQL - Penalized Quasi-Likelihood method. 3 This method is known to produce biased estimates of both β and σ 2 α. 4 Confidence intervals for σ 2 α also biased.

Example 2: Multicore Bootstrap Sampling Bootstrap to the Rescue Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist. Non-Parametric Bootstrap Percentile Method Initialization: Fit the PQL Model to the original data. 1 Sample with replacement the subset of observations from each farm and combine to create a new data set. 2 Compute the PQL estimate of the resampled data set. 3 Collect the estimates of σ 2 α and produce a confidence interval. 4 Create prediction intervals for the individual α i.

Example 2: Multicore Bootstrap Sampling Opportunities for Parallelism. As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Again, run multiple chains since the bootstrap simulation is totally independent. 2 Faster Function - The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data. A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit. 3 Take advantage of gains by using vectorization and avoiding loops.

Example 2: Multicore Bootstrap Sampling Example: Nonparametric Bootstrap Let s try bootstrapping the farm effect for our deer parasite example. 1 First run 01-max_pql.R to fit the initial model. 2 Method 1 is simply a serial implementation with a for loop. Run file 11-npbs_for.R. 3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R. 4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R 5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R

Example 2: Multicore Bootstrap Sampling Ex 2: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 Can you time your results to see if there are any improvements? 3 Estimate mean and the median of variation for the bootstrapped samples? 4 Find a C.I. for beta. 5 More appropriate way to bootstrap?

Example 3: Multicore Methods for fitting GLMM The GLMM Likelihood The generalized linear mixed model likelihood requires us to integrate over the α i with respect to their densities. l(β y, X) = i ( { exp s (y isxis T β + α i) } ) s [1 + exp(x T is β + α p(α i σα)ds 2 i i)] where p(α i σ 2 α) = N(0, σ 2 α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?

Example 3: Multicore Methods for fitting GLMM Approach 1: Maximizing the Likelihood Outer Layer Optimization Level: Inherently Serial. Master Process chooses new parameter values to pass to the function (β, σ 2 α). Function returns a value to the optimization algorithm. Function Evaluation Numerical integration or Monte Carlo integration. Compute the product/sum of the integrals.

Example 3: Multicore Methods for fitting GLMM Opportunities for Parallelism. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Not viable at the outer level. Could try multiple optimizations to check convergence. 2 Faster Function - Break the function up by doing integrations for each group(farm) separately.

Example 3: Multicore Methods for fitting GLMM Bayesian Approach to GLMM p(β, α, σα y) 2 p(y β, α, σα)p(β)p(α σ 2 α)p(σ 2 α) 2 Full Conditionals I S i p(β ) p(y ij β, α i )p(β) i=1 s=1 S i p(β ) p(y ij β, α i )p(β) s=1 S i p(σα ) 2 p(α i σα)p(σ 2 α) 2 s=1

Example 3: Multicore Methods for fitting GLMM Example: Sampling the Posterior Distribution of the GLMM 1 Method 1 is simply a serial implementation with a for loop. 31-mcmc_glmm.R. 2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file 41-mcmc_glmm_mclapply.R. 3 Method 3 like 2 but uses the pbdr package. 42-mcmc_glmm_pbdR.R.

Example 3: Multicore Methods for fitting GLMM Ex 3: Questions 1 Exercise: Find 95% creditable intervals for sd.random. 2 Other ideas.