Parallelization Strategies for Multicore Data Analysis

Size: px
Start display at page:

Download "Parallelization Strategies for Multicore Data Analysis"

Transcription

1 Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management Science Computing in the Cloud, April 6-8, 2014

2 Outline 1 Introduction Basic Strategy Data Analysis Algorithms 2 Data and Analysis Techniques 3 Examples Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM

3 Basic Strategy Multicore A core is an individual processor CPUs used to have a single core, and the terms were interchangeable. Modern CPU s have several cores on a single CPU chip. Processors on the same chip share memory allowing much easier implementation of parallel algorithms.

4 Basic Strategy Multi Node Multi-node machines typically have many interconnected CPU s. Each CPU may have a number of cores which can share memory. Utilizing a multi node machine usually involves explicitly moving data between nodes. This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.

5 Basic Strategy Points of Parallelism In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence? Data: Statistical Independence If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer. Parallelism of the algorithm Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be computed separately and then recombined?

6 Data Analysis Algorithms Inputs and Outputs Most data based (statistical) analyses in the life sciences follow a basic functional structure. Inputs D - Data in the form of a vector, list, or other structure. Λ - Parameters of interest usually summarized as a vector, matrix, or list. Outputs φ - A scaler, vector, matrix or combination of these things.

7 Data Analysis Algorithms Key Algorithms Simulation Parallel chains running at the same time may improve efficiency. Cost of any burn in or discarded samples needs to be considered. Useful if we want to run many chains with different data or parameter values. Cluster Computers highly effective. Use job schedulers. Optimization Inherently serial operation controlled by a master process. Parallel implementation is most likely to occur within the function call.

8 Sample Data Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain. n = 826 deer sampled. Deer were grouped among 351 farms. Sex of deer and length are explanatory variables. For the response variables, define Y is as 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.

9 Logistic Regression Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Y is and use the logistic link function to relate the mean p is to the explanatory variables. That is, p is (x is β) = Y is Bin(1, p is (x is β)) exp β 0 + β 1 x s + β 2 x len + β 3 x len x s 1 + exp β 0 + β 1 x s + β 2 x len + β 3 x len x s We are allowing each gender to have its own intercept and slope.

10 Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( exp(xis T β) ) 1 yis

11 Likelihood Function Whether we take a Bayesian or MLE approach, we will need the log likelihood. l(β y, X) = I S i i=1 s=1 ( exp(x T is β) 1 + exp(x T is β) s y isxis T β} = exp { i i s [1 + exp(x T is β)] ) yis ( exp(xis T β) ) 1 yis

12 Example 1: Multicore M-H Samplers Bayesian Inference in Logistic Regression We ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler. The Prior: β c The posterior π(β y, X) l(β y, X)π(β) l(β y, X) The proposal distribution q(β i β i 1 ) N(β i 1, I 1 (β i 1 ))

13 Example 1: Multicore M-H Samplers Simulation with the Metropolis Hastings Step Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm. Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β 0 Iteration t (t 1): 1 Given β (t 1), generate β q(β (t 1), β). 2 Compute ( ρ(β (t 1), β) π( = min 1, β)q( β, ) β (t 1) ) π(β (t 1) )q(β (t 1), β) = min(1, π( β)/π(β (t 1) )) 3 With probability ρ(β (t 1), β), accept β and set β t = β; otherwise reject β and set β t = β (t 1)

14 MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - We can run multiple independent chains each starting from a different initial value β 0. Very Easy to do but we need to allow each chain to burn in. 2 Faster Function - We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data, compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large. Example 1: Multicore M-H Samplers Opportunities for Parallelism.

15 Example 1: Multicore M-H Samplers Example: Random Walk M-H in R Let s try simulating the posterior for our deer parasite example. 1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R. 2 Method 2 accesses multiple cores through the mclapply function. Run file 22-mcmc-glm-mclapply.R. 3 Method 3 uses the pbdr package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file 23-mcmc-glm-pbdR.R

16 Example 1: Multicore M-H Samplers Ex 1: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 How can you create 95% credible intervals from the output? 3 Can you time your results to see if there are any improvements?

17 Example 2: Multicore Bootstrap Sampling Variance Components Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation. Introduce a "random intercept" into our model to take this into account. p is (x is β) = where α i N(0, σ 2 α). Y is Bin(1, p is (x T is β)) exp β 0 + α i + β 1 x s + β 2 x len 1 + exp β 0 + α i + β 1 x s + β 2 x len 1 GLMM - generalized linear mixed model. 2 Can be fit by PQL - Penalized Quasi-Likelihood method. 3 This method is known to produce biased estimates of both β and σ 2 α. 4 Confidence intervals for σ 2 α also biased.

18 Example 2: Multicore Bootstrap Sampling Bootstrap to the Rescue Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist. Non-Parametric Bootstrap Percentile Method Initialization: Fit the PQL Model to the original data. 1 Sample with replacement the subset of observations from each farm and combine to create a new data set. 2 Compute the PQL estimate of the resampled data set. 3 Collect the estimates of σ 2 α and produce a confidence interval. 4 Create prediction intervals for the individual α i.

19 Example 2: Multicore Bootstrap Sampling Opportunities for Parallelism. As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Again, run multiple chains since the bootstrap simulation is totally independent. 2 Faster Function - The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data. A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit. 3 Take advantage of gains by using vectorization and avoiding loops.

20 Example 2: Multicore Bootstrap Sampling Example: Nonparametric Bootstrap Let s try bootstrapping the farm effect for our deer parasite example. 1 First run 01-max_pql.R to fit the initial model. 2 Method 1 is simply a serial implementation with a for loop. Run file 11-npbs_for.R. 3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R. 4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R 5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R

21 Example 2: Multicore Bootstrap Sampling Ex 2: Questions 1 Can you modify the code to change the number of cores/resources that you are using? 2 Can you time your results to see if there are any improvements? 3 Estimate mean and the median of variation for the bootstrapped samples? 4 Find a C.I. for beta. 5 More appropriate way to bootstrap?

22 Example 3: Multicore Methods for fitting GLMM The GLMM Likelihood The generalized linear mixed model likelihood requires us to integrate over the α i with respect to their densities. l(β y, X) = i ( { exp s (y isxis T β + α i) } ) s [1 + exp(x T is β + α p(α i σα)ds 2 i i)] where p(α i σ 2 α) = N(0, σ 2 α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?

23 Example 3: Multicore Methods for fitting GLMM Approach 1: Maximizing the Likelihood Outer Layer Optimization Level: Inherently Serial. Master Process chooses new parameter values to pass to the function (β, σ 2 α). Function returns a value to the optimization algorithm. Function Evaluation Numerical integration or Monte Carlo integration. Compute the product/sum of the integrals.

24 Example 3: Multicore Methods for fitting GLMM Opportunities for Parallelism. Where are the opportunities for parallelism in this example? Two possibilities: 1 Multichain - Not viable at the outer level. Could try multiple optimizations to check convergence. 2 Faster Function - Break the function up by doing integrations for each group(farm) separately.

25 Example 3: Multicore Methods for fitting GLMM Bayesian Approach to GLMM p(β, α, σα y) 2 p(y β, α, σα)p(β)p(α σ 2 α)p(σ 2 α) 2 Full Conditionals I S i p(β ) p(y ij β, α i )p(β) i=1 s=1 S i p(β ) p(y ij β, α i )p(β) s=1 S i p(σα ) 2 p(α i σα)p(σ 2 α) 2 s=1

26 Example 3: Multicore Methods for fitting GLMM Example: Sampling the Posterior Distribution of the GLMM 1 Method 1 is simply a serial implementation with a for loop. 31-mcmc_glmm.R. 2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file 41-mcmc_glmm_mclapply.R. 3 Method 3 like 2 but uses the pbdr package. 42-mcmc_glmm_pbdR.R.

27 Example 3: Multicore Methods for fitting GLMM Ex 3: Questions 1 Exercise: Find 95% creditable intervals for sd.random. 2 Other ideas.

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Imputing Values to Missing Data

Imputing Values to Missing Data Imputing Values to Missing Data In federated data, between 30%-70% of the data points will have at least one missing attribute - data wastage if we ignore all records with a missing value Remaining data

More information

Partitioning of Variance in Multilevel Models. Dr William J. Browne

Partitioning of Variance in Multilevel Models. Dr William J. Browne Partitioning of Variance in Multilevel Models Dr William J. Browne University of Nottingham with thanks to Harvey Goldstein (Institute of Education) & S.V. Subramanian (Harvard School of Public Health)

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Mixed models for the analysis of categorical repeated measures

Mixed models for the analysis of categorical repeated measures Mixed models for the analysis of categorical repeated measures Geert Verbeke geert.verbeke@med.kuleuven.be Biostatistical Centre, K.U.Leuven, Belgium Joint work with Geert Molenberghs and many others PAGE,

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Part 1 Slide 2 Talk overview Foundations of Bayesian statistics

More information

Applying MCMC Methods to Multi-level Models submitted by William J Browne for the degree of PhD of the University of Bath 1998 COPYRIGHT Attention is drawn tothefactthatcopyright of this thesis rests with

More information

The Bootstrap. 1 Introduction. The Bootstrap 2. Short Guides to Microeconometrics Fall Kurt Schmidheiny Unversität Basel

The Bootstrap. 1 Introduction. The Bootstrap 2. Short Guides to Microeconometrics Fall Kurt Schmidheiny Unversität Basel Short Guides to Microeconometrics Fall 2016 The Bootstrap Kurt Schmidheiny Unversität Basel The Bootstrap 2 1a) The asymptotic sampling distribution is very difficult to derive. 1b) The asymptotic sampling

More information

Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component) Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Lab 8: Introduction to WinBUGS

Lab 8: Introduction to WinBUGS 40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

More information

Introduction to parallel computing in R

Introduction to parallel computing in R Introduction to parallel computing in R Clint Leach April 10, 2014 1 Motivation When working with R, you will often encounter situations in which you need to repeat a computation, or a series of computations,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing Under consideration for publication in Knowledge and Information Systems Compression and Aggregation of Bayesian Estimates for Data Intensive Computing Ruibin Xi 1, Nan Lin 2, Yixin Chen 3 and Youngjin

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Reliability estimators for the components of series and parallel systems: The Weibull model

Reliability estimators for the components of series and parallel systems: The Weibull model Reliability estimators for the components of series and parallel systems: The Weibull model Felipe L. Bhering 1, Carlos Alberto de Bragança Pereira 1, Adriano Polpo 2 1 Department of Statistics, University

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

The Model of Purchasing and Visiting Behavior of Customers in an E-commerce Site for Consumers

The Model of Purchasing and Visiting Behavior of Customers in an E-commerce Site for Consumers DOI: 10.7763/IPEDR. 2012. V52. 15 The Model of Purchasing and Vising Behavior of Customers in an E-commerce Se for Consumers Shota Sato 1+ and Yumi Asahi 2 1 Department of Engineering Management Science,

More information

Centre for Central Banking Studies

Centre for Central Banking Studies Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE BY P.D. ENGLAND AND R.J. VERRALL ABSTRACT This paper extends the methods introduced in England & Verrall (00), and shows how predictive

More information

Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

More information

Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

DURATION ANALYSIS OF FLEET DYNAMICS

DURATION ANALYSIS OF FLEET DYNAMICS DURATION ANALYSIS OF FLEET DYNAMICS Garth Holloway, University of Reading, garth.holloway@reading.ac.uk David Tomberlin, NOAA Fisheries, david.tomberlin@noaa.gov ABSTRACT Though long a standard technique

More information

Bayesian Phylogeny and Measures of Branch Support

Bayesian Phylogeny and Measures of Branch Support Bayesian Phylogeny and Measures of Branch Support Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

Short title: Measurement error in binary regression. T. Fearn 1, D.C. Hill 2 and S.C. Darby 2. of Oxford, Oxford, U.K.

Short title: Measurement error in binary regression. T. Fearn 1, D.C. Hill 2 and S.C. Darby 2. of Oxford, Oxford, U.K. Measurement error in the explanatory variable of a binary regression: regression calibration and integrated conditional likelihood in studies of residential radon and lung cancer Short title: Measurement

More information

Big Data and Parallel Work with R

Big Data and Parallel Work with R Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care. Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care University of Florida 10th Annual Winter Workshop: Bayesian Model Selection and

More information

Appendix A: Sampling Methods

Appendix A: Sampling Methods Appendix A: Sampling Methods What is Sampling? Sampling is used in an @RISK simulation to generate possible values from probability distribution functions. These sets of possible values are then used to

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation

Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation http://users.stat.umn.edu/ christina/googleproposal.pdf Christina Knudson Bio

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

More information

Bayesian inference for population prediction of individuals without health insurance in Florida

Bayesian inference for population prediction of individuals without health insurance in Florida Bayesian inference for population prediction of individuals without health insurance in Florida Neung Soo Ha 1 1 NISS 1 / 24 Outline Motivation Description of the Behavioral Risk Factor Surveillance System,

More information

Applications of R Software in Bayesian Data Analysis

Applications of R Software in Bayesian Data Analysis Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Graduate Programs in Statistics

Graduate Programs in Statistics Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Combining information from different survey samples - a case study with data collected by world wide web and telephone

Combining information from different survey samples - a case study with data collected by world wide web and telephone Combining information from different survey samples - a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N-0314 Oslo Norway E-mail:

More information

Bayes and Big Data: The Consensus Monte Carlo Algorithm

Bayes and Big Data: The Consensus Monte Carlo Algorithm Bayes and Big Data: The Consensus Monte Carlo Algorithm Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George 3, and Robert E. McCulloch 4 Google, Inc. Acadia University

More information

Introduction to latent variable models

Introduction to latent variable models Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

The Exponential Family

The Exponential Family The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

More information

Experimental data analysis Lecture 3: Confidence intervals. Dodo Das

Experimental data analysis Lecture 3: Confidence intervals. Dodo Das Experimental data analysis Lecture 3: Confidence intervals Dodo Das Review of lecture 2 Nonlinear regression - Iterative likelihood maximization Levenberg-Marquardt algorithm (Hybrid of steepest descent

More information

Imposing Curvature Restrictions on a Translog Cost Function using a Markov Chain Monte Carlo Simulation Approach

Imposing Curvature Restrictions on a Translog Cost Function using a Markov Chain Monte Carlo Simulation Approach Imposing Curvature Restrictions on a Translog Cost Function using a Markov Chain Monte Carlo Simulation Approach Kranti Mulik Graduate Student 333 A Waters Hall Department of Agricultural Economics Kansas

More information

SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Note on the EM Algorithm in Linear Regression Model

Note on the EM Algorithm in Linear Regression Model International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University

More information

A crash course in probability and Naïve Bayes classification

A crash course in probability and Naïve Bayes classification Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

More information

Big data in R EPIC 2015

Big data in R EPIC 2015 Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Statistics in Geophysics: Linear Regression II

Statistics in Geophysics: Linear Regression II Statistics in Geophysics: Linear Regression II Steffen Unkel Department of Statistics Ludwig-Maximilians-University Munich, Germany Winter Term 2013/14 1/28 Model definition Suppose we have the following

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods. 1 The Joint Posterior Distribution Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

A Scalable Bootstrap for Massive Data

A Scalable Bootstrap for Massive Data A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

A Latent Variable Approach to Validate Credit Rating Systems using R

A Latent Variable Approach to Validate Credit Rating Systems using R A Latent Variable Approach to Validate Credit Rating Systems using R Chicago, April 24, 2009 Bettina Grün a, Paul Hofmarcher a, Kurt Hornik a, Christoph Leitner a, Stefan Pichler a a WU Wien Grün/Hofmarcher/Hornik/Leitner/Pichler

More information

Bayesian modeling of inseparable space-time variation in disease risk

Bayesian modeling of inseparable space-time variation in disease risk Bayesian modeling of inseparable space-time variation in disease risk Leonhard Knorr-Held Laina Mercer Department of Statistics UW May 23, 2013 Motivation Area and time-specific disease rates Area and

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu)

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Paper Author (s) Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Lei Zhang, University of Maryland, College Park (lei@umd.edu) Paper Title & Number Dynamic Travel

More information

Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

More information

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker Assignment no. 8 Correct setup of likelihood function One fixed set of observation

More information

The Variability of P-Values. Summary

The Variability of P-Values. Summary The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

More information