Tutorial on Markov Chain Monte Carlo



Similar documents
Gaussian Processes to Speed up Hamiltonian Monte Carlo

Introduction to Markov Chain Monte Carlo

11. Time series and dynamic linear models

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Dirichlet Processes A gentle tutorial

STA 4273H: Statistical Machine Learning

Computational Statistics for Big Data

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Introduction to the Monte Carlo method

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Inference on Phase-type Models via MCMC

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Statistics Graduate Courses

MCMC Using Hamiltonian Dynamics

Linear Threshold Units

Bayesian Statistics in One Hour. Patrick Lam

Model-based Synthesis. Tony O Hagan

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Dealing with large datasets

Monte Carlo Simulation

Markov Chain Monte Carlo Simulation Made Simple

Probability and Random Variables. Generation of random variables (r.v.)

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Enhancing Parallel Pattern Search Optimization with a Gaussian Process Oracle

Big Data need Big Model 1/44

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

An Introduction to Machine Learning

Logistic Regression (1/24/13)

Parallelization Strategies for Multicore Data Analysis

Bayesian Analysis Users Guide Release 4.00, Manual Version 1

Bayesian Statistics: Indian Buffet Process

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Markov Chain Monte Carlo and Numerical Differential Equations

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Imperfect Debugging in Software Reliability

Imputing Missing Data using SAS

General Sampling Methods

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Monte Carlo Methods and Models in Finance and Insurance

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Model Calibration with Open Source Software: R and Friends. Dr. Heiko Frings Mathematical Risk Consulting

DURATION ANALYSIS OF FLEET DYNAMICS

Statistical Machine Learning

Monte Carlo Methods in Finance

BayesX - Software for Bayesian Inference in Structured Additive Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Monte Carlo-based statistical methods (MASM11/FMS091)

Bayesian Model Averaging CRM in Phase I Clinical Trials

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

Online Appendix. Supplemental Material for Insider Trading, Stochastic Liquidity and. Equilibrium Prices. by Pierre Collin-Dufresne and Vyacheslav Fos

Christfried Webers. Canberra February June 2015

Centre for Central Banking Studies

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

APPLIED MISSING DATA ANALYSIS

Principle of Data Reduction

Least Squares Estimation

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Topic 3b: Kinetic Theory

Java Modules for Time Series Analysis

Markov Chain Monte Carlo and Applied Bayesian Statistics: a short course Chris Holmes Professor of Biostatistics Oxford Centre for Gene Function

Portfolio Distribution Modelling and Computation. Harry Zheng Department of Mathematics Imperial College

Handling attrition and non-response in longitudinal data

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Bayesian Image Super-Resolution

Introduction to General and Generalized Linear Models

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Accuracy of the coherent potential approximation for a onedimensional array with a Gaussian distribution of fluctuations in the on-site potential

Performance Optimization of I-4 I 4 Gasoline Engine with Variable Valve Timing Using WAVE/iSIGHT

Linear Classification. Volker Tresp Summer 2015

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Analysis of Financial Time Series

Geography 4203 / GIS Modeling. Class (Block) 9: Variogram & Kriging

Time Series Analysis

Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Machine Learning and Pattern Recognition Logistic Regression

ECE302 Spring 2006 HW5 Solutions February 21,

Combining Visual and Auditory Data Exploration for finding structure in high-dimensional data

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Master s thesis tutorial: part III

Methods of Data Analysis Working with probability distributions

PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUMBER OF REFERENCE SYMBOLS

1 Prior Probability and Posterior Probability

Echantillonnage exact de distributions de Gibbs d énergies sous-modulaires. Exact sampling of Gibbs distributions with submodular energies

Nonparametric adaptive age replacement with a one-cycle criterion

From CFD to computational finance (and back again?)

Analyzing Clinical Trial Data via the Bayesian Multiple Logistic Random Effects Model

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Transcription:

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology, Gif-sur-Yvette, France, July 8 13, 2000 This presentation available at http://public.lanl.gov/kmh/talks/ July, 2000- Revised 14/05/08 Bayesian and MaxEnt Workshop LA-UR-05-56801

MCMC experts Acknowledgements Julian Besag, Jim Guberantus, John Skilling, Malvin Kalos General discussions Greg Cunningham, Richard Silver July, 2000 Bayesian and MaxEnt Workshop 2

Problem statement Parameter space of n dimensions represented by vector x Given an arbitrary target probability density function (pdf), q(x), draw a set of samples {x k } from it Only requirement typically is that, given x, one be able to evaluate Cq(x), where C is an unknown constant MCMC algorithms do not typically require knowledge of the normalization constant of the target pdf; from now on the multiplicative constant C will not be made explicit Although focus here is on continuous variables, MCMC can be applied to discrete variables as well July, 2000 Bayesian and MaxEnt Workshop 3

Uses of MCMC Permits evaluation of the expectation values of functions of x, e.g., f(x) = f(x) q(x) dx (1/K) Σ k f(x k ) typical use is to calculate mean x and variance (x - x ) 2 Also useful for evaluating integrals, such as the partition function for properly normalizing the pdf Dynamic display of sequences provides visualization of uncertainties in model and range of model variations Automatic marginalization; when considering any subset of parameters of an MCMC sequence, the remaining parameters are marginalized over July, 2000 Bayesian and MaxEnt Workshop 4

Markov Chain Monte Carlo Generates sequence of random samples from an arbitrary probability density function Metropolis algorithm: draw trial step from symmetric pdf, i.e., t(δ x) = t(-δ x) accept or reject trial step simple and generally applicable relies only on calculation of target pdf for any x x 2 Probability(x 1, x 2 ) accepted step rejected step x 1 July, 2000 Bayesian and MaxEnt Workshop 5

Metropolis algorithm Select initial parameter vector x 0 Iterate as follows: at iteration number k (1) create new trial position x* = x k + Δx, where Δx is randomly chosen from t(δx) (2) calculate ratio r = q(x*)/q(x k ) (3) accept trial position, i.e. set x k+1 = x* if r 1 or with probability r, if r < 1 otherwise stay put, x k+1 = x k Requires only computation of q(x) Creates Markov chain since x k+1 depends only on x k July, 2000 Bayesian and MaxEnt Workshop 6

Choice of trial distribution Loose requirements on trial distribution t() stationary; independent of position Often used functions include n-d Gaussian, isotropic and uncorrelated n-d Cauchy, isotropic and uncorrelated Choose width to optimize MCMC efficiency rule of thumb: aim for acceptance fraction of about 25% July, 2000 Bayesian and MaxEnt Workshop 7

Experiments with the Metropolis algorithm Target distribution q(x) is n dimensional Gaussian uncorrelated, univariate (isotropic with unit variance) most generic case Trial distribution t(δx) is n dimensional Gaussian uncorrelated, equivariate; various widths target x k x k x k trial July, 2000 Bayesian and MaxEnt Workshop 8

MCMC sequences for 2D Gaussian results of running Metropolis with ratios of width of trial to target of 0.25, 1, and 4 when trial pdf is much smaller than target pdf, movement across target pdf is slow when trial width same as target, samples seem to sample target pdf better when trial much larger than target, trials stay put for long periods, but jumps are large This example from Hanson and Cunningham (SPIE, 1998) 0.25 1 4 July, 2000 Bayesian and MaxEnt Workshop 9

MCMC sequences for 2D Gaussian results of running Metropolis with ratios of width of trial to target of 0.25, 1, and 4 display accumulated 2D distribution for 1000 trials viewed this way, it is difficult to see difference between top two images when trial pdf much larger than target, fewer splats, but further apart 0.25 1 4 July, 2000 Bayesian and MaxEnt Workshop 10

MCMC - autocorrelation and efficiency In MCMC sequence, subsequent parameter values are usually correlated Degree of correlation quantified by autocorrelation function: 1 ρ( l) = y() i y( i l) N i = 1 where y(x) is the sequence and l is lag For Markov chain, expect exponential ρ() l = exp[ l ] λ Sampling efficiency is 1 1 η= [ 1+ 2 ρ( l)] = 1 + 2λ η 1 N l= 1 In other words, iterates required to achieve one statistically independent sample July, 2000 Bayesian and MaxEnt Workshop 11

Autocorrelation for 2D Gaussian plot confirms that the autocorrelation drops slowly when the trial width is much smaller than the target width; MCMC efficiency is poor best efficiency is when trial about same size as target (for 2D) ρ 1 4 0.25 Normalized autocovariance for various widths of trial pdf relative to target: 0.25, 1, and 4 July, 2000 Bayesian and MaxEnt Workshop 12

Efficiency as function of width of trial pdf for univariate Gaussians, with 1 to 64 dimensions efficiency as function of width of trial distributions boxes are predictions of optimal efficiency from diffusion theory [A. Gelman, et al., 1996] 64 32 16 8 4 2 1 efficiency drops reciprocally with number of dimensions July, 2000 Bayesian and MaxEnt Workshop 13

Efficiency as function of acceptance fraction for univariate Gaussians, with 1 to 64 dimensions efficiency as function of acceptance fraction best efficiency is achieved when about 25% of trials are accepted for a moderate number of dimensions 64 Acceptance fraction 1 July, 2000 Bayesian and MaxEnt Workshop 14

Efficiency of Metropolis algorithm Results of experimental study agree with predictions from diffusion theory (A. Gelman et al., 1996) Optimum choice for width of Gaussian trial distribution occurs for acceptance fraction of about 25% (but is a weak function of number of dimensions) Optimal statistical efficiency: η ~ 0.3/n for simplest case of uncorrelated, equivariate Gaussian correlation and variable variance generally decreases efficiency July, 2000 Bayesian and MaxEnt Workshop 15

Further considerations When target distribution q(x) not isotropic difficult to accommodate with isotropic t(δx) each parameter can have different efficiency desirable to vary width of different t(x) to approximately match q(x) recovers efficiency of univariate case When q(x) has correlations t(x) should match shape of q(x) q(x) t(δx) July, 2000 Bayesian and MaxEnt Workshop 16

MCMC - Issues Identification of convergence to target pdf is sequence in thermodynamic equilibrium with target pdf? validity of estimated properties of parameters (covariance) Burn in at beginning of sequence, may need to run MCMC for awhile to achieve convergence to target pdf Use of multiple sequences different starting values can help confirm convergence natural choice when using computers with multiple CPUs Accuracy of estimated properties of parameters related to efficiency, described above Optimization of efficiency of MCMC July, 2000 Bayesian and MaxEnt Workshop 17

MCMC - Multiple runs Multiple runs starting with different random number seed confirm MCMC sequences have converged to the target pdf First MCMC sequence Second, independent MCMC sequence Examples of multiple MCMC runs from my talk on the analysis of Rossi data (http://public.lanl.gov/kmh/talks/maxent00a.pdf) July, 2000 Bayesian and MaxEnt Workshop 18

Annealing Introduction of fictitious temperature define functional ϕ(x) as minus-logarithm of target probability ϕ(x) = -log(q(x)) scale ϕ by an inverse temperature to form new pdf q'(x, T) = exp[- T -1 ϕ(x)] q'(x, T) is flatter than q(x) for T > 1 (called annealing) Uses of annealing (also called tempering) allows MCMC to move between multiple peaks in q(x) simulated annealing optimization algorithm (takes lim T 0) July, 2000 Bayesian and MaxEnt Workshop 19

Annealing to handle multiple peaks Example - target distribution is three narrow, well separated peaks For original distribution (T = 1), an MCMC run of 10000 steps rarely moves between peaks At temperature T = 100 (right), MCMC moves easily between peaks and through surrounding regions from M-D Wu and W. J. Fitzgerald, Maximum Entropy and Bayesian Methods (1996) T = 1 T = 100 July, 2000 Bayesian and MaxEnt Workshop 20

Gibbs Other MCMC algorithms vary only one component of x at a time draw new value of x j from conditional q(x j x 1 x 2... x j-1 x j+1... ) Metropolis-Hastings allows use of nonsymmetric trial functions, t(δx; x k ), suitably chosen to improve efficiency use r = [t(δx; x k ) q(x* )] / [t(-δx; x*) q(x k )] Langevin technique uses gradient* of minus-log-prob to shift trial function towards regions of higher probability uses Metropolis-Hastings * adjoint differentiation affords efficient gradient calculation July, 2000 Bayesian and MaxEnt Workshop 21

Hamiltonian hybrid algorithm Hamiltonian hybrid algorithm called hybrid because it alternates Gibbs & Metropolis steps associate with each parameter x a momentum p i i define a Hamiltonian H = ϕ(x) + Σ p i2 /(2 m i ) ; where ϕ = -log (q (x )) new pdf: q'(x, p) = exp(- H(x, p)) = q(x) exp(-σ p i2 /(2 m i )) can easily move long distances in (x, p) space at constant H using Hamiltonian dynamics, so Metropolis step is very efficient uses gradient* of ϕ (minus-log-prob) Gibbs step in p for constant x is easy efficiency may be better than Metropolis for large dimensions * adjoint differentiation affords efficient gradient calculation July, 2000 Bayesian and MaxEnt Workshop 22

Hamiltonian hybrid algorithm p i k+2 k k+1 x i Typical trajectories: red path - Gibbs sample from momentum distribution green path - trajectory with constant H, follow by Metropolis July, 2000 Bayesian and MaxEnt Workshop 23

Conclusions MCMC provides good tool for exploring the posterior and hence for drawing inferences about models and parameters For valid results, care must be taken to verify convergence of the sequence exclude early part of sequence, before convergence reached be wary of multiple peaks that need to be sampled For good efficiency, care must be taken to adjust the size and shape of the trial distribution; rule of thumb is to aim for 25% trial acceptance for 5 < n < 100 A lot of research is happening - don t worry, be patient July, 2000 Bayesian and MaxEnt Workshop 24

Short Bibliography Posterior sampling with improved efficiency, K. M. Hanson and G. S. Cunningham, Proc. SPIE 3338, 371-382 (1998); includes introduction to MCMC Markov Chain Monte Carlo in Practice, W. R. Gilks et al., (Chapman and Hall, 1996); excellent general-purpose book Efficient Metropolis jumping rules, A. Gelman et al, in Bayesian Statistics 5, J. M. Bernardo et al., (Oxford Univ., 1996); diffusion theory Bayesian computation and stochastic systems, J. Besag et al., Stat. Sci. 10, 3-66 (1995); MCMC applied to image analysis Bayesian Learning for Neural Networks, R. M. Neal, (Springer, 1996); Hamiltonian hybrid MCMC Bayesian multinodal evidence computation by adaptive tempered MCMC, M-D Wu and W. J. Fitzgerald, in Maximum Entropy and Bayesian Methods, K. M. Hanson and R. N. Silver, eds., (Kluwer Academic, 1996); annealing More articles and slides under http://www.lanl.gov/home/kmh/ July, 2000 Bayesian and MaxEnt Workshop 25

Short Bibliography Inversion based on complex simulations, K. M. Hanson, Maximum Entropy and Bayesian Methods, G. J. Erickson et al., eds., (Kluwer Academic, 1998); describes adjoint differentiation and its usefulness "Bayesian analysis in nuclear physics," four tutorials presented at LANSCE, July 25 - August 1, 2005, including more detail about MCMC and its application [ http://www.lanl.gov/home/kmh/talks ] More articles and slides under http://www.lanl.gov/home/kmh/ July, 2000 Bayesian and MaxEnt Workshop 26