MCMC Diagnostics. Patrick Breheny. March 5. Convergence Efficiency and accuracy Summary

Similar documents
Imputing Missing Data using SAS

A Latent Variable Approach to Validate Credit Rating Systems using R

Journal of Statistical Software

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Two-sample inference: Continuous data

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Bayesian Statistics in One Hour. Patrick Lam

Markov Chain Monte Carlo Simulation Made Simple

Centre for Central Banking Studies

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

11. Time series and dynamic linear models

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Model Calibration with Open Source Software: R and Friends. Dr. Heiko Frings Mathematical Risk Consulting

Tutorial on Markov Chain Monte Carlo

Bayesian prediction of disability insurance frequencies using economic indicators

Parallelization Strategies for Multicore Data Analysis

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

1 Short Introduction to Time Series

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

R2MLwiN Using the multilevel modelling software package MLwiN from R

More details on the inputs, functionality, and output can be found below.

PS 271B: Quantitative Methods II. Lecture Notes

Financial Risk Management Exam Sample Questions

Numerical Methods for Option Pricing

A Bayesian Antidote Against Strategy Sprawl

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Markov random fields and Gibbs measures

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Bayesian Estimation Supersedes the t-test

Non Parametric Inference

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Master s Theory Exam Spring 2006

Interpretation of Somers D under four simple models

Monte Carlo-based statistical methods (MASM11/FMS091)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Package mcmcse. March 25, 2016

Confidence Intervals for One Standard Deviation Using Standard Deviation

Model-based Synthesis. Tony O Hagan

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

4. Continuous Random Variables, the Pareto and Normal Distributions

Introduction to Markov Chain Monte Carlo

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Penalized regression: Introduction

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

BAYESIAN ANALYSIS OF AN AGGREGATE CLAIM MODEL USING VARIOUS LOSS DISTRIBUTIONS. by Claire Dudley

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Nonparametric adaptive age replacement with a one-cycle criterion

Multivariate Normal Distribution

Dealing with Missing Data

USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

Applications of R Software in Bayesian Data Analysis

Handling missing data in Stata a whirlwind tour

Handling attrition and non-response in longitudinal data

CHAPTER 2 Estimating Probabilities

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Bayesian Statistical Analysis in Medical Research

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Matching Investment Strategies in General Insurance Is it Worth It? Aim of Presentation. Background 34TH ANNUAL GIRO CONVENTION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Regression III: Advanced Methods

Analyzing Clinical Trial Data via the Bayesian Multiple Logistic Random Effects Model

Math 4310 Handout - Quotient Vector Spaces

APPLIED MISSING DATA ANALYSIS

Bayesian Phylogeny and Measures of Branch Support

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Confidence Intervals for Spearman s Rank Correlation

Sample Size Planning for the Squared Multiple Correlation Coefficient: Accuracy in Parameter Estimation via Narrow Confidence Intervals

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Imputing Values to Missing Data

Objections to Bayesian statistics

How To Solve A Sequential Mca Problem

Basic Bayesian Methods

Basics of Statistical Machine Learning

Math 319 Problem Set #3 Solution 21 February 2002

Autocovariance and Autocorrelation

Introduction to Fixed Effects Methods

Recall this chart that showed how most of our course would be organized:

8. Time Series and Prediction

Linear regression methods for large n and streaming data

A Basic Introduction to Missing Data

Stat 704 Data Analysis I Probability Review

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Package zic. September 4, 2015

Time Series Analysis: Basic Forecasting.

Risk Analysis and Quantification

Markov Chain Monte Carlo and Applied Bayesian Statistics: a short course Chris Holmes Professor of Biostatistics Oxford Centre for Gene Function

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Time Series Analysis

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Statistics 104: Section 6!

NCSS Statistical Software

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

STAT3016 Introduction to Bayesian Data Analysis

Transcription:

MCMC Diagnostics Patrick Breheny March 5 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/26

Introduction Convergence As we have seen, there are two chief concerns with the use of MCMC methods for posterior inference: Has our chain converged in distribution to the posterior? How much information about the posterior does our chain contain? Today s lecture is about methods both informal (graphical/visual checks) and formal for assessing each of the above concerns The R2OpenBUGS and R2jags packages provide some of these tools, but we will also be using the package coda, which provides many additional diagnostics for assessing convergence and accuracy We will illustrate these methods on basic linear regression model for a classic data set on Swiss Fertility (full details in supplementary code) Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 2/26

Multiple chains Convergence Most approaches for detecting convergence, both formal and informal, rest on the idea of starting multiple Markov chains and observing whether they come together and start to behave similarly (if they do, we can pool the results from each chain) In bugs/jags, the number of chains is set by the n.chains argument (the default is 3 chains) The results are contained in a T M p array called sims.array, where T is the number of draws/iterations, M is the number of chains, and p is the number of parameters we are monitoring, although most diagnostic functions operate at a higher level and do not require interacting with the array itself Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 3/26

Multiple chains: Traceplot Trace plots, as in the last lecture, can be obtained via traceplot(fit) beta[6] 4 2 beta[6] 0 2 4 0 50 100 150 200 250 300 350 400 450 500 iteration Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 4/26

Initial values Convergence The previous plot indicates that the three chains converge to the posterior after around T = 300 iterations; certainly, however, the number of iterations required to reach convergence depends on the initial values It is typically recommended (e.g., Gelman and Rubin, 1992) to use overdispersed initial values, meaning more variable than the target distribution i.e., the posterior In bugs/jags, initial values may be specified in one of two ways: As a list of M lists, each specifying the initial values for each parameter for that chain As a function generating a list of (random) initial values Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 5/26

Quantifying convergence Although looking at trace plots is certainly useful, it is also desirable to obtain an objective, quantifiable measure of convergence Numerous methods exist, although we will focus on the measure originally proposed in Gelman and Rubin (1992), which is the method used in R2OpenBUGS and R2jags The basic idea is to quantify the between-chain and the within-chain variability of a quantity of interest if the chains have converged, these measures will be similar; otherwise, the between-chain variability will be larger Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 6/26

ˆR Convergence The basic idea of the estimator is as follows (the actual estimator makes a number of modifications to account for degrees of freedom): Let B denote the standard deviation of the pooled sample of all M T iterations (the between-chain variability) Let W denote the average of the within-chain standard deviations Quantify convergence with ˆR = B W If ˆR 1, this is clear evidence that the chains have not converged As T, ˆR 1; ˆR < 1.05 is widely accepted as implying convergence for practical purposes Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 7/26

Obtaining ˆR in R ˆR is displayed for bugs and jags objects using both the print and plot methods, which we will discuss later More details (along with other convergence measures) are given in the coda package, whose gelman.diag function provides, in addition to ˆR itself, An upper confidence interval for ˆR A multivariate extension of ˆR for quantifying convergence of the entire posterior Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 8/26

gelman.diag output Multivariate ˆR: 1.04 ˆR Upper CI β 1 1.18 1.25 β 2 1.02 1.06 β 3 1.11 1.27 β 4 1.06 1.07 β 5 1.04 1.05 β 6 1.03 1.10 τ 1.00 1.01 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 9/26

Updating a chain It is worth pointing out that if a chain has not converged, one does not have to start over again from the initial values, but may simply resume the chain from where we left off In BUGS, unfortunately, this feature is only available through the OpenBUGS GUI (i.e., cannot be accessed from within R, at least as far as I know) In JAGS, however, updating a chain in R is straightforward: fit <- update(fit, 1000) replaces fit with 1000 new draws (per chain), treating the values in the original fit as burn-in Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 10/26

Multiple chains: Pros and Cons The obvious downside to running multiple chains is that it is inefficient: we intentionally force our sampler to spend extra time in a non-converged state, which in turn requires much more burn-in The obvious upside, however, is that it provides us with some measure of confidence that we are actually drawing samples from the posterior It should be stressed, however, that (without additional assumptions about the posterior) no method can truly prove convergence; diagnostics can only detect failure to converge Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 11/26

How many iterations? We may use ˆR, then, as a guide to how long we must run our chains until convergence The obvious next question is: how long must we run our chains to obtain reasonably accurate estimates of the posterior? This issue is complicated by the fact that we cannot obtain independent draws from the posterior distribution and must settle instead for correlated samples Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 12/26

Accuracy of the Monte Carlo approach If we could obtain iid draws from the posterior, estimating the Monte Carlo standard error (at least, of the posterior mean) is straightforward: letting σ p denote the posterior standard deviation, the MCSE is σ p / T A reasonable rule of thumb, then, would be to use T = 400; this is the point at which the Monte Carlo error is less than 5% of the overall uncertainty about the mean Another rule of thumb, studied in Raftery and Lewis (1992), is that T = 4, 000 iterations are required for reasonable accuracy concerning the 2.5th (and 97.5th) percentiles Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 13/26

Efficiency Convergence In the presence of autocorrelation, however, we may obtain T = 4, 000 samples from the posterior, but those samples contain less (possibly much less) information about the 2.5th and 97.5th percentiles than 4,000 independent draws would The lower the autocorrelation, the greater the amount of information contained in a given number of draws from the posterior; this is referred to as the efficiency or mixing of the chain Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 14/26

Autocorrelation Convergence The autocorrelation between two states s and t of a Markov chain is defined, simply, as the correlation between X (s) and X (t) If the chain is stationary, in the sense that its mean and variance are not changing with time, then the correlation between X (t) and X (t+k) does not depend on t; this is known as the lag-k autocorrelation To calculate and plot the autocorrelation function, one may use the acfplot function in coda, or the acf function on elements of sims.array directly Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 15/26

Autocorrelation Convergence ACF 0.0 0.2 0.4 0.6 0.8 1.0 β 0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 β 4 ACF 0.0 0.2 0.4 0.6 0.8 1.0 τ 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Lag Lag Lag Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 16/26

Thinning Convergence A somewhat crude, yet reasonably effective, method dealing with autocorrelation is to only keep every k draws from the posterior and discard the rest; this is known as thinning the chain Both R2OpenBUGS and R2jags allow you to control thinning through the n.thin option, although it is important to note a difference between the two with regard to their defaults: By default, R2OpenBUGS sets n.thin=1; i.e., no thinning By default, R2jags sets n.thin so that the chain is thinned down to 1,000 samples per chain from the posterior Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 17/26

Thinning: Pros and Cons The advantages of thinning are (a) simplicity and (b) a reduction in memory usage saving and working with large chains can be burdensome, especially when p and T are large The disadvantage is that we are clearly throwing away information; thinning can never be as efficient as using all the iterations If we decide not to thin, we must estimate the MCSE for dependent samples; this is not trivial Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 18/26

Markov chain CLTs Although the proof of such theorems is beyond the scope of this course, there exist central limit theorems for dependent samples that can be applied to Markov chains In particular, suppose we are interested the posterior mean of a quantity ω; it is still true that our MCMC estimate, ω, tends in distribution to N(E(ω y), ρ/t ) for some positive constant ρ Note, however, that ρ is not the posterior variance, as it would be if we had iid samples; in the presence of autocorrelation, ρ > σ 2 p Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 19/26

Batch means Convergence How can we estimate the MCSE, ρ/t? One simple method is to use non-overlapping batch means Suppose we divide up our sample into Q batches, each with a iterations, and let ω q denote the mean in batch q If a is sufficiently large that the batch means are approximately uncorrelated, Var( ω q ) ρ/a Thus, a reasonable estimator is ˆρ = a Var( ω), where Var( ω) is the sample variance of the { ω q } Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 20/26

Effective sample size We may define, then, an effective sample size of the Markov chain as follows: T = T ˆσ2 p ˆρ One may then apply the iid rules of thumb analogously, using T in place of T : 400 (effective) iterations is enough for a reasonable estimate of the posterior mean, and 4,000 iterations is required for a reasonable 95% posterior interval Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 21/26

Time series methods More sophisticated methods for estimating MCSE have been proposed based on applying ideas from the time series literature to MCMC chains This is the approach used by coda, which fits an autoregressive (AR) model to the data and then estimates its spectral density; the results are available via the functions summary (which provides the MCSE estimate) and effectivesize (which estimates T ) BUGS itself (if you run it through its stand-alone GUI) estimates MCSE and T using batch means R2OpenBUGS and R2jags use yet another, even cruder method based on within- and between- chain variances Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 22/26

Effective sample size measures For the swiss fertility data, with 10,000 iterations and a burn-in of 5,000 and no thinning, we have the following effective sample size estimates Time series Batch means Between/Within β 1 91 214 35 β 2 475 416 63 β 3 440 457 190 β 4 1112 788 980 β 5 1059 1030 15000 β 6 167 246 60 τ 7419 4905 810 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 23/26

Convergence In summary, then, and at the risk of oversimplifying things, any responsible Bayesian analysis involving MCMC should check to ensure that, for all quantities of interest, ˆR < 1.05 T > 4, 000 Obviously, neither of these diagnostics provide proof that your MCMC approach has adequately estimated the posterior, and further plots and diagnostics are useful, but in practice, these two checks do tend to detect a lot of errors Fortunately, these quantities are easily accessible in bugs/jags with the print and plot methods Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 24/26

print(fit) Convergence > fit Current: 3 chains, each with 10000 iterations (first 5000 discarded) Cumulative: n.sims = 15000 iterations saved mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 62.5 10.3 42.1 55.4 62.8 69.6 82.3 1.1 35 beta[2] -0.2 0.1-0.3-0.2-0.2-0.1 0.0 1.0 63 beta[3] -0.2 0.3-0.7-0.4-0.2-0.1 0.3 1.0 190 beta[4] -0.9 0.2-1.2-1.0-0.9-0.7-0.5 1.0 980 beta[5] 0.1 0.0 0.0 0.1 0.1 0.1 0.2 1.0 15000 beta[6] 1.2 0.4 0.5 1.0 1.2 1.5 2.0 1.0 60 tau 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 810 (Obviously, we need to increase T here) Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 25/26

plot(fit) Convergence medians and 80% intervals 100 80% interval for each chain R hat 50 0 50 100 1 1.5 2+ beta[1] [2] [3] [4] [5] [6] tau 50 0 50 100 1 1.5 2+ 50 beta 0 50 0.03 1 2 3 4 5 6 0.025 tau 0.02 0.015 0.01 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 26/26