CSC 412/2506: Probabilistic Learning and Reasoning

Similar documents
Gaussian Processes to Speed up Hamiltonian Monte Carlo

Tutorial on Markov Chain Monte Carlo

MCMC Using Hamiltonian Dynamics

Introduction to Markov Chain Monte Carlo

Computational Statistics for Big Data

Bayesian Statistics: Indian Buffet Process

Parallelization Strategies for Multicore Data Analysis

Validation of Software for Bayesian Models Using Posterior Quantiles

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

Statistics Graduate Courses

Introduction to Markov Chain Monte Carlo

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

STAT3016 Introduction to Bayesian Data Analysis

Optimising and Adapting the Metropolis Algorithm

Centre for Central Banking Studies

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Linear Threshold Units

Bayesian Statistics in One Hour. Patrick Lam

STA 4273H: Statistical Machine Learning

Chenfeng Xiong (corresponding), University of Maryland, College Park

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Dirichlet Processes A gentle tutorial

Basics of Statistical Machine Learning

Model-based Synthesis. Tony O Hagan

11. Time series and dynamic linear models

Imputing Missing Data using SAS

APPLIED MISSING DATA ANALYSIS

Master s Theory Exam Spring 2006

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

E3: PROBABILITY AND STATISTICS lecture notes

MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

On the mathematical theory of splitting and Russian roulette

Parametric fractional imputation for missing data analysis

The Monte Carlo Framework, Examples from Finance and Generating Correlated Random Variables

Improving paired comparison models for NFL point spreads by data transformation. Gregory J. Matthews

Sections 2.11 and 5.8

Monte Carlo Methods in Finance

Introduction to Time Series Analysis. Lecture 1.

Markov Chain Monte Carlo and Numerical Differential Equations

Basic Bayesian Methods

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Gaussian Conjugate Prior Cheat Sheet

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

A Latent Variable Approach to Validate Credit Rating Systems using R

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Handling attrition and non-response in longitudinal data

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

APPLIED MATHEMATICS ADVANCED LEVEL

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Monte Carlo Simulation

Markov Chain Monte Carlo and Applied Bayesian Statistics: a short course Chris Holmes Professor of Biostatistics Oxford Centre for Gene Function

Journal of Statistical Software

Sampling for Bayesian computation with large datasets

Big Data need Big Model 1/44

Ex. 2.1 (Davide Basilio Bartolini)

EE 570: Location and Navigation

Estimation and comparison of multiple change-point models

Markov Chain Monte Carlo Simulation Made Simple

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

BayesX - Software for Bayesian Inference in Structured Additive Regression

Analysis of Financial Time Series

PS 271B: Quantitative Methods II. Lecture Notes

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014


Inference on Phase-type Models via MCMC

The Chinese Restaurant Process

33. STATISTICS. 33. Statistics 1

L13: cross-validation

How To Understand The Theory Of Probability

Bayesian inference for population prediction of individuals without health insurance in Florida

Lecture Notes 1. Brief Review of Basic Probability

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

HIDDEN MARKOV MODELS FOR ALCOHOLISM TREATMENT TRIAL DATA

AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS. Department of Mathematics and Statistics University of Calgary

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Markov chain Monte Carlo

A Tutorial on Probability Theory

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Master s thesis tutorial: part III

Combining Visual and Auditory Data Exploration for finding structure in high-dimensional data

Bayesian Methods for the Social and Behavioral Sciences

Applications of R Software in Bayesian Data Analysis

Bayes and Big Data: The Consensus Monte Carlo Algorithm

Dealing with large datasets

Bayesian prediction of disability insurance frequencies using economic indicators

1 Short Introduction to Time Series

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Online Model-Based Clustering for Crisis Identification in Distributed Computing

Least Squares Estimation

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Reliability estimators for the components of series and parallel systems: The Weibull model

10.2 Series and Convergence

Transcription:

CSC 412/2506: Probabilistic Learning and Reasoning Week 5-2/2: Sampling II Murat A. Erdogdu University of Toronto Prob Learning (UofT) CSC412-Week 5-2/2 1 / 21

Overview Gibbs sampling Hamiltonian Monte Carlo MCMC diagnostics Prob Learning (UofT) CSC412-Week 5-2/2 2 / 21

Gibbs Sampling Suppose the parameter vector x has been divided into d components x = (x 1,..., x d ) T At each iteration, the Gibbs Sampler, cycles through the components of x, drawing each subset conditional on the value of all others. This means we perform d steps at each sampling iteration t to obtain x (t+1) No accept/reject, only accept. Prob Learning (UofT) CSC412-Week 5-2/2 3 / 21

Gibbs Sampling Procedure At iteration t: choose an ordering j of d sub-vectors of x For j = 1 to j = d: Sample x t j from the conditional distribution given all the other components: x t j p(x j x t 1 j ) Where x t 1 j represents all the components of x except for x j at their current values: x t 1 j = (x t 1, x t 2,..., x t j 1, x t 1 j+1,..., xt 1 d ) Prob Learning (UofT) CSC412-Week 5-2/2 4 / 21

Gibbs Sampling Example Consider a single observation (y 1, y 2 ) from a bivariate normal, with unknown [ ] mean µ = (µ 1, µ 2 ) and known covariance matrix: 1 ρ Σ = with a standard Gaussian prior distribution on µ ρ 1 The posterior takes the form: ( ) (( ) ) µ1 y1 y N, Σ µ 2 Although it is simple to draw from this posterior we can alternatively use the Gibbs sampler. To do that we must first deterimine the conditional posterior distributions for µ 1 and µ 2 y 2 Prob Learning (UofT) CSC412-Week 5-2/2 5 / 21

Gibbs Sampling Example Using the properties of the multivariate normal distribution we have: µ 1 µ 2, y N(y 1 + ρ(µ 2 y 2 ), 1 ρ 2 ) µ 2 µ 1, y N(y 2 + ρ(µ 1 y 1 ), 1 ρ 2 ) Then given some previous (possibly initial) value of µ (t 1), the sampling would be: µ (t) 1 N(y 1 + ρ(µ (t 1) 2 y 2 ), 1 ρ 2 ) µ (t) 2 N(y 2 + ρ(µ (t) 1 y 1), 1 ρ 2 ) Prob Learning (UofT) CSC412-Week 5-2/2 6 / 21

Gibbs Sampling Example 1 1 From Bayesian Data Analysis Third edition by Gelman, Carlin, Stern, Dunson, Vehtari, Rubin Prob Learning (UofT) CSC412-Week 5-2/2 7 / 21

Hamiltonian Monte Carlo This is essentially a Metropolis-Hastings algorithm with a specialized proposal mechanism. Algorithm uses a physical analogy to make proposals. Given the position x, the potential energy is E(x) Construct a distribution p(x) e E(x), with E(x) = log( p(x)) where p(x) is the unnormalized density we can evaluate. Prob Learning (UofT) CSC412-Week 5-2/2 8 / 21

Hamiltonian Monte Carlo Construct a distribution p(x) e E(x), with E(x) = log( p(x)) where p(x) is the unnormalized density we can evaluate. Introduce momentum v carrying the kinetic energy K(v) = v 2 /2 Total energy or Hamiltonian: H = E(x) + K(v). Energy is preserved: Frictionless ball rolling (x, v) (x, v ) H(x, v) = H(x, v ). Ideal Hamiltonian dynamics are reversible: reverse v and the ball will return to its start point! Prob Learning (UofT) CSC412-Week 5-2/2 9 / 21

Hamiltonian Monte Carlo The joint distribution: p(x, v) e E(x) e K(v) = e E(x) K(v) = e H(x,v) Momentum is Gaussian, and independent of the position. MCMC procedure Sample the momentum Simulate Hamiltonian dynamics, flip sign of velocity Hamiltonian dynamics is reversible. Energy is constant p(x, v) = p(x, v ). How to simulate Hamiltonian dynamics? dx dt = H v dv dt = H x Prob Learning (UofT) CSC412-Week 5-2/2 10 / 21

Leap-frog integrator A numerical approximation: H is not conserved. Dynamics are still deterministic (and reversible) Acceptance probability : min{1, exp(h(x, v) H(x, v ))} Prob Learning (UofT) CSC412-Week 5-2/2 11 / 21

HMC algorithm The HMC algorithm (run until it mixes): Current position: x Sample momentum: v N (0, I). Run Leapfrog integrator for L steps and reach (x, v ) Accept new position x with probability: min{1, exp(h(x, v) H(x, v ))} Low energy points are favored. Prob Learning (UofT) CSC412-Week 5-2/2 12 / 21

MCMC Inference Sample from unnormalized posterior Estimate statistics from simulated values of x mean median quantiles Posterior predictive density of unobserved outcomes can be obtained by further simulation conditional on drawn values of x All of this however requires some care, as MCMC is not without problems Prob Learning (UofT) CSC412-Week 5-2/2 13 / 21

MCMC diagnostics How do we know we have ran the algorithm long enough? What if we started very far from where our distribution is? Since there is correlation between each item of the chain (autocorrelation), what is the effective number of samples? Prob Learning (UofT) CSC412-Week 5-2/2 14 / 21

Good Ideas for MCMC Parallel computation is cheap - we can run multiple chains in parallel starting at different points We should discard some initial number of samples - warm-up or burn-in We should examine how well the chain is mixed. (No need to memorize any of the formulas below) Prob Learning (UofT) CSC412-Week 5-2/2 15 / 21

R hat Start with m/2 chains of 2n samples (length of the chain) each, with a warm-up period of n. Split them in half so that we have m chains total (half of which are burn-in) of length n each. Label each scalar estimand with x i,j with (i = 1,..., n; j = 1,..., m) The between sequence variance B is: where: and: B = n m 1 x.j = 1 n x.. = 1 m m ( x.j x.. ) j=1 Prob Learning (UofT) CSC412-Week 5-2/2 16 / 21 n i=1 m j=1 x ij x.j

R hat The within sequence variance W is: where: W = 1 m s 2 j = 1 n 1 j=1 s 2 j n (x ij x.j ) 2 i=1 For any finite n, W will underestimate the true variance, since the chains have not had time to explore the entire possible range of values Prob Learning (UofT) CSC412-Week 5-2/2 17 / 21

R hat We can estimate the marginal posterior variance of x by a weighted average of W and B: var + (x) = n 1 n W + 1 n B This quantity overestimates the marginal posterior variance assuming the starting distribution is overdispersed, but is unbiased under stationarity or in the limit n We estimate the factor by which the scale of the current distribution for x might be reduced if we were to continue to infinity by: var ˆR + (x) = W If chains have not mixed well, R-hat is larger than 1 Prob Learning (UofT) CSC412-Week 5-2/2 18 / 21

Effective Sample Size Since our observations are not independent of each other, we de facto gain less information One way to quantify the effective sample size is to consider statistical efficiency of x.. as an estimate of E[x] ( ) lim mn var( x..) = 1 + 2 ρ t var(x) n t=1 Where ρ t is the autocorrelation of the sequence x at lag t If the draws were completely independent we would have var( x.. ) = 1 mnvar(x) and the effective sample size would be mn Prob Learning (UofT) CSC412-Week 5-2/2 19 / 21

Autocorrelations We define the effective sample size to be: n eff = mn 1 + 2 t=1 ρ t ρ t are unknown, so we estimate them by where V t = 1 m(n t) ˆρ t = 1 m V t 2 var + j=1 i=t+1 n (x i,j x i t,j ) 2 Prob Learning (UofT) CSC412-Week 5-2/2 20 / 21

Diagnostics Summary Once ˆR is near 1, and ˆn eff is more than 10 per chain for all scalar estimands we collect the mn simulations, (excluding the burn-in) We can then draw inference based on our samples. However: Even if the iterative simulations appear to have converged, passed all tests etc. It may still be far from convergence! When we declare convergence - we mean that all chains appear stationary and well mixed. Non of the checks we learned today are hypothesis test. There are no p-values, and no statistical significance. Prob Learning (UofT) CSC412-Week 5-2/2 21 / 21