Multi-parameter MCMC notes by Mark Holder

Similar documents
CHAPTER 2 Estimating Probabilities

Parallelization Strategies for Multicore Data Analysis

Bayesian Statistics in One Hour. Patrick Lam

Bayesian prediction of disability insurance frequencies using economic indicators

1 Prior Probability and Posterior Probability

Multivariate Normal Distribution

Basics of Statistical Machine Learning

Experimental Uncertainty and Probability

Time Series Analysis

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Tutorial on Markov Chain Monte Carlo

Statistical Machine Learning

Introduction to Markov Chain Monte Carlo

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Linear regression methods for large n and streaming data

Likelihood: Frequentist vs Bayesian Reasoning

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Latent Variable Approach to Validate Credit Rating Systems using R

A Basic Introduction to Missing Data

Imputing Missing Data using SAS

STA 4273H: Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Centre for Central Banking Studies

Introduction to General and Generalized Linear Models

Logistic Regression (1/24/13)

Notes on Factoring. MA 206 Kurt Bryan

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

SAS Software to Fit the Generalized Linear Model

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Imputing Values to Missing Data

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

1 Maximum likelihood estimation

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Lecture 8: Signal Detection and Noise Assumption

Markov Chain Monte Carlo Simulation Made Simple

The HB. How Bayesian methods have changed the face of marketing research. Summer 2004

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Model-based Synthesis. Tony O Hagan

PS 271B: Quantitative Methods II. Lecture Notes

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Normal distribution. ) 2 /2σ. 2π σ

Statistics 104: Section 6!

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Monte Carlo-based statistical methods (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Marketing Mix Modelling and Big Data P. M Cain

Part 2: One-parameter models

Univariate Regression

Handling attrition and non-response in longitudinal data

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Lecture 9: Bayesian hypothesis testing

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Applications of R Software in Bayesian Data Analysis

Chapter 3 RANDOM VARIATE GENERATION

Regression III: Advanced Methods

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

BAYESIAN ANALYSIS OF AN AGGREGATE CLAIM MODEL USING VARIOUS LOSS DISTRIBUTIONS. by Claire Dudley

Bayesian probability theory

10.2 Series and Convergence

Continued Fractions and the Euclidean Algorithm

Towards running complex models on big data

2.2 Scientific Notation: Writing Large and Small Numbers

Ordinal Regression. Chapter

BayesX - Software for Bayesian Inference in Structured Additive Regression

Fairfield Public Schools

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Note on growth and growth accounting

Chapter 4. Probability and Probability Distributions

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

Chapter 5. Random variables

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

Imperfect Debugging in Software Reliability

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Estimation and comparison of multiple change-point models

Simple Regression Theory II 2010 Samuel L. Baker

Course: Model, Learning, and Inference: Lecture 5

Introduction to the Monte Carlo method

Basic Bayesian Methods

5.4 Solving Percent Problems Using the Percent Equation

Interpretation of Somers D under four simple models

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

E3: PROBABILITY AND STATISTICS lecture notes

Statistics in Retail Finance. Chapter 6: Behavioural models

Foundations of Statistics Frequentist and Bayesian

Random variables, probability distributions, binomial random variable

HIDDEN MARKOV MODELS FOR ALCOHOLISM TREATMENT TRIAL DATA

Transcription:

Multi-parameter MCMC notes by Mar Holder Review In the last lecture we ustified the Metropolis-Hastings algorithm as a means of constructing a Marov chain with a stationary distribution that is identical to the posterior probability distribution. We found that if you propose a new state from a proposal distribution with probability of proposal denote q(, ) then you could use the following rule to calculate an acceptance probability: α(, ) min [ 1, ( P(D θ ) P(D θ ) ) ( P(θ ) P(θ ) ) ( q(, ) q(, ) )] To get the probability of moving, we have to multiple the proposal probability by the acceptance probability: q(, ) P(x i+1 x i ) α(, ) P(x i+1 x i, x i+1) m, P(x i+1 x i ) q(, )α(, ) If α(, ) < 1 then α(, ) 1. In this case: [( ) ( ) ( )]/ α(, ) P(D θ ) P(θ ) q(, ) 1 α(, ) P(D θ ) P(θ ) q(, ) ( ) ( ) ( ) P(D θ ) P(θ ) q(, ) P(D θ ) P(θ ) q(, ) Thus, the ratio of these two transition probabilities for the Marov chain are: m, m, q(, )α(, ) q(, )α(, ) ( ) ( ) ( ) ( ) q(, ) P(D θ ) P(θ ) q(, ) q(, ) P(D θ ) P(θ ) q(, ) ( ) ( ) P(D θ ) P(θ ) P(D θ ) P(θ ) If we recall that, under detailed balance, we have: π π m, m, we see that we have constructed a chain in which the stationary distribution is proportional to the posterior probability distribution. 1

Convergence We can (sometimes) diagnose failure-to-converge by comparing the results of separate MCMC simulations. If all seems to be woring, then we would lie to treat our sampled points from the MCMC simulation as if they were draws from the posterior probability distribution over parameters. Unfortunately, our samples from our MCMC approximation to the posterior will display autocorrelation. We can calculate an effective sample size by diving the number of sampled points by the autocorrelation time. The CODA pacage in R provides several useful tools for diagnosing problems with MCMC convergence. Multi-parameter inference In the simple example discussed in the last lecture, θ, could only tae one of 5 values. In general, our models have multiple, continuous parameters. We can adopt the acceptance rules to continuous parameters by using a Hastings ratio which is the ratio of proposal densities (and we d also have a ratio of prior probability densities). Adapting MCMC to multi-parameter problems is also simple. Because of co-linearity between parameters, it may be most effective to design proposals that change multiple parameters in one step. But this is not necessary. If we update each parameter, we can still construct a valid chain. In doing this, we are effectively sampling the m-th parameter from P(θ m Data, θ m ) where θ m denote the vector of parameters without parameter m included. Having a marginal distribution is not enough to reconstruct a oint distribution. But we have the distribution on θ m for every possible value of the other parameters. So we are effectively sampling from the oint distribution, we are ust updating one parameter at a time. GLM in Bayesian world Consider a data set of different diets many full sibs reared under two diets (normal0, and unlimited1). We measure snout-vent length for a bunch of geos. Our model is: There is an unnown mean SVL under the normal diet, α 0. α 1 is the mean SVL of an infinitely-large independent sample under the unlimited diet. Each family,, will have a mean effect, B. This effect gets added to the mean based on the diet, regardless of diet. Each family,, will have a mean response to unlimited diet, C 1. This effect is only added to individuals on the unlimited diet. For notational convenience, we can simply define C 0 0 for all families.

The SVL for each individual is expected to normally-distributed around the expected value; the difference between a response and the expected value is ɛ i. To complete the lielihood model, we have to say something about the probability distributions that govern the random effects: B N (0, σ B ) C 1 N (0, σ C ) ɛ i N (0, σ e ) In our previous approach, we could do a hypothesis test such as H 0 : α 0 α 1, or we could generate point estimates. That is OK, but what if we want to answer questions such as What is the probability that α 1 α 0 > 0.5mm? Could we: reparameterize to δ 1 α 1 α 0, construct a x% confidence interval, search for the largest value of x such that 0 is not included in the confidence interval? do something lie P(α 1 α 0 > 0.5) (1 x )/ or something lie that? No! That is not a correct interpretation of a P -value, or confidence interval! If we conduct Bayesian inference on this model, we can estimate the oint posterior probability distribution over all parameters. From this distribution we can calculate the P(α 1 α 0 > 0.5 X) by integrating ( ) P(α 1 α 0 > 0.5 X) p(α 0, α 1 X)dα 1 dα 0 α 0 +0.5 From MCMC output we can count the fraction of the sampled points for which α 1 α 0 > 0.5, and use this as an estimate of the probability. Note that it is hard to estimate very small probabilities accurately using a simulation-based approach. But you can often get a very reasonable estimate of the probability of some complicated, oint statement about the parameters by simply counting the fraction of MCMC samples that satisfy the statement. Definitions of probability In the frequency or frequentist definition, the probability of an event is the fraction of times the event would occur if you were able to repeat the trial an infinitely large number of times. The probability is defined in terms of long-run behavior and repeated experiments. Bayesians accept that this is correct (if we say that the probability of heads is 0.6 for a coin, then a Bayesian would expect the fraction heads in a very large experiment to approach 60%), but Bayesians also use probability to quantify uncertainty even in circumstances in which it is does 3

not seem reasonable to thin of repeated trials. The classic example is the prior probability assigned to a parameter value. A parameter in a model can be a fixed, but unnown constant in nature. But even if there can only be one true value, Bayesians will use probability statements to represent the degree of belief. Low probabilities are assigned to values that seem unliely. Probability statements can be made based on vague beliefs (as is often the case with priors), or they can be informed by previous data. Fixed vs random effects in the GLM model In order to perform Bayesian inference on the GLM model setched out above, we would need to specify a prior distribution on α 0 and α 1. If we new that the geos tended to be around 5cm long (SVL), then we might use priors lie this: α 0 Gamma(α 10, β ) δ 1 Normal(µ 1, σ 10) Where Gamma(α, β) is a Gamma-distribution with mean α/β and variance α/β. Now that we have a prior for all of the parameters, we may notice that the distinction between random effects and a vector of parameters becomes blurry. Often we can implement a sampler more easily in a Bayesian context if we simply model the outcomes of random processes that we don t observe. A variable that is the unobserved result of the action of our model are referred to as latent variables. When we conduct inference without imputing values for the latent variables, we are effectively integrating them out of the calculations. If we choose, we may prefer to use MCMC to integrate over some of the latent variables. When we do this, the latent variables loo lie parameters in the model, we calculate a lielihood ratio for them and a prior ratio for them whenever they need to be updated 1. In the ML approach to the GLM, we did not try to estimate the random effects. We simply recognized that the presence of a family effect, for example, would cause a non-zero covariance. In essence we were integrating over possible values of each families effect, and ust picing up on the degree to which within-family comparison were non-independent because they shared some unnown parameters. In MCMC, it would also be easy to simply treat possible values for each B and C 1 element as a latent variable. The priors would be the Normal distributions that we outlined above, and y i α 0 + iδ 1 + B + C i + ɛ i or y i N (α 0 + iδ 1 + B + C i, σ e ) 1 The distinction between a parameter and a latent variable can be subtle: when you reduce the model to the minimal statement about unnown quantities, you are dealing with the parameters and their priors. You can add latent variables to a model specification, but when you do this the prior for the latent variables comes from the parameters, existing priors, and other latent variables. So you don t have to specify a new prior for a latent variable - it falls out of the model. 4

So f(y i α 0, δ 1, B, C i ) f(ɛ i ) and we can calculate this density by: [ ] 1 (y i α 0 iδ 1 B C i ) f(y i α 0, δ 1, B, C i ) exp πσ e Note that δ 1 only appears in the lielihood for geos on diet 1, thus updating δ 1 will not change many of the terms in the log-lielihood. We only need to calculate the log of the lielihood ratio, so we can ust ignore terms in which δ 1 does not occur when we want to update this parameter. This type of optimization can speed-up the lielihood calculations dramatically. Specifically, we have terms lie: ln f(y 0 α 0, B ) 1 ln [ ] πσe ] [(y 0 α 0 B ) in the log-lielihoods. If we are updating α 0 : ln f(y 1 α 0, B, C 1 ) 1 ln [ ] πσe ] [(y 1 α 0 δ 1 B C 1 ) ln LR(y 0 ) ln f(y 0 α0, δ 1, B, C 1, σ e ) ln f(y 0 α 0, B, δ 1, B, C 1, σ e ) 1 [(y σe 0 α 0 B ) (y 0 α0 B ) ] 1 1 [(y 0 B ) (y 0 B ) α 0 + α 0 (y 0 B ) + (y 0 B ) α 0 (α 0) ] [ (y0 B ) (α 0 α 0 ) + α 0 (α 0) ] Similar algebra leads to: ln LR(y 1 ) ln f(y 1 α0, δ 1, B, C 1, σ e ) ln f(y 1 α 0, B, δ 1, B, C 1, σ e ) 1 [ (y1 σe δ 1 B C 1 ) (α0 α 0 ) + α0 (α0) ] Each of the data points is independent, conditional on all of the latent variables, so: ln LR ln LR(y 0 ) + ln LR(y 1 ) It is helpful to introduce some named variables that are simply convenient functions of the data or parameters: n total # individuals 5

n 0 # individuals in treatment 0 n 1 # individuals in treatment 1 f # families B B D B B 1 B (summing only over treatment 1) C 1 C 1 y 0 y 0 y 1 y 1 y y 0 + y 1 R n 0 [y 0 α 0 B ] + n 1 [y 0 α 0 δ 1 B C 1 ] Updating α 0 Thus, the log-lielihood ratio for α 0 α 0 is: ln LR n [ α 0 (α 0 )] + [y n 1 δ 1 B C 1 ] (α 0 α 0) Note that calculating this log-lielihood is very fast. It is easy to eep the various starred-sums up to date when B and C 1 elements change. So evaluating the lielihood ratio for proposals to α 0 does not even involve iterating through every data point. Updating δ 1 If we are ust updating δ 1 then the log-lielihood ratio have fewer terms, because constants drop out in the subtraction: ln LR(y 1 ) ln f(y 1 α 0, δ1, B, C 1, σ e ) ln f(y 1 α 0, B, δ 1, B, C 1, σ e ) 1 [ (y1 σe α 0 B C 1 ) (δ1 δ 1 ) + δ1 (δ1) ] ln LR(Y ) n 1δ1 n 1(δ1 ) + [y 1 n 1 α 0 B 1 C 1 ] (δ1 δ 1) σe 6

Updating σ B or σ C If we want to update σ B, and we have a set of latent variables then we only have to consider the portion of the lielihood that comes from the latent variables (we don t have to consider the data): ln LR ( 1 ln [ [ ] π(σ B ) ] B + (σ B ) 1 ln [ [ ]) π(σb) ] B (σb ) [ ] σb f ln σb + ([ ] [ ]) B B (σ B ) (σb ) [ ] σb f ln + D [ 1 (σ B ) 1 ] (σb ) σ B The corresponding formula for updating σ C variance parameter. would use C 1 as the variates that depend on the Updating σ E Updating σ e would entail summing the effect of all of the residual, but this is the only parameter that would require iterating over all of the data points: [ ] σe ln LR n ln + R [ ] 1 (σ E ) 1 (σ E o ) σ E Updating B Updating a family effect is much lie updating another mean effect, except (of course) it only affects the lielihood of one family (denoted family ) but effects both treatments: ] (n 0 + n 1 ) [B (B ) + [y (n 0 + n 1 )α 0 n 1 (δ 1 + C 1 )] (B B ) ln LR Updating C 1 Updating a C 1 variable only affects the treatment-1 individuals in one family (denoted family ): [ ] n 1 C1 (C 1 ) + [y 1 n 1 (α 0 + δ 1 + B )] (C1 C 1) ln LR 7

Latent variable mixing The downside of introducing latent variables, is that our chain would have to sample over them (update them). The equation for udpating B loos a lot lie the update of δ 1, but we only have to perform the summation over family. In essence, introducing latent variables lets us do calculations in this form: P(α 0, δ 1, B, C 1, σ B, σ C, σ e Y ) P(Y α 0, δ 1, B, C 1, σ e )P(B σ B )P(C 1 σ C )P(α 0, δ 1, σ B, σ C, σ e ) P(Y ) and use MCMC to integrate over B and C 1 to obtain: P(α 0, δ 1, σ B, σ C, σ e Y ) P(α 0, δ 1, B, C 1, σ B, σ C, σ e Y )db dc 1 Marginalizing over a parameter is easy you simply ignore it when summarizing the MCMC output. 8

References 9