CSC 412/2506: Probabilistic Learning and Reasoning

CSC 412/2506: Probabilistic Learning and Reasoning Week 5-2/2: Sampling II Murat A. Erdogdu University of Toronto Prob Learning (UofT) CSC412-Week 5-2/2 1 / 21

Overview Gibbs sampling Hamiltonian Monte Carlo MCMC diagnostics Prob Learning (UofT) CSC412-Week 5-2/2 2 / 21

Gibbs Sampling Suppose the parameter vector x has been divided into d components x = (x 1,..., x d ) T At each iteration, the Gibbs Sampler, cycles through the components of x, drawing each subset conditional on the value of all others. This means we perform d steps at each sampling iteration t to obtain x (t+1) No accept/reject, only accept. Prob Learning (UofT) CSC412-Week 5-2/2 3 / 21

Gibbs Sampling Procedure At iteration t: choose an ordering j of d sub-vectors of x For j = 1 to j = d: Sample x t j from the conditional distribution given all the other components: x t j p(x j x t 1 j ) Where x t 1 j represents all the components of x except for x j at their current values: x t 1 j = (x t 1, x t 2,..., x t j 1, x t 1 j+1,..., xt 1 d ) Prob Learning (UofT) CSC412-Week 5-2/2 4 / 21

Gibbs Sampling Example Consider a single observation (y 1, y 2 ) from a bivariate normal, with unknown [ ] mean µ = (µ 1, µ 2 ) and known covariance matrix: 1 ρ Σ = with a standard Gaussian prior distribution on µ ρ 1 The posterior takes the form: ( ) (( ) ) µ1 y1 y N, Σ µ 2 Although it is simple to draw from this posterior we can alternatively use the Gibbs sampler. To do that we must first deterimine the conditional posterior distributions for µ 1 and µ 2 y 2 Prob Learning (UofT) CSC412-Week 5-2/2 5 / 21

Gibbs Sampling Example Using the properties of the multivariate normal distribution we have: µ 1 µ 2, y N(y 1 + ρ(µ 2 y 2 ), 1 ρ 2 ) µ 2 µ 1, y N(y 2 + ρ(µ 1 y 1 ), 1 ρ 2 ) Then given some previous (possibly initial) value of µ (t 1), the sampling would be: µ (t) 1 N(y 1 + ρ(µ (t 1) 2 y 2 ), 1 ρ 2 ) µ (t) 2 N(y 2 + ρ(µ (t) 1 y 1), 1 ρ 2 ) Prob Learning (UofT) CSC412-Week 5-2/2 6 / 21

Gibbs Sampling Example 1 1 From Bayesian Data Analysis Third edition by Gelman, Carlin, Stern, Dunson, Vehtari, Rubin Prob Learning (UofT) CSC412-Week 5-2/2 7 / 21

Hamiltonian Monte Carlo This is essentially a Metropolis-Hastings algorithm with a specialized proposal mechanism. Algorithm uses a physical analogy to make proposals. Given the position x, the potential energy is E(x) Construct a distribution p(x) e E(x), with E(x) = log( p(x)) where p(x) is the unnormalized density we can evaluate. Prob Learning (UofT) CSC412-Week 5-2/2 8 / 21

Hamiltonian Monte Carlo Construct a distribution p(x) e E(x), with E(x) = log( p(x)) where p(x) is the unnormalized density we can evaluate. Introduce momentum v carrying the kinetic energy K(v) = v 2 /2 Total energy or Hamiltonian: H = E(x) + K(v). Energy is preserved: Frictionless ball rolling (x, v) (x, v ) H(x, v) = H(x, v ). Ideal Hamiltonian dynamics are reversible: reverse v and the ball will return to its start point! Prob Learning (UofT) CSC412-Week 5-2/2 9 / 21

Hamiltonian Monte Carlo The joint distribution: p(x, v) e E(x) e K(v) = e E(x) K(v) = e H(x,v) Momentum is Gaussian, and independent of the position. MCMC procedure Sample the momentum Simulate Hamiltonian dynamics, flip sign of velocity Hamiltonian dynamics is reversible. Energy is constant p(x, v) = p(x, v ). How to simulate Hamiltonian dynamics? dx dt = H v dv dt = H x Prob Learning (UofT) CSC412-Week 5-2/2 10 / 21

Leap-frog integrator A numerical approximation: H is not conserved. Dynamics are still deterministic (and reversible) Acceptance probability : min{1, exp(h(x, v) H(x, v ))} Prob Learning (UofT) CSC412-Week 5-2/2 11 / 21

HMC algorithm The HMC algorithm (run until it mixes): Current position: x Sample momentum: v N (0, I). Run Leapfrog integrator for L steps and reach (x, v ) Accept new position x with probability: min{1, exp(h(x, v) H(x, v ))} Low energy points are favored. Prob Learning (UofT) CSC412-Week 5-2/2 12 / 21

MCMC Inference Sample from unnormalized posterior Estimate statistics from simulated values of x mean median quantiles Posterior predictive density of unobserved outcomes can be obtained by further simulation conditional on drawn values of x All of this however requires some care, as MCMC is not without problems Prob Learning (UofT) CSC412-Week 5-2/2 13 / 21

MCMC diagnostics How do we know we have ran the algorithm long enough? What if we started very far from where our distribution is? Since there is correlation between each item of the chain (autocorrelation), what is the effective number of samples? Prob Learning (UofT) CSC412-Week 5-2/2 14 / 21

Good Ideas for MCMC Parallel computation is cheap - we can run multiple chains in parallel starting at different points We should discard some initial number of samples - warm-up or burn-in We should examine how well the chain is mixed. (No need to memorize any of the formulas below) Prob Learning (UofT) CSC412-Week 5-2/2 15 / 21

R hat Start with m/2 chains of 2n samples (length of the chain) each, with a warm-up period of n. Split them in half so that we have m chains total (half of which are burn-in) of length n each. Label each scalar estimand with x i,j with (i = 1,..., n; j = 1,..., m) The between sequence variance B is: where: and: B = n m 1 x.j = 1 n x.. = 1 m m ( x.j x.. ) j=1 Prob Learning (UofT) CSC412-Week 5-2/2 16 / 21 n i=1 m j=1 x ij x.j

R hat The within sequence variance W is: where: W = 1 m s 2 j = 1 n 1 j=1 s 2 j n (x ij x.j ) 2 i=1 For any finite n, W will underestimate the true variance, since the chains have not had time to explore the entire possible range of values Prob Learning (UofT) CSC412-Week 5-2/2 17 / 21

R hat We can estimate the marginal posterior variance of x by a weighted average of W and B: var + (x) = n 1 n W + 1 n B This quantity overestimates the marginal posterior variance assuming the starting distribution is overdispersed, but is unbiased under stationarity or in the limit n We estimate the factor by which the scale of the current distribution for x might be reduced if we were to continue to infinity by: var ˆR + (x) = W If chains have not mixed well, R-hat is larger than 1 Prob Learning (UofT) CSC412-Week 5-2/2 18 / 21

Effective Sample Size Since our observations are not independent of each other, we de facto gain less information One way to quantify the effective sample size is to consider statistical efficiency of x.. as an estimate of E[x] ( ) lim mn var( x..) = 1 + 2 ρ t var(x) n t=1 Where ρ t is the autocorrelation of the sequence x at lag t If the draws were completely independent we would have var( x.. ) = 1 mnvar(x) and the effective sample size would be mn Prob Learning (UofT) CSC412-Week 5-2/2 19 / 21

Autocorrelations We define the effective sample size to be: n eff = mn 1 + 2 t=1 ρ t ρ t are unknown, so we estimate them by where V t = 1 m(n t) ˆρ t = 1 m V t 2 var + j=1 i=t+1 n (x i,j x i t,j ) 2 Prob Learning (UofT) CSC412-Week 5-2/2 20 / 21

Diagnostics Summary Once ˆR is near 1, and ˆn eff is more than 10 per chain for all scalar estimands we collect the mn simulations, (excluding the burn-in) We can then draw inference based on our samples. However: Even if the iterative simulations appear to have converged, passed all tests etc. It may still be far from convergence! When we declare convergence - we mean that all chains appear stationary and well mixed. Non of the checks we learned today are hypothesis test. There are no p-values, and no statistical significance. Prob Learning (UofT) CSC412-Week 5-2/2 21 / 21