Estimating the evidence for statistical models

Transcription

1 Estimating the evidence for statistical models Nial Friel University College Dublin March, 2011

2 Introduction Bayesian model choice Given data y and competing models: m 1,..., m l, each with parameters θ 1,..., θ l, respectively. Bayesian inference: π(θ k, m k y) π(y θ k, m k )π(θ k m k )p(m k )

3 Introduction Model evidence Within model m k : π(θ k y, m k ) π(y θ k, m k )π(θ k m k ) Constant of proportionality is π(y m k ) = π(y θ k, m k )π(θ k m k )dθ k. θ k This is often called the marginal likelihood, integrated likelihood or evidence and is difficult to compute in general.

4 Introduction Posterior model probabilities Suppose we could compute π(y m k ). Then, using Bayes theorem we get, π(m k y) = π(y m k)π(m k ) l 1 π(y m k)π(m k ).

5 Introduction Bayes factors If we have two competing models: π(m 1 y) π(m 2 y) = π(y m 1) π(y m 2 ) π(m 1) π(m 2 ) posterior odds = Bayes factor prior odds The Bayes factor, B 12 = π(y m 1) π(y m 2 ). The larger B 12 is, the greater the evidence in favour of M 1 compared to M 2.

6 Introduction Bayesian model averaging Predictions can be made by averaging over all models, weighted proportional to the posterior model probability, thereby incorporating model uncertainty. π(y y) = l π(y m k, y)π(m k y) k=1 This is the average of the posterior distribution for y under each model weighted by the corresponding posterior model probabilities.

7 Introduction Why estimating the model evidence is a challenge π(y m k ) is an integral of a (usually) highly variable function over a high-dimensional parameter space. Analytic tractability is sometimes possible, often where conjugate priors are used. This is quite rare. Consequently, sophisticated Monte Carlo methods are needed.

8 Introduction Within model search or across model search? Within model search: Inference for π(θ k y) separately for every m k. This is used to estimate π(y m k ), for all k. There are many approaches under this heading. Across model search: Here inference is carried out over the joint model and parameter space, π(θ k, m k y). In an MCMC setting, only one chain is needed! Reversible jump Markov chain Monte Carlo developed by Green (1995) is the dominant approach. (> 1, 400 citations to date...)

9 Review of evidence estimation Laplace s method Laplace s method (eg Tierney and Kadane 1986) Assume that π(θ k y) is highly peaked around the posterior mode θ k eg if sample size is large enough. Define l(θ k ) = log{π(y θ k )π(θ k )} Expand l(θ k ) as a quadratic about θ k and then exponentiate. Result gives an approximation to π(y θ k )π(θ k ) as a Gaussian with mean θ k and covariance Σ = ( D 2 l( θ k )) 1, where D 2 l( θ k ) is the Hessian matrix of second derivatives. Integrating this approximation yields π(y) (2π) d/2 Σ 1/2 π(y θ k )π( θ k ).

10 Review of evidence estimation Harmonic mean estimator Harmonic mean estimator (Newtown and Raftery (1994)) π(y) = 1/ Why does this hold? ( 1 n ) n π(y θ i ), θ i π(θ y). 1 { } 1 E π(θ y) π(θ y) = π(y θ)π(θ) π(θ y)π(y) dθ = 1 π(y) π(θ) dθ = 1 π(y). The bad news?

13 Review of evidence estimation Harmonic mean estimator Harmonic mean estimator (Newtown and Raftery (1994)) π(y) = 1/ ( 1 n ) n π(y θ i ), θ i π(θ y). 1 This estimator is based solely on draws from the posterior. But the posterior is typically much more peaked than the prior, eg, when the posterior is insensitive to the prior. Hence in such situations, the harmonic mean estimator will not change much as the prior changes. But π(y) is very sensitive to changes in the prior. This drawback is very well documented. See Radford Neal s blog, for example.

14 Review of evidence estimation Chib s method Chib s method (Chib 1995) Chib (1995) presented a generic method which can be applied to output from the Gibbs sampler. Re-writing this, So we could estimate log π(y) as π(θ y) = π(y θ)π(θ) π(y) π(y) = π(y θ)π(θ). π(θ y) log π(y) = log π(y θ ) + log π(θ ) log ˆπ(θ y) where ˆπ(θ y) is an estimate of the posterior density at a point θ of high posterior prob.

15 Review of evidence estimation Chib s method Chib s method (Chib 1995) Chib s method relies on estimating π(θ y). Suppose the vector θ can be partitioned as (θ 1, θ 2, θ 3 ), where the full-conditional distribution of each θ i is standard. π(θ y) = π(θ 1 θ 2, θ 3, y)π(θ 2 θ 3, y)π(θ 3 y) Gibbs sampling can be used to estimate each factor on the LHS: π(θ 2 θ 3, y) = 1 N π(θ 3 y) = 1 N j j π(θ 2 θ (j) 1, θ 3). π(θ 3 θ (j) 1, θ(j) 2 ).

16 Review of evidence estimation Chib s method Chib s method (Chib 1995) In general, Chib s method can be applied when θ is partitioned into an arbitrary number of blocks. The only requirement is that the full-conditional sampling of each block is possible.

17 Review of evidence estimation Annealed importance sampling Annealed Importance Sampling (Neal 2001) AIS is a very clever algorithm which shows how tempering can be used to define an importance samping function to sample from complex distributions. Aside: Importance sampling to sample from a target f (x) using an importance function g(x): x (1),..., x (N) g(x) Further, E f a(x) = w (i) a(x (i) ), where w (i) = f (x (i) ) w (i) g(x (i) ) 1 w (i) z f as N, N z g where z f = x f (x)dx and z g = x g(x)dx.

18 Review of evidence estimation Annealed importance sampling Annealed Importance Sampling (Neal 2001) Define π i (θ y) = π(θ) 1 t i π(θ y) t i, where 1 = t 0 > > t n = 0. Thus π t0 and π tn corresponds to posterior and prior, respectively. Let T i denote a Markov transition kernel with invariant π ti. For j = 1,..., N Sample θ n 1 from π tn Sample θ n 2 from θ n 1 using T n 1 Sample θ 0 from θ 1 using T 1. Set θ (j) = θ 0 and w (j) = π n 1(θ n 1 ) π n 2 (θ n 2 ) π n (θ n 1 ) π n 1 (θ n 2 )... π 0(θ 0 ) π 1 (θ 0 ).

19 Review of evidence estimation Annealed importance sampling Annealed Importance Sampling AIS yields: 1. An independent sample {θ (i) } from π(θ y). 2. An estimator of the evidence π(y) 1 n n w (i). i=1

20 Review of evidence estimation Power posteriors Evidence estimation via power posteriors (NF and Pettitt (2008)) Consider the Power posterior: π(θ y, t) {π(y θ)} T (t) p(θ) where T : [0, 1] [0, 1] is defined st T (0) = 0 and T (1) = 1. Its normalising constant is z(y t) = {π(y θ)} t p(θ) dθ. z(y t = 1): Posterior model evidence. z(y t = 0): Integral of the prior for θ, which equals 1. θ

21 Review of evidence estimation Power posteriors Evidence via power posteriors The evidence follows the identity: { } z(y t = 1) log π(y) = log = z(y t = 0) Proof: 1 d dt log(z(y t)) = 1 z(y t) z (y t) = = 1 z(y t) 0 E θ t log π(y θ)dt. d dt log(π(y θ))t π(θ)dθ log(π(y θ)) π(y θ)t π(θ)dθ z(y t) = E θ t log(π(y θ)).

22 Review of evidence estimation Power posteriors Evidence via power posteriors d dt log z(y t) = E θ t log(π(y θ)) This is the mean deviance wrt to (θ y, t) - the power posterior. Integrating wrt t yields, { } z(y t = 1) 1 log π(y) = log = E z(y t = 0) θ t log π(y θ)dt. This is essentially an application of thermodynamic integration, which was first developed in the statistical physics community, and outlined in Gelman and Meng (1998). 0

23 Review of evidence estimation Power posteriors In practice: Discretise t [0, 1], 0 = t 0 < t 1,..., t n = 1. For each t i : Sample θ π(θ y, t) and estimate E i = E θ ti log π(y θ). π(y) = n ( ) (Ei 1 + E i ) (t i t i 1 ) 2 i=1

24 Review of evidence estimation Power posteriors Sensitivity of p(y) to the prior - toy example How does sensitivity to the prior impact on this method? Suppose y = {y i } iid N(θ, 1). A priori, θ N(m, v). Then the power posterior θ y, t N(m t, v t ), where m t = ntȳ + m/v nt + 1/v and v t = 1 nt + 1/v and E θ y,t log π(y θ) = log 2π n (y i ȳ) 2 n 2 i=1 (m ȳ) 2 (vmt + 1) 2 n 2 When t = 0 final term is nv/2. As v so too does E θ y,t. 1 (nt + 1/v).

25 Review of evidence estimation Power posteriors Expected deviance, under the distribution θ y, t plotted against t for prior variance equal to 10, 5, 1. As v increases, so too does the rate at which the mean deviance changes with t

26 Review of evidence estimation Power posteriors Connection to Fractional Bayes estimator The fraction z(y t = 1)/z(y t = a) where a is close to 0, is precisely the estimate of the marginal likelihood used in the Fractional Bayes estimate of the Bayes factor (O Hagan 95). π(y) z(y t = 1) z(y t = a) = = θ π(y θ)π(θ) dθ θ {π(y θ)}a π(θ) dθ 1 a E θ t log π(y θ)dt This method was proposed to compute Bayes factor with un-informative priors. Impropriety in π(θ) cancels above and below. Essentially a fraction a of the data is borrowed for the prior.

27 Review of evidence estimation Power posteriors Power posterior approach It is realitively straightforward to code/implement. It is a generic method. In some cases it can be implemented in WinBUGS. Choosing the temperature schedule is vital this is the weakness of this approach. Behrens, NF, Hurn (2011) offer some possibility in this direction.

28 Review of evidence estimation Nested sampling Nested sampling (Skilling, 2006) (For the moment (for ease of notation), let L(θ) = π(y θ).) π(y) = L(θ)π(θ) dθ = L(θ) dx, where dx = π(θ) dθ is an element of prior mass. Define X (λ) = π(θ) dθ L(θ)>λ as a cumulant prior mass. Write the inverse function as L(X ), ie L(X (λ)) = λ. This then allows us to express the evidence as a 1 dimensional integral: π(y) = 1 0 L(X ) dx.

31 Review of evidence estimation Nested sampling Nested sampling The main computational burden is the requirement to sample θ from the prior subject to the constraint that L(θ) > l. This is roughly similar to the computational effort of slice sampling (Neal, 2003). The evidence is estimated by sorting draws from the prior according to their likelihood. I 1 π(y) = Z = (X i X i+1 )L i. i=1

32 Review of evidence estimation Nested sampling Sketch of algorithm Sample θ 1,..., θ N from the prior. Repeat for i = 1,..., I : Find the point θ k with the smallest likelihood, l i, among the N current θ i s. Set X i = exp(i/n) and w i = X i 1 X i. Increment Z by L i w i. Replace θ k with a point sampled from the prior subject to L(θ) > l i.

33 Evidence estimation: doubly intractable distributions Doubly intractable distributions π(θ y) π(y θ)π(θ) Here we assume that the likelihood, π(y θ), is impossible to evaluate.

34 Evidence estimation: doubly intractable distributions Ising model Doubly intractable distributions Gibbs random fields, which find use in spatial statistics and statistical network analysis, involves intractable likelihood models. Ising model Defined on a lattice y = {y 1,..., y n }. Lattice points y i take values { 1, 1}. Full conditional π(y i y i, θ) = π(y i neighbours of i, θ). 1 π(y θ) q(y θ) = exp 2 θ 1 y i y j. Here means is a neighbour of. i j

35 Evidence estimation: doubly intractable distributions Ising model 1st order and 2nd order Ising models. π(y θ) = exp(θt s(y)) z(θ) s(y) is a sufficient statistics and counts the number of like neighbours. z(θ) = x 1 q(y θ). x n

36 Evidence estimation: doubly intractable distributions Ising model Model evidence for MRFs our approach π(y) = q(y θ)π(θ) z(θ)π(θ y) θ. Draw from the posterior, and estimate π(θ y) for a high probability θ. Estimate z(θ) using thermodynamic integration.

37 Evidence estimation: doubly intractable distributions Simulating from the posterior Auxiliary variable method (Møller et al., 2006) Introduce an auxiliary variable y on the same space as the data y and extend the target distribution π(θ, y y) π(y θ)π(θ)π(y θ 0 ), for some fixed θ 0. Joint update (θ, y ) with proposal: h(θ, y θ, y ) = h 1 (y θ )h 2 (θ θ, y ) where h 1 (y θ ) = π(y θ ) = q(y θ ) z(θ. )

38 Evidence estimation: doubly intractable distributions Simulating from the posterior α(θ, y θ, y ) = π(y θ )π(θ )π(y θ 0 )π(y θ)h 2 (θ θ ) π(y θ)π(θ)π(y θ 0 )π(y θ )h 2 (θ θ) z(θ ) appears in π(y θ ) above and in π(y θ ) below, and therefore cancels. Similarly z(θ) cancels above and below. The choice of θ 0 is important. eg the maximum pseudolikelihood estimate based on y.

39 Evidence estimation: doubly intractable distributions Exchange algorithm Exchange algorithm (Murray, Ghahramani & MacKay 2006) Sample from an augmented distribution π(θ, y, θ y) π(y θ)π(θ)h(θ θ)π(y θ ) whose marginal distribution for θ is the posterior of interest π(y θ ) is the same likelihood model on which y is defined. h(θ θ) arbitrary distribution for the augmented variable θ which might depend on θ (eg random walk distribution centred at θ)

40 Evidence estimation: doubly intractable distributions Exchange algorithm Exchange algorithm How it works 1 Gibbs update of (θ, y ) i Draw θ h( θ) ii Draw y π( θ ) 2 Exchange move from (θ, y), (θ, y ) to (θ, y), (θ, y ) with probability α = min 1, q(y θ) q(y θ) } {{ } π(θ ) h(θ θ ) π(θ) h(θ θ) q(y θ ) q(y θ ) } {{ } z(θ)z(θ ) z(θ)z(θ ) } {{ } 1 Exchange move proposes to offer the data y the auxiliary θ and similarly to offer the auxiliary data y the parameter θ The affinity between θ and y is measured by (**) and the affinity between θ and y by (*)

41 Evidence estimation: doubly intractable distributions Exchange algorithm Exchange algorithm How it works 1 Gibbs update of (θ, y ) i Draw θ h( θ) ii Draw y π( θ ) 2 Exchange move from (θ, y), (θ, y ) to (θ, y), (θ, y ) with probability α = min 1, q(y θ) q(y θ) } {{ } π(θ ) h(θ θ ) π(θ) h(θ θ) q(y θ ) q(y θ ) } {{ } z(θ)z(θ ) z(θ)z(θ ) } {{ } 1 Exchange move proposes to offer the data y the auxiliary θ and similarly to offer the auxiliary data y the parameter θ The affinity between θ and y is measured by (**) and the affinity between θ and y by (*)

42 Evidence estimation: doubly intractable distributions Exchange algorithm Exchange algorithm for the Ising model The term ( α = min 1, π(θ ) π(θ) exp { (θ θ ) t (s(y ) s(y)) } ) exp { (θ θ ) t (s(y ) s(y)) } can be viewed as a measure of distance between the observed data y and the auxiliary data y. It is somewhat similar to the accept/reject step in ABC (approximate Bayesian computation). Note: If θ θ, then α 1. This does not necessarily happen with ABC.

43 Evidence estimation: doubly intractable distributions Exchange algorithm Exchange algorithm for the Ising model The main difficulty is the need to draw an exact sample y π( θ ) Perfect sampling is an obvious approach. A pragmatic alternative is to take a realisation from a long MCMC run with stationary distribution π(y θ ) as an approximate draw.

44 Evidence estimation: doubly intractable distributions Ising model Simulation study: Ising model Data y simulated from an Ising model defined on a lattice, with a single interaction parameter θ. Two competing models: 4 and 8 nearest neighbours. Here the lattices are sufficently small to allow a very accurate estimate of the Bayes factor: The normalising constant z(θ) can be calculated exactly for a grid of {θ i } values, which can then be plugged into the right hand side of: π(θ i y) q(y θ i) z(θ i ) π(θ i), i = 1,..., n. Summing up the right hand side yields an estimate of π(y). This serves as a groundtruth to compare with the corresponding MCMC-based estimate of the model evidence.

45 Evidence estimation: doubly intractable distributions Ising model Results: Ising model θ BF ˆ BF

46 Evidence estimation: doubly intractable distributions Exponential random graph models Friendships in a karate club in a US university.

47 Evidence estimation: doubly intractable distributions Exponential random graph models High school dating

48 Evidence estimation: doubly intractable distributions Exponential random graph models The exponential random graph (or p ) model First proposed by Frank and Strauss (JASA, 1986). Let y ij = 1 denote an edge connecting nodes i and j, and 0, otherwise. Data y is an adjacency matrix indicating nodes which are connected by an edge. 1. Edges y ij and y kl are neighbours of one another, if they share a common node. 2. If y ij and y kl are not neighbours, then y ij and y ij are conditionally independent, given the rest of the graph.

49 Evidence estimation: doubly intractable distributions Exponential random graph models The exponential random graph (or p ) model First proposed by Frank and Strauss (JASA, 1986). Let y ij = 1 denote an edge connecting nodes i and j, and 0, otherwise. Data y is an adjacency matrix indicating nodes which are connected by an edge. 1. Edges y ij and y kl are neighbours of one another, if they share a common node. 2. If y ij and y kl are not neighbours, then y ij and y ij are conditionally independent, given the rest of the graph.

50 Evidence estimation: doubly intractable distributions Exponential random graph models The p model π(y θ) = exp{θt s(y)} z(θ) = q(y θ) z(θ) y observed graph s(y) known vector of sufficient statistics θ vector of parameters z(θ) normalizing constant z(θ) = exp{θ t s(y)} all possible graphs 2 (n 2) possible undirected graphs of n nodes Calculation of z(θ) is infeasible for non-trivially small graphs

51 Evidence estimation: doubly intractable distributions Exponential random graph models The p model π(y θ) = exp{θt s(y)} z(θ) = q(y θ) z(θ) y observed graph s(y) known vector of sufficient statistics θ vector of parameters z(θ) normalizing constant z(θ) = exp{θ t s(y)} all possible graphs 2 (n 2) possible undirected graphs of n nodes Calculation of z(θ) is infeasible for non-trivially small graphs

52 Evidence estimation: doubly intractable distributions Exponential random graph models Model Specification: Network Statistics (a) edge mutual edge 2-in-star 2-out-star 2-mixed-star transitive triad cyclic triad (b) edge 2-star 3-star triangle

53 Evidence estimation: doubly intractable distributions Exponential random graph models ERGM: Florentine network Model 1: Model 2: Model 3: y edges + 3-star y edges + 2-star y edges + 2-star + 3-star

54 Evidence estimation: doubly intractable distributions Exponential random graph models ERGM: Florentine network Here it is difficult to establish a groundtruth. For this purpose, we ran an independence RJMCMC sampler: 1. Sample from each model, separately, using the exchange algorithm. (Here used the Bergm package of Caimo and NF (2011)). 2. RJMCMC: Use the posterior mean and variance for model k, as proposal parameters when proposing to jump to model k. This works well, since the model space is small, but also because each posterior model is unimodal. Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model. This is essentially the AutoRJ approach outlined in Chapter 6 of Green (2003).

55 Evidence estimation: doubly intractable distributions Exponential random graph models ERGM: Florentine network Here it is difficult to establish a groundtruth. For this purpose, we ran an independence RJMCMC sampler: 1. Sample from each model, separately, using the exchange algorithm. (Here used the Bergm package of Caimo and NF (2011)). 2. RJMCMC: Use the posterior mean and variance for model k, as proposal parameters when proposing to jump to model k. This works well, since the model space is small, but also because each posterior model is unimodal. Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model. This is essentially the AutoRJ approach outlined in Chapter 6 of Green (2003).

56 Evidence estimation: doubly intractable distributions Exponential random graph models ERGM: Florentine network Here estimates of posterior model probabilities based on AutoRJ are compared to those based on estimates of the model evidence for each model. π(m 1 y) π(m 2 y) π(m 3 y) AutoRJ Model evidence based

57 Evidence estimation: doubly intractable distributions Summary Concluding remarks Model evidence is difficult to compute! Often complex Monte Carlo methods are needed. There are plenty of methods in the Bayesian toolbox. A quick solution is not necessarily the best one!

58 Evidence estimation: doubly intractable distributions Summary References Chib, S. (1995) Marginal likelihood using Gibbs output. Journal of the American Statistical Association, 90, Friel, N and Pettitt, AN (2008) Marginal likelihood via power posteriors. Journal of the Royal Statistical Society, Series B, 70, Newton MA and Raftery, AE (1994) Approximate Bayesian inference by the weighted likelihood bootstrap (with Discussion). Journal of the Royal Statistical Society, Series B, 56, Neal, R (2001) Annealed importance sampling. Statistics and Computing, 11, Murray I., Ghahramani, Z., and MacKay, D. (2006) MCMC for doubly-intractable distributions. In Proceedings of the 22nd annual conference on uncertainty in artificial intelligence Ciamo A., Friel N. (2011) Bayesian inference for the exponential random graph model. Social Networks, 33, Skilling, J. (2006) Nested sampling for general Bayesian computation Bayesian Analysis, 1,