Evaluating probabilities under ig-dimensional latent variable models Iain Murray and Ruslan alakutdinov Department of Computer cience University of oronto oronto, ON. M5 3G4. Canada. {murray,rsalaku}@cs.toronto.edu Abstract We present a simple new Monte Carlo algoritm for evaluating probabilities of observations in complex latent variable models, suc as Deep Belief Networks. Wile te metod is based on Markov cains, estimates based on sort runs are formally unbiased. In expectation, te log probability of a test set will be underestimated, and tis could form te basis of a probabilistic bound. e metod is muc ceaper tan gold-standard annealing-based metods and only sligtly more expensive tan te ceapest Monte Carlo metods. We give examples of te new metod substantially improving simple variational bounds at modest extra cost. Introduction Latent variable models capture underlying structure in data by explaining observations as part of a more complex, partially observed system. A large number of probabilistic latent variable models ave been developed, most of wic express a joint distribution P (v, ) over observed quantities v and teir unobserved counterparts. Altoug it is by no means te only way to evaluate a model, a natural question to ask is wat probability P (v) is assigned to a test observation?. In some models te latent variables associated wit a test input can be easily summed out: P (v) = P (v, ). As an example, standard mixture models ave a single discrete mixture component indicator for eac data point; te joint probability P (v, ) can be explicitly evaluated for eac setting of te latent variable. More complex grapical models explain data troug te combination of many latent variables. is provides ricer representations, but provides greater computational callenges. In particular, marginalizing out many latent variables can require complex integrals or exponentially large sums. One popular latent variable model, te Restricted Boltzmann Macine (RBM), is unusual in tat te posterior over iddens P ( v) is fully-factored, wic allows efficient evaluation of P (v) up to a constant. Almost all oter latent variable models ave posterior dependencies amongst latent variables, even if tey are independent a priori. Our current work is motivated by recent work on evaluating RBMs and teir generalization to Deep Belief Networks (DBNs) []. For bot types of models, a single constant was accurately approximated so tat P (v, ) could be evaluated point-wise. For RBMs, te remaining sum over idden variables was performed analytically. For DBNs, test probabilities were lower-bounded troug a variational tecnique. Peraps surprisingly, te bound was unable to reveal any significant improvement over RBMs in an experiment on MNI digits. It was unclear weter tis was due to looseness of te bound, or to tere being no difference in performance. A more accurate metod for summing over latent variables would enable better and broader evaluation of DBNs. In section 2 we consider existing Monte Carlo metods. ome of tem are certainly
more accurate, but proibitively expensive for evaluating large test sets. We ten develop a new ceap Monte Carlo procedure for evaluating latent variable models in section 3. Like te variational metod used previously, our metod is unlikely to spuriously over-state test-set performance. Our presentation is for general latent variable models, owever for a running example, we use DBNs (see section 4 and [2]). e benefits of our new approac are demonstrated in section 5. 2 Probability of observations as a normalizing constant e probability of a data vector, P (v), is te normalizing constant relating te posterior over idden variables to te joint distribution in Bayes rule, P ( v) = P (, v)/p (v). A large literature on computing normalizing constants exists in pysics, statistics and computer science. In principle, tere are many metods tat could be applied to evaluating te probability assigned to data by a latent variable model. We review a subset of tese metods, wit notation and intuitions tat will elp motivate and explain our new algoritm. In wat follows, all auxiliary distributions Q and transition operators are conditioned on te current test case v, tis is not sown in te notation to reduce clutter. Furter, all of tese metods assume tat we can evaluate P (, v). Grapical models wit undirected connections will require te separate estimation of a single constant as in []. 2. Importance sampling Importance sampling can in principle find te normalizing constant of any distribution. e algoritm involves averaging a simple ratio under samples from some convenient tractable distribution over te idden variables, Q(). Provided Q() 0 wenever P (, v) 0, we obtain: P (, v) Q() P ( (s), v ) P (v) = Q() Q ( (s)), (s) Q ( (s)). () Importance sampling relies on te sampling distribution Q() being similar to te target distribution P ( v). pecifically, te variance of te estimator is an α-divergence between te distributions [3]. Finding a tractable Q() wit small divergence is difficult in ig-dimensional problems. 2.2 e Harmonic mean metod Using Q()=P ( v) in () gives an estimator tat requires knowing P (v). As an alternative, te armonic mean metod, also called te reciprocal metod, gives an unbiased estimate of /P (v): P (v) = P () P (v) = P ( v) P (v ) P ( v (s)), (s) P ( (s) v). (2) In practice correlated samples from MCMC are used; ten te estimator is asymptotically unbiased. It was clear from te original paper and its discussion tat te armonic mean estimator can beave very poorly [4]. amples in te tails of te posterior ave large weigts, wic makes it easy to construct distributions were te estimator as infinite variance. A finite set of samples will rarely include any extremely large weigts, so te estimator s empirical variance can be misleadingly low. In many problems, te estimate of /P (v) will be an underestimate wit ig probability. at is, te metod will overestimate P (v) and often give no indication tat it as done so. ometimes te estimator will ave manageable variance. Also, more expensive versions of te estimator exist wit lower variance. However, it is still prone to overestimate test probabilities: If / ˆP HME (v) is te Harmonic Mean Estimator in (2), Jensen s inequality gives P (v) = / E [ / ˆP HME (v) ] E [ ˆPHME (v) ]. imilarly log P (v) will be overestimated in expectation. Hence te average of a large number of test log probabilities is igly likely to be an overestimate. Despite tese problems te estimator as received significant attention in statistics, and as been used for evaluating latent variable models in recent macine learning literature [5, 6]. is is understandable: all of te existing, more accurate metods are arder to implement and take considerably longer to run. In tis paper we propose a metod tat is nearly as easy to use as te armonic mean metod, but wit better properties.
2.3 Importance sampling based on Markov cains Paradoxically, introducing auxiliary variables and making a distribution muc iger-dimensional tan it was before, can elp find an approximating Q distribution tat closely matces te target distribution. As an example we give a partial review of Annealed Importance ampling (AI) [7], a special case of a larger family of equential Monte Carlo (MC) metods (see, e.g., [8]). ome of tis teory will be needed in te new metod we present in section 3. Annealing algoritms start wit a sample from some tractable distribution P. teps are taken wit a series of operators 2, 3,...,, wose stationary distributions, P s, are cooled towards te distribution of interest. e probability over te resulting sequence H = { (), (2),... () } is: ( Q AI (H) = P () ) ( s (s) (s )). (3) o compute importance weigts, we need to define a target distribution on te same state-space: P AI (H) = P ( () v ) ( s (s ) (s)). (4) Because () as marginal P ( v) = P (, v)/p (v), P AI (H) as our target, P (v), as its normalizing constant. e operators are te reverse operators, of tose used to define Q AI. For any transition operator tat leaves a distribution P ( v) stationary, tere is a unique corresponding reverse operator, wic is defined for any point in te support of P : ( ) = ( ) P ( v) ( ) P ( v) = ( ) P ( v) P (. (5) v) e sum in te denominator is known because leaves te posterior stationary. Operators tat are teir own reverse operator are said to satisfy detailed balance and are also known as reversible. Many transition operators used in practice, suc as Metropolis Hastings, are reversible. Non-reversible operators are usually composed from a sequence of reversible operations, suc as te component updates in a Gibbs sampler. e reverse of tese (so-called) non-reversible operators is constructed from te same reversible base operations, but applied in reverse order. e definitions above allow us to write: Q AI (H) = P AI (H) Q AI(H) P AI (H) = P AI(H) P ( ) () P ( () v ) = P AI (H) P (v) [ ( ) P () ] P ( (), v ) Ps ( (s) ) Ps ( (s ) ) s ( (s) (s )) s ( (s ) (s)) P AI(H) P (v). w(h) We can usually evaluate te Ps, wic are unnormalized versions of te stationary distributions of te Markov cain operators. erefore te AI importance weigt w(h) = / [ ] is tractable as long as we can evaluate P (, v). e AI importance weigt provides an unbiased estimate: [ ] E QAI(H) w(h) = P (v) P AI (H) = P (v). (7) H As wit standard importance sampling, te variance of te estimator depends on a divergence between P AI and Q AI. is can be made small, at large computational expense, by using undreds or tousands of steps, allowing te neigboring intermediate distributions P s () to be close. 2.4 Cib-style estimators Bayes rule implies tat for any special idden state, P (v) = P (, v)/p ( v). (8) is trivial identity suggests a family of estimators introduced by Cib [9]. First, we coose a particular idden state, usually one wit ig posterior probability, and ten estimate P ( v). We would like to obtain an estimator tat is based on a sequence of states H ={ (), (2),..., () } generated by a Markov cain tat explores te posterior distribution P ( v). e most naive estimate of P ( v) is te fraction of states in H tat are equal to te special state s I((s) = )/. (6)
Obviously tis estimator is impractical as it equals zero wit ig probability wen applied to igdimensional problems. A Rao Blackwellized version of tis estimator, ˆp(H), replaces te indicator function wit te probability of transitioning from (s) to te special state under a Markov cain transition operator tat leaves te posterior stationary. is can be derived directly from te operator s stationary condition: P ( v) = ( )P ( v) ˆp(H) ( (s) ), { (s) } P(H), (9) were P(H) is te joint distribution arising from steps of a Markov cain. If te cain as stationary distribution P ( v) and could be initialized at equilibrium so tat P(H) = P ( () ) v ( (s) (s )), (0) ten ˆp(H) would be an unbiased estimate of P ( v). For ergodic cains te stationary distribution is acieved asymptotically and te estimator is consistent regardless of ow it is initialized. If is a Gibbs sampling transition operator, te only way of moving from to is to draw eac element of in turn. If updates are made in index order from to M, te move as probability: M ( ) = P ( ) j :(j ), (j+):m. () j= Equations (9, ) ave been used in scemes for monitoring te convergence of Gibbs samplers [0]. It is wort empasizing tat we ave only outlined te simplest possible sceme inspired by Cib s general approac. For some Markov cains, tere are tecnical problems wit te above construction, wic require an extension explained in te appendix. Moreover te approac above is not wat Cib recommended. In fact, [] explicitly favors a more elaborate procedure involving sampling from a sequence of distributions. is opens up te possibility of many sopisticated developments, e.g. [2, 3]. However, our focus in tis work is on obtaining more useful results from simple ceap metods. ere are also well-known problems wit te Cib approac [4], to wic we will return. 3 A new estimator for evaluating latent-variable models We start wit te simplest Cib-inspired estimator based on equations (8,9,). Like many Markov cain Monte Carlo algoritms, (9) provides only (asymptotic) unbiasedness. For our purposes tis is not sufficient. Jensen s inequality tells us P (v) = P (, v) P ( v) = P [ (, v) P ( ] E[ˆp(H)] E, v). (2) ˆp(H) at is, we will overestimate te probability of a visible vector in expectation. Jensen s inequality also says tat we will overestimate log P (v) in expectation. Ideally we would like an accurate estimate of log P (v). However, if we must suffer some bias, ten a lower bound tat does not overstate performance will usually be preferred. An underestimate of P (v) would result from overestimating P ( v). e probability of te special state will often be overestimated in practice if we initialize our Markov cain at. ere are, owever, simple counter-examples were tis does not appen. Instead we describe a construction based on a sequence of Markov steps starting at tat does ave te desired effect. We draw a state sequence from te following carefully designed distribution, using te algoritm in figure : Q(H) = ( (s) ) s =s+ ( (s ) (s ) ) s s = ( (s ) (s +) ). (3) If te initial state were drawn from P ( v) instead of ( (s) ), ten te algoritm would give a sample from an equilibrium sequence wit distribution P(H) defined in (0). is can be cecked by repeated substitution of (5). is allows us to express Q in terms of P, as we did for AI: Q(H) = ( (s) ) P ( (s) v ) P(H) = P ( v) [ ( (s))] P(H). (4)
Inputs: v, observed test vector, a (preferably ig posterior probability) idden state, number of Markov cain steps, Markov cain operator tat leaves P ( v) stationary (4). Draw s Uniform({,... }) 2. Draw (s) ( (s) ) 3. for s = (s + ) : 4. Draw (s ) ( (s ) ) (s ) 5. for s = (s ) : : 6. Draw (s ) ( (s ) ) (s +) / 7. P (v) P (v, ) ( (s ) ) s = (3) (2) () Figure : Algoritm for te proposed metod. e grapical model sows Q(H s = 3) for = 4. At eac generated state ( (s ) ) is evaluated (step 7), rougly doubling te cost of sampling. e reverse operator, e, was defined in section 2.3. e quantity in square brackets is te estimator for P ( v) given in (9). e expectation of te reciprocal of tis quantity under draws from Q(H) is exactly te quantity needed to compute P (v): / E Q(H) [ ( (s))] = P ( P(H) = v) P ( v). (5) Altoug we are using te simple estimator from (9), by drawing H from a carefully constructed Markov cain procedure, te estimator is now unbiased in P (v). is is not an asymptotic result. As long as no division by zero as occurred in te above equations, te estimator is unbiased in P (v) for finite runs of te Markov cain. Jensen s implies tat log P (v) is underestimated in expectation. Neal noted tat Cibs metod will return incorrect answers in cases were te Markov cain does not mix well amongst modes [4]. Our new proposed metod will suffer from te same problem. Even if no transition probabilities are exactly zero, unbiasedness does not exclude being on a particular side of te correct answer wit very ig probability. Poor mixing may cause P ( v) to be overestimated wit ig probability, wic would result in an underestimate of P (v), i.e., an overly conservative estimate of test performance. e variance of te estimator is generally unknown, as it depends on te (generally unavailable) auto-covariance structure of te Markov cain. We can note one positive property: for te ideal Markov cain operator tat mixes in one step, te estimator as zero variance and gives te correct answer immediately. Altoug tis extreme will not actually occur, it does indicate tat on easy problems, good answers can be returned more quickly tan by AI. 4 Deep Belief Networks In tis section we provide a brief overview of Deep Belief Networks (DBNs), recently introduced by [2]. DBNs are probabilistic generative models, tat can contain many layers of idden variables. Eac layer captures strong ig-order correlations between te activities of idden features in te layer below. e top two layers of te DBN model form a Restricted Boltzmann Macine (RBM) wic is an undirected grapical model, but te lower layers form a directed generative model. e original paper introduced a greedy, layer-by-layer unsupervised learning algoritm tat consists of learning a stack of RBMs one layer at a time. Consider a DBN model wit two layers of idden features. e model s joint distribution is: H P (v,, 2 ) = P (v ) P ( 2, ), (6) were P (v ) represents a sigmoid belief network, and P (, 2 ) is te joint distribution defined by te second layer RBM. By explicitly summing out 2, we can easily evaluate an unnormalized probability P (v, )=ZP (v, ). Using an approximating factorial posterior distribution Q( v),
Estimated est Log probability 85 85.5 86 86.5 87 AI Estimator MNI digits Our Proposed Estimator Estimate of Variational Lower Bound 5 0 5 20 25 30 35 40 Number of Markov cain steps Estimated est Log probability 565 570 575 580 585 AI Estimator Our Proposed Estimator Estimate of Variational Lower Bound Image Patces 5 0 5 20 25 30 35 40 Number of Markov cain steps Figure 2: AI, our proposed estimator and a variational metod were used to sum over te idden states for eac of 50 randomly sampled test cases to estimate teir average log probability. e tree metods sared te same AI estimate of a single global normalization constant Z. obtained as a byproduct of te greedy learning procedure, and an AI estimate of te model s partition function Z, [] proposed obtaining an estimate of a variational lower bound: log P (v) Q( v) log P (v, ) log Z + H(Q( v)). (7) e entropy term H( ) can be computed analytically, since Q is factorial, and te expectation term was estimated by a simple Monte Carlo approximation: Q( v) log P (v, ) log P (v, (s) ), were (s) Q( v). (8).. Instead of te variational approac, we could also adopt AI to estimate P (v, ). is would be computationally very expensive, since we would need to run AI for eac test case. In te next section we sow tat variational lower bounds can be quite loose. Running AI on te entire test set, containing many tousands of test cases, is computationally too demanding. Our proposed estimator requires te same single AI estimate of Z as te variational metod, so tat we can evaluate P (v, ). It ten provides better estimates of log P (v) by approximately summing over for eac test case in a reasonable amount of computer time. 5 Experimental Results We present experimental results on two datasets: te MNI digits and a dataset of image patces, extracted from images of natural scenes taken from te collection of Van Hateren (ttp://lab.pys.rug.nl/imlib/). e MNI dataset contains 60,000 training and 0,000 test images of ten andwritten digits (0 to 9), wit 28 28 pixels. e image dataset consisted of 30,000 training and 20,000 test 20 20 patces. e raw image intensities were preprocessed and witened as described in [5]. Gibbs sampling was used as a Markov cain transition operator trougout. All log probabilities quoted use natural logaritms, giving values in nats. 5. MNI digits In our first experiment we used a deep belief network (DBN) taken from []. e network ad two idden layers wit 500 and 2000 idden units, and was greedily trained by learning a stack of two RBMs one layer at a time. Eac RBM was trained using te Contrastive Divergence (CD) learning rule. e estimate of te lower bound on te average test log probability, using (7), was 86.22. o estimate ow loose te variational bound is, we randomly sampled 50 test cases, 5 of eac class, and ran AI for eac test case to estimate te true test log probability. Computationally, tis is equivalent to estimating 50 additional partition functions. Figure 2, left panel, sows te results. e estimate of te variational bound was 87.05 per test case, wereas te estimate of te true test log probability using AI was 85.20. Our proposed estimator, averaged over 0 runs, provided an answer of 85.22. e special state for eac test example v was obtained by first sampling from te approximating distribution Q( v), and ten performing deterministic ill-climbing in log p(v, ) to get to a local mode.
AI used a and-tuned temperature scedule designed to equalize te variance of te intermediate log weigts [7]. We needed 0,000 intermediate distributions to get stable results, wic took about 3.6 days on a Pentium Xeon 3.00GHz macine, wereas for our proposed estimator we only used =40, wic took about 50 minutes. For a more direct comparison we tried giving AI 50 minutes, wic allows 00 temperatures. is run gave an estimate of 89.59, wic is lower tan te lower bound and tells us noting. Giving AI ten times more time, 000 temperatures, gave 86.05. is is iger tan te lower bound, but still worse tan our estimator at = 40, or even = 5. Finally, using our proposed estimator, te average test log probability on te entire MNI test data was 84.55. e difference of about 2 nats sows tat te variational bound in [] was rater tigt, altoug a very small improvement of te DBN over te RBM is now revealed. 5.2 Image Patces In our second experiment we trained a two-layer DBN model on te image patces of natural scenes. e first layer RBM ad 2000 idden units and 400 Gaussian visible units. e second layer represented a semi-restricted Boltzmann macine (RBM) wit 500 idden and 2000 visible units. e RBM contained visible-to-visible connections, and was trained using Contrastive Divergence togeter wit mean-field. Details of training can be found in [5]. e overall DBN model can be viewed as a directed ierarcy of Markov random fields wit idden-to-idden connections. o estimate te model s partition function, we used AI wit 5,000 intermediate distributions and 00 annealing runs. e estimated lower bound on te average test log probability (see Eq. 7), using a factorial approximate posterior distribution Q( v), wic we also get as a byproduct of te greedy learning algoritm, was 583.73. e estimate of te true test log probability, using our proposed estimator, was 563.39. In contrast to te model trained on MNI, te difference of over 20 nats sows tat, for model comparison purposes, te variational lower bound is quite loose. For comparison, we also trained square ICA and a mixture of factor analyzers (MFA) using code from [6, 7]. quare ICA acieves a test log probability of 55.4, and MFA wit 50 mixture components and a 30-dimensional latent space acieves 502.30, clearly outperforming DBNs. 6 Discussion Our new Monte Carlo procedure is formally unbiased in estimating P (v). In practice it is likely to underestimate te (log-)probability of a test set. Altoug te algoritm involves Markov cains, importance sampling underlies te estimator. erefore te metods discussed in [8] could be used to bound te probability of accidentally over-estimating a test set probability. In principle our procedure is a general tecnique for estimating normalizing constants. It would not always be appropriate owever, as it would suffer te problems outlined in [4]. As an example our metod will not succeed in estimating te global normalizing constant of an RBM. For our metod to work well, a state drawn from ( (s) ) sould look like it could be part of an equilibrium sequence H P(H). e details of te algoritm arose by developing existing Monte Carlo estimators, but te starting state (s) could be drawn from any arbitrary distribution: Q var (H) = [ q( (s) ) P(H) = P (v) P ( (s) v) ] q( (s) ) P(H). (9) P ( (s), v) As before te reciprocal of te quantity in square brackets would give an estimate of P (v). If an approximation q() is available tat captures more mass tan ( ), tis generalized estimator could perform better. We are opeful tat our metod will be a natural next step in a variety of situations were improvements are sougt over a deterministic approximation. Acknowledgments is researc was supported by NERC and CFI. Iain Murray was supported by te government of Canada. We tank Geoffrey Hinton and Radford Neal for useful discussions, imon Osindero for providing preprocessed image patces of natural scenes, and te reviewers for useful comments.
References [] Ruslan alakutdinov and Iain Murray. On te quantitative analysis of Deep Belief Networks. In Proceedings of te International Conference on Macine Learning, volume 25, pages 872 879, 2008. [2] Geoffrey E. Hinton, imon Osindero, and Yee Wye e. A fast learning algoritm for deep belief nets. Neural Computation, 8(7):527 554, 2006. [3] om Minka. Divergence measures and message passing. R-2005-73, Microsoft Researc, 2005. [4] Micael A. Newton and Adrian E. Raftery. Approximate Bayesian inference wit te weigted likeliood bootstrap. Journal of te Royal tatistical ociety, eries B (Metodological), 56():3 48, 994. [5] omas L. Griffits, Mark teyvers, David M. Blei, and Josua B. enenbaum. Integrating topics and syntax. In Advances in Neural Information Processing ystems (NIP*7). MI Press, 2005. [6] Hanna M. Wallac. opic modeling: beyond bag-of-words. In Proceedings of te 23rd international conference on Macine learning, pages 977 984. ACM Press New York, NY, UA, 2006. [7] Radford M. Neal. Annealed importance sampling. tatistics and Computing, (2):25 39, 200. [8] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. equential Monte Carlo samplers. Journal of te Royal tatistical ociety B, 68(3): 26, 2006. [9] iddarta Cib. Marginal likeliood from te Gibbs output. Journal of te American tatistical Association, 90(432):33 32, December 995. [0] Cristian Ritter and Martin A. anner. Facilitating te Gibbs sampler: te Gibbs stopper and te griddy- Gibbs sampler. Journal of te American tatistical Association, 87(49):86 868, 992. [] iddarta Cib and Ivan Jeliazkov. Marginal likeliood from te Metropolis Hastings output. Journal of te American tatistical Association, 96(453), 200. [2] Antonietta Mira and Geoff Nicolls. Bridge estimation of te probability density at a point. tatistica inica, 4:603 62, 2004. [3] Francesco Bartolucci, Luisa caccia, and Antonietta Mira. Efficient Bayes factor estimation from te reversible jump output. Biometrika, 93():4 52, 2006. [4] Radford M. Neal. Erroneous results in Marginal likeliood from te Gibbs output, 999. Available from ttp://www.cs.toronto.edu/ radford/cib-letter.tml. [5] imon Osindero and Geoffrey Hinton. Modeling image patces wit a directed ierarcy of Markov random fields. In Advances in Neural Information Processing ystems (NIP*20). MI Press, 2008. [6] Aapo Hyvärinen. Fast and robust fixed-point algoritms for independent component analysis. IEEE ransactions on Neural Networks, 0(3):626 634, 999. [7] Zoubin Garamani and Geoffrey E. Hinton. e EM algoritm for mixtures of factor analyzers. ecnical Report CRG-R-96-, University of oronto, 997. [8] Vibav Gogate, Bozena Bidyuk, and Rina Decter. tudies in lower bounding probability of evidence using te Markov inequality. In 23rd Conference on Uncertainty in Artificial Intelligence (UAI), 2007. A Real-valued latents and Metropolis Hastings ere are tecnical difficulties wit te original Cib-style approac applied to Metropolis Hastings and continuous latent variables. e continuous version of equation (9), P ( v) = ( )P ( v) d ( (s) ), (s) P(H), (20) doesn t work if is te Metropolis Hastings operator. e Dirac-delta function at = contains a significant part of te integral, wic is ignored by samples from P ( v) wit probability one. Following [], te fix is to instead integrate over te generalized detailed balance relationsip (5). Cib and Jeliazkov implicitly took out te = point from all of teir integrals. We do te same: P ( v) = d ( )P ( v)/ d ( ). (2) e numerator can be estimated as before. As bot integrals omit =, te denominator is less tan one wen contains a delta function. For Metropolis Hastings: ( ) = q(; ) min (, a(; ) ), were a(; ) is an easy-to-compute acceptance ratio. ampling from q(; ) and averaging min(, a(; )) provides an estimate of te denominator. In our importance sampling approac tere is no need to separately approximate an additional quantity. e algoritm in figure still applies if te s are interpreted as probability density functions. If, due to a rejection, is drawn in step 2. ten te sum in step 7. will contain an infinite term giving a trivial underestimate P (v)=0. (teps 3 6 need not be performed in tis case.) On repeated runs, te average estimate is still unbiased, or an underestimate for cains tat can t mix. Alternatively, te variational approac (9) could be applied togeter wit Metropolis Hastings sampling.