Evaluating probabilities under highdimensional latent variable models


 Harriet Johns
 1 years ago
 Views:
Transcription
1 Evaluating probabilities under igdimensional latent variable models Iain Murray and Ruslan alakutdinov Department of Computer cience University of oronto oronto, ON. M5 3G4. Canada. Abstract We present a simple new Monte Carlo algoritm for evaluating probabilities of observations in complex latent variable models, suc as Deep Belief Networks. Wile te metod is based on Markov cains, estimates based on sort runs are formally unbiased. In expectation, te log probability of a test set will be underestimated, and tis could form te basis of a probabilistic bound. e metod is muc ceaper tan goldstandard annealingbased metods and only sligtly more expensive tan te ceapest Monte Carlo metods. We give examples of te new metod substantially improving simple variational bounds at modest extra cost. Introduction Latent variable models capture underlying structure in data by explaining observations as part of a more complex, partially observed system. A large number of probabilistic latent variable models ave been developed, most of wic express a joint distribution P (v, ) over observed quantities v and teir unobserved counterparts. Altoug it is by no means te only way to evaluate a model, a natural question to ask is wat probability P (v) is assigned to a test observation?. In some models te latent variables associated wit a test input can be easily summed out: P (v) = P (v, ). As an example, standard mixture models ave a single discrete mixture component indicator for eac data point; te joint probability P (v, ) can be explicitly evaluated for eac setting of te latent variable. More complex grapical models explain data troug te combination of many latent variables. is provides ricer representations, but provides greater computational callenges. In particular, marginalizing out many latent variables can require complex integrals or exponentially large sums. One popular latent variable model, te Restricted Boltzmann Macine (RBM), is unusual in tat te posterior over iddens P ( v) is fullyfactored, wic allows efficient evaluation of P (v) up to a constant. Almost all oter latent variable models ave posterior dependencies amongst latent variables, even if tey are independent a priori. Our current work is motivated by recent work on evaluating RBMs and teir generalization to Deep Belief Networks (DBNs) []. For bot types of models, a single constant was accurately approximated so tat P (v, ) could be evaluated pointwise. For RBMs, te remaining sum over idden variables was performed analytically. For DBNs, test probabilities were lowerbounded troug a variational tecnique. Peraps surprisingly, te bound was unable to reveal any significant improvement over RBMs in an experiment on MNI digits. It was unclear weter tis was due to looseness of te bound, or to tere being no difference in performance. A more accurate metod for summing over latent variables would enable better and broader evaluation of DBNs. In section 2 we consider existing Monte Carlo metods. ome of tem are certainly
2 more accurate, but proibitively expensive for evaluating large test sets. We ten develop a new ceap Monte Carlo procedure for evaluating latent variable models in section 3. Like te variational metod used previously, our metod is unlikely to spuriously overstate testset performance. Our presentation is for general latent variable models, owever for a running example, we use DBNs (see section 4 and [2]). e benefits of our new approac are demonstrated in section 5. 2 Probability of observations as a normalizing constant e probability of a data vector, P (v), is te normalizing constant relating te posterior over idden variables to te joint distribution in Bayes rule, P ( v) = P (, v)/p (v). A large literature on computing normalizing constants exists in pysics, statistics and computer science. In principle, tere are many metods tat could be applied to evaluating te probability assigned to data by a latent variable model. We review a subset of tese metods, wit notation and intuitions tat will elp motivate and explain our new algoritm. In wat follows, all auxiliary distributions Q and transition operators are conditioned on te current test case v, tis is not sown in te notation to reduce clutter. Furter, all of tese metods assume tat we can evaluate P (, v). Grapical models wit undirected connections will require te separate estimation of a single constant as in []. 2. Importance sampling Importance sampling can in principle find te normalizing constant of any distribution. e algoritm involves averaging a simple ratio under samples from some convenient tractable distribution over te idden variables, Q(). Provided Q() 0 wenever P (, v) 0, we obtain: P (, v) Q() P ( (s), v ) P (v) = Q() Q ( (s)), (s) Q ( (s)). () Importance sampling relies on te sampling distribution Q() being similar to te target distribution P ( v). pecifically, te variance of te estimator is an αdivergence between te distributions [3]. Finding a tractable Q() wit small divergence is difficult in igdimensional problems. 2.2 e Harmonic mean metod Using Q()=P ( v) in () gives an estimator tat requires knowing P (v). As an alternative, te armonic mean metod, also called te reciprocal metod, gives an unbiased estimate of /P (v): P (v) = P () P (v) = P ( v) P (v ) P ( v (s)), (s) P ( (s) v). (2) In practice correlated samples from MCMC are used; ten te estimator is asymptotically unbiased. It was clear from te original paper and its discussion tat te armonic mean estimator can beave very poorly [4]. amples in te tails of te posterior ave large weigts, wic makes it easy to construct distributions were te estimator as infinite variance. A finite set of samples will rarely include any extremely large weigts, so te estimator s empirical variance can be misleadingly low. In many problems, te estimate of /P (v) will be an underestimate wit ig probability. at is, te metod will overestimate P (v) and often give no indication tat it as done so. ometimes te estimator will ave manageable variance. Also, more expensive versions of te estimator exist wit lower variance. However, it is still prone to overestimate test probabilities: If / ˆP HME (v) is te Harmonic Mean Estimator in (2), Jensen s inequality gives P (v) = / E [ / ˆP HME (v) ] E [ ˆPHME (v) ]. imilarly log P (v) will be overestimated in expectation. Hence te average of a large number of test log probabilities is igly likely to be an overestimate. Despite tese problems te estimator as received significant attention in statistics, and as been used for evaluating latent variable models in recent macine learning literature [5, 6]. is is understandable: all of te existing, more accurate metods are arder to implement and take considerably longer to run. In tis paper we propose a metod tat is nearly as easy to use as te armonic mean metod, but wit better properties.
3 2.3 Importance sampling based on Markov cains Paradoxically, introducing auxiliary variables and making a distribution muc igerdimensional tan it was before, can elp find an approximating Q distribution tat closely matces te target distribution. As an example we give a partial review of Annealed Importance ampling (AI) [7], a special case of a larger family of equential Monte Carlo (MC) metods (see, e.g., [8]). ome of tis teory will be needed in te new metod we present in section 3. Annealing algoritms start wit a sample from some tractable distribution P. teps are taken wit a series of operators 2, 3,...,, wose stationary distributions, P s, are cooled towards te distribution of interest. e probability over te resulting sequence H = { (), (2),... () } is: ( Q AI (H) = P () ) ( s (s) (s )). (3) o compute importance weigts, we need to define a target distribution on te same statespace: P AI (H) = P ( () v ) ( s (s ) (s)). (4) Because () as marginal P ( v) = P (, v)/p (v), P AI (H) as our target, P (v), as its normalizing constant. e operators are te reverse operators, of tose used to define Q AI. For any transition operator tat leaves a distribution P ( v) stationary, tere is a unique corresponding reverse operator, wic is defined for any point in te support of P : ( ) = ( ) P ( v) ( ) P ( v) = ( ) P ( v) P (. (5) v) e sum in te denominator is known because leaves te posterior stationary. Operators tat are teir own reverse operator are said to satisfy detailed balance and are also known as reversible. Many transition operators used in practice, suc as Metropolis Hastings, are reversible. Nonreversible operators are usually composed from a sequence of reversible operations, suc as te component updates in a Gibbs sampler. e reverse of tese (socalled) nonreversible operators is constructed from te same reversible base operations, but applied in reverse order. e definitions above allow us to write: Q AI (H) = P AI (H) Q AI(H) P AI (H) = P AI(H) P ( ) () P ( () v ) = P AI (H) P (v) [ ( ) P () ] P ( (), v ) Ps ( (s) ) Ps ( (s ) ) s ( (s) (s )) s ( (s ) (s)) P AI(H) P (v). w(h) We can usually evaluate te Ps, wic are unnormalized versions of te stationary distributions of te Markov cain operators. erefore te AI importance weigt w(h) = / [ ] is tractable as long as we can evaluate P (, v). e AI importance weigt provides an unbiased estimate: [ ] E QAI(H) w(h) = P (v) P AI (H) = P (v). (7) H As wit standard importance sampling, te variance of te estimator depends on a divergence between P AI and Q AI. is can be made small, at large computational expense, by using undreds or tousands of steps, allowing te neigboring intermediate distributions P s () to be close. 2.4 Cibstyle estimators Bayes rule implies tat for any special idden state, P (v) = P (, v)/p ( v). (8) is trivial identity suggests a family of estimators introduced by Cib [9]. First, we coose a particular idden state, usually one wit ig posterior probability, and ten estimate P ( v). We would like to obtain an estimator tat is based on a sequence of states H ={ (), (2),..., () } generated by a Markov cain tat explores te posterior distribution P ( v). e most naive estimate of P ( v) is te fraction of states in H tat are equal to te special state s I((s) = )/. (6)
4 Obviously tis estimator is impractical as it equals zero wit ig probability wen applied to igdimensional problems. A Rao Blackwellized version of tis estimator, ˆp(H), replaces te indicator function wit te probability of transitioning from (s) to te special state under a Markov cain transition operator tat leaves te posterior stationary. is can be derived directly from te operator s stationary condition: P ( v) = ( )P ( v) ˆp(H) ( (s) ), { (s) } P(H), (9) were P(H) is te joint distribution arising from steps of a Markov cain. If te cain as stationary distribution P ( v) and could be initialized at equilibrium so tat P(H) = P ( () ) v ( (s) (s )), (0) ten ˆp(H) would be an unbiased estimate of P ( v). For ergodic cains te stationary distribution is acieved asymptotically and te estimator is consistent regardless of ow it is initialized. If is a Gibbs sampling transition operator, te only way of moving from to is to draw eac element of in turn. If updates are made in index order from to M, te move as probability: M ( ) = P ( ) j :(j ), (j+):m. () j= Equations (9, ) ave been used in scemes for monitoring te convergence of Gibbs samplers [0]. It is wort empasizing tat we ave only outlined te simplest possible sceme inspired by Cib s general approac. For some Markov cains, tere are tecnical problems wit te above construction, wic require an extension explained in te appendix. Moreover te approac above is not wat Cib recommended. In fact, [] explicitly favors a more elaborate procedure involving sampling from a sequence of distributions. is opens up te possibility of many sopisticated developments, e.g. [2, 3]. However, our focus in tis work is on obtaining more useful results from simple ceap metods. ere are also wellknown problems wit te Cib approac [4], to wic we will return. 3 A new estimator for evaluating latentvariable models We start wit te simplest Cibinspired estimator based on equations (8,9,). Like many Markov cain Monte Carlo algoritms, (9) provides only (asymptotic) unbiasedness. For our purposes tis is not sufficient. Jensen s inequality tells us P (v) = P (, v) P ( v) = P [ (, v) P ( ] E[ˆp(H)] E, v). (2) ˆp(H) at is, we will overestimate te probability of a visible vector in expectation. Jensen s inequality also says tat we will overestimate log P (v) in expectation. Ideally we would like an accurate estimate of log P (v). However, if we must suffer some bias, ten a lower bound tat does not overstate performance will usually be preferred. An underestimate of P (v) would result from overestimating P ( v). e probability of te special state will often be overestimated in practice if we initialize our Markov cain at. ere are, owever, simple counterexamples were tis does not appen. Instead we describe a construction based on a sequence of Markov steps starting at tat does ave te desired effect. We draw a state sequence from te following carefully designed distribution, using te algoritm in figure : Q(H) = ( (s) ) s =s+ ( (s ) (s ) ) s s = ( (s ) (s +) ). (3) If te initial state were drawn from P ( v) instead of ( (s) ), ten te algoritm would give a sample from an equilibrium sequence wit distribution P(H) defined in (0). is can be cecked by repeated substitution of (5). is allows us to express Q in terms of P, as we did for AI: Q(H) = ( (s) ) P ( (s) v ) P(H) = P ( v) [ ( (s))] P(H). (4)
5 Inputs: v, observed test vector, a (preferably ig posterior probability) idden state, number of Markov cain steps, Markov cain operator tat leaves P ( v) stationary (4). Draw s Uniform({,... }) 2. Draw (s) ( (s) ) 3. for s = (s + ) : 4. Draw (s ) ( (s ) ) (s ) 5. for s = (s ) : : 6. Draw (s ) ( (s ) ) (s +) / 7. P (v) P (v, ) ( (s ) ) s = (3) (2) () Figure : Algoritm for te proposed metod. e grapical model sows Q(H s = 3) for = 4. At eac generated state ( (s ) ) is evaluated (step 7), rougly doubling te cost of sampling. e reverse operator, e, was defined in section 2.3. e quantity in square brackets is te estimator for P ( v) given in (9). e expectation of te reciprocal of tis quantity under draws from Q(H) is exactly te quantity needed to compute P (v): / E Q(H) [ ( (s))] = P ( P(H) = v) P ( v). (5) Altoug we are using te simple estimator from (9), by drawing H from a carefully constructed Markov cain procedure, te estimator is now unbiased in P (v). is is not an asymptotic result. As long as no division by zero as occurred in te above equations, te estimator is unbiased in P (v) for finite runs of te Markov cain. Jensen s implies tat log P (v) is underestimated in expectation. Neal noted tat Cibs metod will return incorrect answers in cases were te Markov cain does not mix well amongst modes [4]. Our new proposed metod will suffer from te same problem. Even if no transition probabilities are exactly zero, unbiasedness does not exclude being on a particular side of te correct answer wit very ig probability. Poor mixing may cause P ( v) to be overestimated wit ig probability, wic would result in an underestimate of P (v), i.e., an overly conservative estimate of test performance. e variance of te estimator is generally unknown, as it depends on te (generally unavailable) autocovariance structure of te Markov cain. We can note one positive property: for te ideal Markov cain operator tat mixes in one step, te estimator as zero variance and gives te correct answer immediately. Altoug tis extreme will not actually occur, it does indicate tat on easy problems, good answers can be returned more quickly tan by AI. 4 Deep Belief Networks In tis section we provide a brief overview of Deep Belief Networks (DBNs), recently introduced by [2]. DBNs are probabilistic generative models, tat can contain many layers of idden variables. Eac layer captures strong igorder correlations between te activities of idden features in te layer below. e top two layers of te DBN model form a Restricted Boltzmann Macine (RBM) wic is an undirected grapical model, but te lower layers form a directed generative model. e original paper introduced a greedy, layerbylayer unsupervised learning algoritm tat consists of learning a stack of RBMs one layer at a time. Consider a DBN model wit two layers of idden features. e model s joint distribution is: H P (v,, 2 ) = P (v ) P ( 2, ), (6) were P (v ) represents a sigmoid belief network, and P (, 2 ) is te joint distribution defined by te second layer RBM. By explicitly summing out 2, we can easily evaluate an unnormalized probability P (v, )=ZP (v, ). Using an approximating factorial posterior distribution Q( v),
6 Estimated est Log probability AI Estimator MNI digits Our Proposed Estimator Estimate of Variational Lower Bound Number of Markov cain steps Estimated est Log probability AI Estimator Our Proposed Estimator Estimate of Variational Lower Bound Image Patces Number of Markov cain steps Figure 2: AI, our proposed estimator and a variational metod were used to sum over te idden states for eac of 50 randomly sampled test cases to estimate teir average log probability. e tree metods sared te same AI estimate of a single global normalization constant Z. obtained as a byproduct of te greedy learning procedure, and an AI estimate of te model s partition function Z, [] proposed obtaining an estimate of a variational lower bound: log P (v) Q( v) log P (v, ) log Z + H(Q( v)). (7) e entropy term H( ) can be computed analytically, since Q is factorial, and te expectation term was estimated by a simple Monte Carlo approximation: Q( v) log P (v, ) log P (v, (s) ), were (s) Q( v). (8).. Instead of te variational approac, we could also adopt AI to estimate P (v, ). is would be computationally very expensive, since we would need to run AI for eac test case. In te next section we sow tat variational lower bounds can be quite loose. Running AI on te entire test set, containing many tousands of test cases, is computationally too demanding. Our proposed estimator requires te same single AI estimate of Z as te variational metod, so tat we can evaluate P (v, ). It ten provides better estimates of log P (v) by approximately summing over for eac test case in a reasonable amount of computer time. 5 Experimental Results We present experimental results on two datasets: te MNI digits and a dataset of image patces, extracted from images of natural scenes taken from te collection of Van Hateren (ttp://lab.pys.rug.nl/imlib/). e MNI dataset contains 60,000 training and 0,000 test images of ten andwritten digits (0 to 9), wit pixels. e image dataset consisted of 30,000 training and 20,000 test patces. e raw image intensities were preprocessed and witened as described in [5]. Gibbs sampling was used as a Markov cain transition operator trougout. All log probabilities quoted use natural logaritms, giving values in nats. 5. MNI digits In our first experiment we used a deep belief network (DBN) taken from []. e network ad two idden layers wit 500 and 2000 idden units, and was greedily trained by learning a stack of two RBMs one layer at a time. Eac RBM was trained using te Contrastive Divergence (CD) learning rule. e estimate of te lower bound on te average test log probability, using (7), was o estimate ow loose te variational bound is, we randomly sampled 50 test cases, 5 of eac class, and ran AI for eac test case to estimate te true test log probability. Computationally, tis is equivalent to estimating 50 additional partition functions. Figure 2, left panel, sows te results. e estimate of te variational bound was per test case, wereas te estimate of te true test log probability using AI was Our proposed estimator, averaged over 0 runs, provided an answer of e special state for eac test example v was obtained by first sampling from te approximating distribution Q( v), and ten performing deterministic illclimbing in log p(v, ) to get to a local mode.
7 AI used a andtuned temperature scedule designed to equalize te variance of te intermediate log weigts [7]. We needed 0,000 intermediate distributions to get stable results, wic took about 3.6 days on a Pentium Xeon 3.00GHz macine, wereas for our proposed estimator we only used =40, wic took about 50 minutes. For a more direct comparison we tried giving AI 50 minutes, wic allows 00 temperatures. is run gave an estimate of 89.59, wic is lower tan te lower bound and tells us noting. Giving AI ten times more time, 000 temperatures, gave is is iger tan te lower bound, but still worse tan our estimator at = 40, or even = 5. Finally, using our proposed estimator, te average test log probability on te entire MNI test data was e difference of about 2 nats sows tat te variational bound in [] was rater tigt, altoug a very small improvement of te DBN over te RBM is now revealed. 5.2 Image Patces In our second experiment we trained a twolayer DBN model on te image patces of natural scenes. e first layer RBM ad 2000 idden units and 400 Gaussian visible units. e second layer represented a semirestricted Boltzmann macine (RBM) wit 500 idden and 2000 visible units. e RBM contained visibletovisible connections, and was trained using Contrastive Divergence togeter wit meanfield. Details of training can be found in [5]. e overall DBN model can be viewed as a directed ierarcy of Markov random fields wit iddentoidden connections. o estimate te model s partition function, we used AI wit 5,000 intermediate distributions and 00 annealing runs. e estimated lower bound on te average test log probability (see Eq. 7), using a factorial approximate posterior distribution Q( v), wic we also get as a byproduct of te greedy learning algoritm, was e estimate of te true test log probability, using our proposed estimator, was In contrast to te model trained on MNI, te difference of over 20 nats sows tat, for model comparison purposes, te variational lower bound is quite loose. For comparison, we also trained square ICA and a mixture of factor analyzers (MFA) using code from [6, 7]. quare ICA acieves a test log probability of 55.4, and MFA wit 50 mixture components and a 30dimensional latent space acieves , clearly outperforming DBNs. 6 Discussion Our new Monte Carlo procedure is formally unbiased in estimating P (v). In practice it is likely to underestimate te (log)probability of a test set. Altoug te algoritm involves Markov cains, importance sampling underlies te estimator. erefore te metods discussed in [8] could be used to bound te probability of accidentally overestimating a test set probability. In principle our procedure is a general tecnique for estimating normalizing constants. It would not always be appropriate owever, as it would suffer te problems outlined in [4]. As an example our metod will not succeed in estimating te global normalizing constant of an RBM. For our metod to work well, a state drawn from ( (s) ) sould look like it could be part of an equilibrium sequence H P(H). e details of te algoritm arose by developing existing Monte Carlo estimators, but te starting state (s) could be drawn from any arbitrary distribution: Q var (H) = [ q( (s) ) P(H) = P (v) P ( (s) v) ] q( (s) ) P(H). (9) P ( (s), v) As before te reciprocal of te quantity in square brackets would give an estimate of P (v). If an approximation q() is available tat captures more mass tan ( ), tis generalized estimator could perform better. We are opeful tat our metod will be a natural next step in a variety of situations were improvements are sougt over a deterministic approximation. Acknowledgments is researc was supported by NERC and CFI. Iain Murray was supported by te government of Canada. We tank Geoffrey Hinton and Radford Neal for useful discussions, imon Osindero for providing preprocessed image patces of natural scenes, and te reviewers for useful comments.
8 References [] Ruslan alakutdinov and Iain Murray. On te quantitative analysis of Deep Belief Networks. In Proceedings of te International Conference on Macine Learning, volume 25, pages , [2] Geoffrey E. Hinton, imon Osindero, and Yee Wye e. A fast learning algoritm for deep belief nets. Neural Computation, 8(7): , [3] om Minka. Divergence measures and message passing. R , Microsoft Researc, [4] Micael A. Newton and Adrian E. Raftery. Approximate Bayesian inference wit te weigted likeliood bootstrap. Journal of te Royal tatistical ociety, eries B (Metodological), 56():3 48, 994. [5] omas L. Griffits, Mark teyvers, David M. Blei, and Josua B. enenbaum. Integrating topics and syntax. In Advances in Neural Information Processing ystems (NIP*7). MI Press, [6] Hanna M. Wallac. opic modeling: beyond bagofwords. In Proceedings of te 23rd international conference on Macine learning, pages ACM Press New York, NY, UA, [7] Radford M. Neal. Annealed importance sampling. tatistics and Computing, (2):25 39, 200. [8] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. equential Monte Carlo samplers. Journal of te Royal tatistical ociety B, 68(3): 26, [9] iddarta Cib. Marginal likeliood from te Gibbs output. Journal of te American tatistical Association, 90(432):33 32, December 995. [0] Cristian Ritter and Martin A. anner. Facilitating te Gibbs sampler: te Gibbs stopper and te griddy Gibbs sampler. Journal of te American tatistical Association, 87(49):86 868, 992. [] iddarta Cib and Ivan Jeliazkov. Marginal likeliood from te Metropolis Hastings output. Journal of te American tatistical Association, 96(453), 200. [2] Antonietta Mira and Geoff Nicolls. Bridge estimation of te probability density at a point. tatistica inica, 4:603 62, [3] Francesco Bartolucci, Luisa caccia, and Antonietta Mira. Efficient Bayes factor estimation from te reversible jump output. Biometrika, 93():4 52, [4] Radford M. Neal. Erroneous results in Marginal likeliood from te Gibbs output, 999. Available from ttp://www.cs.toronto.edu/ radford/cibletter.tml. [5] imon Osindero and Geoffrey Hinton. Modeling image patces wit a directed ierarcy of Markov random fields. In Advances in Neural Information Processing ystems (NIP*20). MI Press, [6] Aapo Hyvärinen. Fast and robust fixedpoint algoritms for independent component analysis. IEEE ransactions on Neural Networks, 0(3): , 999. [7] Zoubin Garamani and Geoffrey E. Hinton. e EM algoritm for mixtures of factor analyzers. ecnical Report CRGR96, University of oronto, 997. [8] Vibav Gogate, Bozena Bidyuk, and Rina Decter. tudies in lower bounding probability of evidence using te Markov inequality. In 23rd Conference on Uncertainty in Artificial Intelligence (UAI), A Realvalued latents and Metropolis Hastings ere are tecnical difficulties wit te original Cibstyle approac applied to Metropolis Hastings and continuous latent variables. e continuous version of equation (9), P ( v) = ( )P ( v) d ( (s) ), (s) P(H), (20) doesn t work if is te Metropolis Hastings operator. e Diracdelta function at = contains a significant part of te integral, wic is ignored by samples from P ( v) wit probability one. Following [], te fix is to instead integrate over te generalized detailed balance relationsip (5). Cib and Jeliazkov implicitly took out te = point from all of teir integrals. We do te same: P ( v) = d ( )P ( v)/ d ( ). (2) e numerator can be estimated as before. As bot integrals omit =, te denominator is less tan one wen contains a delta function. For Metropolis Hastings: ( ) = q(; ) min (, a(; ) ), were a(; ) is an easytocompute acceptance ratio. ampling from q(; ) and averaging min(, a(; )) provides an estimate of te denominator. In our importance sampling approac tere is no need to separately approximate an additional quantity. e algoritm in figure still applies if te s are interpreted as probability density functions. If, due to a rejection, is drawn in step 2. ten te sum in step 7. will contain an infinite term giving a trivial underestimate P (v)=0. (teps 3 6 need not be performed in tis case.) On repeated runs, te average estimate is still unbiased, or an underestimate for cains tat can t mix. Alternatively, te variational approac (9) could be applied togeter wit Metropolis Hastings sampling.
A Fast Learning Algorithm for Deep Belief Nets
LETTER Communicated by Yann Le Cun A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton hinton@cs.toronto.edu Simon Osindero osindero@cs.toronto.edu Department of Computer Science, University
More informationRepresentation Learning: A Review and New Perspectives
1 Representation Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, and Pascal Vincent Department of computer science and operations research, U. Montreal also, Canadian Institute
More informationSumProduct Networks: A New Deep Architecture
SumProduct Networks: A New Deep Architecture Hoifung Poon and Pedro Domingos Computer Science & Engineering University of Washington Seattle, WA 98195, USA {hoifung,pedrod}@cs.washington.edu Abstract
More informationExponential Family Harmoniums with an Application to Information Retrieval
Exponential Family Harmoniums with an Application to Information Retrieval Max Welling & Michal RosenZvi Information and Computer Science University of California Irvine CA 926973425 USA welling@ics.uci.edu
More informationLearning Deep Architectures for AI. Contents
Foundations and Trends R in Machine Learning Vol. 2, No. 1 (2009) 1 127 c 2009 Y. Bengio DOI: 10.1561/2200000006 Learning Deep Architectures for AI By Yoshua Bengio Contents 1 Introduction 2 1.1 How do
More informationTo Recognize Shapes, First Learn to Generate Images
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2006. October 26, 2006
More information1 An Introduction to Conditional Random Fields for Relational Learning
1 An Introduction to Conditional Random Fields for Relational Learning Charles Sutton Department of Computer Science University of Massachusetts, USA casutton@cs.umass.edu http://www.cs.umass.edu/ casutton
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationGenerative or Discriminative? Getting the Best of Both Worlds
BAYESIAN STATISTICS 8, pp. 3 24. J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2007 Generative or Discriminative?
More informationAN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS. Department of Mathematics and Statistics University of Calgary
AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS DAVID P. M. SCOLLNIK Department of Mathematics and Statistics University of Calgary Abstract This paper introduces the
More informationTHE adoption of classical statistical modeling techniques
236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 2, FEBRUARY 2006 Data Driven Image Models through Continuous Joint Alignment Erik G. LearnedMiller Abstract This paper
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics:
More informationAn Introduction to MCMC for Machine Learning
Machine Learning, 50, 5 43, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of
More informationFlexible and efficient Gaussian process models for machine learning
Flexible and efficient Gaussian process models for machine learning Edward Lloyd Snelson M.A., M.Sci., Physics, University of Cambridge, UK (2001) Gatsby Computational Neuroscience Unit University College
More informationA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 981952350, U.S.A. pedrod@cs.washington.edu ABSTRACT
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationDirichlet Process. Yee Whye Teh, University College London
Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, BlackwellMacQueen urn scheme, Chinese restaurant
More informationA Unifying View of Sparse Approximate Gaussian Process Regression
Journal of Machine Learning Research 6 (2005) 1939 1959 Submitted 10/05; Published 12/05 A Unifying View of Sparse Approximate Gaussian Process Regression Joaquin QuiñoneroCandela Carl Edward Rasmussen
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationGenerating more realistic images using gated MRF s
Generating more realistic images using gated MRF s Marc Aurelio Ranzato Volodymyr Mnih Geoffrey E. Hinton Department of Computer Science University of Toronto {ranzato,vmnih,hinton}@cs.toronto.edu Abstract
More informationThe Capital Asset Pricing Model: Some Empirical Tests
The Capital Asset Pricing Model: Some Empirical Tests Fischer Black* Deceased Michael C. Jensen Harvard Business School MJensen@hbs.edu and Myron Scholes Stanford University  Graduate School of Business
More informationGaussian Process Kernels for Pattern Discovery and Extrapolation
Andrew Gordon Wilson Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams School of Engineering and Applied Sciences, Harvard University, Cambridge, USA agw38@cam.ac.uk
More informationDirichlet Process Gaussian Mixture Models: Choice of the Base Distribution
Görür D, Rasmussen CE. Dirichlet process Gaussian mixture models: Choice of the base distribution. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 5(4): 615 66 July 010/DOI 10.1007/s1139001010511 Dirichlet
More informationFourier Theoretic Probabilistic Inference over Permutations
Journal of Machine Learning Research 10 (2009) 9971070 Submitted 5/08; Revised 3/09; Published 5/09 Fourier Theoretic Probabilistic Inference over Permutations Jonathan Huang Robotics Institute Carnegie
More informationRegression and Classification Using Gaussian Process Priors
BAYESIAN STATISTICS 6, pp. 000000 J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds.) Oxford University Press, 1998 Regression and Classification Using Gaussian Process Priors RADFORD
More informationThe Backpropagation Algorithm
7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single
More information