Biology at Home - Pariion Funcion Guillaume

On Tracking The Pariion Funcion Guillaume Desjardins, Aaron Courville, Yoshua Bengio {desjagui,courvila,bengioy}@iro.umonreal.ca Déparemen d informaique e de recherche opéraionnelle Universié de Monréal Absrac Markov Random Fields (MRFs) have proven very powerful boh as densiy esimaors and feaure exracors for classificaion. However, heir use is ofen limied by an inabiliy o esimae he pariion funcion Z. In his paper, we exploi he gradien descen raining procedure of resriced Bolzmann machines (a ype of MRF) o rack he log pariion funcion during learning. Our mehod relies on wo disinc sources of informaion: (1) esimaing he change Z incurred by each gradien updae, (2) esimaing he difference in Z over a small se of empered disribuions using bridge sampling. The wo sources of informaion are hen combined using an inference procedure similar o Kalman filering. Learning MRFs hrough Tempered Sochasic Maximum Likelihood, we can esimae Z using no more emperaures han are required for learning. Comparing o boh exac values and esimaes using annealed imporance sampling (AIS), we show on several daases ha our mehod is able o accuraely rack he log pariion funcion. In conras o AIS, our mehod provides his esimae a each ime-sep, a a compuaional cos similar o ha required for raining alone. 1 Inroducion In many areas of applicaion, problems are naurally expressed as a Gibbs measure, where he disribuion over he domain X is given by, for x X : q(x) = q(x) Z(β) = exp{ βe(x)}, wih Z(β) = q(x). (1) Z(β) x E(x) is refered o as he energy of configuraion x, β is a free parameer known as he inverse emperaure and Z(β) is he normalizaion facor commonly refered o as he pariion funcion. Under cerain general condiions on he form of E, hese models are known as Markov Random Fields (MRF), and have been very popular wihin he vision and naural language processing communiies. MRFs wih laen variables in paricular resriced Bolzmann machines (RBMs) [9] are among he mos popular building block for deep archiecures [1], being used in he unsupervised iniializaion of boh Deep Belief Neworks [9] and Deep Bolzmann Machines [22]. As illusraed in Eq. 1, he pariion funcion is compued by summing over all variable configuraions. Since he number of configuraions scales exponenially wih he number of variables, exac calculaion of he pariion funcion is generally compuaionally inracable. Wihou he pariion funcion, probabiliies under he model can only be deermined up o a muliplicaive consan, which seriously limis he model s uiliy. One mehod recenly proposed for esimaing Z(β) is annealed imporance sampling (AIS) [18, 23]. In AIS, Z(β) is approximaed by he sum of a se of imporance-weighed samples drawn from he model disribuion. Wih a large number of variables, drawing a se of imporance-weighed samples is generally subjec o exreme variance in he imporance weighs. AIS alleviaes his issue by annealing he model disribuion hrough a series of slowly changing disribuions ha link he arge model disribuion o one where he log pariion funcion is racable. While AIS is quie successful, i generally requires he use of ens of housands of annealing disribuions in order o achieve accurae resuls. This compuaionally inensive 1

requiremen renders AIS inappropriae as a means of mainaining a running esimae of he log pariion funcion hroughou raining. Ye, having ready access o his quaniy hroughou learning opens he door o a range of possibiliies. Likelihood could be used as a basis for model comparison hroughou raining; early-sopping could be accomplished by monioring an esimae of he likelihood of a validaion se. Anoher imporan applicaion is in Bayesian inference in MRFs [17] where we require he pariion funcion for each value of he parameers in he region of suppor. Tracking he log pariion funcion would also enable simulaneous esimaion of all he parameers of a heerogeneous model, for example an exended direced graphical model wih Gibbs disribuions forming some of he model componens. In his work, we consider a mehod of racking he log pariion funcion during raining, which builds upon he parallel empering (PT) framework [7, 1, 15]. Our mehod relies on wo basic observaions. Firs, when using sochasic gradien descen 1, parameers end o change slowly during raining; consequenly, he pariion funcion Z(β) also ends o evolve slowly. We exploi his propery of he learning process by using imporance sampling o esimae changes in he log pariion funcion from one learning ieraion he nex. If he changes in he disribuion from ime-sep o + 1 are small, he imporance sampling esimae can be very accurae, even wih relaively few samples. This is he same basic sraegy employed in AIS, bu while wih AIS one consrucs a pah of close disribuions hrough an annealing schedule, in our procedure we simply rely on he pah of disribuions ha emerges from he learning process. Second, parallel empering (PT) relies on simulaing an exended sysem, consising of muliple models each running a heir own emperaure. These emperaures are chosen such ha neighboring models overlap sufficienly as o allow for frequen cross-emperaure sae swaps. This is an ideal operaing regime for bridge sampling [2, 19], which can hus serve o esimae he difference in log pariion funcions beween neighboring models. While wih relaively few samples, each mehod on is own ends no o provide reliable esimaes, we propose o combine hese measuremens using a variaion of he well-known Kalman filer (KF), allowing us o accuraely rack he evoluion of he log pariion funcion hroughou learning. The efficiency of our mehod sems from he fac ha our esimaor makes use of he samples generaed in he course of raining, hus incurring relaively lile addiional compuaional cos. This paper is srucured as follows. In Secion 2, we provide a brief overview of RBMs and he SML-PT raining algorihm, which serves as he basis of our racking algorihm. Secions (3.1-3.3) cover he deails of he imporance and bridge sampling esimaes, while Secion 3.4 provides a comprehensive look a our filering procedure and he racking algorihm as a whole. Experimenal resuls are presened in Secion 4. 2 Sochasic Maximum Likelihood wih Parallel Tempering Our proposed log pariion funcion racking sraegy is applicable o any Gibbs disribuion model ha is undergoing relaively smooh changes in he pariion funcion. However, we concenrae on is applicaion o he RBM since i has become a model of choice for learning unsupervised feaures for use in deep feed-forward archiecures [9, 1] as well as for modeling complex, high-dimensional disribuions [27, 24, 12]. RBMs are biparie graphical models where visible unis v {, 1} nv inerac wih hidden unis h {, 1} n h hrough he energy funcion E(v, h) = h T W v c T h b T v. The model parameers θ = [W, c, b] consis of he weigh marix W R n h n v, whose enries W ij connec unis (v i, h j), and offse vecors b and c. RBMs can be rained hrough a sochasic approximaion o he negaive log-likelihood gradien F (v) F (v) θ E p [ θ ], where F (v) is he free-energy funcion defined as F (v) = log P h exp( E(v, h)). In Sochasic Maximum Likelihood (SML) [25], we replace he expecaion by a sample average, where approximae samples are drawn from a persisen Markov chain, updaed hrough k-seps of Gibbs sampling beween parameer updaes. Oher algorihms improve upon his defaul formulaion by replacing Gibbs sampling wih more powerful sampling algorihms [26, 7, 21, 2]. By increasing he mixing rae of he underlying Markov chain, hese mehods can lead o lower variance esimaes of he maximum likelihood gradien and faser conver- 1 Sochasic gradien descen is one of he mos popular mehods for raining MRFs precisely because second order opimizaion mehods ypically require a deerminisic gradien, whereas sampling-based esimaors are he only pracical opion for models wih an inracable pariion funcion. 2

gence. However, from he perspecive of racking he log pariion funcion, we will see in Secion 3 ha he SML-PT scheme [7] presens a raher unique advanage. Throughou raining, parallel empering draws samples from an exended sysem M = {q i, ; i [1, M]}, where q i, denoes he model wih inverse emperaure β i [, 1] obained afer seps of gradien descen. Each model q i, (associaed wih a unique pariion funcion Z i, ) represens a smoohed version of he arge disribuion: q 1, (wih β 1 = 1). The inverse emperaure β i = 1/T i [, 1] conrols he degree of smoohing, wih smaller values of β i leading o disribuions which are easier o sample from. To leverage hese fas-mixing chains, PT alernaes k seps of Gibbs sampling (performed independenly a each emperaure) wih cross-emperaure sae swaps. These are proposed beween neighboring chains using a Meropolis-Hasings-based accepance crierion. If we denoe he paricle obained by each model q i, afer k seps of Gibbs sampling as x i,, hen he swap accepance raio r i, for chains (i, i + 1) is given by: ( r i, = min 1, q ) i,(x i+1, ) q i+1, (x i, ) (2) q i, (x i, ) q i+1, (x i+1, ) These swaps ensure ha samples from highly ergodic chains are gradually swapped ino lower emperaure chains. Our swapping schedule is he deerminisic even-odd algorihm [14] which proposes swaps beween all pairs (q i,, q i+1, ) wih even i s, followed by hose wih odd i s. The gradien is hen esimaed by using he sample which was las swapped ino emperaure β 1. To reduce he variance on our esimae, we run muliple Markov chains per emperaure, yielding a mini-bach of model samples X i, = {x (n) i, qi,(x); 1 n N} a each ime-sep and emperaure. SML wih Adapive parallel empering (SML-APT) [6], furher improves upon SML-PT by auomaing he choice of emperaures. I does so by maximizing he flow of paricles beween exremal emperaures, yielding beer ergodiciy and more robus sampling in he negaive phase of raining. 3 Tracking he Pariion Funcion Unrolling in ime (learning ieraions) he M models being simulaed by PT, we can envision a wodimensional laice of RBMs indexed by (i, ). As previously menioned, gradien descen learning causes q i,, he model wih inverse emperaure β i obained a ime-sep, o be close o q i, 1. We can hus apply imporance sampling beween adjacen emporal models 2 o obain an esimae of ζ i, ζ i, 1, denoed as Oi,. Inspired by he annealing disribuions used in AIS, one could hink o ierae his process from a known quaniy ζ i,1, in order o esimae ζ i,. Unforunaely, he variance of such an esimae would grow quickly wih. PT provides an ineresing soluion o his problem, by simulaing an exended sysem M where he β i s are seleced such ha q i, and q i+1, have enough overlap o allow for frequen cross-emperaure sae swaps. This moivaes using bridge sampling [2] o provide an esimae of ζ i+1, ζ i,, he difference in log pariions beween emperaures β i+1 and β i. We denoe his esimae as i,. Addiionally, we can rea ζ M, as a known quaniy during raining, by seing β M = 3. Beginning wih ζ M, (see definiion in Fig. 1), repeaed applicaion of bridge sampling alone could in principle arrive a an accurae esimae of {ζ i,; i [1, M], [1, T ]}. However, reducing he variance sufficienly o provide useful esimaes of he log pariion funcion would require using a relaively large number of samples a each emperaure. Wihin he conex of RBM raining, he required number of samples a each of he parallel chains would have an excessive compuaional cos. Noneheless even wih relaively few samples, he bridge sampling esimae provides an addiional source of informaion regarding he log pariion funcion. Our sraegy is o combine hese wo high variance esimaes Oi, and i, by reaing he unknown log pariion funcions as a laen sae o be racked by a Kalman filer. In his framework, we consider Oi, and i, as observed quaniies, used o ieraively refine he join disribuion over he laen sae a each learning ieraion. Formally, we define his laen sae o be ζ = [ζ 1,,..., ζ M,, b ], where b is an exra erm o accoun for a sysemaic bias in 1, (see Sec. 3.2 for deails). The corresponding graphical model is shown in Figure 1. 2 This same echnique was recenly used in [5], in he conex of learning rae adapaion. 3 The visible unis of an RBM wih zero weighs are marginally independen. Is log pariion funcion is hus given as P i log(1 + exp(bi)) + n h log(2). 3

Sysem Equaions: ζ M, 1 ζ M, p(ζ ) = N (µ, Σ ) M 1, 1 M 1, p(ζ ζ 1 ) = N (ζ 1, Σ ζ ) p(o ( ) ζ, ζ 1 ) = N (C[ζ, ζ 1 ] T, Σ ) 1, 1 ζ 2, 1 ζ 1, 1 b 1 O2, 1, O1, ζ 2, ζ 1, b p(o ( β) ζ ) = N (Hζ, Σ β ) 2 1 C = H = 6 4 IM 2 6 4. I M. 3 7 5 1 +1 1 +1.... 1 +1 3 7 5 Figure 1: A direced graphical model for log pariion funcion racking. The shaded nodes represen observed variables, and he double-walled nodes represen he racable ζ M,: wih β M =. For clariy of presenaion, we show he bias erm as disinc from he oher ζ i, (recall b = ζ M+1,.) 3.1 Model Dynamics The firs sep is o specify how we expec he log pariion funcion o change over raining ieraions, i.e. our prior over he model dynamics. SML raining of he RBM model parameers is a sochasic gradien descen algorihm (ypically over a mini-bach of N examples) where he parameers change by small incremens specified by an approximaion o he likelihood gradien. This implies ha boh he model disribuion and he pariion funcion change relaively slowly over learning incremens wih a rae of change being a funcion of he SML learning rae; i.e. we expec q i, and ζ i, o be close o q i, 1 and ζ i, 1 respecively. Our model dynamics are hus simple and capure he fac ha he log pariion funcion is slowly changing. Characerizing he evoluion of he log pariion funcions as independen Gaussian processes, we model he probabiliy of ζ condiioned on ζ 1 as p(ζ ζ 1 ) = N (ζ 1, Σ ζ ), a normal disribuion wih mean ζ 1 and fixed diagonal covariance Σ ζ = Diag[σ 2 Z,..., σ 2 Z, σ 2 b ]. σ 2 Z and σ2 b are hyper-parameers conrolling how quickly he laen saes ζ i, and b are expeced o change beween learning ieraions. 3.2 Imporance Sampling Beween Learning Ieraions The observaion disribuion p(o ( ) ζ, ζ 1) = N (C[ζ, ζ 1] T, Σ ) models he relaionship beween he evoluion of he laen log pariions and he saisical measuremens O ( ) = [O ( ) 1,,..., O ( ) M, ] given by imporance sampling, wih O i, defined as: { } Oi, 1 N = log w (n) i, wih w (n) i, = q i,(x (n) i, 1 ) (3) N q i, 1 (x (n) ). n=1 In he above disribuion, he marix C encodes he fac ha he average imporance weighs esimae ζ i, ζ i, 1 + b I i=1, where I is he indicaor funcion. I is formally defined in Fig. 1. Σ is a diagonal covariance marix, whose elemens are updaed online from he esimaed variances of he log-imporance weighs. A ime-sep, he i-h enry of is diagonal is hus given by Var[w i, ]/ [ n w(n)] 2. The erm b accouns for a sysemaic bias in O ( ) 1,. I sems from he reuse of samples X 1, 1 : firs, for esimaing he negaive phase gradien a ime-sep 1 (i.e. he gradien applied beween q i, 1 i, 1 4

and q i, ) and second, o compue he imporance weighs of Eq. 3. Since he SML gradien acs o lower he probabiliy of negaive paricles, w (n) i, is biased. 3.3 Bridging he Parallel Tempering Temperaure Gaps Consider now he oher dimension of our parallel empered laice of RBMs: emperaure. As previously menioned, neighboring disribuions in PT are designed o have significan overlap in heir densiies in order o permi paricle swaps. However, he inermediae disribuions q i, (v, h) are no so close o one anoher ha we can use hem as he inermediae disribuions of AIS. AIS ypically requires housands of inermediae chains, and mainaining ha number of parallel chains would carry a prohibiive compuaional burden. On he oher hand, he parallel empering sraegy of spacing he emperaure o ensure moderaely frequen swapping nicely maches he ideal operaing regime of bridge sampling [2]. We hus consider a second observaion model as p(o ( β) Fig.1. The quaniies O ( β) ζ i+1, ζ i,. Enries i, i, = log N n=1 u (n) i, = [ 1,,..., O β M 1, are given by: log N n=1 v (n) i, ζ ) = N (Hζ, Σ β ), wih H defined in ] are obained via bridge sampling as esimaes of q, where u (n) i, i, = ( q i, ( x (n) i, x (n) i, ) ), v (n) i, = ( ) qi, x (n) i+1, ( ). (4) q i+1, x (n) i+1, The bridging disribuion [2, 19] qi, is chosen such ha i has large suppor wih boh q i and q i+1. For all i [1, M 1], we choose he approximaely opimal disribuion q (op) i, (x) = q i,(x) q i+1,(x) s i, q where s i,(x)+ q i+1,(x) i, Z i+1, /Z i,. Since he Z i, s are he very quaniies we are rying o esimae, his definiion may seem problemaic. However i is possible o sar wih a coarse esimae of s i,1 and refine i in subsequen ieraions by using he oupu of our racking algorihm. Σ β is once again a diagonal covariance marix, updaed online from he variance of he log-imporance weighs u and v [19]. The i-h enry is given by 3.4 Kalman Filering of he Log-Pariion Funcion h Var[ui,] P i 2 + h Var[vi,] P i 2. n u(n) i, n v(n) i, In he above we have described wo sources of informaion regarding he log pariion funcion for each of he RBMs in he laice. In his secion we describe a mehod o fuse all available informaion o improve he overall accuracy of he esimae of every log pariion funcion. We now consider he seps involved in he inference process in moving from an esimae of he poserior over he laen sae a ime 1 o an esimae of he poserior a ime. We begin by assuming we know he poserior p(ζ 1 O ( ) 1:, O( β) 1: ), where O( ) 1: = [O( ) 1,..., O( ) ]. 1 We follow he reamen of Neal [18] in characerizing our uncerainy regarding ζ i, as a Gaussian disribuion and define p(ζ 1 O ( ) 1:, O( β) 1: ) N (µ 1, 1, P 1, 1), a mulivariae Gaussian wih mean µ 1, 1 and covariance P 1, 1. The double index noaion is used o indicae which is he laes observaion being condiioned on for each of he wo ypes of observaions: e.g. µ, 1 represens he poserior mean given O ( ) : and O 1:. ( β) Deparing from he ypical Kalman filer seing, O ( ) depends on boh ζ and ζ 1. In order o incorporae his observaion ino our esimae of he laen sae, we firs need o specify he prior join disribuion p(ζ 1, ζ O ( ) 1:, O( β) 1: ) = p(ζ ζ 1)p(ζ 1 O( ) 1:, O( β) 1: ), wih p(ζ ζ 1) as defined in Sec. 3.1. Observaion O ( ) is hen incorporaed hrough Bayes rule, yielding p(ζ 1, ζ O ( ) ). Having incorporaed he imporance sampling esimae ino he model, :, O( β) 1: we can hen marginalize over ζ 1 (which is no longer required), o yield p(ζ O ( ) :, O( β) ). 1: Finally, i remains only o incorporae he bridge sampler esimae O ( β) by a second applicaion of Bayes rule, which gives us p(ζ O ( ) :, O( β) : ), he updaed poserior over he laen sae a ime-sep. The deailed inference equaions are provided in Fig. 2 and can be derived easily from sandard exbook equaions on producs and marginals of normal disribuions [4]. 5

Inference Equaions: (i) (ii) p ζ 1, ζ O ( ) wih η 1, 1 = p(ζ 1, ζ O ( ) 1:, O( β) 1: = N (η 1, 1, V 1, 1)» µ 1, 1 and V µ 1, 1 = 1, 1 :, O( β) 1: ) = N (η, 1, V, 1)» P 1, 1 P 1, 1 P 1, 1 Σ ζ + P 1, 1 wih V, 1 = (V 1 1, 1 + CT Σ 1 C) 1 and η, 1 = V, 1(C T Σ O ( ) + V 1 1, 1 η 1, 1) (iii) p ζ O ( ) :, O( β) 1: = N (µ, 1, P, 1) wih µ, 1 = [η, 1] 2 and P, 1 = [V, 1] 2,2 (iv) p(ζ O ( ) :, O( β) : ) = N (µ,, P,) wih P, = (P 1, 1 + HT Σ 1 β H) 1 and µ, = P,(H T Σ β O ( β) + P 1, 1 µ, 1) Figure 2: Inference equaions for our log pariion racking algorihm, a varian on he Kalman filer. For any vecor v and marix V, we use he noaion [v] 2 o denoe he vecor obained by preserving he boom half elemens of v and [V ] 2,2 o indicae he lower righ-hand quadran of V. 4 Experimenal Resuls For he following experimens, SML was performed using eiher consan or decreasing learning α raes. We used he decreasing schedule ɛ = min(ɛ ini +1, ɛ ini), where ɛ is he learning rae a ime-sep, ɛ ini is he iniial or base learning rae and α is he decrease consan. Enries of Σ ζ (see Secion 3.1) were se as follows. We se σz 2 = +, which is o say ha we did no exploi he smoohness prior when esimaing he prior disribuion over he join p(ζ 1, ζ O ( ) 1:, O( β) ). 1: σ2 b was se o 1 3 ɛ, allowing he esimaed bias on O ( ) 1, o change faser for large learning raes. When iniializing he RBM visible offses 4 as proposed in [8], he inermediae disribuions of Eq. 1 lead o sub-opimal swap raes beween adjacen chains early in raining, wih a direc impac on he qualiy of racking. In our experimens, we avoid his issue by using he inermediae disribuions q i,(x) exp[β i ( h T W v c T h) b T v]. We esed mini-bach sizes N [1, 2]. Comparing o Exac Likelihood We sar by comparing he performance of our racking algorihm o he exac likelihood, obained by marginalizing over boh visible and hidden unis. We chose 25 hidden unis and rained on he ubiquious MNIST [13] daase for 3k updaes, using boh fixed and adapive learning raes. The main resuls are shown in Figure 3. In Figure 3(a), we can see ha our racker provides a very good fi o he likelihood wih ɛ ini =.1 and decrease consans α in {1 3, 1 4, 1 5 }. Increasing he base learning rae o ɛ ini =.1 in Figure 3(b), we mainain a good fi up o α = 1 4, wih a small dip in performance a 5k updaes. Our racker fails however o capure he oscillaory behavior engendered by oo high of a learning rae (ɛ ini =.1, α = 1 5 ). I is ineresing o noe ha he failure mode of our algorihm seems o coincide wih an unsable opimizaion process. Comparing o AIS for Large-Scale Models In evaluaing he performance of our racking algorihm on larger models, exac compuaion of he likelihood is no longer possible, so we use AIS as our baseline. 5 Our models consised of RBMs wih 5 hidden unis, rained using SML-APT [6] on he MNIST and Calech Silhouees [16] daases. We performed 2k updaes, wih learning rae parameers ɛ ini {.1,.1} and α {1 3, 1 4, 1 5 }. On MNIST, AIS esimaed he es-likelihood of our bes model a 94.34 ± 3.8 (where ± indicaes he 3σ confidence inerval), while our racking algorihm repored a value 89.96. On Calech Silhouees, our model reached 134.23 ± 21.14 according o AIS, while our racker repored 4 Each b k is iniialized o log x k 1 x k, where x k is he mean of he k-h dimension on he raining se. 5 Our base AIS config. was 1 3 inermediae disribuions spaced linearly beween β = [,.5], 1 4 disribuions for he inerval [.5,.9] and 1 4 for [.9, 1.]. Esimaes of log Z are averaged over 1 annealed imporance weighs. 6

14 13 15 14 Likelihood (nas) 16 17 18 19 Likelihood (nas) 15 16 17 18 19 2 µi =.1 α = 1e3 Exac µi =.1 α = 1e4 AIS µi =.1 α = 1e5 Kalman 21 5 1 15 2 25 3 Updaes (x1e3) 2 21 µi =.1 α = 1e3 µi =.1 α = 1e4 µi =.1 α = 1e5 Exac AIS Kalman 5 1 15 2 25 3 Updaes (x1e3) Figure 3: Comparison of exac es-se likelihood and esimaed likelihood as given by AIS and our racking algorihm. We rained a 25-hidden uni RBM for 3k updaes using SML, wih a learning rae schedule ɛ = min(α ɛ ini/(+1), ɛ ini), wih (lef) ɛ ini =.1 and (righ) ɛ ini =.1 varying α {1 3, 1 4, 1 5 }. 54 8 52 6 5 4 85 log Z (nas) 48 46 44 2 log Z Likelihood (nas) 9 95 42 4 38 5 1 15 2 Updaes (x1e3) ζ O ( β) AIS O ( ) b 2 1 ζ (rain) ζ (valid) AIS (valid) 15 1 2 3 4 5 Epochs Figure 4: (lef) Ploed on lef y-axis are he Kalman filer measuremens O ( β), our log pariion esimae of ζ 1, and poin esimaes of ζ 1, obained by AIS. On he righ y-axis, measuremen O ( ) is ploed, along wih he esimaed bias b. Noe how b becomes progressively less-pronounced as ɛ decreases and he model converges. Also of ineres, he variance on O ( β) increases wih bu is compensaed by a decreasing variance on O ( ), yielding a relaively smooh esimae ζ 1,. (no shown) The ±3σ confidence inerval of he AIS esimae a 2k updaes was measured o be 3.8. (righ) Example of early-sopping on dna daase. 114.31. To pu hese numbers in perspecive, Salakhudinov and Murray [23] repors values of 125.53, 15.5 and 86.34 for 5 hidden uni RBMs rained wih CD{1,3,25} respecively. Marlin e al. [16] repor around 12 for Calech Silhouees, again using 5 hidden unis. Figure 4(lef) shows a deailed view of he Kalman filer measuremens and is oupu, for he bes performing MNIST model. We can see ha he variance on O ( β) (ploed on he lef y-axis) grows slowly over ime, which is miigaed by a decreasing variance on O ( ) (ploed on he righ y- axis). As he model converges and he learning rae decreases, q i, 1 and q i, become progressively closer and he imporance sampling esimaes become more robus. The esimaed bias erm b also converges o zero. An imporan poin o noe is ha a naive linear-spacing of emperaures yielded low exchange raes beween neighboring emperaures, wih adverse effecs on he qualiy of our bridge sampling esimaes. As a resul, we observed a drop in performance, boh in likelihood as well as racking performance. Adapive empering [6] (wih a fixed number of chains M) proved crucial in geing good racking for hese experimens. Early-Sopping Experimens Our final se of experimens highlighs he performance of our mehod on a wide-variey of daases [11]. In hese experimens, we use our esimae of he log 7

Daase RBM RBM-25 NADE Kalman AIS adul -15.24-15.7 (±.5) -16.29-13.19 connec4-15.77-16.81 (±.67) -22.66-11.99 dna -87.97-88.51 (±.97) -96.9-84.81 mushrooms -1.49-14.68 (± 3.75) -15.15-9.81 nips -27.1-271.23 (±.58) -277.37-273.8 ocr leers -33.87-31.45 (± 2.7) -43.5-27.22 rcv1-46.89-48.61 (±.69) -48.88-46.66 web -28.95-29.91 (±.74) -29.38-28.39 Table 1: Tes se likelihood on various daases. Models were rained using SML-PT. Early-sopping was performed by monioring likelihood on a hold-ou validaion se, using our KF esimae of he log pariion funcion. Bes models (i.e. he choice of hyper-parameers) were hen chosen according o he AIS likelihood esimae. Resuls for 25-hidden uni RBMs and NADE are aken from [11]. ± indicaes a confidence inerval of hree sandard deviaions. pariion o monior model performance on a held-ou validaion se. When he onse of over-fiing is deeced, we sore he model parameers and repor he associaed es-se likelihood, as esimaed by boh AIS and our racking algorihm. The advanages of such an early-sopping procedure is shown in Figure 4(b), where raining log-likelihood increases hroughou raining while validaion performance sars o decrease around 25 epochs. Deecing over-fiing wihou racking he log pariion would require a dense grid of AIS runs which would prove compuaionally prohibiive. We esed parameers in he following range: number of hidden unis in {1, 2, 5, 1} (depending on daase size), learning raes in {1 2, 1 3, 1 4 } eiher held consan during raining or annealed wih consans α {1 3, 1 4, 1 5 }. For empering, we used 1 fixed emperaures, spaced linearly beween β = [, 1]. SGD was performed using mini-baches of size {1, 1} when esimaing he gradien, and mini-baches of size {1, 2} for our se of empered-chains (we hus simulae 1 {1, 2} empered chains in oal). As can be seen in Table 4, our racker performs very well compared o he AIS esimaes and across all daases. Effors o lower he variance of he AIS esimae proved unsuccessful, even going as far as 1 5 inermediae disribuions. 5 Discussion In his paper, we have shown ha while exac calculaion of he pariion funcion of RBMs may be inracable, one can exploi he smoohness of gradien descen learning in order o approximaely rack he evoluion of he log pariion funcion during learning. Treaing he ζ i, s as laen variables, he graphical model of Figure 1 allowed us o combine muliple sources of informaion o achieve good racking of he log pariion funcion hroughou raining, on a variey of daases. We noe however ha good racking performance is coningen on he ergodiciy of he negaive phase sampler. Unsurprisingly, his is he same condiion required by SML for accurae esimaion of he negaive phase gradien. The mehod presened in he paper is also compuaionally aracive, wih only a small compuaion overhead relaive o SML-PT raining. The added compuaional cos lies in he compuaion of he imporance weighs for imporance sampling and bridge sampling. However, his boils down o compuing free-energies which are mosly pre-compued in he course of gradien updaes wih he sole excepion being he compuaion of q i, (x i, 1 ) in he imporance sampling sep. In comparison o AIS, our mehod allows us o fairly accuraely rack he log pariion funcion, and a a per-poin esimae cos well below ha of AIS. Having a reliable and accurae online esimae of he log pariion funcion opens he door o a wide range of new research direcions. Acknowledgmens The auhors acknowledge he financial suppor of NSERC and CIFAR; and Calcul Québec for compuaional resources. We also hank Hugo Larochelle for access o he daases of Sec. 4; Hannes Schulz, Andreas Mueller, Olivier Delalleau and David Warde-Farley for feedback on he paper and algorihm; along wih he developers of Theano [3]. 8

References [1] Bengio, Y. (29). Learning deep archiecures for AI. Foundaions and Trends in Machine Learning, 2(1), 1 127. Also published as a book. Now Publishers, 29. [2] Benne, C. (1976). Efficien esimaion of free energy differences from Mone Carlo daa. Journal of Compuaional Physics, 22(2), 245 268. [3] Bergsra, J., Breuleux, O., Basien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (21). Theano: a CPU and GPU mah expression compiler. In Proceedings of he Pyhon for Scienific Compuing Conference (SciPy). Oral. [4] Bishop, C. M. (26). Paern Recogniion and Machine Learning. Springer. [5] Cho, K., Raiko, T., and Ilin, A. (211). Enhanced gradien and adapive learning rae for raining resriced bolzmann machines. In L. Geoor and T. Scheffer, ediors, Proceedings of he 28h Inernaional Conference on Machine Learning (ICML-11), ICML 11, pages 15 112, New York, NY, USA. ACM. [6] Desjardins, G., Courville, A., and Bengio, Y. (21a). Adapive parallel empering for sochasic maximum likelihood learning of rbms. NIPS*21 Deep Learning and Unsupervised Feaure Learning Workshop. [7] Desjardins, G., Courville, A., Bengio, Y., Vincen, P., and Delalleau, O. (21b). Tempered Markov chain mone carlo for raining of resriced Bolzmann machine. In JMLR W&CP: Proceedings of he Thireenh Inernaional Conference on Arificial Inelligence and Saisics (AISTATS 21), volume 9, pages 145 152. [8] Hinon, G. (21). A pracical guide o raining resriced bolzmann machines. Technical Repor 213, Universiy of Torono. version 1. [9] Hinon, G. E., Osindero, S., and Teh, Y. (26). A fas learning algorihm for deep belief nes. Neural Compuaion, 18, 1527 1554. [1] Iba, Y. (21). Exended ensemble mone carlo. Inernaional Journal of Modern Physics, C12, 623 656. [11] Larochelle, H. and Murray, I. (211). The Neural Auoregressive Disribuion Esimaor. In Proceedings of he Foureenh Inernaional Conference on Arificial Inelligence and Saisics (AISTATS 211), volume 15 of JMLR: W&CP. [12] Larochelle, H., Bengio, Y., and Turian, J. (21). Tracable mulivariae binary densiy esimaion and he resriced Bolzmann fores. Neural Compuaion, 22(9), 2285 237. [13] LeCun, Y., Boou, L., Bengio, Y., and Haffner, P. (1998). Gradien based learning applied o documen recogniion. IEEE, 86(11), 2278 2324. [14] Lingenheil, M., Denschlag, R., Mahias, G., and Tavan, P. (29). Efficiency of exchange schemes in replica exchange. Chemical Physics Leers, 478(1-3), 8 84. [15] Marinari, E. and Parisi, G. (1992). Simulaed empering: A new mone carlo scheme. EPL (Europhysics Leers), 19(6), 451. [16] Marlin, B., Swersky, K., Chen, B., and de Freias, N. (29). Inducive principles for resriced bolzmann machine learning. In Proceedings of The Thireenh Inernaional Conference on Arificial Inelligence and Saisics (AISTATS 1), volume 9, pages 59 516. [17] Murray, I. and Ghahramani, Z. (24). Bayesian learning in undireced graphical models: Approximae mcmc algorihms. [18] Neal, R. M. (21). Annealed imporance sampling. Saisics and Compuing, 11(2), 125 139. [19] Neal, R. M. (25). Esimaing raios of normalizing consans using linked imporance sampling. [2] Salakhudinov, R. (21a). Learning deep bolzmann machines using adapive mcmc. In L. Boou and M. Liman, ediors, Proceedings of he Tweny-sevenh Inernaional Conference on Machine Learning (ICML-1), volume 1, pages 943 95. ACM. [21] Salakhudinov, R. (21b). Learning in Markov random fields using empered ransiions. In NIPS 9. [22] Salakhudinov, R. and Hinon, G. E. (29). Deep Bolzmann machines. In AISTATS 29, volume 5, pages 448 455. [23] Salakhudinov, R. and Murray, I. (28). On he quaniaive analysis of deep belief neworks. In W. W. Cohen, A. McCallum, and S. T. Roweis, ediors, ICML 28, volume 25, pages 872 879. ACM. [24] Taylor, G. and Hinon, G. (29). Facored condiional resriced Bolzmann machines for modeling moion syle. In L. Boou and M. Liman, ediors, ICML 29, pages 125 132. ACM. [25] Tieleman, T. (28). Training resriced Bolzmann machines using approximaions o he likelihood gradien. In W. W. Cohen, A. McCallum, and S. T. Roweis, ediors, ICML 28, pages 164 171. ACM. [26] Tieleman, T. and Hinon, G. (29). Using fas weighs o improve persisen conrasive divergence. In L. Boou and M. Liman, ediors, ICML 29, pages 133 14. ACM. [27] Welling, M., Rosen-Zvi, M., and Hinon, G. E. (25). Exponenial family harmoniums wih an applicaion o informaion rerieval. In NIPS 4, volume 17, Cambridge, MA. MIT Press. 9