On Autoencoders and Score Matching for Energy Based Models

Transcription

1 On Autoencoders and Score Matchng for Energy Based Models Kevn Swersky* Marc Aurelo Ranzato Davd Buchman* Benjamn M. Marln* Nando de Fretas* *Department of Computer Scence, Uersty of Brtsh Columba, Vancouver, BC V6T Z4, Canada Department of Computer Scence, Uersty of Toronto, Toronto, ON M5S G4, Canada Abstract We consder estmaton methods for the class of contnuous-data energy based models EBMs). Our man result shows that estmatng the parameters of an EBM usng score matchng when the condtonal dstrbuton over the vsble unts s Gaussan corresponds to tranng a partcular form of regularzed autoencoder. We show how dfferent Gaussan EBMs lead to dfferent autoencoder archtectures, provdng deep lnks between these two famles of models. We compare the score matchng estmator for the mpot model, a partcular Gaussan EBM, to several other tranng methods on a varety of tasks ncludng mage denosng and unsupervsed feature extracton. We show that the regularzaton functon nduced by score matchng leads to superor classfcaton performance relatve to a standard autoencoder. We also show that score matchng yelds classfcaton results that are ndstngushable from better-known stochastc approxmaton maxmum lkelhood estmators.. Introducton In ths work, we consder a rch class of probablstc models called energy based models EBMs) LeCun et al., 006; Teh et al., 003; Hnton, 00). These models defne a probablty dstrbuton though an exponentated energy functon. Markov Random Felds MRFs) and Restrcted Boltzmann Machnes RBMs) are the most common nstance of such models and have Appearng n Proceedngs of the 8 th Internatonal Conference on Machne Learnng, Bellevue, WA, USA, 0. Copyrght 0 by the authors)/owners). a long hstory n partcular applcaton areas ncludng modelng natural mages. Recently, more sophstcated latent varable EBMs for contnuous data ncludng the PoT Wellng et al., 003), mpot Ranzato et al., 00b), mcrbm Ranzato & Hnton, 00), FoE Schmdt et al., 00) and others have become popular models for learnng representatons of natural mages as well as other sources of real-valued data. Such models, also called gated MRFs, leverage latent varables to represent hgher order nteractons between the nput varables. In the very actve research area of deep learnng Hnton et al., 006), these models been employed as elementary buldng blocks to construct herarchcal models that acheve very promsng performance on several perceptual tasks Ranzato & Hnton, 00; Bengo, 009). Maxmum lkelhood estmaton s the default parameter estmaton approach for probablstc models due to ts optmal theoretcal propertes. Unfortunately, maxmum lkelhood estmaton s computatonally nfeasble n many EBM models due to the presence of an ntractable normalzaton term the partton functon) n the model probablty. Ths term arses n EBMs because the exponentated energes do not automatcally ntegrate to unty, unlke drected models parameterzed by products of locally normalzed condtonal dstrbutons Bayesan networks). Several alternatve methods have been proposed to estmate the parameters of an EBM wthout the need for computng the partton functon. One partcularly nterestng method s called score matchng SM) Hyvärnen, 005). The score matchng objectve functon s constructed from an L loss on the dfference between the dervatves of the log of the model and emprcal dstrbuton functons wth respect to the nputs. Hyvärnen 005) showed that ths results n a cancellaton of the

2 Autoencoders and Score Matchng partton functon. Further manpulaton yelds an estmator that can be computed analytcally and s provably consstent. Autoencoder neural networks are another class of models that are often used to model hgh-dmensonal realvalued data Hnton & Zemel, 994; Vncent et al., 008; Vncent, 0; Kngma & LeCun, 00). Both EBMs and autoencoders are unsupervsed models that can be thought of as learnng to re-represent nput data n a latent space. In contrast to probablstc EBMs, autoencoders are determnstc and feed-forward. As a result, autoencoders can be traned to reconstruct ther nput through one or more hdden layers, they have fast feed-forward nference for hdden layer states, and all common tranng losses lead to computatonally tractable model estmaton methods. In order to learn better representatons, autoencoders are often modfed by tyng the weghts between the nput and output layers to reduce the number of parameters, ncludng addtonal terms n the objectve to bas learnng toward sparse hdden unt actvatons, and addng nose to nput data to ncrease robustness Vncent et al., 008; Vncent, 0). Interestngly, Vncent 0) showed that a partcular knd of denosng autoencoder traned to mnmze an L reconstructon error can be nterpreted as Gaussan RBM traned usng Hyvärnen s score matchng estmator. In ths paper, we apply score matchng to a number of latent varable EBMs where the condtonal dstrbuton of the vsble unts gven the hdden unts s Gaussan. We show that the resultng estmaton algorthms can be nterpreted as mnmzng a regularzed L reconstructon error on the vsble unts. For Gaussan-bnary RBMs, the reconstructon term corresponds to a standard autoencoder wth ted weghts. For the mpot and mcrbm models, the reconstructon terms correspond to new autoencoder archtectures that take nto account the covarance structure of the nputs. Ths suggests a new way to derve novel autoencoder tranng crtera by applyng score matchng to the free energy of an EBM. We further generalze score matchng to arbtrary EBMs wth real-valued nput unts and show that ths vew leads to an ntutve nterpretaton for the regularzaton terms that appear n the score matchng objectve functon.. Score Matchng for Latent Energy Based Models A latent varable energy based model defnes a probablty dstrbuton over real valued data vectors v V R as follows: P v, h; θ) = exp E θv, h)), ) Zθ) where h H R n h are the latent varables, E θ v, h) s an energy functon parameterzed by θ Θ, and Zθ) s the partton functon. We refer to these models as latent energy based models. Ths general latent energy based model subsumes many specfc models for real-valued data such as Boltzmann machnes, exponental-famly harmonums Wellng et al., 005), factored RBMs and Product of Student s T PoT) models Memsevc & Hnton, 009; Ranzato & Hnton, 00; Ranzato et al., 00a;b). The margnal dstrbuton n terms of the free energy F θ v) s obtaned by ntegratng out the hdden varables as seen below. Typcally, but not always, ths margnalzaton can be carred out analytcally. P v; θ) = exp F θv)). ) Zθ) Maxmum lkelhood parameter estmaton s dffcult when Zθ) s ntractable. In EBMs the ntractablty of Zθ) arses due to the fact that t s a very hgh-dmensonal ntegral that often lacks a closed form soluton. In such cases, stochastc algorthms can be appled to approxmately maxmze the lkelhood and a varety of algorthms have been descrbed and evaluated Swersky et al., 00; Marln et al., 00) n the lterature ncludng contrastve dvergence CD) Hnton, 00), persstent contrastve dvergence PCD) Younes, 989; Teleman, 008), and fast persstent contrastve dvergence FPCD) Teleman & Hnton, 009). However, these methods often requre very careful hand-tunng of optmzaton-related parameters lke step sze, momentum, batch sze and weght decay, whch s complcated by the fact that the objectve functon can not be computed. The score matchng estmator was proposed by Hyvärnen 005) to overcome the ntractablty of Zθ) when dealng wth contnuous data. The score matchng objectve functon s defned through a score functon appled to the emprcal pv) and model p θ v) dstrbutons. The score functon for a generc dstrbuton pv) s gven by ψ pv)) = log pv) v = F θv) v = h E θv,h) v p θ h v)dh. The full objectve functon s gven below. Jθ) = E pv) ψ pv)) ψ p θ v))). 3)

3 Autoencoders and Score Matchng The beneft of optmzng Jθ) s that Zθ) cancels off n the dervatve of log p θ v) snce t s constant wth respect to each v. However, n the above form, Jθ) s stll ntractable due to the dependence on pv). Hyvärnen, shows that under weak regularty condtons Jθ) can be expressed n the followng form, whch can be tractably approxmated by replacng the expectaton over the emprcal dstrbuton by an emprcal average over the tranng set: Jθ) = E pv) ψ p θ v))) + ψ p θ v)) v. 4) In theoretcal stuatons where the regularty condtons on the dervatves of the emprcal dstrbuton are not satsfed, or n practcal stuatons where a fnte sample approxmaton to the expectaton over the emprcal dstrbuton s used, a smoothed verson of the score matchng estmator may be of nterest. Consder smoothng pv) usng a probablstc kernel q β v v ) wth bandwdth parameter β > 0. We obtan a new dstrbuton q β v) = q β v v ) pv )dv. Vncent 0) showed that applyng score matchng to q β v) s equvalent to the followng objectve functon where q β v, v ) = q β v v ) pv ): Qθ) =E qβ v,v ) ψ q β v v )) ψ p θ v))). 5) For the case where q β v v ) = N v v, β ).e. a Gaussan smoothng kernel wth varance β, ths s equvalent to the regularzed score matchng objectve proposed n Kngma & LeCun, 00). We refer to the objectve gven by Equaton 5 as denosng score matchng SMD). Although SMD s ntractable to evaluate analytcally, we can agan replace the ntegral over v by an emprcal average over a fnte sample of tranng data. We can then replace the ntegral over v by an emprcal average over samples v, whch can be easly drawn from q β v v ) for each tranng sample v. Compared to PCD and CD, SM and SMD gve tractable objectve functons that can be used to montor tranng progress. Whle SMD s not consstent, t does have sgnfcant computatonal advantages relatve to SM Vncent, 0). 3. Applyng and Interpretng Score Matchng For Latent EBMs We now derve score matchng objectves for several commonly used EBMs. In order to apply score matchng to a partcular EBM, one smply needs an expresson for the correspondng free energy. Example Score Matchng for Gaussanbnary RBMs: Here, the energy E θ v, h) s gven by: n v n h v σ W j h j n h b j h j + n v c v ), 6) where the parameters are θ = W, σ, b, c) and h j {0, }. Ths leads to the free energy F θ v): n v c v ) n h log + exp σ σ v σ W j + b j The correspondng score matchng objectve s: Jθ) = N σ N n v v n σ n= n h + Wj σ where ĥjn := sgm +exp x). c σ n h W j σ ĥ jn )) 7) ĥ jn ĥjn), 8) ) v n σ W j + b j and sgmx) := For a standardzed Normal model, wth c = 0 and σ =, ths objectve reduces to: Jθ) = N N n v n h v n W j ĥ jn n= n h + W jĥjn ĥjn), 9) The frst term corresponds to the quadratc reconstructon error of an autoencoder wth ted weghts. From ths we can see that ths type of of autoencoder, whch researchers have prevously treated as a dfferent model, can n fact be explaned by the applcaton of the score matchng estmaton prncple to Gaussan RBMs. Example Score matchng for mcrbm: The energy E θ v, h m, h c ) of the mcrbm model for each data pont ncludes mean Bernoull hdden unts h m j {0, } and covarance Bernoull hdden unts h c k {0, }. The latter allow one to model correlatons n the data v Ranzato & Hnton, 00; Ranzato et al., 00a). To ease the notaton, we wll gnore the ndex,

4 Autoencoders and Score Matchng n over the data. The energy for ths model s: n f n hc n v n hm n v P fk h c k C f v ) W j h m j v n hm f= n hc n v b m j h m j b c kh c k b v v + n v v, 0) where θ = b v, b m, b c, P, W, C). Ths leads to the free energy F θ v): n hc log + e φc k ) n hm n v log + e φm j ) b v v + n v v, ) where φ c k = nf f= P fk n v C f v ) + b c k and φm j = W jv + b m j. The correspondng score matchng objectve s: Jθ) = ψ p θ v)) n hc + n hm + n f ρĥc k )D k + ĥc k K k hm ˆ j h ˆ ) m j )W j ) ) n hc n ψ p θ v)) = ĥ c k D hm k + hˆ m j W j + b v v K k = P fk Cf f= n f ) ) D k = P fk C f v C f f= ĥ c k =sgm φc k) hˆ m j =sgm φ m ) j ρx) :=x x). = Example 3 Score matchng for mpot The energy E θ v, h m, h c ) of the mpot model s: n hc h c k + n v C k v ) ) + γ) logh c k) n v + v n v n hm b v v n hm h m j W j v b m j h m j, 3) where θ = γ, W, C, b v, b m ) and h c s a vector of Gamma covarance latent varables, C s a flter bank and γ s a scalar parameter. energy F θ v): n hc γ log + φc k) ) n hm Ths leads to the free n v log + e φm j ) b v v + n v v, 4) where φ c k = n v C kv and φ m j = n v W jv + b m j. The correspondng score matchng objectve Jθ) s equvalent to the objectve gven n Equaton wth the followng redefnton of terms: where I nhc P = I nhc ĥ c k =γϕφc k) 5) ˆ h m j =sgmφ m j ) 6) ϕx) := + x) ρx) :=x, s the n hc n hc dentty matrx. In each of these examples, we see that an objectve emerges whch seeks to mnmze a form of regularzed reconstructon error, and that the forms of these regularzers can end up beng qute dfferent. Rather than tryng to nterpret score matchng on a case by case bass, we provde a general theorem for all latent EBMs on whch score matchng can be appled: Theorem The score matchng objectve, Equaton 4), for a latent energy based model can be expressed succnctly n terms of ether the free energy or expectatons of the energy wth respect to the condtonal dstrbuton ph v). Specfcally, Jθ) =E pv) =E pv) + var pθ h v) ψ p θ v))) + ψ p θ v)) v ) Eθ v, h) E pθ h v) v Eθ v, h) E θ v, h) E pθ h v) v v Corollary If the energy functon of a latent EBM E θ v, h) takes the followng form: E θ v, h) = v µh))t Ωh)v µh)) + gh), where µh) s an arbtrary vector-valued functon of length n v, gh) s an arbtrary scalar functon, and.

5 Autoencoders and Score Matchng Ωh) s an n v n v postve-defnte matrx-valued functon, then the vector-valued score functon ψp θ v)) wll be: E pθ h v) Ωh)v µh)). As a result, the score matchng objectve can be expressed as: ) Jθ) =E pv) Epθ h v) Ωh)v µh)) +var pθ h v) Ωh)v µh)) E pθ h v) Ωh). The proofs of Theorem and Corollary are straghtforward, and can be found n an onlne appendx to ths paper. Corollary states that score matchng appled to a Gaussan latent EBM wll always result n a quadratc reconstructon term wth penaltes to mnmze the varance of the reconstructon and to maxmze the expected curvature of the energy wth respect to v. Ths shows that we can develop new autoencoder archtectures n a prncpled way by smply startng wth an EBM and applyng score matchng. One further connecton between the two models s that one step of gradent descent on the free energy F θ v) of an EBM corresponds to one feed-forward step of an autoencoder. To see ths, consder the mpot model. If we start at some vsble confguraton v and update a sngle dmenson : v t+) = v t) = v t) n hm + η F θv) v nhc + η ĥ c k D k hˆ m j W j + b v v t). Then settng η =, the v t) terms cancel and we get: n hc n v t+) = ĥ c k D hm k + hˆ m j W j + b v. 7) Ths corresponds to the reconstructon produced by mpot n ts score matchng objectve. In general, an autoencoder reconstructon can be produced by takng a sngle step of gradent descent along the free energy of ts correspondng EBM. smpaper-appendx.pdf 4. Experments In ths secton, we study several estmaton methods appled to the mpot model ncludng SM, SMD, CD, PCD, and FPCD wth the goal of uncoverng dfferences n the characterstcs of traned models due to varatons n tranng methods. For our experments, we used two datasets of mages. The frst dataset conssts of 8,000 color mage patches of sze 6x6 pxels randomly extracted from the Berkeley segmentaton dataset. We subtracted the per-patch means and appled PCA whtenng. We retaned 99% of the varance, correspondng to 05 egeectors. All estmaton methods were appled to the mpot model by tranng on mn-batches of sze 8 for 00 epochs of stochastc gradent descent. The second dataset, named CIFAR 0 Krzhevsky, 009), conssts of color mages of sze 3x3 pxels belongng to one of 0 categores. The task s to classfy a set of 0,000 test mages. CIFAR 0 s a subset of a larger dataset of tny mages Torralba et al., 008). Usng a protocol establshed n prevous work Krzhevsky, 009; Ranzato & Hnton, 00) we bult a tranng dataset of 8x8 color mage patches from ths larger dataset, ensurng there was no overlap wth CIFAR 0. The preprocessng of the data s exactly the same as for the Berkeley dataset, but here we use approxmately 800,000 mage patches and perform only 0 epochs of tranng. For our experments, we used the Theano package 3, and mpot 4 code from Ranzato et al., 00b). 4.. Objectve Functon Analyss From Corollary, we know that we can nterpret score matchng for mpot as tradng off reconstructon error, reconstructon varance and the expected curvature of the energy functon wth respect to the vsble unts. Ths experment, usng the Berkeley dataset, s desgned to determne how these terms evolve over the course of tranng and to what degree ther changes mpact the fnal model. Fgures a) and b) show the values of the three terms usng non-nosy nputs on each tranng epoch, as well as the overall objectve functon the sum of the 3 terms). Surprsngly, these results show that most of the tranng s olved wth maxmzng the expected curvature correspondng to a lower negatve curvature). In SM, each pont groupng/segbench/ publcatons/mpot/mpot.html

6 Autoencoders and Score Matchng 0.5 Total Recon Var Curve 00 0 Total Recon Var Curve Total Recon Var Curve Value Value Value a) SM terms 0 4 ) b) SMD terms c) Autoencoder terms Free energy dfference FPCD PCD CD SM SMD MSE FPCD PCD CD SM SMD MSE FPCD PCD CD SM SMD d) Free energy dfference e) Mean-feld denosng f) Bayesan denosng Fgure. a), b), c) Expected reconstructon error, reconstructon varance, and energy curvature for SM, SMD, and AE. Total represents the sum of these terms. d) Dfference of free energy between nosy and test mages. e) MSE of denosed test mages usng mean-feld. f) MSE of denosed test mages usng Bayesan MAP. s relatvely solated n v-space meanng that the objectve wll try to make the dstrbuton very peaked. In SMD, each pont exsts near a cloud of ponts and so the dstrbuton must be broader. From ths perspectve, SMD can be seen as a regularzed verson of SM that puts less emphass on changng the expected curvature. Ths also seems to gve SMD some room to reduce the reconstructon error. To examne the mpact of regularzaton, we traned an autoencoder AE) based on the mpot model usng the reconstructon gven by Equaton 7, whch corresponds to SM wthout the varance and curvature terms. Fgure c) shows that smply optmzng the reconstructon leaves the curvature almost arant, whch agrees wth the fndngs of Ranzato et al., 007). 4.. Denosng In our next set of experments, we compare models learned by each of the score matchng estmators wth models learned by the more commonly used stochastc estmators. For these experments, we traned mpot models correspondng to SM, SMD, FPCD, PCD, and CD. We compare the models n terms of the average free energy dfference between natural mage patches and patches corrupted by Gaussan nose. consder denosng natural mage patches. 5 We also Durng tranng, we hope that the probablty of natural mages wll ncrease whle that of other mages decreases. The free energy dfference between natural and other mages s equvalent to the log of ther probablty rato, so we expect the free energy dfference to ncrease durng tranng as well. Fgure d) shows the dfference n free energy between a test set of 0,000 mage patches from the Berkeley dataset, and the energy of the same mages corrupted by nose. For most estmators, the free energy dfference mproves as tranng proceeds, as expected. Interestngly, SM and SMD exhbt completely opposte behavors. SM seems to sgnfcantly ncrease the free energy dfference relatve to nearby nosy mages, correspondng to a dstrbuton that s peaked around natural mages. SMD, on the other hand, actually decreases the free energy dfference relatve to nearby nosy mages. In the next experment, we consder an mage denosng task. We take an mage patch v and add Gaussan whte nose, obtanng a nosy patch v. We then ap- 5 Note that for coenence, both tasks were performed n the PCA doman. We use a standard devaton of for the Gaussan nose n all cases.

7 Autoencoders and Score Matchng a) Mean flters b) Covarance flters Fgure. mpot flters learned usng dfferent estmaton methods: a) mean flters, b) covarance flters. ply each model to denose each patch v, obtanng a reconstructon ˆv. The frst denosng method, shown n Fgure e), computes a reconstructon ˆv by smulatng one step of a Markov chan usng a mean-feld approxmaton. That s, we frst compute h c k and hm j by Equatons 5 and 6 usng v as the nput. The reconstructon s the expectaton of the condtonal dstrbuton P θ v h c k, hm j ). The second method, shown n Fgure f), s the Bayesan MAP estmator: ˆv = arg mn F θ v) + λ v v v, 8) where λ s a scalar representng how close the reconstructon should reman to the nosy nput. We select λ by cross-valdaton. The results show that score matchng acheves the mnmum error usng both denosng approaches, however t quckly overfts as tranng proceeds. FPCD and PCD do not match the mnmum error of SM and also overft, albet to a lesser extent. CD and SMD do not appear to overft. However, we note that the mnmum error obtaned by SMD s sgnfcantly hgher than the mnmum error obtaned by SM usng both denosng methods. Ths s qute ntutve snce SMD s equvalent to estmatng the model usng a smoothed tranng dstrbuton that shfts mass onto nearby nosy mages Feature Extracton and Classfcaton One of the prmary uses for latent EBMs s to generate dscrmnatve features. Table shows the result of usng each method to extract features on the benchmark CIFAR 0 dataset. We follow the protocol of Ranzato & Hnton, 00) wth early stoppng. We use a valdaton set to select regularzaton parame- Table. Recognton accuracy on CIFAR 0. CD PCD FPCD SM SMD AE 64.6% 64.7% 65.5% 65.0% 64.7% 57.6% ters. Wth the excepton of AE, all methods appear to do well and the dfferences between them are not statstcally sgnfcant. AE, on the other hand, does sgnfcantly worse. Fnally, we show examples of flters learned by each method. Fgure a) shows a random subset of mean flters correspondng to the columns of W, whle Fgure b) shows a random subset of covarance flters correspondng to the columns of C. Interestngly, only FPCD and PCD show structure n the learned mean flters. In the covarance unts, all methods except AE learn localzed Gabor-lke flters. It s well known that obtanng nce lookng flters wll usually correlate wth good performance, but t s not always clear what leads to these flters. We have shown here that one way to obtan good qualtatve and quanttatve performance s to focus on approprately modelng the curvature of the energy wth respect to v. In ths context, the SM reconstructon and varance terms serve to ensure that the peaks of the dstrbuton occur around the tranng cases. 5. Concluson By applyng score matchng to the energy space of a latent EBM, as opposed to the free energy space, we gan an ntutve nterpretaton of the score matchng objectve. We can always break the objectve down nto three terms correspondng to expectatons under the condtonal dstrbuton of the hdden unts: reconstructon, reconstructon varance, and curvature. We have determned that for the Gaussan-bnary RBM, the reconstructon term wll always correspond to an autoencoder wth ted weghts. Whle autoencoders and RBMs were prevously consdered to be related, but separate models, ths analyss shows that they can be nterpreted as dfferent estmators appled to the same underlyng model. We also showed that one can derve novel autoencoders by applyng score matchng to more complex EBMs. Ths allows us to

8 Autoencoders and Score Matchng thnk about models n terms of EBMs before creatng a correspondng autoencoder to leverage fast nference. Furthermore, ths framework provdes gudance on selectng prncpled regularzaton functons for autoencoder tranng, leadng to mproved representatons. Our experments show that not only does score matchng yeld smlar performance to exstng estmaton methods when appled to classfcaton, but that shapng the curvature of the energy approprately may be mportant for generatng good features. Whle ths seems obvous for probablstc EBMs, t has prevously been dffcult to apply to autoencoders because they were not thought of as havng a correspondng energy functon. Now that we know whch statstcs may be mportant to montor durng tranng, t would be nterestng to see what happens when other heurstcs, such as sparsty, are appled to help generate nterpretable features. References Bengo, Y. Learnng deep archtectures for AI. Foundatons and Trends n Machne Learnng, ): 7, 009. Hnton, G.E. Tranng products of experts by mnmzng contrastve dvergence. Neural Computaton, 4:77 800, 00. Hnton, G.E. and Zemel, R.S. Autoencoders, mnmum descrpton length and Helmholtz free energy. In Advances n Neural Informaton Processng Systems, pp. 3 0, 994. Hnton, G.E., Osndero, S., and Teh, Y.W. A fast learnng algorthm for deep belef nets. Neural Computaton, 8 7):57 554, 006. Hyvärnen, A. Estmaton of non-normalzed statstcal models usng score matchng. Journal of Machne Learnng Research, 6: , 005. Kngma, D. and LeCun, Y. Regularzed estmaton of mage statstcs by score matchng. In Advances n Neural Informaton Processng Systems, 00. Krzhevsky, A. Learnng multple layers of features from tny mages, 009. MSc Thess, Dept. of Comp. Scence, U. of Toronto. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F.J. A tutoral on energy-based learnng. In Predctng Structured Data. MIT Press, 006. Marln, B.M., Swersky, K., Chen, B., and de Fretas, N. Inductve prncples for restrcted Boltzmann machne learnng. In Artfcal Intellgence and Statstcs, pp , 00. Memsevc, R. and Hnton, G.E. Learnng to represent spatal transformatons wth factored hgher-order Boltzmann machnes. Neural Computaton, :473 49, 009. Ranzato, M. and Hnton, G.E. Modelng pxel means and covarances usng factorzed thrd-order Boltzmann machnes. In IEEE Computer Vson and Pattern Recognton, pp , 00. Ranzato, M., Boureau, Y.L., Chopra, S., and LeCun, Y. A unfed energy-based framework for unsupervsed learnng. In Artfcal Intellgence and Statstcs, 007. Ranzato, M., Krzhevsky, A., and Hnton, G.E. Factored 3- way restrcted Boltzmann machnes for modelng natural mages. In Artfcal Intellgence and Statstcs, pp. 6 68, 00a. Ranzato, M., Mnh, V., and Hnton, G.E. How to generate realstc mages usng gated MRF s. In Advances n Neural Informaton Processng Systems, pp , 00b. Schmdt, U., Gao, Q., and Roth, S. A generatve perspectve on MRFs n low-level vson. In IEEE Computer Vson and Pattern Recognton, 00. Swersky, K., Chen, B., Marln, B.M., and de Fretas, N. A tutoral on stochastc approxmaton algorthms for tranng restrcted Boltzmann machnes and deep belef nets. In Informaton Theory and Applcatons Workshop, pp. 0, 00. Teh, Y.W., Wellng, M., Osndero, S., and Hnton, G.E. Energy-based models for sparse overcomplete representatons. Journal of Machne Learnng Research, 4:35 60, 003. Teleman, T. Tranng restrcted Boltzmann machnes usng approxmatons to the lkelhood gradent. In Internatonal Conference on Machne Learnng, pp , 008. Teleman, T. and Hnton, G.E. Usng fast weghts to mprove persstent contrastve dvergence. In Internatonal Conference on Machne Learnng, 009. Torralba, A., Fergus, R., and Freeman, W.T. 80 mllon tny mages: A large dataset for non-parametrc object and scene recognton. IEEE Transactons on Pattern Analyss and Machne Intellgence, 30: , 008. Vncent, P. A connecton between score matchng and denosng autoencoders. Neural Computaton, To appear, 0. Vncent, P., Larochelle, H., Bengo, Y., and Manzagol, P.A. Extractng and composng robust features wth denosng autoencoders. In Internatonal Conference on Machne Learnng, pp , 008. Wellng, M., Hnton, G.E., and Osndero, S. Learnng sparse topographc representatons wth products of student-t dstrbutons. In Advances n Neural Informaton Processng Systems, 003. Wellng, M., Rosen-Zv, M., and Hnton, G.E. Exponental famly harmonums wth an applcaton to nformaton retreval. In Advances n Neural Informaton Processng Systems, 005. Younes, L. Parametrc nference for mperfectly observed Gbbsan felds. Probablty Theory and Related Felds, 84):65 645, 989.