A Contnuous Restrcted Boltzmann Machne wth a Hardware-Amenable Learnng Algorthm Hsn Chen and Alan Murray Dept. of Electroncs and Electrcal Engneerng, Unversty of Ednburgh, Mayfeld Rd., Ednburgh, EH93JL, UK, {Hsn.Chen,A.F.Murray}@ee.ed.ac.uk Abstract. Ths paper proposes a contnuous stochastc generatve model that offers an mproved ablty to model analogue data, wth a smple and relable learnng algorthm. The archtecture forms a contnuous restrcted Boltzmann Machne, wth a novel learnng algorthm. The capabltes of the model are demonstrated wth both artfcal and real data. 1 Introducton Probablstc generatve models offer a flexble route to mproved data modellng, wheren the stochastcty represents the natural varablty of real data. Our prmary nterest s n processng and modellng analogue data close to a sensor nterface. It s therefore mportant that such models are amenable to analogue or mxed-mode VLSI mplementaton. The Product of Experts (PoE) has been shown to be a flexble archtecture and Mnmsng Contrastve Dvergence (MCD) can underpn a smple learnng rule [1]. The Restrcted Boltzmann Machne (RBM) [2] wth an MCD rule has been shown to be amenable to further smplfcaton and use n real applcatons [3]. The RBM has one hdden and one vsble layer wth only nter-layer connectons. Let s and s j represent the states of the stochastc unts, j, and w j be the nterconnect weghts. The MCD rule for RBM replaces the computatonallyexpensve relaxaton search of the Boltzmann Machne wth: w j = η(< s s j > 0 < ŝ ŝ j > 1 ) (1) ŝ and ŝ j correspond to one-step Gbbs sampled reconstructon states, and <> denotes expectaton value over the tranng data. By approxmatng the probabltes of vsble unts as analogue-valued states, the RBM can model analogue data [1][3]. However, the bnary nature of the hdden unt causes the RBM to tend to reconstruct symmetrc analogue data only, as wll be shown n Sect. 3. The rate-coded RBM (RBMrate) [4] removes ths lmtaton by samplng each stochastc unt for m tmes. The RBMrate unt thus has dscrete-valued states, whle retanng the smple learnng algorthm of (1). RBMrate offers an mproved ablty to model analogue mage [4], but the repettve samplng wll cause more spkng nose n the power supples of a VLSI mplementaton, placng the crcuts n danger of synchronsaton [5]. J.R. Dorronsoro (Ed.): ICANN 2002, LNCS 2415, pp. 358 363, 2002. c Sprnger-Verlag Berln Hedelberg 2002
A Contnuous Restrcted Boltzmann Machne 359 2 The Contnuous Restrcted Boltzmann Machne 2.1 A Contnuous Stochastc Unt Addng a zero-mean Gaussan wth varance σ 2 to the nput of a sampled sgmodal unt produces a contnuous stochastc unt as follows: ( ) s j = ϕ j w j s + σ N j (0, 1), (2) wth ϕ j (x j )=θ L +(θ H θ L ) 1 1 + exp( a j x j ) (3) where N j (0, 1) represents a unt Gaussan, and ϕ j (x) s a sgmod functon wth lower and upper asymptotes at θ L and θ H, respectvely. Parameter a j controls the steepness of the sgmod functon, and thus the nature of the unt s stochastc behavour. A small value of a j renders nput nose neglgble and leads to a near-determnstc unt, whle a large value of a j leads to a bnary stochastc unt. If the value of a j renders the sgmod lnear over the range of the added nose, the probablty of s j remans Gaussan wth mean w js and varance σ 2. Replacng the bnary stochastc unt n RBM by ths contnuous form of stochastc unt leads to a contnuous RBM (CRBM). 2.2 CRBM and Dffuson Network The model and learnng algorthms of the Dffuson Network (DN) [6][7] arse from ts contnuous stochastc behavour, as descrbed by a stochastc dfferental equaton. A DN conssts of n fully-connected unts and an n n real-valued matrx W, defnng the connecton-weghts. Let x j (t) be the state of neuron j n a DN. The dynamcal dffuson process s descrbed by the Langevn equaton: ) dx j (t) =κ j ( w j ϕ (x (t)) ρ j x j (t) dt + σ db j (t) (4) where 1/κ j > 0 and 1/ρ j > 0 represent the nput capactance and resstance of neuron j. db j (t) s the Brownan moton dfferental [7]. The ncrement, B j (t + dt) B j (t), s thus a zero-mean Gaussan random varable wth varance dt. The dscrete-tme dffuson process for a fnte tme ncrement t s: x j (t + t) =x j (t)+κ j w j ϕ (x (t)) t κ j ρ j x j (t) t + σz j (t) t (5) where z j (t) s a Gaussan random varable wth zero mean and unt varance. If κ j ρ j t = 1, the terms n x j (t) cancel and wrtng σ t = σ, ths becomes: x j (t + t) =κ j w j ϕ (x (t)) t + σ z j (t) (6)
360 H. Chen and A. Murray If w j = w j and κ j s constant over the network, the RHS of (6) s equvalent to the total nput of a CRBM as gven by (2). As s j = ϕ j (x j ), the CRBM s smply a symmetrc restrcted DN (RDN), and the learnng algorthm of the DN s thus a useful canddate for the CRBM. 2.3 M.C.D. Learnng Algorthms for the CRBM The learnng rule for the parameter λ j of the DN s [6]: λ j =<S λj > 0 <S λj > (7) where <> 0 refers to the expectaton value over the tranng data wth vsble states clamped, and <> to that n free-runnng equlbrum. S λj s the systemcovarate [6], the negatve dervatve of the DN s energy functon w.r.t. parameter λ j. The restrcted DN can be shown to be a PoE [8] and we choose to smplfy (7) by once agan mnmsng contrastve dvergence [1]. ˆλ j =<S λj > 0 < Ŝλ j > 1 (8) where <> 1 ndcates the expectaton values over one-step sampled data. Let ϕ(s) represent ϕ j (s) wth a j = 1. The energy functon of CRBM can be shown to be smlar to that of the contnuous Hopfeld model [9][6]. U = 1 w j s s j + 2 j ρ a s 0 ϕ 1 (s)ds (9) (8) and (9) then lead to the MCD learnng rule for the CRBM s parameters: wˆ j = η w (<s s j > 0 < ŝ ŝ j > 1 ) (10) ( ) ρ sj j â j = η a a 2 ϕ 1 (s)ds j (11) where ŝ j denotes the one-step sampled state of unt j, and <> n (11) refers to the expextaton value over the tranng data. To smplfy the hardware desgn, we approxmate the ntegral term n (11) as sj ŝ j ϕ 1 (s)ds (s j +ŝ j )(s j ŝ j ) (12) The tranng rules for w j and a j thus requre only addng and multplyng calculaton of local unts states. ŝ j 3 Demonstraton: Artfcal Data Two-dmensonal data were generated to probe and to compare the performance of RBM and CRBM on analogue data (Fg.1(a)). The data nclude two clusters
A Contnuous Restrcted Boltzmann Machne 361 Fg. 1. (a) Artfcal-generated analogue tranng data. (b) Reconstructon by the traned RBM (c) Reconstructon by the traned CRBM (d)learnng trace of a j. of 200 data. Fgure 1(b) shows ponts reconstructed from 400 random nput data after 20 steps of Gbbs samplng by an RBM wth 6 hdden unts, after 4000 tranng epochs. The RBM s tendancy to generate data n symmetrc patterns s clear. Fgure 1(c) shows the same result for a CRBM wth four hdden unts, η w =1.5,η a = 1 and σ =0.2 for all unts. The evoluton of the gan factor a j of one vsble unt s shown n Fg.1(d) and dsplays a form of autonomous annealng, drven by (11), ndcatng that the approxmaton n (12) leads to sensble tranng behavour n ths stylsed, but non-trval example. 4 Demonstraton: Real Heart-Beat (ECG) Data To hghlght the mproved modellng rchness of the CRBM and to gve these results credence, a CRBM wth four hdden unts was traned to model the ECG data used n [3] and [10]. The ECG trace was dvded nto one tranng dataset of 500 heartbeats and one test dataset of 1700 heartbeats, each of 65 samples. The 500 tranng data contan sx Ventrcular Ectopc Beats (VEB), whle the 1700 test data contan 27 VEBs. The CRBM was traned for 4000 epochs wth η w =1.5, η a =1,σ =0.2 for vsble unts and σ =0.5 for hdden unts. Fgure 2 shows the reconstructon by the traned CRBM, from ntal vsble states as 2(a) an observed normal QRS complex 2(b) an observed typcal VEB, after 20 subsequent steps of unclamped Gbbs samplng. The CRBM models both forms of heartbeat successfully, although VEBs represent only 1% of the
362 H. Chen and A. Murray Fg. 2. Reconstructon by the traned CRBM wth nput of (a) a normal QRS and (b) a typcal VEB. (c) The receptve felds of the hdden bas and the four hdden unts tranng data. Followng [3], Fg.2(c) shows the receptve felds of the hdden bas unt and the four hdden unts. The Bas unt codes an average normal QRS complex, and H3 adds to the P- and T- waves. H1 and H2 drve a small horzontal shft and a magntude varaton of the QRS complex. Fnally, H4 encodes the sgnfcant dp found n a VEB. The most prncpled detector of VEBs n test data s the log-lkelhood under the traned CRBM. However, loglkelhood requres complcated hardware. Fgure 2(c) suggests that the actvtes of hdden unts may be usable as the bass of a smple novelty detector. For example, the actvtes of H4 correspondng to 1700 test data are shown n Fg.3, wth the nose source n equaton (2) removed. The peaks ndcate the VEBs clearly. VL n Fg.3 ndcates the mnmum H4 actvty for a VEB and QH marks the datum wth maxmum H4 actvty for a normal heartbeat. Therefore, Fg. 3. The actvtes of H4 correspondng to 1700 test data.
A Contnuous Restrcted Boltzmann Machne 363 even a smple lnear classfer wth threshold set between the two dashed lne wll detect the VEBs wth an accuracy of 100%. The margn for threshold s more than 0.5, equvalent to 25% of the total value range. A sngle hdden unt actvty n a CRBM s, therefore, potentally a relable novelty detector and t s expected that layerng a supervsed classfer on the CRBM, to fuse the hdden unt actvtes, wll lead to mproved results. 5 Concluson The CRBM can model analogue data successfully wth a smplfed MCD rule. Experments wth real ECG data further show that the actvtes of the CRBM s hdden unts may functon as a smple but relable novelty detector. Component crcuts of the RBM wth the MCD rule have been successfully mplemented [5][11]. Therefore, the CRBM s a potental contnuous stochastc model for VLSI mplementaton and embedded ntellgent systems. References 1. Hnton, G.E.: Tranng Products of Experts by mnmsng contrastve dvergence. Techncal Report: Gatsby Comp. Neuroscence Unt, no.tr2000-004. (2000) 2. Smolensky, P.: Informaton processng n dynamcal systems: Foundatons of harmony theory. Parallel Dstrbuted Processng. Vol.1 (1986) 195 281 3. Murray, A.F.: Novelty detecton usng products of smple experts-a potental archtecture for embedded systems. Neural Networks 14(9) (2001) 1257 1264 4. Teh, Y.W., Hnton, G.E.: Rate-coded Restrcted Boltzmann Machne for face recognton. Advances n Neural Informaton Processng Systems, Vol.13 (2001) 5. Woodburn, R.J., Astaras, A.A., Dalzell, R.W., Murray, A.F., McNell, D.K.: Computng wth uncertanty n probablstc neural networks on slcon. Proc. 2nd Int. ICSC Symp. on Neural Computaton. (1999) 470 476 6. Movellan, J.R.: A learnng theorem for networks at detaled stochastc equlbrum. Neural Computaton, Vol.10(5) (1998) 1157 1178 7. Movellan, J.R., Mnero, P., Wllam, R.J.: A Monte-Carlo EM approach for partally observable dffuson process: Theory and applcatons to neural network. Neural Computaton (In Press) 8. Marks, T.K., Movellan, J.R.: Dffuson networks, products of experts, and factor analyss. UCSD MPLab Techncal Report, 2001.02. (2001) 9. Hopfeld, J.J.: Neurons wth graded response have collectve computatonal propertes lke those of two-state neurons. Proc. Nat. Academy of Scence of the USA, Vol.81(10). (1984) 3088 3092 10. Tarassenko, L., Clfford G.: Detecton of ectopc beats n the electrocardogram usng an Auto-assocatve neural network. Neural Processng Letters, Vol.14(1) (2001) 15 25 11. Fleury, P., Woodburn, R.J., Murray, A.F.: Matchng analogue hardware wth applcatons usng the Products of Experts algorthm. Proc. IEEE European Symp. on Artfcal Neural Networks. (2001)63 67