A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach

Transcription

1 A tutoral on tranng recurrent neural networks, coverng BPPT, RTRL, EKF and the "echo state network" approach Herbert Jaeger Fraunhofer Insttute for Autonomous Intellgent Systems AIS) snce 2003: Internatonal Unversty Bremen Frst publshed: Oct Frst revson: Feb Second revson: March 2005 Abstract: Ths tutoral s a worked-out verson of a 5-hour course orgnally held at AIS n September/October It has two dstnct components. Frst, t contans a mathematcally-orented crash course on tradtonal tranng methods for recurrent neural networks, coverng back-propagaton through tme BPTT), real-tme recurrent learnng RTRL), and extended Kalman flterng approaches EKF). Ths materal s covered n Sectons 2 5. The remanng sectons 1 and 6 9 are much more gentle, more detaled, and llustrated wth smple examples. They are ntended to be useful as a stand-alone tutoral for the echo state network ESN) approach to recurrent neural network tranng. The author apologzes for the poor layout of ths document: t was transformed from an html fle nto a Word fle... Ths manuscrpt was frst prnted n October 2002 as H. Jaeger 2002): Tutoral on tranng recurrent neural networks, coverng BPPT, RTRL, EKF and the "echo state network" approach. GMD Report 159, German Natonal Research Center for Informaton Technology, pp.) Revson hstory: 01/04/2004: several serous typos/errors n Sectons 3 and 5 03/05/2004: numerous typos 21/03/2005: errors n Secton 8.1, updated some URLs 1

2 Index 1. Recurrent neural networks Frst mpresson Supervsed tranng: basc scheme Formal descrpton of RNNs Example: a lttle tmer network Standard tranng technques for RNNs Backpropagaton revsted Backpropagaton through tme Real-tme recurrent learnng Hgher-order gradent descent technques Extended Kalman-flterng approaches The extended Kalman flter Applyng EKF to RNN weght estmaton Echo state networks Tranng echo state networks Frst example: a snewave generator Second Example: a tuneable snewave generator Tranng echo state networks: mathematcs of echo states Tranng echo state networks: algorthm Why echo states? Lqud state machnes Short term memory n ESNs Frst example: tranng an ESN as a delay lne Theoretcal nsghts ESNs wth leaky ntegrator neurons The neuron model Example: slow snewave generator Trcks of the trade References

3 1. Recurrent neural networks 1.1 Frst mpresson There are two maor types of neural networks, feedforward and recurrent. In feedforward networks, actvaton s "pped" through the network from nput unts to output unts from left to rght n left drawng n Fg. 1.1): Fgure 1.1: Typcal structure of a feedforward network left) and a recurrent network rght). Short characterzaton of feedforward networks: typcally, actvaton s fed forward from nput to output through "hdden layers" "Mult-Layer Perceptrons" MLP), though many other archtectures exst mathematcally, they mplement statc nput-output mappngs functons) basc theoretcal result: MLPs can approxmate arbtrary term needs some qualfcato nonlnear maps wth arbtrary precson "unversal approxmaton property") most popular supervsed tranng algorthm: backpropagaton algorthm huge lterature, 95 % of neural network publcatons concern feedforward nets my estmate) have proven useful n many practcal applcatons as approxmators of nonlnear functons and as pattern classfcators are not the topc consdered n ths tutoral By contrast, a recurrent neural network RNN) has at least one) cyclc path of synaptc connectons. Basc characterstcs: all bologcal neural networks are recurrent mathematcally, RNNs mplement dynamcal systems basc theoretcal result: RNNs can approxmate arbtrary term needs some qualfcato dynamcal systems wth arbtrary precson "unversal approxmaton property") several types of tranng algorthms are known, no clear wnner theoretcal and practcal dffcultes by and large have prevented practcal applcatons so far 3

4 not covered n most neuronformatcs textbooks, absent from engneerng textbooks ths tutoral s all about them. Because bologcal neuronal systems are recurrent, RNN models abound n the bologcal and bocybernetcal lterature. Standard types of research papers nclude... bottom-up, detaled neurosmulaton: o compartment models of small even sngle-unt) systems o complex bologcal network models e.g. Freeman's olfactory bulb models) top-down, nvestgaton of prncples o complete mathematcal study of few-unt networks n AIS: Pasemann, Gannakopoulos) o unversal propertes of dynamcal systems as "explanatons" for cogntve neurodynamcs, e.g. "concept ~ attractor state"; "learnng ~ parameter change"; " umps n learnng and development ~ bfurcatons" o demonstraton of dynamcal workng prncples o synaptc learnng dynamcs and condtonng o synfre chans Ths tutoral does not enter ths vast area. The tutoral s about algorthmcal RNNs, ntended as blackbox models for engneerng and sgnal processng. The general pcture s gven n Fg. 1.2: physcal system emprcal tme seres data observe model "learn", "estmate", "dentfy" ft smlar dstrbuto generate RNN model model-generated data Fgure 1.2: Prncpal moves n the blackbox modelng game. 4

5 Types of tasks for whch RNNs can, n prncple, be used: system dentfcaton and nverse system dentfcaton flterng and predcton pattern classfcaton stochastc sequence modelng assocatve memory data compresson Some relevant applcaton areas: telecommuncaton control of chemcal plants control of engnes and generators fault montorng, bomedcal dagnostcs and montorng speech recognton robotcs, toys and edutanment vdeo data analyss man-machne nterfaces State of usage n applcatons: RNNs are not ofte proposed n techncal artcles as "n prncple promsng" solutons for dffcult tasks. Demo prototypes n smulated or clean laboratory tasks. Not economcally relevant yet. Why? supervsed tranng of RNNs s was) extremely dffcult. Ths s the topc of ths tutoral. 1.2 Supervsed tranng: basc scheme There are two basc classes of "learnng": supervsed and unsupervsed and unclear cases, e.g. renforcement learnng). Ths tutoral consders only supervsed tranng. In supervsed tranng of RNNs, one starts wth teacher data or tranng data): emprcally observed or artfcally constructed nput-output tme seres, whch represent examples of the desred model behavor. A. Tranng Teacher: Model: n out B. Explotaton Input: Correct unknow output: Model: n out Fgure 1.3: Supervsed tranng scheme. 5

6 The teacher data s used to tran a RNN such that t more or less precsely reproduces fts) the teacher data hopng that the RNN then generalzes to novel nputs. That s, when the traned RNN receves an nput sequence whch s somehow smlar to the tranng nput sequence, t should generate an output whch resembles the output of the orgnal system. A fundamental ssue n supervsed tranng s overfttng: f the model fts the tranng data too well extreme case: model duplcates teacher data exactly), t has only "learnt the tranng data by heart" and wll not generalze well. Partcularly mportant wth small tranng samples. Statstcal learnng theory addresses ths problem. For RNN tranng, however, ths tended to be a non-ssue, because known tranng methods have a hard tme fttng tranng data well n the frst place. 1.3 Formal descrpton of RNNs The elementary buldng blocks of a RNN are neurons we wll use the term unts) connected by synaptc lnks connectons) whose synaptc strength s coded by a weght. One typcally dstngushes nput unts, nternal or hdde unts, and output unts. At a gven tme, a unt has an actvaton. We denote the actvatons of nput unts by u, of nternal unts by x, of output unts by y. Sometmes we gnore the nput/nternal/output dstncton and then use x n a metonymcal fashon. x n 1) f wx ) x x w f x ) + nput, + bas, +nose, +output feedback...) contnuous tme dscrete tme spkng spatally organzed Fgure 1.4: A typology of RNN models ncomplete). There are many types of formal RNN models see Fg. 1.4). Dscrete-tme models are mathematcally cast as maps terated over dscrete tme steps n = 1, 2, 3,.... Contnuous-tme models are defned through dfferental equatons whose solutons are defned over a contnous tme t. Especally for purposes of bologcal modelng, contnuous dynamcal models can be qute nvolved and descrbe actvaton sgnals on the level of ndvdual acton potentals spkes). Often the model ncorporates a specfcaton of a spatal topology, most often of a 2D surface where unts are locally connected n retna-lke structures. In ths tutoral we wll only consder a partcular knd of dscrete-tme models wthout spatal organzaton. Our model conssts of K nput unts wth an actvaton colum vector t 1.1) u n ) u,, u ) 1 K, 6

7 of N nternal unts wth an actvaton vector 1.2) x x1,, x )', and of L output unts wth an actvaton vector 1.3) y y1,, y )', N L where t denotes transpose. The nput / nternal / output connecton weghts are collected n N x K / N x N / L x K+N) weght matrces n n out out 1.4) W w ), W w ), W w ). The output unts may optonally proect back to nternal unts wth connectons whose weghts are collected n a N x L backproecton weght matrx back back 1.5) W w ). K nput unts N nternal unts L output unts Fgure 1.5. The basc network archtecture used n ths tutoral. Shaded arrows ndcate optonal connectons. Dotted arrows mark connectons whch are traned n the "echo state network" approach n other approaches, all connectons can be traned). A zero weght value can be nterpreted as "no connecton". Note that output unts may have connectons not only from nternal unts but also ofte from nput unts and rarely) from output unts. The actvaton of nternal unts s updated accordng to n back 1.6) x n 1) f W u n 1) Wx W y ), where un+1) s the externally gven nput, and f denotes the component-wse applcaton of the ndvdual unt's transfer functon, f also known as actvaton functon, unt output functon, or squashng functo. We wll mostly use the sgmod functon f = tanh but sometmes also consder lnear networks wth f = 1. The output s computed accordng to 7

8 out out 1.7) y n 1) f W u n 1), x n 1), y ), where un+1),xn+1),y) denotes the concatenated vector made from nput, nternal, and output actvaton vectors. We wll use output transfer functons f out = tanh or f out = 1; n the latter case we have lnear output unts. 1.4 Example: a lttle tmer network Consder the nput-output task of tmng. The nput sgnal has two components. The frst component u 1 s 0 most of the tme, but sometmes umps to 1. The second nput u 2 can take values between 0.1 and 1.0 n ncrements of 0.1, and assumes a new random) of these values each tme u 1 umps to 1. The desred output s 0.5 for 10 x u 2 tme steps after u 1 was 1, else s 0. Ths amounts to mplementng a tmer: u 1 gves the "go" sgnal for the tmer, u 2 gves the desred duraton nput 1: start sgnals nput 2: duraton settng ouput: rectangular sgnals of desred duraton Fgure 1.6: Schema of the tmer network. The followng fgure shows traces of nput and output generated by a RNN traned on ths task accordng to the ESN approach: Fgure 1.7: Performance of a RNN traned on the tmer task. Sold lne n last graph: desred teacher) output. Dotted lne: network ouput. 8

9 Clearly ths task requres that the RNN must act as a memory: t has to retan nformaton about the "go" sgnal for several tme steps. Ths s possble because the nternal recurrent connectons let the "go" sgnal "reverberate" n the nternal unts' actvatons. Generally, tasks requrng some form of memory are canddates for RNN modelng. 2. Standard tranng technques for RNNs Durng the last decade, several methods for supervsed tranng of RNNs have been explored. In ths tutoral we present the currently most mportant ones: backpropagaton through tme BPTT), real-tme recurrent learnng RTRL), and extended Kalman flterng based technques EKF). BPTT s probably the most wdely used, RTRL s the mathematcally most straghtforward, and EKF s arguably) the technque that gves best results. 2.1 Backpropagaton revsted BPTT s an adaptaton of the well-known backpropagaton tranng method known from feedforward networks. The backpropagaton algorthm s the most commonly used tranng method for feedforward networks. We start wth a recap of ths method. We consder a mult-layer perceptron MLP) wth k hdden layers. Together wth the layer of nput unts and the layer of output unts ths gves k+2 layers of unts altogether Fg left shows a MLP wth two hdden layers), whch we number by 0,..., k+1. The number of nput unts s K, of output unts L, and of unts n hdden layer m s N m. The weght of the -th unt n layer m and the -th unt n layer m+1 s denoted by w m. The actvaton of the -th unt n layer m s x m for m = 0 ths s an nput value, for m = k+1 an output value). The tranng data for a feedforward network tranng task consst of T nput-output vector-valued) data pars 0 0 t k1 k1 t 2.1) u n ) x,, x ), d d,, d ), 1 K where n denotes tranng nstance, not tme. The actvaton of non-nput unts s computed accordng to m 1,..., N m1 m 2.2) x f w x ). 1 Standardly one also has bas terms, whch we omt here). Presented wth teacher nput ut), the prevous update equaton s used to compute actvatons of unts n subsequent hdden layers, untl a network response k1 k1 2.3) y x,, x ))' 1 L n L 9

10 s obtaned n the output layer. The obectve of tranng s to fnd a set of network weghts such that the summed squared error 2 2.4) E d y n1,..., T n1,..., T E s mnmzed. Ths s done by ncrementally changng the weghts along the drecton of the error gradent w.r.t. weghts E 2.5) m w t1,... T E m w usng a small) learnng rate : m m E 2.6) new w w. m w Superscrpt m omtted for readablty). Ths s the formula used n batch learnng mode, where new weghts are computed after presentng all tranng samples. One such pass through all samples s called an epoch. Before the frst epoch, weghts are ntalzed, typcally to small random numbers. A varant s ncremental learnng, where weghts are changed after presentaton of ndvdual tranng samples: m m E 2.7) new w w. m w The central subtask n ths method s the computaton of the error gradents E. The backpropagaton algorthm s a scheme to perform these computatons. We gve the recpe for one epoch of batch processng. Input: current weghts w m, tranng samples. Output: new weghts. Computaton steps: m w For each sample n, compute actvatons of nternal and output unts usng 2.2) "forward pass"). Compute, by proceedng backward through m = k+1,k,...,1, for each unt x m the error propagaton term m k1 f u) 2.8) n ) d y ) k 1 uz u 10

11 for the output layer and m1 N m m1 m f u) 2.9) w m uz u 1 for the hdden layers, where m1 m 2.10) z x N 1 m1 w m1 s the nternal state or "potental") of unt x m. Ths s the error backpropagaton pass. Mathematcally, the error propagaton term m represents the error gradent w.r.t. the potental of the unt unt x m. 2.10a) E u m uz. 3. Adust the connecton weghts accordng to m1 m1 m m1 2.11) new w w x. T t1 After every such epoch, compute the error accordng to 2.4). Stop when the error falls below a predetermned threshold, or when the change n error falls below another predetermned threshold, or when the number of epochs exceeds a predetermned maxmal number of epochs. Many order of thousands n nontrval tasks) such epochs may be requred untl a suffcently small error s acheved. One epoch requres OT M) multplcatons and addtons, where M s the total number of network connectons. The basc gradent descent approach and ts backpropagaton algorthm mplementato s notorous for slow convergence, because the learnng rate must be typcally chosen small to avod nstablty. Many speed-up technques are descrbed n the lterature, e.g. dynamc learnng rate adaptaton schemes. Another approach to acheve faster convergence s to use second-order gradent descent technques, whch explot curvature of the gradent but have epoch complexty OT M 2 ). Lke all gradent-descent technques on error surfaces, backpropagaton fnds only a local error mnmum. Ths problem can be addressed by varous measures, e.g. addng nose durng tranng smulated annealng approaches) to avod gettng stuck n poor mnma, or by repeatng the entre learnng from dfferent ntal weght 11

12 settngs, or by usng task-specfc pror nformaton to start from an already plausble set of weghts. Some authors clam that the local mnmum problem s overrated. Another problem s the selecton of a sutable network topology number and sze of hdden layers). Agan, one can use pror nformaton, perform systematc search, or use ntuton. All n all, whle basc backpropagaton s transparent and easly mplemented, consderable expertse and experence or patence) s a prerequste for good results n non-trval tasks Backpropagaton through tme The feedforward backpropagaton algorthm cannot be drectly transferred to RNNs because the error backpropagaton pass presupposes that the connectons between unts nduce a cycle-free orderng. The soluton of the BPTT approach s to "unfold" the recurrent network n tme, by stackng dentcal copes of the RNN, and redrectng connectons wthn the network to obtan connectons between subsequent copes. Ths gves a feedforward network, whch s amenable to the backpropagaton algorthm.... x t+ 1) un+ 1) y n+ 1) x t) u y u n-1) x t-1) A. B.... y n-1) Fgure 2.1: Schema of the basc dea of BPTT. A: the orgnal RNN. B: The feedforward network obtaned from t. The case of sngle-channel nput and output s shown. The weghts w n, w, w out, w back are dentcal n/between all copes. The teacher data conssts now of a sngle nput-output tme seres 2.12) u u 1 K 1 L,, u )', d d,, d )' n 1, T. The forward pass of one tranng epoch conssts n updatng the stacked network, startng from the frst copy and workng upwards through the stack. At each copy / tme n nput u s read n, then the nternal state x s computed from u, xn-1) [and from yn-1) f nonzero w back exst], and fnally the current copy's output y s computed. The error to be mnmzed s agan lke n 2.4) 12

13 2.13) T n T n n E n n E 1,..., 2 1,..., ) ) ) y d, but the meanng of t has changed from "tranng nstance" to "tme". The algorthm s a straghtforward, albet notatonally complcated, adaptaton of the feedforward algorthm: Input: current weghts w, tranng tme seres. Output: new weghts. Computaton steps: 1. Forward pass: as descrbed above. 2. Compute, by proceedng backward through n = T,...,1, for each tme n and unt actvaton x, y the error propagaton term 2.14) ) ) )) ) ) T z u u u f T y T d T for the output unts of tme layer T and 2.15) ) 1 ) ) ) n z u L out u u f w T T for nternal unts x T) at tme layer T and 2.16) ) 1 ) 1) )) ) ) n z u N back u u f w n n y n d n for the output unts of earler layers, and 2.17) ) 1 1 ) ) 1) ) n z u L out N u u f w n w n n for nternal unts x at earler tmes, where z agan s the potental of the correspondng unt. 3. Adust the connecton weghts accordng to 13

14 2.18) new w new w new w new w n out w back w n w w out back T n1 T n1 x u T T n1 n1 T n1 n 1) u x y, f n 1), f n 1) [ use x n 1) 0 for n 1] refers to nput unt refers to hdden unt [ use y n 1) 0 for n 1] Warnng: programmng errors are easly made but not easly perceved when they degrade performance only slghtly. The remarks concernng slow convergence made for standard backpropagaton carry over to BPTT. The computatonal complexty of one epoch s OT N 2 ), where N s number of nternal unts. Several thousands of epochs are often requred. A varant of ths algorthm s to use the teacher output d n the computaton of actvatons n layer n+1 n the forward pass. Ths s known as teacher forcng. Teacher forcng typcally speeds up convergence, or even may be necessary to acheve convergence at all, but when the traned network s exploted, t may exhbt nstablty. A general rule when to use teacher forcng cannot be gven. A drawback of ths "batch" BPTT s that the entre teacher tme seres must be used. Ths precludes applcatons where onlne adaptaton s requred. The soluton s to truncate the past hstory and use, at tme n, only a fnte hstory 2.19) u n p), u n p 1),, u, d n p), d n p 1),, d as tranng data. Snce the error backpropagaton terms need to be computed only once for each new tme slce, the complexty s ON 2 ) per tme step. A potental drawback of such truncated BPPT or p-bptt) s that memory effects exceedng a duraton p cannot be captured by the model. Anyway, BPTT generally has dffcultes capturng long-lfed memory effects, because backpropagated error gradent nformaton tends to "dlute" exponentally over tme. A frequently stated opnon s that memory spans longer than 10 to 20 tme steps are hard to acheve. Repeated executon of tranng epochs shft a complex nonlnear dynamcal system the network) slowly through parameter weght) space. Therefore, bfurcatons are necessarly encountered when the startng weghts nduce a qualtatvely dfferent dynamcal behavor than task requres. Near such bfurcatons, the gradent nformaton may become essentally useless, dramatcally slowng down convergence. The error may even suddenly grow n the vcnty of such crtcal ponts, due to crossng bfurcaton boundares. Unlke feedforward backpropagaton, BPTT s not guaranteed to converge to a local error mnmum. Ths dffculty cannot arse wth feedforward networks, because they realze functons, not dynamcal systems. 14

15 All n all, t s far from trval to acheve good results wth BPTT, and much expermentaton and processor tme) may be requred before a satsfactory result s acheved. Because of lmted processng tme, BPTT s typcally used wth small networks szes n the order of 3 to 20 unts. Larger networks may requre many hours of computaton on current hardware. 3. Real-tme recurrent learnng Real-tme recurrent learnng RTRL) s a gradent-descent method whch computes the exact error gradent at every tme step. It s therefore sutable for onlne learnng tasks. I bascally quote the descrpton of RTRL gven n Doya 1995). The most often cted early descrpton of RTRL s Wllams & Zpser 1989). The effect of weght change on the network dynamcs can be seen by smply dfferentatng the network dynamcs equatons 1.6) and 1.7) by ts weghts. For convenence, actvatons of all unts whether nput, nternal, or output) are enumerated and denoted by v and all weghts are denoted by w kl, wth = 1,...,N denotng nternal unts, = N+1,...,N+L denotng output unts, and = N+L+1,...,N+L+K denotng nput unts. The dervatve of an nternal or output unt w.r.t. a weght w kl s gven by N L v v n n 1) 3.1) f z n ) )) w ) kvl 1,, N L, wkl 1 wkl where k,l N + L + K, z s agan the unt's potental, but k here denotes Kronecker's delta k =1 f = k and 0 otherwse). The term k v l represents an explct effect of the weght w kl onto the unt k, and the sum term represents an mplct effect onto all the unts due to network dynamcs. Equaton 3.1) for each nternal or output unt consttutes an N+L-dmensonal dscrete-tme lnear dynamcal system wth tme-varyng coeffcents, where v 3.2) 1 vn L,, wkl wkl s taken as a dynamcal varable. Snce the ntal state of the network s ndependent of the connecton weghts, we can ntalze 3.1) by v 0) 3.3) 0. w kl 15

16 Thus we can compute 3.2) forward n tme by teratng Equaton 3.1) smultaneously wth the network dynamcs 1.6) and 1.7). From ths soluton, we can calculate the error gradent for the error gven n 2.13)) as follows: 3.4) E w kl 2 T N L n1 N v v d ). w kl A standard batch gradent descent algorthm s to accumulate the error gradent by Equaton 3.4) and update each weght after a complete epoch of presentng all tranng data by 3.5) new w kl E wkl, w kl where s s a learnng rate. An alternatve update scheme s the gradent descent of current output error at each tme step, 3.6) w kl n 1) w kl L 1 v v d ). w kl Note that we assumed w kl s a constant, not a dynamcal varable, n dervng 3.1), so we have to keep the learnng rate small enough. 3.6) s referred to as real-tme recurrent learnng. RTRL s mathematcally transparent and n prncple sutable for onlne tranng. However, the computatonal cost s ON+L) 4 ) for each update step, because we have to solve the N+L)-dmensonal system 3.1) for each of the weghts. Ths hgh computatonal cost makes RTRL useful for onlne adaptaton only when very small networks suffce. 4. Hgher-order gradent descent technques Just a lttle note: Pure gradent-descent technques for optmzaton generally suffer from slow convergence when the curvature of the error surface s dfferent n dfferent drectons. In that stuaton, on the one hand the learnng rate must be chosen small to avod nstablty n the drectons of hgh curvature, but on the other hand, ths small learnng rate mght lead to unacceptably slow convergence n the drectons of low curvature. A general remedy s to ncorporate curvature nformaton nto the gradent descent process. Ths requres the calculaton of the second-order dervatves, for whch several approxmatve technques have been proposed n the context of recurrent neural networks. These calculatons are expensve, but can accelerate convergence especally near an optmum where the error surface can be reasonably approxmated by a quadratc functon. Dos Santos & von Zuben 2000) and Schraudolph 2002) provde references, dscusson, and propose approxmaton technques whch are faster than nave calculatons. 16

17 5. Extended Kalman-flterng approaches 5.1 The extended Kalman flter The extended Kalman flter EKF) s a state estmaton technque for nonlnear systems derved by lnearzng the well-known lnear-systems Kalman flter around the current state estmate. We consder a smple specal case, a tme-dscrete system wth addtve nput and no observaton nose: x n 1) f x ) q 5.1), d h n x ) where x s the system's nternal state vector, f s the system's state update functon lnear n orgnal Kalman flters), q s external nput to the system an uncorrelated Gaussan whte nose process, can also be consdered as process nose), d s the system's output, and h n s a tme-dependent observaton functon also lnear n the orgnal Kalman flter). At tme n = 0, the system state x0) s guessed by a multdmensonal normal dstrbuton wth mean ˆx0) and covarance matrx P0). The system s observed untl tme n through d0),..., d. The task addressed by the extended Kalman flter s to gve an estmate ˆxn 1) of the true state xn+1), gven the ntal state guess and all prevous output observatons. Ths task s solved by the followng two tme update and three measurement update computatons: 5.2) xˆ * Fˆ x ) P * F P n 1) F t Q 5.3) t K P * H [ H P * H ] xˆ n 1) xˆ * K t P n 1) P * K H P * 1 where we roughly follow the notaton n Snghal and Wu 1989), who frst appled extended Kalman flterng to feedforward) network weght estmaton. Here F and H are the Jacobans 5.4) f x) F x xxˆ, hn x) H x xxˆ of the components of f, h n wth respect to the state varables, evaluated at the prevous state estmate; d h n ˆ x ) s the error dfference between observed output and output calculated from state estmate x ˆ ), P s an estmate of the condtonal error covarance matrx E[ d0),..., d]; Q s the dagonal) covarance matrx of the process nose, and the tme updates ˆx *, P* of state estmate and state error covarance estmate are obtaned from extrapolatng the prevous estmates wth the known dynamcs f. 17

18 The basc dea of Kalman flterng s to frst update xˆ, P to prelmnary guesses xˆ *, P* by extrapolatng from ther prevous values, applyng the known dynamcs n the tme update steps 5.2), and then adustng these prelmnary guesses by ncorporatng the nformaton contaned n d ths nformaton enters the measurement update n the form of, and s accumulated n the Kalman gan K. In the case of classcal lnear, statonary) Kalman flterng, F and H constant, and the state estmates converge to the true condtonal mean state E[x d0),..., d]. For nonlnear f, h n, ths s not generally true, and the use of extended Kalman flters leads only to locally optmal state estmates. 5.2 Applyng EKF to RNN weght estmaton Assume that there exsts a RNN whch perfectly reproduces the nput-output tme seres of the tranng data t t 5.5) u u,, u ), d d,, d ) n 1, T 1 K 1 L where the nput / nternal / output / backproecton connecton weghts are as usual collected n N x K / N x N / L x K+N+L) / N x L weght matrces n n out out back back 5.6) W w ), W w ), W w ), W w ). In ths subsecton, we wll not dstngush between all these dfferent types of weghts and refer to all of them by a weght vector w. In order to apply EKF to the task of estmatng optmal weghts of a RNN, we nterpret the weghts w of the perfect RNN as the state of a dynamcal system. From a brd's eye perspectve, the output d of the RNN s a functon h of the weghts and nput up to n: 5.7) d h w, u0),..., u ) where we assume that the transent effects of the ntal network state have ded out. The nputs can be ntegrated nto the output functon h, renderng t a tme-dependent functon h n. We further assume that the network update contans some process nose, whch we add to the weghts!) n the form of a Gaussan uncorrelated nose q. Ths gves the followng verson of 5.1) for the dynamcs of the perfect RNN: 5.8) w n 1) w q d h n w ) Except for nosy shfts nduced by process nose, the state "dynamcs" of ths system s statc, and the nput u to the network s not entered n the state update equaton, 18

19 but s hdden n the tme-dependence of the observaton functon. Ths takes some mental effort to swallow! The network tranng task now takes the form of estmatng the statc, perfect) state w from an ntal guess ŵ0) and the sequence of outputs d0),..., d. The error covarance matrx P0) s ntalzed as a dagonal matrx wth large dagonal components, e.g The smpler form of 5.8) over 5.1) leads to some smplfcatons of the EKF recursons 5.2) and 5.3): because the system state = weght!) dynamcs s now trval, the tme update steps become unnecessary. The measurement updates become 5.9) t K P H [ H P H ] 1 wˆ n 1) wˆ K t P n 1) P K H P Q A learnng rate small at the begnnng [!] of learnng to compensate for ntally bad estmates of P) can be ntroduced nto the Kalman gan update equaton: t ) K P H [1/ ) I H P H ], whch s essentally the formulaton gven n Puskorus and Feldkamp 1994). Insertng process nose q nto EKF has been clamed to mprove the algorthm's numercal stablty, and to avod gettng stuck n poor local mnma Puskorus and Feldkamp 1994). EKF s a second-order gradent descent algorthm, n that t uses curvature nformaton of the squared) error surface. As a consequence of explotng curvature, for lnear nose-free systems the Kalman flter can converge n a sngle step. We demonstrate ths by a super-smple feedforward network example. Consder the sngle-nput, sngle-output network whch connects the nput unt wth the output unt by a connecton wth weght w, wthout any nternal unt 5.11) w n 1) w d wu We nspect the runnng EKF n the verson of 5.9)) at some tme n, where t has reached an estmate wˆ. Observng that the Jacoban H s smply dwu/dw = u, the next estmated state s ) wˆ n 1) wˆ K wˆ wu wu ˆ ) w. u The EKF s clamed n the lterature to exhbt fast convergence, whch should have become plausble from ths example at least for cases where the current estmate 19

20 w ˆ s already close to the correct value, such that the lnearsaton yelds a good approxmaton to the true system. EKF requres the dervatves H of the network outputs w.r.t. the weghts evaluated at the current weght estmate. These dervatves can be exactly computed as n the RTRL algorthm, at cost ON 4 ). Ths s too expensve but for small networks. Alternatvely, one can resort to truncated BPTT, use a "stacked" verson of 5.8) whch descrbes a fnte sequence of outputs nstead of a sngle output, and obtan approxmatons to H by a procedure analogous to 2.14) 2.17). Two varants of ths approach are detaled out n Feldkamp et al. 1998). The cost here s OpN 2 ), where p s the truncaton depth. Apart from the calculaton of H, the most expensve operaton n EKF s the update of P, whch requres OLN 2 ) computatons. By settng up the network archtecture wth sutably decoupled subnetworks, one can acheve a block-dagonal P, wth consderable reducton n computatons Feldkamp et al. 1998). As far as I have an overvew, t seems to me that currently the best results n RNN tranng are acheved wth EKF, usng truncated BPTT for estmatng H, demonstrated especally n many remarkable achevements from Lee Feldkamp's research group see references). As wth BPTT and RTRL, the eventual success and qualty of EKF tranng depends very much on professonal experence, whch gudes the approprate selecton of network archtecture, learnng rates, the subteltes of gradent calculatons, presentaton of nput e.g., wndowng technques), etc. 6. Echo state networks 6.1 Tranng echo state networks Frst example: a snewave generator In ths subsecton I nformally demonstrate the prncples of echo state networks ESN) by showng how to tran a RNN to generate a snewave. The desred snewave s gven by d = 1/2 snn/4). The task of generatng such a sgnal nvolves no nput, so we want a RNN wthout any nput unts and a sngle output unt whch after tranng produces d. The teacher sgnal s a 300-step sequence of d. We start by constructng a recurrent network wth 20 unts, whose nternal connecton weghts W are set to random values. We wll refer to ths network as the "dynamcal reservor" DR). The nternal weghts W wll not be changed n the tranng descrbed later n ths subsecton. The network's unts are standard sgmod unts, as n Eq. 1.6), wth a transfer functon f = tanh. A randomly constructed RNN, such as our DR, mght develop oscllatory or even chaotc actvty even n the absence of external exctaton. We do not want ths to occur: The ESN approach needs a DR whch s damped, n the sense that f the 20

21 network s started from an arbtrary state x0), the subsequent network states converge to the zero state. Ths can be acheved by a proper global scalng of W: the smaller the weghts of W, the stronger the dampng. We assume that we have scaled W such that we have a DR wth modest dampng. Fg. 6.1 shows traces of the 20 unts of our DR when t was started from a random ntal state x0). The desred dampng s clearly vsble Fgure 6.1. The damped dynamcs of our dynamcal reservor. We add a sngle output unt to ths DR. Ths output unt features connectons that proect back nto the DR. These backproectons are gven random weghts W back, whch are also fxed do not change durng subsequent tranng. We use a lnear output unt n ths example,.e. f out = d. The only connectons whch are changed durng learnng are the weghts W out from the DR to the output unt. These weghts are not defned nor are they used) durng tranng. Fgure 6.2 shows the network prepared for tranng. fxed before tranng: traned: dynamcal reservor DR) lnear output unt Fgure 6.2: Schematc setup of ESN for tranng a snewave generator. 21

22 The tranng s done n two stages, samplng and weght computaton. Samplng. Durng the samplng stage, the teacher sgnal s wrtten nto the output unt for tmes n = 1,...,300. Wrtng the desred output nto the output unts durng tranng s often called teacher forcng). The network s started at tme n = 1 wth an arbtrary startng state; we use the zero state for startng but that s ust an arbtrary decson. The teacher sgnal d s pumped nto the DR through the backproecton connectons W back and thereby exctes an actvaton dynamcs wthn the DR. Fgure 6.3 shows what happens nsde the DR for samplng tme steps n = 101,..., Fgure 6.3. The dynamcs wthn the DR nduced by teacher-forcng the snewave d n the output unt. -step traces of the 20 nternal DR unts and of the teacher sgnal last plot) are shown. We can make two mportant observatons: 1. The actvaton patterns wthn the DR are all perodc sgnals of the same perod length as the drvng teacher d. 2. The actvaton patterns wthn the DR are dfferent from each other. Durng the samplng perod, the nternal states x = x 1,...,x 20 ) for n = 101,..., 300 are collected nto the rows of a state-collectng matrx M of sze 200x20. At the same tme, the teacher outputs d are collected nto the rows of a matrx T of sze 200x1. We do not collect nformaton from tmes n = 1,..., 100, because the network's dynamcs s ntally partly determned by the network's arbtrary startng state. By tme n = 100, we can safely assume that the effects of the arbtrary startng state have ded out and that the network states are a pure reflecton of the teacher-forced d, as s manfest n Fg

23 Weght computaton. We now compute 20 output weghts w out for our lnear output unt y such that the teacher tme seres d s approxmated as a lnear combnaton of the nternal actvaton tme seres x by out d y w x 6.1) More specfcally, we compute the weghts w out such that the mean squared tranng error 6.2) MSE tran 1/ n101 d y ) 2 1/ n101 d 20 1 w out x ) 2 s mnmzed. From a mathematcal pont of vew, ths s a lnear regresson task: compute regresson weghts w out for a regresson of d on the network states x. [n = 101,..., 300]. From an ntutve-geometrcal pont of vew, ths means combnng the 20 nternal sgnals seen n Fg. 6.3 such that the resultng combnaton best approxmates the last teacher) sgnal seen n the same fgure. From an algorthmcal pont of vew, ths offlne computaton of regresson weghts bols down to the computaton of a pseudonverse: The desred weghts whch mnmze MSE tran are obtaned by multplyng the pseudonverse of M wth T: 6.3) W out M 1 T Computng the pseudonverse of a matrx s a standard operaton of numercal lnear algebra. Ready-made functons are ncluded n Mathematca and Matlab, for example. In our example, the tranng error computed by 6.2) wth optmal output weghts obtaned by 6.3) was found to be MSE tran.= 1.2e 13. The computed output weghts are mplemented n the network, whch s then ready for use. Explotaton. After the learnt output weghts were wrtten nto the output connectons, the network was run for another steps, contnung from the last tranng network state x300), but now wth teacher forcng swtched off. The output y was now generated by the traned network all on ts own [n = 301,..., 3]. The test error 6.4) MSE test 1/ 3 n301 d y ) 2 23

24 was found to be MSE test.= 5.6e 12. Ths s greater than the tranng error, but stll very small. The network has learnt to generate the desred snewave very precsely. An ntutve explanaton of ths precson would go as follows: The snewave y at the output unt evokes perodc sgnals x nsde the DR whose perod length s dentcal to that of the output sne. These perodc sgnals make a knd of "bass" of sgnals from whch the target y s combned. Ths "bass" s optmally "pre-adapted" to the target n the sense of dentcal perod length. Ths pre-adaptaton s a natural consequence of the fact that the "bass" sgnals x have been nduced by the target tself, va the feedback proectons. So, n a sense, the task [to combne y from x ] s solved by means [the x ] whch have been formed by the very task [by the backproecton of y nto the DR]. Or sad n ntutve terms, the target sgnal y s re-consttuted from ts own echos x! An mmedate queston concerns the stablty of the soluton. One may rghtfully wonder whether the error n testng phase, small as t was n the frst steps, wll not grow over tme and fnally render the network's global oscllaton unstable. That s, we mght suspect that the precse contnuaton of the sne output after the tranng s due to the fact that we start testng from state x300), whch was produced by teacher forcng. However, ths s not usually the case. Most networks traned accordng to the prescrpton gven here can be started from almost any arbtrary nonzero startng state and wll lock nto the desred snewave. Fgure 6.4 shows an example. In mathematcal terms, the traned network s a dynamcal system wth a sngle attractor, and ths attractor s the desred oscllaton. However, the strong stablty observed n ths example s a pleasant sde-effect ot the smplcty of the snewave generatng task. When the tasks become more dffcult, the stablty of the traned dynamcs s ndeed a crtcal ssue for ESN tranng Fgure 6.4: Startng the traned network from a random startng state. Plot shows frst outputs. The network quckly settles nto the desred snewave oscllaton Second Example: a tuneable snewave generator 24

25 We now make the snewave generaton task more dffcult by demandng that the snewave be adustable n frequency. The tranng data now conssts of an nput sgnal u, whch sets the desred frequency, and an output d, whch s a snewave whose frequency follows the nput u. Fgure 6.5 shows the resultng network archtecture and a short sequence of teacher nput and output Fgure 6.5: Setup of tuneable snewave generator task. Tranable connectons appear as dotted red arrows, fxed connectons as sold black arrows. Because the task s now more dffcult, we use a larger DR wth 100 unts. In the samplng perod, the network s drven by the teacher data. Ths tme, ths nvolves both nputtng the slow sgnal u, and teacher-forcng the desred output d. We nspect the resultng actvaton patterns of nternal unts and fnd that they reflect, combne, and modfy both u and d Fgure 6.6). ut ) dt ) four nternal states Fgure 6.6: Traces of some nternal DR unts durng the samplng perod n the tuneable frequency generaton task. In ths example we use a sgmod output unt. In order to make that work, durng samplng we collect nto T not the raw desred output d but the transfer-nverted verson tanh -1 d). We also use a longer tranng sequence of 1200 steps of whch 25

26 we dscard the frst 200 steps as ntal transent. The tranng error whch we mnmze concerns the tanh-nverted quanttes: 6.5) MSE tran 1/ n201 tanh 1 d tanh 1 y ) 2 1/ n201 tanh 1 d w out x ) 2 Ths s acheved, as prevously, by computng W out M 1 T. The tranng error was found to be 8.1e-6, and the test error on the frst steps after nsertng the computed output weghts was Agan, the traned network stably locks nto the desred type of dynamcs even from a random startng state, as dsplayed n Fgure 6.7. ut ) dt ) [sold] and yt ) [dotted] Fgure 6.7 Startng the traned generator from a random startng state. Stablty of the traned network was not as easy to acheve here as n the prevous example. In fact, a trck was used whch was found emprcally to foster stable solutons. The trck s to nsert some nose nto the network durng samplng. That s, durng samplng, the network was updated accordng to the followng varant of 1.6): n back 6.6) x n 1) f W u n 1) Wx W y ), where s a small whte nose term. 6.2 Tranng echo state networks: mathematcs of echo states In the two ntroductory examples, we rather vaguely sad that the DR should exhbt a "damped" dynamcs. We now descrbe n a rgorous way what knd of "dampng" s requred to make the ESN approach work, namely, that the DR must have echo states. The key to understandng ESN tranng s the concept of echo states. Havng echo states or not havng them) s a property of the network pror to tranng, that s, a property of the weght matrces W n, W, and optonally, f they exst) W back. The property s also relatve to the type of tranng data: the same untraned network may have echo states for certan tranng data but not for others. We therefore requre that the tranng nput vectors u come from a compact nterval U and the tranng output 26

27 vectors d from a compact nterval D. We frst gve the mathematcal defnton of echo states and then provde an ntutve nterpretaton. Defnton 6.1 echo states). Assume an untraned network wth weghts W n, W, and W back s drven by teacher nput u and teacher-forced by teacher output d from compact ntervals U and D. The network W n, W, W back ) has echo states w.r.t. U and D, f for every left-nfnte nput/output sequence u, dn-1)), where n =..., -2,-1,0, and for all state sequences x, x' compatble wth the teacher sequence,.e. wth x n 1) f W n u n 1) Wx W d ) n back 6.7) x' n 1) f W u n 1) Wx' W d ) t holds that x = x' for all n 0. Intutvely, the echo state property says, "f the network has been run for a very long tme [from mnus nfnty tme n the defnton], the current network state s unquely determned by the hstory of the nput and the teacher-forced) output". An equvalent way of statng ths s to say that for every nternal sgnal x there exsts an echo functon e whch maps nput/output hstores to the current state: e : U D) 6.9), u 1), d 2)), u0), d 1)) x 0) We often say, somewhat loosely, that a traned) network W n, W, W out, W back ) s an echo state network f ts untraned "core" W n, W, W back ) has the echo state property w.r.t. nput/output from any compact nterval UxD. Several condtons have been shown to be equvalent to echo states are collected n Jaeger 2001a). We provde one for llustraton. Defnton 6.2 state contractng). Wth the same assumptons as n Def. 6.1, the network W n, W, W back ) s state contractng w.r.t. U and D, f for all rghtnfnte nput/output sequences u, dn-1)) UxD, where n = 0,1,2,... there exsts a null sequence n ) n 1, such that for all startng states x0), x'0) and for all n>0 t holds that x x' n, where x [resp. x'] s the network state at tme n obtaned when the network s drven by u, dn-1)) up to tme n after havng been started n x0), [resp. n x'0)]. Intutvely, the state forgettng property says that the effects on ntal network state wash out over tme. Note that there s some subtelty nvolved here n that the null sequence used n the defnton depends on the the nput/output sequence. The echo state property s connected to algebrac propertes of the weght matrx W. Unfortunately, there s no known necessary and suffcent algebrac condton whch allows one to decde, gven W n, W, W back ), whether the network has the echo state property. For readers famlar wth lnear algebra, we quote here from Jaeger 2001) a suffcent condton for the non-exstence of echo states. back 27

28 Proposton 6.1 Assume an untraned network W n, W, W back ) wth state update accordng to 1.6) and wth transfer functons tanh. Let W have a spectral radus max > 1, where max s the largest absolute value of an egenvector of W. Then the network has no echo states wth respect to any nput/output nterval U x D contanng the zero nput/output 0,0). At face value, ths proposton s not helpful for fndng echo state networks. However, n practce t was consstently found that when the condton noted n Proposton 6.1 s not satsfed,.e. when the spectral radus of the weght matrx s smaller than unty, we do have an echo state network. For the mathematcally adventurous, here s a conecture whch remans to be shown: Conecture 6.1 Let, be two small postve numbers. Then there exsts a network sze N, such that when an N szed dynamcal reservor s randomly constructed by 1) randomly generatng a weght matrx W 0 by samplng the weghts from a unform dstrbuton over [1,1], 2) normalzng W 0 to a matrx W 1 wth unt spectral radus by puttng W 1 =1/ max W 0, where max s the spectral radus of W 0, 3) scalng W 1 to W 2 = 1) W 1, whereby W 2 obtans a spectral radus of 1), then the network W n, W 3, W back ) s an echo state network wth probablty 1. Note that both n Proposton 6.1 and Conecture 6.1 the nput and backproecton weghts are not used for the clams. It seems that these weghts are rrelevant for the echo state property. In practce, t s found that they can be freely chosen wthout affectng the echo state property. Agan, a mathematcal analyss of these observatons remans to be done. For practcal purposes, the followng procedure also used n the conecture) seems to guarantee echo state networks: 1. Randomly generate an nternal weght matrx W Normalze W 0 to a matrx W 1 wth unt spectral radus by puttng W 1 =1/ max W 0, where max s the spectral radus of W Scale W 1 to W = W 1, where < 1, whereby W has a spectral radus of. 4. Then, the untraned network W n, W, W back ) s or more precsely, has always been found to be) an echo state network, regardless of how W n, W back are chosen. The dlgent choce of the spectral radus of the DR weght matrx s of crucal mportance for the eventual success of ESN tranng. Ths s because s ntmately connected to the ntrnsc tmescale of the dynamcs of the DR state. Small means that one has a fast DR, large.e., close to unty) means that one has a slow DR. The ntrnsc tmescale of the task should match the DR tmescale. For example, f one wshes to tran a sne generator as n the example of Subsecton 6.1.1, one should use a small for fast snewaves and a large for slow snewaves. 28

29 Note that the DR tmescale seems to depend exponentally on 1, so e.g. settngs of 0.99, 0.98, 0.97 wll yeld an exponental speedup of DR tmescale, not a lnear one. However, these remarks rest only on emprcal observatons; a rgorous mathematcal nvestgaton remans to be carred out. An llustratve example for a fast task s gven n Jaeger 2001, Secton 4.2), where very fast "swtchng"-type of dynamcs was traned wth a DR whose spectral radus was set to 4, whch s qute small consderng the exponental nature of tme scale dependence on. Standard settngs of le n a range between 0.7 and The snewave generator presented n Secton and the tuneable snewave generator from Secton both used a DR wth 0.8. Fgure 6.8 gves a plot of the tranng log error logmse tran ) of the snewave generator tranng task consdered n Secton obtaned wth dfferent settngs of. It s evdent that a proper settng of ths parameter s crucal for the qualty of the resultng generator network Fgure 6.8: log 10 MSE tran ) vs. spectral radus of the DR weght matrx for the snewave generator experment from Secton Tranng echo state networks: algorthm Wth the sold grasp on the echo state concept procured by Secton 6.2, we can now gve a complete descrpton of tranng ESNs for a gven task. In ths descrpton we assume that the output unts) are sgmod unts; we further assume that there are output-to-dr feedback connectons. Ths s the most comprehensve verson of the algorthm. Often one wll use smpler versons, e.g. lnear output unts; no output-to- DR feedback connectons; or even systems wthout nput such as the pure snewave generator). In such cases, the algorthm presented below has to be adapted n obvous ways. Gven: A tranng nput/output sequence u1), d1)),..., ut), dt)). Wanted: A traned ESN W n, W, W back, W out ) whose output y approxmates the teacher output d, when the ESN s drven by the tranng nput u. Notes: 29