A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach

Transcription

1 A tutoral on tranng recurrent neural networks, coverng BPPT, RTRL, EKF and the "echo state network" approach Herbert Jaeger Fraunhofer Insttute for Autonomous Intellgent Systems AIS) snce 2003: Internatonal Unversty Bremen Frst publshed: Oct Frst revson: Feb Second revson: March 2005 Abstract: Ths tutoral s a worked-out verson of a 5-hour course orgnally held at AIS n September/October It has two dstnct components. Frst, t contans a mathematcally-orented crash course on tradtonal tranng methods for recurrent neural networks, coverng back-propagaton through tme BPTT), real-tme recurrent learnng RTRL), and extended Kalman flterng approaches EKF). Ths materal s covered n Sectons 2 5. The remanng sectons 1 and 6 9 are much more gentle, more detaled, and llustrated wth smple examples. They are ntended to be useful as a stand-alone tutoral for the echo state network ESN) approach to recurrent neural network tranng. The author apologzes for the poor layout of ths document: t was transformed from an html fle nto a Word fle... Ths manuscrpt was frst prnted n October 2002 as H. Jaeger 2002): Tutoral on tranng recurrent neural networks, coverng BPPT, RTRL, EKF and the "echo state network" approach. GMD Report 159, German Natonal Research Center for Informaton Technology, pp.) Revson hstory: 01/04/2004: several serous typos/errors n Sectons 3 and 5 03/05/2004: numerous typos 21/03/2005: errors n Secton 8.1, updated some URLs 1

2 Index 1. Recurrent neural networks Frst mpresson Supervsed tranng: basc scheme Formal descrpton of RNNs Example: a lttle tmer network Standard tranng technques for RNNs Backpropagaton revsted Backpropagaton through tme Real-tme recurrent learnng Hgher-order gradent descent technques Extended Kalman-flterng approaches The extended Kalman flter Applyng EKF to RNN weght estmaton Echo state networks Tranng echo state networks Frst example: a snewave generator Second Example: a tuneable snewave generator Tranng echo state networks: mathematcs of echo states Tranng echo state networks: algorthm Why echo states? Lqud state machnes Short term memory n ESNs Frst example: tranng an ESN as a delay lne Theoretcal nsghts ESNs wth leaky ntegrator neurons The neuron model Example: slow snewave generator Trcks of the trade References

3 1. Recurrent neural networks 1.1 Frst mpresson There are two maor types of neural networks, feedforward and recurrent. In feedforward networks, actvaton s "pped" through the network from nput unts to output unts from left to rght n left drawng n Fg. 1.1): Fgure 1.1: Typcal structure of a feedforward network left) and a recurrent network rght). Short characterzaton of feedforward networks: typcally, actvaton s fed forward from nput to output through "hdden layers" "Mult-Layer Perceptrons" MLP), though many other archtectures exst mathematcally, they mplement statc nput-output mappngs functons) basc theoretcal result: MLPs can approxmate arbtrary term needs some qualfcato nonlnear maps wth arbtrary precson "unversal approxmaton property") most popular supervsed tranng algorthm: backpropagaton algorthm huge lterature, 95 % of neural network publcatons concern feedforward nets my estmate) have proven useful n many practcal applcatons as approxmators of nonlnear functons and as pattern classfcators are not the topc consdered n ths tutoral By contrast, a recurrent neural network RNN) has at least one) cyclc path of synaptc connectons. Basc characterstcs: all bologcal neural networks are recurrent mathematcally, RNNs mplement dynamcal systems basc theoretcal result: RNNs can approxmate arbtrary term needs some qualfcato dynamcal systems wth arbtrary precson "unversal approxmaton property") several types of tranng algorthms are known, no clear wnner theoretcal and practcal dffcultes by and large have prevented practcal applcatons so far 3

4 not covered n most neuronformatcs textbooks, absent from engneerng textbooks ths tutoral s all about them. Because bologcal neuronal systems are recurrent, RNN models abound n the bologcal and bocybernetcal lterature. Standard types of research papers nclude... bottom-up, detaled neurosmulaton: o compartment models of small even sngle-unt) systems o complex bologcal network models e.g. Freeman's olfactory bulb models) top-down, nvestgaton of prncples o complete mathematcal study of few-unt networks n AIS: Pasemann, Gannakopoulos) o unversal propertes of dynamcal systems as "explanatons" for cogntve neurodynamcs, e.g. "concept ~ attractor state"; "learnng ~ parameter change"; " umps n learnng and development ~ bfurcatons" o demonstraton of dynamcal workng prncples o synaptc learnng dynamcs and condtonng o synfre chans Ths tutoral does not enter ths vast area. The tutoral s about algorthmcal RNNs, ntended as blackbox models for engneerng and sgnal processng. The general pcture s gven n Fg. 1.2: physcal system emprcal tme seres data observe model "learn", "estmate", "dentfy" ft smlar dstrbuto generate RNN model model-generated data Fgure 1.2: Prncpal moves n the blackbox modelng game. 4

5 Types of tasks for whch RNNs can, n prncple, be used: system dentfcaton and nverse system dentfcaton flterng and predcton pattern classfcaton stochastc sequence modelng assocatve memory data compresson Some relevant applcaton areas: telecommuncaton control of chemcal plants control of engnes and generators fault montorng, bomedcal dagnostcs and montorng speech recognton robotcs, toys and edutanment vdeo data analyss man-machne nterfaces State of usage n applcatons: RNNs are not ofte proposed n techncal artcles as "n prncple promsng" solutons for dffcult tasks. Demo prototypes n smulated or clean laboratory tasks. Not economcally relevant yet. Why? supervsed tranng of RNNs s was) extremely dffcult. Ths s the topc of ths tutoral. 1.2 Supervsed tranng: basc scheme There are two basc classes of "learnng": supervsed and unsupervsed and unclear cases, e.g. renforcement learnng). Ths tutoral consders only supervsed tranng. In supervsed tranng of RNNs, one starts wth teacher data or tranng data): emprcally observed or artfcally constructed nput-output tme seres, whch represent examples of the desred model behavor. A. Tranng Teacher: Model: n out B. Explotaton Input: Correct unknow output: Model: n out Fgure 1.3: Supervsed tranng scheme. 5

6 The teacher data s used to tran a RNN such that t more or less precsely reproduces fts) the teacher data hopng that the RNN then generalzes to novel nputs. That s, when the traned RNN receves an nput sequence whch s somehow smlar to the tranng nput sequence, t should generate an output whch resembles the output of the orgnal system. A fundamental ssue n supervsed tranng s overfttng: f the model fts the tranng data too well extreme case: model duplcates teacher data exactly), t has only "learnt the tranng data by heart" and wll not generalze well. Partcularly mportant wth small tranng samples. Statstcal learnng theory addresses ths problem. For RNN tranng, however, ths tended to be a non-ssue, because known tranng methods have a hard tme fttng tranng data well n the frst place. 1.3 Formal descrpton of RNNs The elementary buldng blocks of a RNN are neurons we wll use the term unts) connected by synaptc lnks connectons) whose synaptc strength s coded by a weght. One typcally dstngushes nput unts, nternal or hdde unts, and output unts. At a gven tme, a unt has an actvaton. We denote the actvatons of nput unts by u, of nternal unts by x, of output unts by y. Sometmes we gnore the nput/nternal/output dstncton and then use x n a metonymcal fashon. x n 1) f wx ) x x w f x ) + nput, + bas, +nose, +output feedback...) contnuous tme dscrete tme spkng spatally organzed Fgure 1.4: A typology of RNN models ncomplete). There are many types of formal RNN models see Fg. 1.4). Dscrete-tme models are mathematcally cast as maps terated over dscrete tme steps n = 1, 2, 3,.... Contnuous-tme models are defned through dfferental equatons whose solutons are defned over a contnous tme t. Especally for purposes of bologcal modelng, contnuous dynamcal models can be qute nvolved and descrbe actvaton sgnals on the level of ndvdual acton potentals spkes). Often the model ncorporates a specfcaton of a spatal topology, most often of a 2D surface where unts are locally connected n retna-lke structures. In ths tutoral we wll only consder a partcular knd of dscrete-tme models wthout spatal organzaton. Our model conssts of K nput unts wth an actvaton colum vector t 1.1) u n ) u,, u ) 1 K, 6

7 of N nternal unts wth an actvaton vector 1.2) x x1,, x )', and of L output unts wth an actvaton vector 1.3) y y1,, y )', N L where t denotes transpose. The nput / nternal / output connecton weghts are collected n N x K / N x N / L x K+N) weght matrces n n out out 1.4) W w ), W w ), W w ). The output unts may optonally proect back to nternal unts wth connectons whose weghts are collected n a N x L backproecton weght matrx back back 1.5) W w ). K nput unts N nternal unts L output unts Fgure 1.5. The basc network archtecture used n ths tutoral. Shaded arrows ndcate optonal connectons. Dotted arrows mark connectons whch are traned n the "echo state network" approach n other approaches, all connectons can be traned). A zero weght value can be nterpreted as "no connecton". Note that output unts may have connectons not only from nternal unts but also ofte from nput unts and rarely) from output unts. The actvaton of nternal unts s updated accordng to n back 1.6) x n 1) f W u n 1) Wx W y ), where un+1) s the externally gven nput, and f denotes the component-wse applcaton of the ndvdual unt's transfer functon, f also known as actvaton functon, unt output functon, or squashng functo. We wll mostly use the sgmod functon f = tanh but sometmes also consder lnear networks wth f = 1. The output s computed accordng to 7

8 out out 1.7) y n 1) f W u n 1), x n 1), y ), where un+1),xn+1),y) denotes the concatenated vector made from nput, nternal, and output actvaton vectors. We wll use output transfer functons f out = tanh or f out = 1; n the latter case we have lnear output unts. 1.4 Example: a lttle tmer network Consder the nput-output task of tmng. The nput sgnal has two components. The frst component u 1 s 0 most of the tme, but sometmes umps to 1. The second nput u 2 can take values between 0.1 and 1.0 n ncrements of 0.1, and assumes a new random) of these values each tme u 1 umps to 1. The desred output s 0.5 for 10 x u 2 tme steps after u 1 was 1, else s 0. Ths amounts to mplementng a tmer: u 1 gves the "go" sgnal for the tmer, u 2 gves the desred duraton nput 1: start sgnals nput 2: duraton settng ouput: rectangular sgnals of desred duraton Fgure 1.6: Schema of the tmer network. The followng fgure shows traces of nput and output generated by a RNN traned on ths task accordng to the ESN approach: Fgure 1.7: Performance of a RNN traned on the tmer task. Sold lne n last graph: desred teacher) output. Dotted lne: network ouput. 8

9 Clearly ths task requres that the RNN must act as a memory: t has to retan nformaton about the "go" sgnal for several tme steps. Ths s possble because the nternal recurrent connectons let the "go" sgnal "reverberate" n the nternal unts' actvatons. Generally, tasks requrng some form of memory are canddates for RNN modelng. 2. Standard tranng technques for RNNs Durng the last decade, several methods for supervsed tranng of RNNs have been explored. In ths tutoral we present the currently most mportant ones: backpropagaton through tme BPTT), real-tme recurrent learnng RTRL), and extended Kalman flterng based technques EKF). BPTT s probably the most wdely used, RTRL s the mathematcally most straghtforward, and EKF s arguably) the technque that gves best results. 2.1 Backpropagaton revsted BPTT s an adaptaton of the well-known backpropagaton tranng method known from feedforward networks. The backpropagaton algorthm s the most commonly used tranng method for feedforward networks. We start wth a recap of ths method. We consder a mult-layer perceptron MLP) wth k hdden layers. Together wth the layer of nput unts and the layer of output unts ths gves k+2 layers of unts altogether Fg left shows a MLP wth two hdden layers), whch we number by 0,..., k+1. The number of nput unts s K, of output unts L, and of unts n hdden layer m s N m. The weght of the -th unt n layer m and the -th unt n layer m+1 s denoted by w m. The actvaton of the -th unt n layer m s x m for m = 0 ths s an nput value, for m = k+1 an output value). The tranng data for a feedforward network tranng task consst of T nput-output vector-valued) data pars 0 0 t k1 k1 t 2.1) u n ) x,, x ), d d,, d ), 1 K where n denotes tranng nstance, not tme. The actvaton of non-nput unts s computed accordng to m 1,..., N m1 m 2.2) x f w x ). 1 Standardly one also has bas terms, whch we omt here). Presented wth teacher nput ut), the prevous update equaton s used to compute actvatons of unts n subsequent hdden layers, untl a network response k1 k1 2.3) y x,, x ))' 1 L n L 9

10 s obtaned n the output layer. The obectve of tranng s to fnd a set of network weghts such that the summed squared error 2 2.4) E d y n1,..., T n1,..., T E s mnmzed. Ths s done by ncrementally changng the weghts along the drecton of the error gradent w.r.t. weghts E 2.5) m w t1,... T E m w usng a small) learnng rate : m m E 2.6) new w w. m w Superscrpt m omtted for readablty). Ths s the formula used n batch learnng mode, where new weghts are computed after presentng all tranng samples. One such pass through all samples s called an epoch. Before the frst epoch, weghts are ntalzed, typcally to small random numbers. A varant s ncremental learnng, where weghts are changed after presentaton of ndvdual tranng samples: m m E 2.7) new w w. m w The central subtask n ths method s the computaton of the error gradents E. The backpropagaton algorthm s a scheme to perform these computatons. We gve the recpe for one epoch of batch processng. Input: current weghts w m, tranng samples. Output: new weghts. Computaton steps: m w For each sample n, compute actvatons of nternal and output unts usng 2.2) "forward pass"). Compute, by proceedng backward through m = k+1,k,...,1, for each unt x m the error propagaton term m k1 f u) 2.8) n ) d y ) k 1 uz u 10

11 for the output layer and m1 N m m1 m f u) 2.9) w m uz u 1 for the hdden layers, where m1 m 2.10) z x N 1 m1 w m1 s the nternal state or "potental") of unt x m. Ths s the error backpropagaton pass. Mathematcally, the error propagaton term m represents the error gradent w.r.t. the potental of the unt unt x m. 2.10a) E u m uz. 3. Adust the connecton weghts accordng to m1 m1 m m1 2.11) new w w x. T t1 After every such epoch, compute the error accordng to 2.4). Stop when the error falls below a predetermned threshold, or when the change n error falls below another predetermned threshold, or when the number of epochs exceeds a predetermned maxmal number of epochs. Many order of thousands n nontrval tasks) such epochs may be requred untl a suffcently small error s acheved. One epoch requres OT M) multplcatons and addtons, where M s the total number of network connectons. The basc gradent descent approach and ts backpropagaton algorthm mplementato s notorous for slow convergence, because the learnng rate must be typcally chosen small to avod nstablty. Many speed-up technques are descrbed n the lterature, e.g. dynamc learnng rate adaptaton schemes. Another approach to acheve faster convergence s to use second-order gradent descent technques, whch explot curvature of the gradent but have epoch complexty OT M 2 ). Lke all gradent-descent technques on error surfaces, backpropagaton fnds only a local error mnmum. Ths problem can be addressed by varous measures, e.g. addng nose durng tranng smulated annealng approaches) to avod gettng stuck n poor mnma, or by repeatng the entre learnng from dfferent ntal weght 11

12 settngs, or by usng task-specfc pror nformaton to start from an already plausble set of weghts. Some authors clam that the local mnmum problem s overrated. Another problem s the selecton of a sutable network topology number and sze of hdden layers). Agan, one can use pror nformaton, perform systematc search, or use ntuton. All n all, whle basc backpropagaton s transparent and easly mplemented, consderable expertse and experence or patence) s a prerequste for good results n non-trval tasks Backpropagaton through tme The feedforward backpropagaton algorthm cannot be drectly transferred to RNNs because the error backpropagaton pass presupposes that the connectons between unts nduce a cycle-free orderng. The soluton of the BPTT approach s to "unfold" the recurrent network n tme, by stackng dentcal copes of the RNN, and redrectng connectons wthn the network to obtan connectons between subsequent copes. Ths gves a feedforward network, whch s amenable to the backpropagaton algorthm.... x t+ 1) un+ 1) y n+ 1) x t) u y u n-1) x t-1) A. B.... y n-1) Fgure 2.1: Schema of the basc dea of BPTT. A: the orgnal RNN. B: The feedforward network obtaned from t. The case of sngle-channel nput and output s shown. The weghts w n, w, w out, w back are dentcal n/between all copes. The teacher data conssts now of a sngle nput-output tme seres 2.12) u u 1 K 1 L,, u )', d d,, d )' n 1, T. The forward pass of one tranng epoch conssts n updatng the stacked network, startng from the frst copy and workng upwards through the stack. At each copy / tme n nput u s read n, then the nternal state x s computed from u, xn-1) [and from yn-1) f nonzero w back exst], and fnally the current copy's output y s computed. The error to be mnmzed s agan lke n 2.4) 12

13 2.13) T n T n n E n n E 1,..., 2 1,..., ) ) ) y d, but the meanng of t has changed from "tranng nstance" to "tme". The algorthm s a straghtforward, albet notatonally complcated, adaptaton of the feedforward algorthm: Input: current weghts w, tranng tme seres. Output: new weghts. Computaton steps: 1. Forward pass: as descrbed above. 2. Compute, by proceedng backward through n = T,...,1, for each tme n and unt actvaton x, y the error propagaton term 2.14) ) ) )) ) ) T z u u u f T y T d T for the output unts of tme layer T and 2.15) ) 1 ) ) ) n z u L out u u f w T T for nternal unts x T) at tme layer T and 2.16) ) 1 ) 1) )) ) ) n z u N back u u f w n n y n d n for the output unts of earler layers, and 2.17) ) 1 1 ) ) 1) ) n z u L out N u u f w n w n n for nternal unts x at earler tmes, where z agan s the potental of the correspondng unt. 3. Adust the connecton weghts accordng to 13

14 2.18) new w new w new w new w n out w back w n w w out back T n1 T n1 x u T T n1 n1 T n1 n 1) u x y, f n 1), f n 1) [ use x n 1) 0 for n 1] refers to nput unt refers to hdden unt [ use y n 1) 0 for n 1] Warnng: programmng errors are easly made but not easly perceved when they degrade performance only slghtly. The remarks concernng slow convergence made for standard backpropagaton carry over to BPTT. The computatonal complexty of one epoch s OT N 2 ), where N s number of nternal unts. Several thousands of epochs are often requred. A varant of ths algorthm s to use the teacher output d n the computaton of actvatons n layer n+1 n the forward pass. Ths s known as teacher forcng. Teacher forcng typcally speeds up convergence, or even may be necessary to acheve convergence at all, but when the traned network s exploted, t may exhbt nstablty. A general rule when to use teacher forcng cannot be gven. A drawback of ths "batch" BPTT s that the entre teacher tme seres must be used. Ths precludes applcatons where onlne adaptaton s requred. The soluton s to truncate the past hstory and use, at tme n, only a fnte hstory 2.19) u n p), u n p 1),, u, d n p), d n p 1),, d as tranng data. Snce the error backpropagaton terms need to be computed only once for each new tme slce, the complexty s ON 2 ) per tme step. A potental drawback of such truncated BPPT or p-bptt) s that memory effects exceedng a duraton p cannot be captured by the model. Anyway, BPTT generally has dffcultes capturng long-lfed memory effects, because backpropagated error gradent nformaton tends to "dlute" exponentally over tme. A frequently stated opnon s that memory spans longer than 10 to 20 tme steps are hard to acheve. Repeated executon of tranng epochs shft a complex nonlnear dynamcal system the network) slowly through parameter weght) space. Therefore, bfurcatons are necessarly encountered when the startng weghts nduce a qualtatvely dfferent dynamcal behavor than task requres. Near such bfurcatons, the gradent nformaton may become essentally useless, dramatcally slowng down convergence. The error may even suddenly grow n the vcnty of such crtcal ponts, due to crossng bfurcaton boundares. Unlke feedforward backpropagaton, BPTT s not guaranteed to converge to a local error mnmum. Ths dffculty cannot arse wth feedforward networks, because they realze functons, not dynamcal systems. 14

15 All n all, t s far from trval to acheve good results wth BPTT, and much expermentaton and processor tme) may be requred before a satsfactory result s acheved. Because of lmted processng tme, BPTT s typcally used wth small networks szes n the order of 3 to 20 unts. Larger networks may requre many hours of computaton on current hardware. 3. Real-tme recurrent learnng Real-tme recurrent learnng RTRL) s a gradent-descent method whch computes the exact error gradent at every tme step. It s therefore sutable for onlne learnng tasks. I bascally quote the descrpton of RTRL gven n Doya 1995). The most often cted early descrpton of RTRL s Wllams & Zpser 1989). The effect of weght change on the network dynamcs can be seen by smply dfferentatng the network dynamcs equatons 1.6) and 1.7) by ts weghts. For convenence, actvatons of all unts whether nput, nternal, or output) are enumerated and denoted by v and all weghts are denoted by w kl, wth = 1,...,N denotng nternal unts, = N+1,...,N+L denotng output unts, and = N+L+1,...,N+L+K denotng nput unts. The dervatve of an nternal or output unt w.r.t. a weght w kl s gven by N L v v n n 1) 3.1) f z n ) )) w ) kvl 1,, N L, wkl 1 wkl where k,l N + L + K, z s agan the unt's potental, but k here denotes Kronecker's delta k =1 f = k and 0 otherwse). The term k v l represents an explct effect of the weght w kl onto the unt k, and the sum term represents an mplct effect onto all the unts due to network dynamcs. Equaton 3.1) for each nternal or output unt consttutes an N+L-dmensonal dscrete-tme lnear dynamcal system wth tme-varyng coeffcents, where v 3.2) 1 vn L,, wkl wkl s taken as a dynamcal varable. Snce the ntal state of the network s ndependent of the connecton weghts, we can ntalze 3.1) by v 0) 3.3) 0. w kl 15

16 Thus we can compute 3.2) forward n tme by teratng Equaton 3.1) smultaneously wth the network dynamcs 1.6) and 1.7). From ths soluton, we can calculate the error gradent for the error gven n 2.13)) as follows: 3.4) E w kl 2 T N L n1 N v v d ). w kl A standard batch gradent descent algorthm s to accumulate the error gradent by Equaton 3.4) and update each weght after a complete epoch of presentng all tranng data by 3.5) new w kl E wkl, w kl where s s a learnng rate. An alternatve update scheme s the gradent descent of current output error at each tme step, 3.6) w kl n 1) w kl L 1 v v d ). w kl Note that we assumed w kl s a constant, not a dynamcal varable, n dervng 3.1), so we have to keep the learnng rate small enough. 3.6) s referred to as real-tme recurrent learnng. RTRL s mathematcally transparent and n prncple sutable for onlne tranng. However, the computatonal cost s ON+L) 4 ) for each update step, because we have to solve the N+L)-dmensonal system 3.1) for each of the weghts. Ths hgh computatonal cost makes RTRL useful for onlne adaptaton only when very small networks suffce. 4. Hgher-order gradent descent technques Just a lttle note: Pure gradent-descent technques for optmzaton generally suffer from slow convergence when the curvature of the error surface s dfferent n dfferent drectons. In that stuaton, on the one hand the learnng rate must be chosen small to avod nstablty n the drectons of hgh curvature, but on the other hand, ths small learnng rate mght lead to unacceptably slow convergence n the drectons of low curvature. A general remedy s to ncorporate curvature nformaton nto the gradent descent process. Ths requres the calculaton of the second-order dervatves, for whch several approxmatve technques have been proposed n the context of recurrent neural networks. These calculatons are expensve, but can accelerate convergence especally near an optmum where the error surface can be reasonably approxmated by a quadratc functon. Dos Santos & von Zuben 2000) and Schraudolph 2002) provde references, dscusson, and propose approxmaton technques whch are faster than nave calculatons. 16

17 5. Extended Kalman-flterng approaches 5.1 The extended Kalman flter The extended Kalman flter EKF) s a state estmaton technque for nonlnear systems derved by lnearzng the well-known lnear-systems Kalman flter around the current state estmate. We consder a smple specal case, a tme-dscrete system wth addtve nput and no observaton nose: x n 1) f x ) q 5.1), d h n x ) where x s the system's nternal state vector, f s the system's state update functon lnear n orgnal Kalman flters), q s external nput to the system an uncorrelated Gaussan whte nose process, can also be consdered as process nose), d s the system's output, and h n s a tme-dependent observaton functon also lnear n the orgnal Kalman flter). At tme n = 0, the system state x0) s guessed by a multdmensonal normal dstrbuton wth mean ˆx0) and covarance matrx P0). The system s observed untl tme n through d0),..., d. The task addressed by the extended Kalman flter s to gve an estmate ˆxn 1) of the true state xn+1), gven the ntal state guess and all prevous output observatons. Ths task s solved by the followng two tme update and three measurement update computatons: 5.2) xˆ * Fˆ x ) P * F P n 1) F t Q 5.3) t K P * H [ H P * H ] xˆ n 1) xˆ * K t P n 1) P * K H P * 1 where we roughly follow the notaton n Snghal and Wu 1989), who frst appled extended Kalman flterng to feedforward) network weght estmaton. Here F and H are the Jacobans 5.4) f x) F x xxˆ, hn x) H x xxˆ of the components of f, h n wth respect to the state varables, evaluated at the prevous state estmate; d h n ˆ x ) s the error dfference between observed output and output calculated from state estmate x ˆ ), P s an estmate of the condtonal error covarance matrx E[ d0),..., d]; Q s the dagonal) covarance matrx of the process nose, and the tme updates ˆx *, P* of state estmate and state error covarance estmate are obtaned from extrapolatng the prevous estmates wth the known dynamcs f. 17

18 The basc dea of Kalman flterng s to frst update xˆ, P to prelmnary guesses xˆ *, P* by extrapolatng from ther prevous values, applyng the known dynamcs n the tme update steps 5.2), and then adustng these prelmnary guesses by ncorporatng the nformaton contaned n d ths nformaton enters the measurement update n the form of, and s accumulated n the Kalman gan K. In the case of classcal lnear, statonary) Kalman flterng, F and H constant, and the state estmates converge to the true condtonal mean state E[x d0),..., d]. For nonlnear f, h n, ths s not generally true, and the use of extended Kalman flters leads only to locally optmal state estmates. 5.2 Applyng EKF to RNN weght estmaton Assume that there exsts a RNN whch perfectly reproduces the nput-output tme seres of the tranng data t t 5.5) u u,, u ), d d,, d ) n 1, T 1 K 1 L where the nput / nternal / output / backproecton connecton weghts are as usual collected n N x K / N x N / L x K+N+L) / N x L weght matrces n n out out back back 5.6) W w ), W w ), W w ), W w ). In ths subsecton, we wll not dstngush between all these dfferent types of weghts and refer to all of them by a weght vector w. In order to apply EKF to the task of estmatng optmal weghts of a RNN, we nterpret the weghts w of the perfect RNN as the state of a dynamcal system. From a brd's eye perspectve, the output d of the RNN s a functon h of the weghts and nput up to n: 5.7) d h w, u0),..., u ) where we assume that the transent effects of the ntal network state have ded out. The nputs can be ntegrated nto the output functon h, renderng t a tme-dependent functon h n. We further assume that the network update contans some process nose, whch we add to the weghts!) n the form of a Gaussan uncorrelated nose q. Ths gves the followng verson of 5.1) for the dynamcs of the perfect RNN: 5.8) w n 1) w q d h n w ) Except for nosy shfts nduced by process nose, the state "dynamcs" of ths system s statc, and the nput u to the network s not entered n the state update equaton, 18

19 but s hdden n the tme-dependence of the observaton functon. Ths takes some mental effort to swallow! The network tranng task now takes the form of estmatng the statc, perfect) state w from an ntal guess ŵ0) and the sequence of outputs d0),..., d. The error covarance matrx P0) s ntalzed as a dagonal matrx wth large dagonal components, e.g The smpler form of 5.8) over 5.1) leads to some smplfcatons of the EKF recursons 5.2) and 5.3): because the system state = weght!) dynamcs s now trval, the tme update steps become unnecessary. The measurement updates become 5.9) t K P H [ H P H ] 1 wˆ n 1) wˆ K t P n 1) P K H P Q A learnng rate small at the begnnng [!] of learnng to compensate for ntally bad estmates of P) can be ntroduced nto the Kalman gan update equaton: t ) K P H [1/ ) I H P H ], whch s essentally the formulaton gven n Puskorus and Feldkamp 1994). Insertng process nose q nto EKF has been clamed to mprove the algorthm's numercal stablty, and to avod gettng stuck n poor local mnma Puskorus and Feldkamp 1994). EKF s a second-order gradent descent algorthm, n that t uses curvature nformaton of the squared) error surface. As a consequence of explotng curvature, for lnear nose-free systems the Kalman flter can converge n a sngle step. We demonstrate ths by a super-smple feedforward network example. Consder the sngle-nput, sngle-output network whch connects the nput unt wth the output unt by a connecton wth weght w, wthout any nternal unt 5.11) w n 1) w d wu We nspect the runnng EKF n the verson of 5.9)) at some tme n, where t has reached an estmate wˆ. Observng that the Jacoban H s smply dwu/dw = u, the next estmated state s ) wˆ n 1) wˆ K wˆ wu wu ˆ ) w. u The EKF s clamed n the lterature to exhbt fast convergence, whch should have become plausble from ths example at least for cases where the current estmate 19

20 w ˆ s already close to the correct value, such that the lnearsaton yelds a good approxmaton to the true system. EKF requres the dervatves H of the network outputs w.r.t. the weghts evaluated at the current weght estmate. These dervatves can be exactly computed as n the RTRL algorthm, at cost ON 4 ). Ths s too expensve but for small networks. Alternatvely, one can resort to truncated BPTT, use a "stacked" verson of 5.8) whch descrbes a fnte sequence of outputs nstead of a sngle output, and obtan approxmatons to H by a procedure analogous to 2.14) 2.17). Two varants of ths approach are detaled out n Feldkamp et al. 1998). The cost here s OpN 2 ), where p s the truncaton depth. Apart from the calculaton of H, the most expensve operaton n EKF s the update of P, whch requres OLN 2 ) computatons. By settng up the network archtecture wth sutably decoupled subnetworks, one can acheve a block-dagonal P, wth consderable reducton n computatons Feldkamp et al. 1998). As far as I have an overvew, t seems to me that currently the best results n RNN tranng are acheved wth EKF, usng truncated BPTT for estmatng H, demonstrated especally n many remarkable achevements from Lee Feldkamp's research group see references). As wth BPTT and RTRL, the eventual success and qualty of EKF tranng depends very much on professonal experence, whch gudes the approprate selecton of network archtecture, learnng rates, the subteltes of gradent calculatons, presentaton of nput e.g., wndowng technques), etc. 6. Echo state networks 6.1 Tranng echo state networks Frst example: a snewave generator In ths subsecton I nformally demonstrate the prncples of echo state networks ESN) by showng how to tran a RNN to generate a snewave. The desred snewave s gven by d = 1/2 snn/4). The task of generatng such a sgnal nvolves no nput, so we want a RNN wthout any nput unts and a sngle output unt whch after tranng produces d. The teacher sgnal s a 300-step sequence of d. We start by constructng a recurrent network wth 20 unts, whose nternal connecton weghts W are set to random values. We wll refer to ths network as the "dynamcal reservor" DR). The nternal weghts W wll not be changed n the tranng descrbed later n ths subsecton. The network's unts are standard sgmod unts, as n Eq. 1.6), wth a transfer functon f = tanh. A randomly constructed RNN, such as our DR, mght develop oscllatory or even chaotc actvty even n the absence of external exctaton. We do not want ths to occur: The ESN approach needs a DR whch s damped, n the sense that f the 20

21 network s started from an arbtrary state x0), the subsequent network states converge to the zero state. Ths can be acheved by a proper global scalng of W: the smaller the weghts of W, the stronger the dampng. We assume that we have scaled W such that we have a DR wth modest dampng. Fg. 6.1 shows traces of the 20 unts of our DR when t was started from a random ntal state x0). The desred dampng s clearly vsble Fgure 6.1. The damped dynamcs of our dynamcal reservor. We add a sngle output unt to ths DR. Ths output unt features connectons that proect back nto the DR. These backproectons are gven random weghts W back, whch are also fxed do not change durng subsequent tranng. We use a lnear output unt n ths example,.e. f out = d. The only connectons whch are changed durng learnng are the weghts W out from the DR to the output unt. These weghts are not defned nor are they used) durng tranng. Fgure 6.2 shows the network prepared for tranng. fxed before tranng: traned: dynamcal reservor DR) lnear output unt Fgure 6.2: Schematc setup of ESN for tranng a snewave generator. 21

22 The tranng s done n two stages, samplng and weght computaton. Samplng. Durng the samplng stage, the teacher sgnal s wrtten nto the output unt for tmes n = 1,...,300. Wrtng the desred output nto the output unts durng tranng s often called teacher forcng). The network s started at tme n = 1 wth an arbtrary startng state; we use the zero state for startng but that s ust an arbtrary decson. The teacher sgnal d s pumped nto the DR through the backproecton connectons W back and thereby exctes an actvaton dynamcs wthn the DR. Fgure 6.3 shows what happens nsde the DR for samplng tme steps n = 101,..., Fgure 6.3. The dynamcs wthn the DR nduced by teacher-forcng the snewave d n the output unt. -step traces of the 20 nternal DR unts and of the teacher sgnal last plot) are shown. We can make two mportant observatons: 1. The actvaton patterns wthn the DR are all perodc sgnals of the same perod length as the drvng teacher d. 2. The actvaton patterns wthn the DR are dfferent from each other. Durng the samplng perod, the nternal states x = x 1,...,x 20 ) for n = 101,..., 300 are collected nto the rows of a state-collectng matrx M of sze 200x20. At the same tme, the teacher outputs d are collected nto the rows of a matrx T of sze 200x1. We do not collect nformaton from tmes n = 1,..., 100, because the network's dynamcs s ntally partly determned by the network's arbtrary startng state. By tme n = 100, we can safely assume that the effects of the arbtrary startng state have ded out and that the network states are a pure reflecton of the teacher-forced d, as s manfest n Fg

23 Weght computaton. We now compute 20 output weghts w out for our lnear output unt y such that the teacher tme seres d s approxmated as a lnear combnaton of the nternal actvaton tme seres x by out d y w x 6.1) More specfcally, we compute the weghts w out such that the mean squared tranng error 6.2) MSE tran 1/ n101 d y ) 2 1/ n101 d 20 1 w out x ) 2 s mnmzed. From a mathematcal pont of vew, ths s a lnear regresson task: compute regresson weghts w out for a regresson of d on the network states x. [n = 101,..., 300]. From an ntutve-geometrcal pont of vew, ths means combnng the 20 nternal sgnals seen n Fg. 6.3 such that the resultng combnaton best approxmates the last teacher) sgnal seen n the same fgure. From an algorthmcal pont of vew, ths offlne computaton of regresson weghts bols down to the computaton of a pseudonverse: The desred weghts whch mnmze MSE tran are obtaned by multplyng the pseudonverse of M wth T: 6.3) W out M 1 T Computng the pseudonverse of a matrx s a standard operaton of numercal lnear algebra. Ready-made functons are ncluded n Mathematca and Matlab, for example. In our example, the tranng error computed by 6.2) wth optmal output weghts obtaned by 6.3) was found to be MSE tran.= 1.2e 13. The computed output weghts are mplemented n the network, whch s then ready for use. Explotaton. After the learnt output weghts were wrtten nto the output connectons, the network was run for another steps, contnung from the last tranng network state x300), but now wth teacher forcng swtched off. The output y was now generated by the traned network all on ts own [n = 301,..., 3]. The test error 6.4) MSE test 1/ 3 n301 d y ) 2 23

24 was found to be MSE test.= 5.6e 12. Ths s greater than the tranng error, but stll very small. The network has learnt to generate the desred snewave very precsely. An ntutve explanaton of ths precson would go as follows: The snewave y at the output unt evokes perodc sgnals x nsde the DR whose perod length s dentcal to that of the output sne. These perodc sgnals make a knd of "bass" of sgnals from whch the target y s combned. Ths "bass" s optmally "pre-adapted" to the target n the sense of dentcal perod length. Ths pre-adaptaton s a natural consequence of the fact that the "bass" sgnals x have been nduced by the target tself, va the feedback proectons. So, n a sense, the task [to combne y from x ] s solved by means [the x ] whch have been formed by the very task [by the backproecton of y nto the DR]. Or sad n ntutve terms, the target sgnal y s re-consttuted from ts own echos x! An mmedate queston concerns the stablty of the soluton. One may rghtfully wonder whether the error n testng phase, small as t was n the frst steps, wll not grow over tme and fnally render the network's global oscllaton unstable. That s, we mght suspect that the precse contnuaton of the sne output after the tranng s due to the fact that we start testng from state x300), whch was produced by teacher forcng. However, ths s not usually the case. Most networks traned accordng to the prescrpton gven here can be started from almost any arbtrary nonzero startng state and wll lock nto the desred snewave. Fgure 6.4 shows an example. In mathematcal terms, the traned network s a dynamcal system wth a sngle attractor, and ths attractor s the desred oscllaton. However, the strong stablty observed n ths example s a pleasant sde-effect ot the smplcty of the snewave generatng task. When the tasks become more dffcult, the stablty of the traned dynamcs s ndeed a crtcal ssue for ESN tranng Fgure 6.4: Startng the traned network from a random startng state. Plot shows frst outputs. The network quckly settles nto the desred snewave oscllaton Second Example: a tuneable snewave generator 24

25 We now make the snewave generaton task more dffcult by demandng that the snewave be adustable n frequency. The tranng data now conssts of an nput sgnal u, whch sets the desred frequency, and an output d, whch s a snewave whose frequency follows the nput u. Fgure 6.5 shows the resultng network archtecture and a short sequence of teacher nput and output Fgure 6.5: Setup of tuneable snewave generator task. Tranable connectons appear as dotted red arrows, fxed connectons as sold black arrows. Because the task s now more dffcult, we use a larger DR wth 100 unts. In the samplng perod, the network s drven by the teacher data. Ths tme, ths nvolves both nputtng the slow sgnal u, and teacher-forcng the desred output d. We nspect the resultng actvaton patterns of nternal unts and fnd that they reflect, combne, and modfy both u and d Fgure 6.6). ut ) dt ) four nternal states Fgure 6.6: Traces of some nternal DR unts durng the samplng perod n the tuneable frequency generaton task. In ths example we use a sgmod output unt. In order to make that work, durng samplng we collect nto T not the raw desred output d but the transfer-nverted verson tanh -1 d). We also use a longer tranng sequence of 1200 steps of whch 25

26 we dscard the frst 200 steps as ntal transent. The tranng error whch we mnmze concerns the tanh-nverted quanttes: 6.5) MSE tran 1/ n201 tanh 1 d tanh 1 y ) 2 1/ n201 tanh 1 d w out x ) 2 Ths s acheved, as prevously, by computng W out M 1 T. The tranng error was found to be 8.1e-6, and the test error on the frst steps after nsertng the computed output weghts was Agan, the traned network stably locks nto the desred type of dynamcs even from a random startng state, as dsplayed n Fgure 6.7. ut ) dt ) [sold] and yt ) [dotted] Fgure 6.7 Startng the traned generator from a random startng state. Stablty of the traned network was not as easy to acheve here as n the prevous example. In fact, a trck was used whch was found emprcally to foster stable solutons. The trck s to nsert some nose nto the network durng samplng. That s, durng samplng, the network was updated accordng to the followng varant of 1.6): n back 6.6) x n 1) f W u n 1) Wx W y ), where s a small whte nose term. 6.2 Tranng echo state networks: mathematcs of echo states In the two ntroductory examples, we rather vaguely sad that the DR should exhbt a "damped" dynamcs. We now descrbe n a rgorous way what knd of "dampng" s requred to make the ESN approach work, namely, that the DR must have echo states. The key to understandng ESN tranng s the concept of echo states. Havng echo states or not havng them) s a property of the network pror to tranng, that s, a property of the weght matrces W n, W, and optonally, f they exst) W back. The property s also relatve to the type of tranng data: the same untraned network may have echo states for certan tranng data but not for others. We therefore requre that the tranng nput vectors u come from a compact nterval U and the tranng output 26

27 vectors d from a compact nterval D. We frst gve the mathematcal defnton of echo states and then provde an ntutve nterpretaton. Defnton 6.1 echo states). Assume an untraned network wth weghts W n, W, and W back s drven by teacher nput u and teacher-forced by teacher output d from compact ntervals U and D. The network W n, W, W back ) has echo states w.r.t. U and D, f for every left-nfnte nput/output sequence u, dn-1)), where n =..., -2,-1,0, and for all state sequences x, x' compatble wth the teacher sequence,.e. wth x n 1) f W n u n 1) Wx W d ) n back 6.7) x' n 1) f W u n 1) Wx' W d ) t holds that x = x' for all n 0. Intutvely, the echo state property says, "f the network has been run for a very long tme [from mnus nfnty tme n the defnton], the current network state s unquely determned by the hstory of the nput and the teacher-forced) output". An equvalent way of statng ths s to say that for every nternal sgnal x there exsts an echo functon e whch maps nput/output hstores to the current state: e : U D) 6.9), u 1), d 2)), u0), d 1)) x 0) We often say, somewhat loosely, that a traned) network W n, W, W out, W back ) s an echo state network f ts untraned "core" W n, W, W back ) has the echo state property w.r.t. nput/output from any compact nterval UxD. Several condtons have been shown to be equvalent to echo states are collected n Jaeger 2001a). We provde one for llustraton. Defnton 6.2 state contractng). Wth the same assumptons as n Def. 6.1, the network W n, W, W back ) s state contractng w.r.t. U and D, f for all rghtnfnte nput/output sequences u, dn-1)) UxD, where n = 0,1,2,... there exsts a null sequence n ) n 1, such that for all startng states x0), x'0) and for all n>0 t holds that x x' n, where x [resp. x'] s the network state at tme n obtaned when the network s drven by u, dn-1)) up to tme n after havng been started n x0), [resp. n x'0)]. Intutvely, the state forgettng property says that the effects on ntal network state wash out over tme. Note that there s some subtelty nvolved here n that the null sequence used n the defnton depends on the the nput/output sequence. The echo state property s connected to algebrac propertes of the weght matrx W. Unfortunately, there s no known necessary and suffcent algebrac condton whch allows one to decde, gven W n, W, W back ), whether the network has the echo state property. For readers famlar wth lnear algebra, we quote here from Jaeger 2001) a suffcent condton for the non-exstence of echo states. back 27

28 Proposton 6.1 Assume an untraned network W n, W, W back ) wth state update accordng to 1.6) and wth transfer functons tanh. Let W have a spectral radus max > 1, where max s the largest absolute value of an egenvector of W. Then the network has no echo states wth respect to any nput/output nterval U x D contanng the zero nput/output 0,0). At face value, ths proposton s not helpful for fndng echo state networks. However, n practce t was consstently found that when the condton noted n Proposton 6.1 s not satsfed,.e. when the spectral radus of the weght matrx s smaller than unty, we do have an echo state network. For the mathematcally adventurous, here s a conecture whch remans to be shown: Conecture 6.1 Let, be two small postve numbers. Then there exsts a network sze N, such that when an N szed dynamcal reservor s randomly constructed by 1) randomly generatng a weght matrx W 0 by samplng the weghts from a unform dstrbuton over [1,1], 2) normalzng W 0 to a matrx W 1 wth unt spectral radus by puttng W 1 =1/ max W 0, where max s the spectral radus of W 0, 3) scalng W 1 to W 2 = 1) W 1, whereby W 2 obtans a spectral radus of 1), then the network W n, W 3, W back ) s an echo state network wth probablty 1. Note that both n Proposton 6.1 and Conecture 6.1 the nput and backproecton weghts are not used for the clams. It seems that these weghts are rrelevant for the echo state property. In practce, t s found that they can be freely chosen wthout affectng the echo state property. Agan, a mathematcal analyss of these observatons remans to be done. For practcal purposes, the followng procedure also used n the conecture) seems to guarantee echo state networks: 1. Randomly generate an nternal weght matrx W Normalze W 0 to a matrx W 1 wth unt spectral radus by puttng W 1 =1/ max W 0, where max s the spectral radus of W Scale W 1 to W = W 1, where < 1, whereby W has a spectral radus of. 4. Then, the untraned network W n, W, W back ) s or more precsely, has always been found to be) an echo state network, regardless of how W n, W back are chosen. The dlgent choce of the spectral radus of the DR weght matrx s of crucal mportance for the eventual success of ESN tranng. Ths s because s ntmately connected to the ntrnsc tmescale of the dynamcs of the DR state. Small means that one has a fast DR, large.e., close to unty) means that one has a slow DR. The ntrnsc tmescale of the task should match the DR tmescale. For example, f one wshes to tran a sne generator as n the example of Subsecton 6.1.1, one should use a small for fast snewaves and a large for slow snewaves. 28

29 Note that the DR tmescale seems to depend exponentally on 1, so e.g. settngs of 0.99, 0.98, 0.97 wll yeld an exponental speedup of DR tmescale, not a lnear one. However, these remarks rest only on emprcal observatons; a rgorous mathematcal nvestgaton remans to be carred out. An llustratve example for a fast task s gven n Jaeger 2001, Secton 4.2), where very fast "swtchng"-type of dynamcs was traned wth a DR whose spectral radus was set to 4, whch s qute small consderng the exponental nature of tme scale dependence on. Standard settngs of le n a range between 0.7 and The snewave generator presented n Secton and the tuneable snewave generator from Secton both used a DR wth 0.8. Fgure 6.8 gves a plot of the tranng log error logmse tran ) of the snewave generator tranng task consdered n Secton obtaned wth dfferent settngs of. It s evdent that a proper settng of ths parameter s crucal for the qualty of the resultng generator network Fgure 6.8: log 10 MSE tran ) vs. spectral radus of the DR weght matrx for the snewave generator experment from Secton Tranng echo state networks: algorthm Wth the sold grasp on the echo state concept procured by Secton 6.2, we can now gve a complete descrpton of tranng ESNs for a gven task. In ths descrpton we assume that the output unts) are sgmod unts; we further assume that there are output-to-dr feedback connectons. Ths s the most comprehensve verson of the algorthm. Often one wll use smpler versons, e.g. lnear output unts; no output-to- DR feedback connectons; or even systems wthout nput such as the pure snewave generator). In such cases, the algorthm presented below has to be adapted n obvous ways. Gven: A tranng nput/output sequence u1), d1)),..., ut), dt)). Wanted: A traned ESN W n, W, W back, W out ) whose output y approxmates the teacher output d, when the ESN s drven by the tranng nput u. Notes: 29

30 1. We can merely expect that the traned ESN approxmates the teacher output well after ntal transent dynamcs have washed out, whch are nvoked by the untraned, arbtrary) network startng state. Therefore, more precsely what we want s that the traned ESN approxmates the teacher output for tmes n = T 0,..., T, where T 0 > 1. Dependng on network sze and ntrnsc tmescale, typcal ranges for T 0 are 10 for small, fast nets) to 0 for large, slow nets). 2. What we actually want s not prmarly a good approxmaton of the teacher output, but more mportantly, a good approxmaton of testng output data from ndependent test data sets generated by the same unknow system whch also generated the teacher data. Ths leads to the queston of how good generalzaton performance of a traned model can be ascertaned, a far from trval topc. Ths queston s central to statstcal learnng theory, and we smply gnore t n ths tutoral. However, f one wshes to acheve real-world problem solutons wth blackbox models, t becomes absolutely mandatory to deal wth ths ssue. Step 1. Procure an untraned DR network W n, W, W back ) whch has the echo state property, and whose nternal unts exhbt mutually nterestngly dfferent dynamcs when excted. Ths step nvolves many heurstcs. The way I proceed most often nvolves the followng substeps. 1. Randomly generate an nternal weght matrx W Normalze W 0 to a matrx W 1 wth unt spectral radus by puttng W 1 =1/ max W 0, where max s the spectral radus of W 0. Standard mathematcal packages for matrx operatons all nclude routnes to determne the egenvalues of a matrx, so ths s a straghtforward thng. 3. Scale W 1 to W = W 1, where < 1, whereby W obtans a spectral radus of. 4. Randomly generate nput weghts W n and output backpropagaton weghts W back. Then, the untraned network W n, W, W back ) s or more honestly, has always been found to be) an echo state network, regardless of how W n, W back are chosen. Notes: 1. The matrx W 0 should be sparse, a smple method to encourage a rch varety of dynamcs of dfferent nternal unts. Furthermore, the weghts should be roughly equlbrated,.e. the mean value of weghts should be about zero. I usually draw nonzero weghts from a unform dstrbuton over [ 1, 1], or I set nonzero weghts randomly to 1 or The sze N of W 0 should reflect both the length T of tranng data, and the dffculty of the task. As a rule of thumb, N should not exceed an order of magntude of T/10 to T/2 the more regular-perodc the tranng data, the closer to T/2 can N be chose. Ths s a smple precauton aganst overfttng. Furthermore, more dffcult tasks requre larger N. 3. The settng of s crucal for subsequent model performance. It should be small for fast teacher dynamcs and large for slow teacher dynamcs, 30

31 accordng to the observatons made above n Secton 6.2. Typcally, needs to be hand-tuned by tryng out several settngs. 4. The absolute sze of nput weghts W n s also of some mportance. Large absolute W n mply that the network s strongly drven by nput, small absolute values mean that the network state s only slghtly excted around the DR's restng zero) state. In the latter case, the network unts operate around the lnear central part of the sgmod,.e. one obtans a network wth an almost lnear dynamcs. Larger W n drve the nternal unts closer to the saturaton of the sgmod, whch results n a more nonlnear behavor of the resultng model. In the extreme, when W n becomes very large, the nternal unts wll be drven nto an almost pure 1 / +1 valued, bnary dynamcs. Agan, manual adustment and repeated learnng trals wll often be requred to fnd the taskapproprate scalng. 5. Smlar remarks hold for the absolute sze of weghts n W back. Step 2. Sample network tranng dynamcs. Ths s a mechancal step, whch nvolves no heurstcs. It nvolves the followng operatons: 1. Intalze the network state arbtrarly, e.g. to zero state x0) = Drve the network by the tranng data, for tmes n = 0,..., T, by presentng the teacher nput u, and by teacher-forcng the teacher output dn-1), by computng n back 6.10) x n 1) f W u n 1) Wx W d ) 3. At tme n = 0, where d s not defned, use d = For each tme larger or equal than an ntal washout tme T 0, collect the network state x as a new row nto a state collectng matrx M. In the end, one has obtaned a state collectng matrx of sze T T 0 +1 ) x K + N + L). 5. Smlarly, for each tme larger or equal to T 0, collect the sgmod-nverted teacher output tanh -1 d row-wse nto a teacher collecton matrx T, to end up wth a teacher collectng matrx T of sze T T 0 +1 ) x L. Note: Be careful to collect nto M and T the vectors x and tanh -1 d, not x and tanh -1 dn-1)! Step 3: Compute output weghts. 1. Concretely, multply the pseudonverse of M wth T, to obtan a K + N + L) x L szed matrx W out ) t whose -th column contans the output weghts from all network unts to the -th output unt: t ) W ) M T. 31

32 Every programmng package of numercal lnear algebra has optmzed procedures for computng pseudonverses. 2. Transpose W out ) t to W out n order to obtan the desred output weght matrx. Step 4: Explotaton. The network W n, W, W back, W out ) s now ready for use. It can be drven by novel nput sequences u, usng the update equatons 1.6) and 1.7), whch we repeat here for convenence: 1.6) n back x n 1) f W u n 1) Wx W y ), 1.7) out out y n 1) f W u n 1), x n 1), y ). Varants. 1. If stablty problems are encountered when usng the traned network, t very often helps to add some small nose durng samplng,.e. to use an update equaton n back 6.12) x n 1) f W u n 1) Wx W d ), where s a small unform whte nose term typcal szes to 0.01). The ratonale behnd ths s explaned n Jaeger Often one does not desre to have traned output-to-output connectons. In that case, collect durng samplng only network states strpped off the output unt values, whch wll yeld a K + N ) x L szed matrx W out. In explotaton, nstead of 1.7) use out out 1.7') y n 1) f W u n 1), x n 1)). 3. When the system to be modeled s hghly nonlnear, better models are sometmes obtaned by usng "augmented" network states for tranng and explotaton. Ths means that n addton to the orgnal network states x, nonlnear transformatons thereof are also ncluded nto the game. A smple case s to nclude the squares of the orgnal state. Durng samplng, nstead of collectng pure states x = x 1,..., x K+N+L ), collect the double-szed "augmented" states x, x 2 ) = x 1,..., x K+N+L, x 1 2,..., x K+N+L 2 ) nto the state collectng matrx. Ths wll yeld an output weght matrx W out of sze L x 2K + N + L). Durng explotaton, nstead of 1.7) use 1.7'') y n 1) f out W out u n 1), x n 1), y, u 2 n 1), x 2 n 1), y 2 ). The paper Jaeger 2002) uses augmented states. 32

33 6.4 Why echo states? Why must the DR have the echo state property to make the approach work? From the perspectve of systems engneerng, the unknow system's dynamcs s governed by an update equaton of the form 6.13) d e u, u n 1),..., d n 1), d n 2)), where e s a possbly hghly complex, nonlnear) functon of the prevous nputs and system outputs. 6.13) s the most general possble way of descrbng a determnstc, statonary system. In engneerng problems, one typcally consders smpler versons, for example, where e s lnear and has only fntely many arguments.e., the system has fnte memory). Here we wll however consder the fully general verson 6.13). The task of fndng a black-box model for an unknown system 6.13) amounts to fndng a good approxmaton to the system functon e. We wll assume an ESN wth lnear output unts to facltate notaton. Then, the network output of the traned network s a lnear combnaton of the network states, whch n turn are governed by the echo functons, see 6.9). We observe the followng connecton between 6.13) and 6.9): 6.14) e u, u n 1),..., d n 1), d n 2)) d y w w out out x e u, u n 1),..., y n 1), y n 2)) It becomes clear from 6.14) how the desred approxmaton of the system functon e can be nterpreted as a lnear combnaton of echo functons e. Ths transparent nterpretataton of the system approxmaton task drectly reles on the nterpretaton of network states as echo states. The arguments of e and e are dentcal n nature: both are collectons of prevous nputs and system or network, respectvely) outputs. Wthout echo states, one could nether mathematcally understand the relatonshp between network output and orgnal system output, nor would the tranng algorthm work Lqud state machnes An approach very smlar to the ESN approach has been ndependently explored by Wolfgang Maass et al. at Graz Techncal Unversty. It s called the "lqud state machne" LSM) approach. Lke n ESNs, large recurrent neural networks are conceved as a reservor called "lqud" there) of nterestng exctable dynamcs, whch can be tapped by tranable readout mechansms. LSMs compare wth ESNs as follows: 33

34 LSM research focuses on modelng dynamcal and representatonal phenomena n bologcal neural networks, whereas ESN research s amed more at engneerng applcatons. The "lqud" network n LSMs s typcally made from bologcally more adequate, spkng neuron models, whereas ESNs "reservors" are typcally made up from smple sgmod unts. LSM research consders a varety of readout mechansms, ncludng traned feedforward networks, whereas ESNs typcally make do wth a sngle layer of readout unts. An ntroducton to LSMs and lnks to publcatons can be found at 7. Short term memory n ESNs Many tme-seres processng tasks nvolve some form of short term memory STM). By short-term memory we understand the property of some nput-output systems, where the current output y depends on earler values un-k) of the nput and/or earler values yn-k) of the output tself. Ths s obvous, for nstance, n speech processng. Engneerng tasks lke suppressng echos n telephone channels or the control of chemcal plants wth attenuated chemcal reactons requre system models wth short-term memory capabltes. We saw n Secton 6 that the DR unt's actvatons x can be understood n terms of echo functons e whch maps nput/output hstores to the current state. We repeat the correspondng Equaton 6.9) here for convenence: e : U D) 6.9), u 1), d 2)), u0), d 1)) x 0) The queston whch we wll now nvestgate more closely s how many of the prevous nputs/output arguments un-k), yn-k-1)) are actually relevant for the echo functon? or asked n other words, how long s the effectve short-term memory of an ESN? A good ntutve grasp on ths ssue s mportant for successful practcal work wth ESNs because as we wll see, wth a sutable setup of the DR, one can control to some extent the short-term memory characterstcs of the resultng ESN model. We wll provde here only an ntutve ntroducton; for a more detaled treatment consult the techncal report devoted to short-term memory Jaeger 2001a) and the mathematcally-orented Master thess Bertschnger 2002). 34

35 7.1 Frst example: tranng an ESN as a delay lne Much nsght nto the STM of ESNs can be ganed when we tran ESNs on a pure STM task. We consder an ESN wth a sngle nput channel and many output channels. The nput u s a whte nose sgnal generated by samplng at each tme ndependently from a unform dstrbuton over [ 0.5, 0.5]. We consder delays k = 1, 2,.... For each delay k, we tran a separate output unt wth the tranng sgnal d k = un-k). We do not equp our network wth feedback connectons from the output unts to the DR, so all output unts can be traned smultaneously and ndependently from each each other. Fgure 7.1 depcts the setup of the network. nput: un ) teachers: un- 1) = d 0 un- 2) = d 1 un- 3) = d2... Fgure 7.1: Setup of delay learnng task. Concretely, we use a 20-unt DR wth a connectvty of 15%, that s, 15% of the entres of the weght matrx W are non-null. The non-null weghts were sampled randomly from a unform dstrbuton over [ 1,1], and the resultng wegth matrx was rescaled to a spectral radus of = 0.8, as descrbed n Secton 6.2. The nput weghts were put to values of 0.1 or +0.1 wth equal probablty. We traned 4 output unts wth delays of k = 4, 8, 16, 20. The tranng was done over 300 tme steps, of whch the frst 100 were dscarded to wash out ntal transents. On test data, the traned network showed testng mean square errors of MSE test.= , , 0.040, 0.12 for the four traned delays. Fgure 7.2 upper dagrams) shows an overlay of the correct delayed sgnals sold lne) wth the traned network output Fgure 7.2: Results of tranng delays k = 4, 8, 16, 20 wth a 20-unt DR. Top row: nput weghts of sze 0.1 or +0.1, bottom row: nput weghts szed or

36 When the same experment s redone wth the same DR, but wth much smaller nput weghts set to random values of or , the performance greatly mproves: testng errors MSE test.= , , , are now obtaned. Three fundamental observatons can be gleaned from ths smple example: 1. The network can master the delay learnng task, whch mples that the current network state x retans extractable nformaton about prevous nputs un-k). 2. The longer the delay, the poorer the delay learnng performance. 3. The smaller the nput weghts, the better the performance. 7.2 Theoretcal nsghts I now report some theoretcal fndngs from Jaeger 2001a), whch explan the observatons made n the prevous subsecton. Frst, we need a precse verson of the ntutve noton of "network performance for learnng the k-delay task". We consder the correlaton coeffcent run-k), y k ) between the correct delayed sgnal un-k) and the network ouput y k of the unt traned on the delay k. It ranges between 1 and 1. By squarng t, we obtan a quantty called n statstcs the determnaton coeffcent r 2 un-k), y k ). It ranges between 0 and 1. A value of 1 ndcates perfect correlaton between correct sgnal and network output,k a value of 0 ndcates complete loss of correlaton. In statstcal terms, the determnaton coeffcent gves the proporton of varance n one sgnal explaned by the other). Perfect recall of the k delayed sgnal thus would be ndcated by r 2 un-k), y k ) = 1, complete falure by r 2 un-k), y k ) = 0. Next, we defne the overall delay recallng performance of a network, as the sum of ths coeffcent over all delays. We defne the memory capacty MC of a network by 7.1) MC k1 2 r u n k), y k ) Wthout proof, we cte from Jaeger 2001a) some fundamental results concernng the memory capacty of ESNs: Theorem 7.1. In a network whose DR has N nodes, MC N. That s, the maxmal possble memory capacty s bounded by DR sze. Theorem 7.2. In a lnear network wth N nodes, genercally MC N. That s, a lnear network wll genercally reach maxmal network capacty. Notes: ) a lnear network s a network whose nternal unts have a lnear transfer functon,.e. f = d. ) "Genercally" means: f we randomly construct such a network, t wll have the desred property wth probablty one. 36

37 Theorem 7.3. In a lnear network, long delays can never be learnt better than short delays "monotonc forgettng") When we plot the determnaton coeffcent aganst the delay, we obtan the forgettng curves of an ESN. Fgure 7.3 shows some forgettng curves obtaned from varous 400-unt ESNs. Fgure 7.3: Forgettng curves of varous 400-unt networks. A: randomly created lnear DR. B: randomly created sgmod DR. C: lke A, but wth nosy state update of DR. D: almost untary weght matrx, lnear update. E: same as D, but wth nosy state update. F: same as D, but wth spectral radus = Mnd the dfferent scalngs of the x-axs! The forgettng curves n Fgure 7.3 exhbt some nterestng phenomena: Accordng to theorem 7.2, the forgettng curve n curve A should reflect a memory capacty of 400 = network sze). That s, the area under the curve should be 400. However, we fnd an area = memory capacty) of about 120 only. Ths s due to roundng errors n the network update. The longer the delay, the more severe the effect of accumulated roundng errors, whch reduce the effectvely achevable memory capacty. The curve B was generated wth a DR made from the same weght matrx as n A, but ths tme, the standard sgmod transfer funton tanh was used for network update. Compared to A, we observe a drop n memory capacty. It s a general emprcal observaton that the more nonlnear a network, the lower ts memory capacty. Ths also explans the fndng from Secton 7.1, namely, that the STM of a network s mproved when the nput weghts are made very 37

38 small. Very small nput weghts make the nput drve the network only mnmally around ts zero restng state. Around the zero state, the sgmods are almost lnear. Thus, small nput weghts yeld an almost lnear network dynamcs, whch s good for ts memory capacty. Curve C was generated lke A, but the lnear) network was updated wth a small nose term added to the states. As can be expected, ths decreases the memory capacty. What s worth mentonng s that the effect s qute strong. An ntroductory dscusson can be found n Jaeger 2001a), and a detaled analytcal treatment s gven n Bertschnger 2002). The forgettng curve D comes close to the theoretcal optmum of MC 400. The trck was to use an almost untary weght matrx, agan wth lnear DR unts. Intutvely, ths means that a network state x, whch carres the nformaton about the current nput, "revolves" around the state space N wthout nteracton wth succeedng states, for N update steps. There s more about ths n Jaeger 2001a) and Bertschnger 2002). The forgettng curve E was obtaned from the same lnear untary network as D, but nose was added to state update. The corrupton of memory capacty s less dramatc as n curve C. Fnally, the forgettng curve F was obtaned by scalng the lnear, untary) network from D to a spectral radus = Ths leads to long-term "reverberatons" of nput. On the one hand, ths yelds a forgettng curve wth a long tal n fact, t extends far beyond the value of the network sze, N 400. On the other hand, "reverberatons" of long-tme past nputs stll occupyng the present network state lead to poor recall of even the mmedately past nput: the forgettng curve s only about 0.5 rght at the begnnng. The area under the forgettng curve, however, comes close to the theoretcal optmum of 400. For practcal purposes, when one needs ESNs wth long STM effects, on can resort to a combnaton of the followng approaches: Use large DRs. Ths s the most effcent and generally applcable method, but t requres suffcently large tranng data sets. Use small nput weghts, to work n the almost lnear workng range of the network. Ths mght conflct wth nonlnear task characterstcs. Use lnear update for the DR. Agan, mght conflct wth nonlnear task characterstcs. Use specally prepared DRs wth almost untary weght matrces. Use a spectral radus close to 1. Ths would work only wth "slow" tasks for nstance, t would not work f one wants to have fast oscllatng dynamcs wth long STM effects). 8. ESNs wth leaky ntegrator neurons The ESN approach s not confned to standard sgmod networks. The basc dea of a dynamcal reservor works wth any knd of dynamcal system whch has the echo state property. Other knd of systems whch one mght consder are, for nstance, 38

39 neural network wth bologcally more adequate, spkng neuron models as n most of the research on the related "lqud state machnes" of Maass et al. contnuous-tme neural networks descrbed by dfferental equatons, coupled map lattces, exctable meda models and reacton-dffuson systems. More or less for hstorcal random) reasons, the ESN approach has so far been worked out almost exclusvely usng standard sgmod networks. However, one dsadvantage of these networks s that they do not have a tme constant ther dynamcs cannot be "slowed down" lke the dynamcs of a dfferental equaton, f one would wsh to do so. It s, for nstance, almost mpossble wth standard sgmod networks to obtan ESN generators of very slow snewaves. In ths secton, we wll consder ESNs made from a contnous-tme neuron model, leaky ntegrator neurons, whch ncorporate a tme constant and whch can be slowed down. Ths model was frst used n an ESN context n Jaeger The neuron model The contnuous-tme dynamcs xt) of a leaky ntegrator neuron s descrbed by the dfferental equaton 1 8.1) x ax f w x), where w s the weght vector of connectons from all unts x whch feed nto neuron x, and f s the neuron's output nonlnearty typcally a sgmod; we wll use tanh aga. The postve quantty s the tme constant of ths equaton the larger, the slower the otherwse dentcal) resultng dynamcs. The term ax models a decayng potental of the neuron; by vrtue of ths term, the neuron retans part of ts prevous state. The nonnegatve constant a s the neuron's decay constant. The larger a, the faster the decay of prevous state, and the larger the relatve nfluence of nput from other neurons. If an entre DR of an ESN s made from leaky ntegrator neurons x wth decay constants a, we obtan the followng DR dynamcs: 1 n back 8.2) x Ax f W u W x W y), where A s a dagonal matrx wth the decay constants on ts dagonal. In order to sumulate 8.2) on dgtal machnes, we must dscretze t by some stepsze. An approxmate dscretzed verson, whch gves -tme ncrements of the DR state for each dscrete smulaton step n, reads lke ths: 8.3) n back x n 1) I A) x f W u n 1) W x W y ), 39

40 where I s the dentty matrx. For numercal stablty reasons,,, A must be chosen such that the matrx I - A has no entry of absolute value greater or equal to 1. In the followng we consder only the case 1, whereby 8.3) smplfes to n back 8.4) x n 1) I A) x f W u n 1) W x W y ). By substtutng 1 - a wth the "retanment rate" r = 1 - a, we arrve at the more convenent verson n back 8.5) x n 1) R x f W u n 1) Wx W y ), where R s a dagonal matrx wth the retanment rates. The retanment rates must range between 0 r < 1. A value of zero means that the unt does not retan any nformaton about ts prevous state; the unt's dynamcs then degenerates to the knd of standard sgmod unts whch we used n prevous sectons. A value close to 1 means that the unt's dynamcs s domnated by retanng most of) ts prevous state. Choosng approprate values for R and W becomes a more subtle task than choosng W n standard sgmod DRs. Two effects have to be consdered: the echo state property must be ensured, and the absolute values of resultng network states should be kept small enough. We consder both effects n turn, n reverse order. Small enough network states. Because of the leaky) ntegraton nature of 8.4), actvaton states of arbtrary sze can buld up. These are fed nto the sgmod f by vrtue of the second term n 8.5). f can easly be drven nto saturaton due to the ntegraton-buldup of state values. Then, the second term n 8.5) wll yeld essentally a bnary, [-1, 1]-valued contrbuton, and the network dynamcs wll tend to oscllate. A way to avod ths condton s to use a weght matrx W wth small values. Snce the tendency to grow large actvaton states s connected to the retanment rates larger rates lead to stronger growth), an ad-hoc counter-measure s to "renormalze" states by multplcaton wth I R before feedng them nto f. That s, one uses 8.6) n back x n 1) R x f W u n 1) I R) W x W y ) nstead of 8.5). Echo state property. It can be shown a smple lnearzaton argument, see Jaeger 2001) that the echo state property does not hold f the matrx I - R)W + R has a spectral radus greater or equal to 1 ths refers to an update accordng to 8.6); f 8.5) s used, consder the spectral radus of W + R nstead). Conversely, and smlarly as dscussed n Secton 6 dscusson of Proposton 6.1), emprcally t has always been found that the echo state property does hold when I - R)W + R has a spectral radus smaller than 1. 40

41 8.2 Example: slow snewave generator Lke n Secton 6.1, we consder the task of generatng a snewave, usng an ESN wth a sngle output unt and no nput. Ths tme, however, the desred snewave s very slow: we want d = 1/5 snn/100). Ths s hard to mpossble to acheve wth standard sgmod ESNs, but easy wth a leaky ntegrator network. We select agan a 20-unt DR, whose weght matrx has sparse connectvty of 20% and whch s scaled to a spectral radus of. We choose a retanment rate of 0.98 for all unts. Output feedback connecton weghts are sampled from a unform dstrbuton over [-1, 1]. Tranng s run over 4000 steps, of whch the frst 2000 are dscarded. A long ntal washout s requred here because the retanment rates close to 1 mply a long ntal transent after all, we are dealng wth slow dynamcs here 2000 steps of samplng bol down to roughly 3 cycles of the snewave). Ths task s found to easly lead to unstable solutons. Some hand-tunng s necessary to fnd approprate scalng ranges for R and W. Addtonally, t s essental that durng samplng a small nose term of sze ) s added to network states. Wth the settngs mentoned above and ths nose term, the tranng produced 10 stable solutons n 10 runs wth dfferent, randomly generated weght matrces. The test MSEs obtaned on a 2000 step test sequence ranged between 1.8e-6 and 3.0e Trcks of the trade The basc dea of ESNs for black-box-modelng can be condensed nto the followng statement: "Use an exctable system [the DR] to gve a hgh-dmensonal dynamcal representaton of the task nput dynamcs and/or output dynamcs, and extract from ths reservor of task-related dynamcs a sutable combnaton to make up the desred target sgnal." Obvously, the success of the modelng task depends crucally on the nature of the excted dynamcs t should be adapted to the task at hand. For nstance, f the target sgnal s slow, the excted dynamcs should be slow, too. If the target sgnal s very nonlnear, the excted dynamcs should be very nonlnear, too. If the target sgnal nvolves long short-term memory, so should the excted dynamcs. And so forth. Successful applcaton of the ESN approach, then, nvolves a good udgement on mportant characterstcs of the dynamcs excted nsde the DR. Such udgement can only grow wth the expermenter's personal experence. However, a number of general practcal hnts can be gven whch wll facltate ths learnng process. All hnts refer to standard sgmod networks. 41

42 Plot nternal states Snce the dynamcs wthn the DR s so essental for the task, you should always vsually nspect t. Plot some of the nternal unts x durng samplng and/or testng. These plots can be very revealng about the cause of falure. If thngs don't go well, you wll frequently observe n these plots one or two of the followng msbehavors: Fast oscllatons. In tasks where you don't want fast oscllatons, ths observaton ndcates a too large spectral radus of the weght matrx W, and/or too large values of the output feedback weghts f they exst). Remedy: scale them down. Almost saturated network states. Sometmes you wll observe that the DR unts almost always take extreme values near 1 or 1. Ths s caused by a large mpact of ncomng sgnals nput and/or output feedback). It s only desrable when you want to acheve some almost bnary, "swtchng" type of target dynamcs. Otherwse t's harmful. Remedy: scale down the nput and/or output feedback weghts. Plot output weghts You should always nspect the output weghts obtaned from the learnng procedure. The easest way to do ths s to plot them. They should not become too large. Reasonable absolute values are not greater than, say,. If the learnt output weghts are n the order of 1000 and larger, one should attempt to brng them down to smaller ranges. Very small values, by contrast, do not ndcate anythng bad. When udgng the sze of output weghts, however, you should put them nto relaton wth the range of DR states. If the DR s only mnmally excted let's say, DR unt actvatons n the range of ths would for nstance make sense n almost lnear tasks wth long-term memory characterstcs), and f the desred output has a range up to 0.5, then output weghts have to be around 100 ust n order to scale up from the nternal state range to the output range. If after factorng out the range-adaptaton effect ust mentoned, the output weghts stll seem unreasonably large, you have an ndcaton that the DR's dynamcs s somehow badly adapted to your learnng task. Ths s because large output weghts mply that the generated output sgnal explots subtle dfferences between the DR unt's dynamcs, but does not explot the "frst order" phenomena nsde the DR a more mathematcal treatment of large output weghts can be found n Jaeger 2001a)). It s not easy to suggest remedes aganst too large output weghts, because they are an ndcaton that the DR s generally poorly matched wth the task. You should consder large output values as a symptom, not as the cause of a problem. Good doctors do not cure the symptoms, but try to address the cause. Large output values wll occur only n hgh-precson tasks, where the tranng materal s mathematcal n orgn and ntrnscally very accurate. Emprcal tranng data wll mostly contan some random nose component, whch wll lead to 42

43 reasonably scaled output weghts anyway. Addng nose durng tranng s a safe method to scale output weghts down, but s lkely to mpar the desred accuracy as well. Fnd a good spectral radus The sngle most mportant knob to tune an ESN s the spectral radus of the DR weght matrx. The general rule: for fast, short-memory tasks use small, for slow, long-memory tasks use large. Manual expermentaton wll be necessary n most cases. One does not have to care about fndng the precse best value for however. The range of optmal settngs s relatvely broad, so f an ESN works well wth = 0.8, t can also be expected to work well wth = 0.7 and wth = The closer you get to 1, the smaller the regon of optmalty. Fnd an approprate model sze Generally, wth larger DR one can learn more complex dynamcs, or learn a gven dynamcs wth greater accuracy. However, beware of overfttng: f the model s too powerful.e. the DR too large), rrelevant statstcal fluctuatons n the tranng data wll be learnt by the model. That leads to poor generalzaton on test data. Try ncreasng network szes untl performance on test data deterorates. The problem of overfttng s partcularly mportant when you tran on emprcal, nosy data. It s not theoretcally qute clear at least not to me) whether the concept of overfttng also carres over to non-statstcal, determnstc, 100%-precsely defned tranng tasks, for example tranng a chaotc attractor descrbed by a dfferental equaton as n Jaeger 2001). The best results I obtaned n that task were acheved wth a 1000-unt network traned on 2000 data ponts, whch means that 1000 parameters were estmated from 2000 data ponts. For emprcal, statstcal tasks, ths would normally lead to overfttng a rule of thumb n statstcal learnng s to have at least 10 data ponts per estmated parameter). Add nose durng samplng When you are tranng an ESN wth output feedback from accurate mathematcal, nose-free) tranng data, stablty of the traned network s often dffcult to acheve. A method that works wonders s to nect nose nto the DR update durng samplng, as descrbed n Secton 6, Eqn. 6.6). It s not clearly understood why ths works. Attempts at an explanaton are made n Jaeger 2001). In tasks wth emprcal, nosy tranng data, nose nserton does not a pror make sense. Nor s t requred when there are no output feedback connectons. There s one stuaton, however, where nose nserton mght make sense even wth emprcal data and wthout output feedback connectons. That s when the learnt model overfts data, whch s revealed by a small tranng and a large test error. In ths condton, necton of extra nose works as a regularzer n the sense of statstcal learnng theory. The tranng error wll ncrease, but the test error wll go down. However, a more approprate way to avod overfttng s to use smaller networks. 43

44 Use an extra bas nput When the desred output has a mean value that departs from zero, t s a good dea to nvest an extra nput unt and feed t wth a constant value "bas") durng tranng and testng. Ths bas nput wll mmedately enable the tranng to set the traned output to the correct mean value. A relatvely large bas nput wll shft many nternal unts towards one of the extremer outer ranges of ther sgmods; ths mght be advsable when you want to acheve a strongly nonlnear behavor. Sometmes you do not want to affect the DR strongly by the bas nput. In such cases, use a small value for the bas nput for nstance, a value of 0.01), or connect the bas only to the output.e., put all bas-to-dr connectons to zero). Beware of symmetrc nput Standard sgmod networks are "symmetrc" devces n the sense that when an nput sequence u gves an output sequence y, then the nput sequence u wll yeld an output sequence y. For nstance, you can never tran an ESN to produce an output y = u 2 from an nput u whch takes negatve and postve values. There are two smple methods to succeed n "asymmetrc" tasks: Feed n an extra constant bas nput. Ths wll effectvely render the DR an unsymmetrc devce. Shft the nput. Instead of usng the orgnal nput sgnal, use a shfted verson whch only takes postve sgn. The symmetrc-nput fallacy comes n many dsguses and s often not easy to recognze. Generally be cautous when the nput sgnal has a range that s roughly symmetrcal around zero. It almost never harms to shft t nto an asymmetrcal range. Nor does a small bas nput usually harm. Shft an scale nput You are free to transform the nput nto any value range [a, b] by scalng and/or shftng t. A rule I work wth: the more nonlnear the task, the more extravagantly I shft the nput range. For nstance, n a dffcult nonlnear system dentfcaton task 30th order NARMA system) I once got best models wth an nput range [a, b] = [3, 3.5]. The apparent reason s that shftng the nput far away from zero made the DR work n a hghly nonlnear range of ts sgmod unts. Blackbox modelng generals Be aware of the fact that ESN s a blackbox modelng technque. Ths means that you cannot expect good results on test data whch operate n a workng range that was never vsted durng tranng. Or to put t the other way round, make sure that your 44

45 tranng data vst all the workng condtons that you wll later meet when usng the traned model. Ths s sometmes not easy to satsfy wth nonlnear systems. For nstance, a typcal approach to obtan tranng data s to drve the emprcal system wth whte nose nput. Ths approach s well-motvated wth lnear systems, where whte nose nput n tranng data generally reveals the most about the system to be dentfed. However, ths may not be the case wth nonlnear systems! By contrast, what typcally happens s that whte nose nput keeps the nonlnear system n a small subregon of ts workng range. When the system the orgnal system or the traned model) s later drven wth more orderly nput for nstance, slowly changng nput), t wll be drven nto a qute dfferent workng range. A black-box model traned from data wth whte nose nput wll be unable to work well n another workng range. On the other hand, one should also not cover n the tranng data more portons of the workng range than wll be met n testng / explotaton. Much of the modelng capacty of you model wll then be used up to model those workng regons whch are later rrelevant. As a consequence, the model accuracy n the relevant workng regons wll be poorer. The golden rule s: use bascally the same knd of nput durng tranng as you wll later encounter n testng / explotaton, but make the tranng nput a bt more vared than you expect the nput n the explotaton phase. Acknowledgment. The author s ndebted to Stéphane Beauregard for proofreadng, to Danl Prokhorov for many nsghts nto the art of RNN tranng, and to Wolfgang Maass for a hghly valued professonal collaboraton. References An up-to-date repostory ESN publcatons and computatonal resources can be found at A repostory of publcatons concernng the related "lqud state machne" approach can be found at Bengo, Y., Smard, P. and P. Frascon, Learnng long-term dependences wth gradent descent s dffcult. IEEE Transactons on Neural Networks 5, 1994, N. Bertschnger 2002), Kurzzetspecher ohne stable Zustände n rückgekoppelten neuronalen Netzen. Dplomarbet, Informatk VII, RWTH Aachen,

46 Dos Santos, E.P. and von Zuben, F.J. 2000) Effcent second-order learnng algorthms for dscrete-tme recurrent neural networks. In: Medsker, L.R. and Jan, L.C. eds), Recurrent Neural Networsk: Desgn and Applcatons, 2000, CRC Press: Boca Raton, Florda Doya, K. 1992), Bfurcatons n the learnng of recurrent neural networks. Proceedngs of 1992 IEEE Int. Symp. on Crcuts and Systems vol. 6, 1992, Doya, K. 1995) Recurrent networks: supervsed learnng. In: Arbb, M.A. ed.), The Handbook of Bran Theory and Neural Networks, 1995, MIT Press / Bradford Books Feldkamp, L. A., Prokhorov, D., Eagen, C.F., and F. Yuan 1998) Enhanced multstream Kalman flter tranng for recurrent networks. In: XXX ed.), Nonlnear modelng: advanced black-box technques, 1998, Boston: Kluwer H. Jaeger 2001), The "echo state" approach to analysng and tranng recurrent neural networks. GMD Report 148, GMD - German Natonal Research Insttute for Computer Scence, 2001, H. Jaeger 2001a), Short term memory n echo state networks. GMD Report 152, GMD - German Natonal Research Insttute for Computer Scence, 2002, H. Jaeger 2002), H. Jaeger, Adaptve nonlnear system dentfcaton wth echo state networks. To appear n the Proceedngs of NIPS 02, p.). Puskorus, G.V. and L. A. Feldkamp 1994) Neurocontrol of nonlnear dynamcal systems wth Kalman flter traned recurrent networks. IEEE Transactons on Neural Networks 52), 1994, Schraudolph, N. 2002). Fast curvature matrx-vector products for second-order gradent descent. To appear n Neural Computaton. Manuscrpt onlne at Snghal, S. and L. Wu 1989), Tranng multlayer perceptrons wth the extended Kalman algorthm. In D.S. Touretzky ed.), Advances n Neural Informaton Processng Systems 1, 1989, San Mateo, CA: Morgan Kaufmann Werbos, P.J. 1990) Backpropagaton through tme: what t does and how to do t. Proceedngs of the IEEE 7810), 1990, A much-cted artcle explanng BPTT n a detaled fashon, and ntroducng a notaton that s often used elsewhere. Wllams, R.J. and D. Zpser 1989) A learnng algorthm for contnually runnng fully recurrent neural networks. Neural Computton 1, 1989,