Generalization Dynamics in LMS Trained Linear Networks

Geeralizatio Dyamics i LMS Traied Liear Networks Yves Chauvi Psychology Departmet Staford Uiversity Staford, CA 94305 Abstract For a simple liear case, a mathematical aalysis of the traiig ad geeralizatio (validatio) performace of etworks traied by gradiet descet o a Least Mea Square cost fuctio is provided as a fuctio of the learig parameters ad of the statistics of the traiig data base. The aalysis predicts that geeralizatio error dyamics are very depedet o a priori iitial weights. I particular, the geeralizatio error might sometimes weave withi a computable rage durig exteded traiig. I some cases, the aalysis provides bouds o the optimal umber of traiig cycles for miimal validatio error. For a speech labelig task, predicted weavig effects were qualitatively tested ad observed by computer simulatios i etworks traied by the liear ad o-liear back-propagatio algorithm. 1 INTRODUCTION Recet progress i etwork desig demostrates that o-liear feedforward eural etworks ca perform impressive patter classificatio for a variety of real-world applicatios (e.g., Le Cu et al., 1990; Waibel et al., 1989). Various simulatios ad relatioships betwee the eural etwork ad machie learig theoretical literatures also suggest that too large a umber of free parameters ("weight overfittig") could substatially reduce geeralizatio performace. (e.g., Baum, 1989 1989). A umber of solutios have recetly bee proposed to decrease or elimiate the overfittig problem i specific situatios. They rage from ad hoc heuristics to i theoretical cosideratios (e.g., Le Cu et al., 1990; Chauvi, 1990a; Weiged et al., Also with Thomso-CSF, Ic., 630 Hase Way, Suite 250, Palo Alto, CA 94304. 890

Geeralizatio Dyamics i LMS Traied Liear Networks 891 I Press). For a phoeme labelig applicatio, Chauvi showed that the overfittig pheomeo was actually observed oly whe etworks were overtraied far beyod their "optimal" performace poit (Chauvi, 1990b). Furthermore, geeralizatio performace of etworks seemed to be idepedet of the size of the etwork durig early traiig but the rate of decrease i performace with overtraiig was ideed related the umber of weights. The goal of this paper is to better uderstad traiig ad geeralizatio error dyamics i Least-Mea-Square traied liear etworks. As we will see, gradiet descet traiig o liear etworks ca actually geerate surprisigly rich ad isightful validatio dyamics. Furthermore, i umerous applicatios, eve o-liear etworks ted to fuctio i their liear rage, as if the etworks were makig use of o-liearities oly whe ecessary ('Veiged et al., I Press; Chauvi, 1990a). I Sectio 2, I preset a theoretical illustratio yieldig a better uderstadig of traiig ad validatio error dyamics. I Sectio 3, umerical solutios to obtaied aalytical results make iterestig predictios for validatio dyamics uder overtraiig. These predictios are tested for a phoemic labelig task. The obtaied simulatios suggest that the results of the aalysis obtaied with the simple theoretical framework of Sectio 2 might remai qualitatively valid for o-liear complex architectures. 2 THEORETICAL ILLUSTRATION 2.1 ASSUMPTIONS Let us cosider a liear etwork composed of iput uits ad output uits fully coected by a. weight matrix W. Let us suppose the etwork is traied to reproduce a oiseless output "sigal" from a oisy iput "sigal" (the etwork ca be see as a liear filter). 'Ve write F as the "sigal", N the oise, X the iput, Y the output, ad D the desired output. For the cosidered case, we have X = F+N, Y = W X ad D = F. The statistical properties of the data base are the followig. The sigal is zero-mea with covariace matrix CF. 'Ve write Ai ad ei as the eigevalues ad eigevectors of C F (ei are the so-called pricipal compoets; we will call Ai the "sigal ~ower spectrum"). The oise is assumed to be zero-mea, with covariace matrix CN = v.i where I is the idetity matrix. We assume the oise is ucorrelated with the sigal: CFN = O. We suppose two sets of patters have bee sampled for traiig ad for validatio. We write CF, CN ad CFN the resultig covariace matrices for the traiig set ad CF, CN ~d CF N the corresp_odig matrices for the validatio set. We assume CF ~ Cp ~ CF, CFN ~ CPN ~ CFN = 0, CN = v.i ad CN = v'.i with v' > v. (N umerous of these assumptios are made for the sake of clarity of explaatio: they ca be relaxed without chagig the resultig implicatios.) The problem cosidered is much simpler tha typical realistic applicatios. However, we will see below that (i) a formal aalysis becomes complex very quickly (ii) the validatio dyamics are rich, isightful ad ca be mapped to a umber of results observed i simulatios of realistic applicatios ad (iii) a iterestig umber of predictios ca be obtaied.

892 Chauvi 2.2 LEARNING The etwork is traied by gradiet descet o the Least Mea Square (LMS) error: dw = -1JV'wE where 1J is the usual learig rate ad, i the case cosidered, E = E; (Fp - Yp)T(Fp - Yp). We ca write the gradiet as a fuctio of the various covariace matrices: V' we = (I - W)C F + (I - 2W)C F N - W C N. From the geeral assumptios, we get: V'wE ~ CF - WCF - WCN (1) We assume ow that the pricipal compoets ei are also eigevectors of the weight matrix W at iteratio k with correspodig eigevalue Qik: Wk.ei = Qikei. We ca the compute the image of each eigevector ei at iteratio k + 1: Wk+l.ei = 1JAi.ei + Qik[I-1J(Ai + v)).ei (2) Therefore, ei is also a eigevector of Wk+l ad Qi,k+l satisfies the iductio: Qi,k+l = 1JAi + Qik[l - 1J(Ai + v)] (3) Assumig Wo = 0, we ca compute the alpha-dyamics of the weight matrix W: A Qik= A ' [1-(I-1J(Ai+ v ))k] (4),+v < 1/ AM + v, Qi approaches Ai/(A, + Vi), which As k goes to ifiity, provided 1J correspods to the optimal (Wieer) value of the liear filter implemeted by the etwork. We will write the covergece rates ai = I-1JA, -1JV. These rates deped o the sigal "power spectrum", o the oise power ad o the learig rate 1J. If we ow assume WO.ei = QiO.ei with QiO #- 0 (this assumptio ca be made more geeral), we get: where bi = 1 - QiO - QiOV / Ai. Figure 1 represets possible alpha dyamics for arbitrary values of Ai with QiD = Qo #- O. We ca ow compute the learig error dyamics by expadig the LMS error term E at time k. Usig the geeral assumptios o the covariace matrices, we fid: Ek = E Eik = E Ai(1 - Qik)2 + VQ~k (6) Therefore, traiig error is a sum of error compoets, each of them beig a quadratic fuctio of Qi. Figure 2 represets a traiig error compoet Ei as a fuctio of Q. Kowig the alpha-dyamics, we ca write these error compoets as a fuctio of k: A, ( \ b2 2k) E... = V+A a h; Ai + V ' It is easy to see that E is a mootoic decreasig fuctio (geerated by gradiet descet) which coverges to the bottom of the quadratic error surface, yieldig the residual asymptotic error: (5) (7) (8)

Geeralizatio Dyamics i LMS Traied Liear Networks 893 1.0-1---------------------, o.~ -~ ---------------- >.. =.2 ~---------------------, O.O;---~--~I--~ ~~I--~--~I--~--~I--~---,I o 20 40 60 80 100 N umber of Cycles Figure 1: Alpha dyamics for differet values of >'i with 'T1 =.01 ad aio = ao =j:. O. The solid lies represet the optimal values of ai for the traiig data set. The dashed lies represet correspodig optimal values for the validatio data set. LMS v!, o ~~ A;+V J A.+V aik 1 Figure 2: Traiig ad validatio error dyamics as a fuctio of ai. The dashed curved lies represet the error dyamics for the iitial coditios aiq. Each traiig error compoet follows the gradiet of a quadratic learig curve (bottom). Note the overtraiig pheomeo (top curve) betwee at (optimal for validatio) ad aioo (optimal for traiig).

894 Chauvi 2.3 GENERALIZATION Cosiderig the geeral assumptios o the statistics of the data base, we ca compute the validatio error E' (N ote that "validatio error" strictly applies to the validatio data set. "Geeralizatio error" ca qualify the validatio data set or the whole populatio, depedig o cotext.): Ek = ~E:k = ~Ai(l- aik)2 + v'a;k (9) where the alpha-dyamics are imposed by gradiet descet learig o the traiig data set. Agai, the validatio error is a sum of error compoets Ei, quadratic fuctios of ai. However, because the alpha-dyamics are adapted to the traiig sample, they might geerate complex dyamics which will strogly deped o the iital values aio (Figure 1). Cosequetly, the resultig error compoets E: are ot mootoic decreasig fuctios aymore. As see i Figure 2, each of the validatio error compoets might (i) decrease (ii) decrease the icrease (overtraiig) or (iii) icrease as a fuctio of aio. For each of these compoets, i the case of overtraiig, it is possible to compute the value of aik at which traiig should be stopped to get miimal validatio error: L 2L-+L v'-v og >.;+v' og >';-aio(>'.+v') Log(1-7JAi - 7Jv) (10) However, the validatio error dyamics become much more complex whe we cosider sums of these compoets. If we assume aiq = 0, the miimum (or miima) of E' ca be foud to correspod to possible itersectios of hyper-ellipsoids ad power curves. I geeral, it is possible to show that there exists at least oe such miimum. It is also possible to fid simple bouds o the optimal traiig time for miimal validatio error: These bouds are tight whe the oise power is small compared to the sigal "power spectrum". For aio =f. 0, a formal aalysis of the validatio error dyamics becomes itractable. Because some error compoets might icrease while others decrease, it is possible to imagie multiple miima ad maxima for the total validatio error (see simulatios below). Cosiderig each compoet's dyamics, it is oetheless possible to compute bouds withi which E' might vary durig traiig: ~ AW' '2:" Ai(V2 + v' Ai) -:---- < Ek <,. Ai + v' - -,. (Ai + v)2 Because of the "expoetial" ature of traiig (Figure 1), it is possible to imagie that this "weavig" effect might still be observed after a log traiig period, whe the traiig error itself has become stable. Furthermore, whereas the traiig error will qualitatively show the same dyamics, validatio error will very much deped o aio: for sufficietly large iitial weights, validatio dyamics might be very depedet o particular simulatio "rus". (11) (12)

Geeralizatio Dyamics i LMS Traied Liear Networks 895 20.. 5 10 " o Figure 3: Traiig (bottom curves) ad validatio (top curves) error dyamics i a two-dimesioal case for ).1 = 17,).2 = 1.7, v = 2, v' = 10, l: 10 = 0 as l: 20 varies from 0 to 1.6 (bottom-up) i.2 icremets. 3 SIMULATIONS 3.1 CASE STUDY Equatios 7 ad 9 were simulated for a two-dimesioal case ( = 2) with ).1 17,).2 = 1.7, v = 2, v' = 10 ad l: 10 = O. The values of l: 20 determied the relative domiace of the two error compoets durig traiig. Figure 3 represets traiig ad validatio dyamics as a fuctio of k for a rage of values of l: 20. As show aalytically, traiig dyamics are basically uaffected by the iitial coditios of the weight matrix Woo However, a variety of validatio dyamics ca be observed as l: 20 varies from 0 to 1.6. For 1.6 ~ l: 20 ~ 1.4, the validatio error is mootically decreasig ad looks like a typical "gradiet descet" traiig error. For 1.2 ~ l: 20 ~ 1.0, each error compoet i tur imposes a descet rate: the validatio error looks like two "coected descets". For.8 ~ 0'20 ~.6, E~ is mootically decreasig with a slow covergece rate, forcig the validatio error to decrease log after E~ has become stable. This creates a miimum, followed by a maximum, followed by a miimum for E'. Fially, for.4 ~ l: 20 ~ 0, both error compoets have a sigle miimum durig traiig ad geerate a sigle miimum for the total validatio error E'. 3.2 PHONEMIC LABELING Oe of the mai predictios obtaied from the aalytical results ad from the previous case study is that validatio dyamics ca demostrate multiple local miima ad maxima. To my kowledge, this pheomeo has ot bee described i the literature. However, the theory also predicts that the pheomeo will probably appear very late i traiig, well after the traiig error has become stable, which might explai the absece of such observatios. The predictios were tested for a phoemic labelig task with spectrograms as iput patters ad phoemes as output

896 Chauvi patters. Various architectures were tested (direct coectios or back-propagatio etworks with liear or o-liear hidde layers). Due to the limited legth of this article, the complete simulatios will be reported elsewhere. I all cases, as predicted, multiple mimia/maxima were observed for the validatio dyamics, provided the etworks were traied way beyod usual traiig times. Furthermore, these geeralizatio dyamics were very depedet o the iitial weights (provided sufficiet variace o the iitial weight distributio). 4 DISCUSSION It is sometimes assumed that optimal learig is obtaied whe validatio error starts to icrease durig the course of traiig. Although for the theoretical study preseted, the first miimum of E' is probably always a global miimum, idepedetly of aw, simulatios of the speech labelig task show it is ot always the case with more complex architectures: late validatio miima ca sometimes (albeit rarely) be deeper tha the first "local" miimum. These observatios ad a lack of theoretical uderstadig of statistical iferece uder limited data set raise the questio of the sigificace of a validatio data set. As a fial commet, we are ot ready iterested i miimal validatio error (E') but i miimal geeralizatio error (E'). Uderstadig the dyamics of the "populatio" error as a fuctio of traiig ad validatio errors ecessitates, at least, a evaluatio of the sample statistics as a fuctio of the umber of traiig ad validatio patters. This is beyod the scope of this paper. Ackowledgemets Thaks to Pierre Baldi ad Julie Holmes for their helpful commets. Refereces Baum, E. B. & Haussler, D. (1989). 'ivhat size et gives valid geeralizatio? Neural Computatio, 1, 151-160. Chauvi, Y. (1990a). Dyamic behavior of costraied back-propagatio etworks. I D. S. Touretzky (Ed.), Neural Iformatio Processig Systems (Vol. 2) (pp. 642-649). Sa Mateo, CA: Morga Kaufma. Chauvi, Y. (1990b). Geeralizatio performace of overtraied back-propagatio etworks. I L. B. Almeida & C. J. 'ivellekes (Eds.), Lecture Notes i Computer Sciece (Vo1. 412) (pp. 46-55). Berli: Germay: Spriger-Verlag. Cu, Y. 1., Boser, B., Deker, J. S., Hederso, D., Howard, R. E., Hubbard, 'iv., & Jackel, 1. D. (1990). Hadwritte digit recogitio with a back-propagatio etwork. I D. S. Touretzky (Ed.), Neural Iformatio Processig Systems (Vo1. 2) (pp. 396-404). Sa Mateo, CA: Morga Kaufma. 'ivaibel, A., Sawai, H., & Shikao, K. (1989). Modularity ad scalig i large phoemic eural etworks. IEEE Trasactios o Acoustics, Speech ad Sigal Processig, ASSP-37, 1888-1898. 'iveiged, A. S., Huberma, B. A., & Rumelhart, D. E. (I Press). Predictig the future: a coectioist approach. Iteratioal Joural of Neural Systems.