Generalization Dynamics in LMS Trained Linear Networks



Similar documents
Modified Line Search Method for Global Optimization

I. Chi-squared Distributions

1 Correlation and Regression Analysis

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

LECTURE 13: Cross-validation

Hypothesis testing. Null and alternative hypotheses

Research Article Sign Data Derivative Recovery

1 Computing the Standard Deviation of Sample Means

Review: Classification Outline

Irreducible polynomials with consecutive zero coefficients

Chapter 7 Methods of Finding Estimators

Department of Computer Science, University of Otago

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Maximum Likelihood Estimators.

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Institute of Actuaries of India Subject CT1 Financial Mathematics

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Ekkehart Schlicht: Economic Surplus and Derived Demand

How To Solve The Homewor Problem Beautifully

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

5: Introduction to Estimation


COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Confidence Intervals for One Mean

CHAPTER 3 DIGITAL CODING OF SIGNALS

Systems Design Project: Indoor Location of Wireless Devices

Soving Recurrence Relations

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

A probabilistic proof of a binomial identity

Section 11.3: The Integral Test

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

Determining the sample size

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Data-Enhanced Predictive Modeling for Sales Targeting

JJMIE Jordan Journal of Mechanical and Industrial Engineering

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Output Analysis (2, Chapters 10 &11 Law)

Comparative Study On Estimate House Price Using Statistical And Neural Network Model

CHAPTER 3 THE TIME VALUE OF MONEY

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Building Blocks Problem Related to Harmonic Series

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Coordinating Principal Component Analyzers

Groups of diverse problem solvers can outperform groups of high-ability problem solvers

Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations

Chapter 7: Confidence Interval and Sample Size

Incremental calculation of weighted mean and variance

Estimating Probability Distributions by Observing Betting Practices

Properties of MLE: consistency, asymptotic normality. Fisher information.

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

High-dimensional support union recovery in multivariate regression

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Basic Elements of Arithmetic Sequences and Series

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

A Fuzzy Model of Software Project Effort Estimation

AP Calculus AB 2006 Scoring Guidelines Form B

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Overview on S-Box Design Principles

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Asymptotic Growth of Functions

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

NATIONAL SENIOR CERTIFICATE GRADE 12

Subject CT5 Contingencies Core Technical Syllabus

Trading rule extraction in stock market using the rough set approach

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

AP Calculus BC 2003 Scoring Guidelines Form B

PSYCHOLOGICAL STATISTICS

Modeling of Ship Propulsion Performance

HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY

Lesson 17 Pearson s Correlation Coefficient

Mathematical goals. Starting points. Materials required. Time needed

The Stable Marriage Problem

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Sequences and Series

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS

A Model Based Mixture Supervised Classification Approach in Hyperspectral Data Analysis

Measures of Spread and Boxplots Discrete Math, Section 9.4

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Ranking Irregularities When Evaluating Alternatives by Using Some ELECTRE Methods

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley


(VCP-310)

Hypergeometric Distributions

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

Application and research of fuzzy clustering analysis algorithm under micro-lecture English teaching mode

Convention Paper 6764

Transcription:

Geeralizatio Dyamics i LMS Traied Liear Networks Yves Chauvi Psychology Departmet Staford Uiversity Staford, CA 94305 Abstract For a simple liear case, a mathematical aalysis of the traiig ad geeralizatio (validatio) performace of etworks traied by gradiet descet o a Least Mea Square cost fuctio is provided as a fuctio of the learig parameters ad of the statistics of the traiig data base. The aalysis predicts that geeralizatio error dyamics are very depedet o a priori iitial weights. I particular, the geeralizatio error might sometimes weave withi a computable rage durig exteded traiig. I some cases, the aalysis provides bouds o the optimal umber of traiig cycles for miimal validatio error. For a speech labelig task, predicted weavig effects were qualitatively tested ad observed by computer simulatios i etworks traied by the liear ad o-liear back-propagatio algorithm. 1 INTRODUCTION Recet progress i etwork desig demostrates that o-liear feedforward eural etworks ca perform impressive patter classificatio for a variety of real-world applicatios (e.g., Le Cu et al., 1990; Waibel et al., 1989). Various simulatios ad relatioships betwee the eural etwork ad machie learig theoretical literatures also suggest that too large a umber of free parameters ("weight overfittig") could substatially reduce geeralizatio performace. (e.g., Baum, 1989 1989). A umber of solutios have recetly bee proposed to decrease or elimiate the overfittig problem i specific situatios. They rage from ad hoc heuristics to i theoretical cosideratios (e.g., Le Cu et al., 1990; Chauvi, 1990a; Weiged et al., Also with Thomso-CSF, Ic., 630 Hase Way, Suite 250, Palo Alto, CA 94304. 890

Geeralizatio Dyamics i LMS Traied Liear Networks 891 I Press). For a phoeme labelig applicatio, Chauvi showed that the overfittig pheomeo was actually observed oly whe etworks were overtraied far beyod their "optimal" performace poit (Chauvi, 1990b). Furthermore, geeralizatio performace of etworks seemed to be idepedet of the size of the etwork durig early traiig but the rate of decrease i performace with overtraiig was ideed related the umber of weights. The goal of this paper is to better uderstad traiig ad geeralizatio error dyamics i Least-Mea-Square traied liear etworks. As we will see, gradiet descet traiig o liear etworks ca actually geerate surprisigly rich ad isightful validatio dyamics. Furthermore, i umerous applicatios, eve o-liear etworks ted to fuctio i their liear rage, as if the etworks were makig use of o-liearities oly whe ecessary ('Veiged et al., I Press; Chauvi, 1990a). I Sectio 2, I preset a theoretical illustratio yieldig a better uderstadig of traiig ad validatio error dyamics. I Sectio 3, umerical solutios to obtaied aalytical results make iterestig predictios for validatio dyamics uder overtraiig. These predictios are tested for a phoemic labelig task. The obtaied simulatios suggest that the results of the aalysis obtaied with the simple theoretical framework of Sectio 2 might remai qualitatively valid for o-liear complex architectures. 2 THEORETICAL ILLUSTRATION 2.1 ASSUMPTIONS Let us cosider a liear etwork composed of iput uits ad output uits fully coected by a. weight matrix W. Let us suppose the etwork is traied to reproduce a oiseless output "sigal" from a oisy iput "sigal" (the etwork ca be see as a liear filter). 'Ve write F as the "sigal", N the oise, X the iput, Y the output, ad D the desired output. For the cosidered case, we have X = F+N, Y = W X ad D = F. The statistical properties of the data base are the followig. The sigal is zero-mea with covariace matrix CF. 'Ve write Ai ad ei as the eigevalues ad eigevectors of C F (ei are the so-called pricipal compoets; we will call Ai the "sigal ~ower spectrum"). The oise is assumed to be zero-mea, with covariace matrix CN = v.i where I is the idetity matrix. We assume the oise is ucorrelated with the sigal: CFN = O. We suppose two sets of patters have bee sampled for traiig ad for validatio. We write CF, CN ad CFN the resultig covariace matrices for the traiig set ad CF, CN ~d CF N the corresp_odig matrices for the validatio set. We assume CF ~ Cp ~ CF, CFN ~ CPN ~ CFN = 0, CN = v.i ad CN = v'.i with v' > v. (N umerous of these assumptios are made for the sake of clarity of explaatio: they ca be relaxed without chagig the resultig implicatios.) The problem cosidered is much simpler tha typical realistic applicatios. However, we will see below that (i) a formal aalysis becomes complex very quickly (ii) the validatio dyamics are rich, isightful ad ca be mapped to a umber of results observed i simulatios of realistic applicatios ad (iii) a iterestig umber of predictios ca be obtaied.

892 Chauvi 2.2 LEARNING The etwork is traied by gradiet descet o the Least Mea Square (LMS) error: dw = -1JV'wE where 1J is the usual learig rate ad, i the case cosidered, E = E; (Fp - Yp)T(Fp - Yp). We ca write the gradiet as a fuctio of the various covariace matrices: V' we = (I - W)C F + (I - 2W)C F N - W C N. From the geeral assumptios, we get: V'wE ~ CF - WCF - WCN (1) We assume ow that the pricipal compoets ei are also eigevectors of the weight matrix W at iteratio k with correspodig eigevalue Qik: Wk.ei = Qikei. We ca the compute the image of each eigevector ei at iteratio k + 1: Wk+l.ei = 1JAi.ei + Qik[I-1J(Ai + v)).ei (2) Therefore, ei is also a eigevector of Wk+l ad Qi,k+l satisfies the iductio: Qi,k+l = 1JAi + Qik[l - 1J(Ai + v)] (3) Assumig Wo = 0, we ca compute the alpha-dyamics of the weight matrix W: A Qik= A ' [1-(I-1J(Ai+ v ))k] (4),+v < 1/ AM + v, Qi approaches Ai/(A, + Vi), which As k goes to ifiity, provided 1J correspods to the optimal (Wieer) value of the liear filter implemeted by the etwork. We will write the covergece rates ai = I-1JA, -1JV. These rates deped o the sigal "power spectrum", o the oise power ad o the learig rate 1J. If we ow assume WO.ei = QiO.ei with QiO #- 0 (this assumptio ca be made more geeral), we get: where bi = 1 - QiO - QiOV / Ai. Figure 1 represets possible alpha dyamics for arbitrary values of Ai with QiD = Qo #- O. We ca ow compute the learig error dyamics by expadig the LMS error term E at time k. Usig the geeral assumptios o the covariace matrices, we fid: Ek = E Eik = E Ai(1 - Qik)2 + VQ~k (6) Therefore, traiig error is a sum of error compoets, each of them beig a quadratic fuctio of Qi. Figure 2 represets a traiig error compoet Ei as a fuctio of Q. Kowig the alpha-dyamics, we ca write these error compoets as a fuctio of k: A, ( \ b2 2k) E... = V+A a h; Ai + V ' It is easy to see that E is a mootoic decreasig fuctio (geerated by gradiet descet) which coverges to the bottom of the quadratic error surface, yieldig the residual asymptotic error: (5) (7) (8)

Geeralizatio Dyamics i LMS Traied Liear Networks 893 1.0-1---------------------, o.~ -~ ---------------- >.. =.2 ~---------------------, O.O;---~--~I--~ ~~I--~--~I--~--~I--~---,I o 20 40 60 80 100 N umber of Cycles Figure 1: Alpha dyamics for differet values of >'i with 'T1 =.01 ad aio = ao =j:. O. The solid lies represet the optimal values of ai for the traiig data set. The dashed lies represet correspodig optimal values for the validatio data set. LMS v!, o ~~ A;+V J A.+V aik 1 Figure 2: Traiig ad validatio error dyamics as a fuctio of ai. The dashed curved lies represet the error dyamics for the iitial coditios aiq. Each traiig error compoet follows the gradiet of a quadratic learig curve (bottom). Note the overtraiig pheomeo (top curve) betwee at (optimal for validatio) ad aioo (optimal for traiig).

894 Chauvi 2.3 GENERALIZATION Cosiderig the geeral assumptios o the statistics of the data base, we ca compute the validatio error E' (N ote that "validatio error" strictly applies to the validatio data set. "Geeralizatio error" ca qualify the validatio data set or the whole populatio, depedig o cotext.): Ek = ~E:k = ~Ai(l- aik)2 + v'a;k (9) where the alpha-dyamics are imposed by gradiet descet learig o the traiig data set. Agai, the validatio error is a sum of error compoets Ei, quadratic fuctios of ai. However, because the alpha-dyamics are adapted to the traiig sample, they might geerate complex dyamics which will strogly deped o the iital values aio (Figure 1). Cosequetly, the resultig error compoets E: are ot mootoic decreasig fuctios aymore. As see i Figure 2, each of the validatio error compoets might (i) decrease (ii) decrease the icrease (overtraiig) or (iii) icrease as a fuctio of aio. For each of these compoets, i the case of overtraiig, it is possible to compute the value of aik at which traiig should be stopped to get miimal validatio error: L 2L-+L v'-v og >.;+v' og >';-aio(>'.+v') Log(1-7JAi - 7Jv) (10) However, the validatio error dyamics become much more complex whe we cosider sums of these compoets. If we assume aiq = 0, the miimum (or miima) of E' ca be foud to correspod to possible itersectios of hyper-ellipsoids ad power curves. I geeral, it is possible to show that there exists at least oe such miimum. It is also possible to fid simple bouds o the optimal traiig time for miimal validatio error: These bouds are tight whe the oise power is small compared to the sigal "power spectrum". For aio =f. 0, a formal aalysis of the validatio error dyamics becomes itractable. Because some error compoets might icrease while others decrease, it is possible to imagie multiple miima ad maxima for the total validatio error (see simulatios below). Cosiderig each compoet's dyamics, it is oetheless possible to compute bouds withi which E' might vary durig traiig: ~ AW' '2:" Ai(V2 + v' Ai) -:---- < Ek <,. Ai + v' - -,. (Ai + v)2 Because of the "expoetial" ature of traiig (Figure 1), it is possible to imagie that this "weavig" effect might still be observed after a log traiig period, whe the traiig error itself has become stable. Furthermore, whereas the traiig error will qualitatively show the same dyamics, validatio error will very much deped o aio: for sufficietly large iitial weights, validatio dyamics might be very depedet o particular simulatio "rus". (11) (12)

Geeralizatio Dyamics i LMS Traied Liear Networks 895 20.. 5 10 " o Figure 3: Traiig (bottom curves) ad validatio (top curves) error dyamics i a two-dimesioal case for ).1 = 17,).2 = 1.7, v = 2, v' = 10, l: 10 = 0 as l: 20 varies from 0 to 1.6 (bottom-up) i.2 icremets. 3 SIMULATIONS 3.1 CASE STUDY Equatios 7 ad 9 were simulated for a two-dimesioal case ( = 2) with ).1 17,).2 = 1.7, v = 2, v' = 10 ad l: 10 = O. The values of l: 20 determied the relative domiace of the two error compoets durig traiig. Figure 3 represets traiig ad validatio dyamics as a fuctio of k for a rage of values of l: 20. As show aalytically, traiig dyamics are basically uaffected by the iitial coditios of the weight matrix Woo However, a variety of validatio dyamics ca be observed as l: 20 varies from 0 to 1.6. For 1.6 ~ l: 20 ~ 1.4, the validatio error is mootically decreasig ad looks like a typical "gradiet descet" traiig error. For 1.2 ~ l: 20 ~ 1.0, each error compoet i tur imposes a descet rate: the validatio error looks like two "coected descets". For.8 ~ 0'20 ~.6, E~ is mootically decreasig with a slow covergece rate, forcig the validatio error to decrease log after E~ has become stable. This creates a miimum, followed by a maximum, followed by a miimum for E'. Fially, for.4 ~ l: 20 ~ 0, both error compoets have a sigle miimum durig traiig ad geerate a sigle miimum for the total validatio error E'. 3.2 PHONEMIC LABELING Oe of the mai predictios obtaied from the aalytical results ad from the previous case study is that validatio dyamics ca demostrate multiple local miima ad maxima. To my kowledge, this pheomeo has ot bee described i the literature. However, the theory also predicts that the pheomeo will probably appear very late i traiig, well after the traiig error has become stable, which might explai the absece of such observatios. The predictios were tested for a phoemic labelig task with spectrograms as iput patters ad phoemes as output

896 Chauvi patters. Various architectures were tested (direct coectios or back-propagatio etworks with liear or o-liear hidde layers). Due to the limited legth of this article, the complete simulatios will be reported elsewhere. I all cases, as predicted, multiple mimia/maxima were observed for the validatio dyamics, provided the etworks were traied way beyod usual traiig times. Furthermore, these geeralizatio dyamics were very depedet o the iitial weights (provided sufficiet variace o the iitial weight distributio). 4 DISCUSSION It is sometimes assumed that optimal learig is obtaied whe validatio error starts to icrease durig the course of traiig. Although for the theoretical study preseted, the first miimum of E' is probably always a global miimum, idepedetly of aw, simulatios of the speech labelig task show it is ot always the case with more complex architectures: late validatio miima ca sometimes (albeit rarely) be deeper tha the first "local" miimum. These observatios ad a lack of theoretical uderstadig of statistical iferece uder limited data set raise the questio of the sigificace of a validatio data set. As a fial commet, we are ot ready iterested i miimal validatio error (E') but i miimal geeralizatio error (E'). Uderstadig the dyamics of the "populatio" error as a fuctio of traiig ad validatio errors ecessitates, at least, a evaluatio of the sample statistics as a fuctio of the umber of traiig ad validatio patters. This is beyod the scope of this paper. Ackowledgemets Thaks to Pierre Baldi ad Julie Holmes for their helpful commets. Refereces Baum, E. B. & Haussler, D. (1989). 'ivhat size et gives valid geeralizatio? Neural Computatio, 1, 151-160. Chauvi, Y. (1990a). Dyamic behavior of costraied back-propagatio etworks. I D. S. Touretzky (Ed.), Neural Iformatio Processig Systems (Vol. 2) (pp. 642-649). Sa Mateo, CA: Morga Kaufma. Chauvi, Y. (1990b). Geeralizatio performace of overtraied back-propagatio etworks. I L. B. Almeida & C. J. 'ivellekes (Eds.), Lecture Notes i Computer Sciece (Vo1. 412) (pp. 46-55). Berli: Germay: Spriger-Verlag. Cu, Y. 1., Boser, B., Deker, J. S., Hederso, D., Howard, R. E., Hubbard, 'iv., & Jackel, 1. D. (1990). Hadwritte digit recogitio with a back-propagatio etwork. I D. S. Touretzky (Ed.), Neural Iformatio Processig Systems (Vo1. 2) (pp. 396-404). Sa Mateo, CA: Morga Kaufma. 'ivaibel, A., Sawai, H., & Shikao, K. (1989). Modularity ad scalig i large phoemic eural etworks. IEEE Trasactios o Acoustics, Speech ad Sigal Processig, ASSP-37, 1888-1898. 'iveiged, A. S., Huberma, B. A., & Rumelhart, D. E. (I Press). Predictig the future: a coectioist approach. Iteratioal Joural of Neural Systems.