Large Scale Online Learning.

Large Scale Online Learning. Léon Boou NEC Labs America Princeon NJ 08540 leon@boou.org Yann Le Cun NEC Labs America Princeon NJ 08540 yann@lecun.com Absrac We consider siuaions where raining daa is abundan and compuing resources are comparaively scarce. We argue ha suiably designed online learning algorihms asympoically ouperform any bach learning algorihm. Boh heoreical and experimenal evidences are presened. 1 Inroducion The las decade brough us remendous improvemens in he performance and price of mass sorage devices and nework sysems. Soring and shipping audio or video daa is now inexpensive. Nework raffic iself provides new and abundan sources of daa in he form of server log files. The availabiliy of such large daa sources provides clear opporuniies for he machine learning communiy. These echnological improvemens have oupaced he exponenial evoluion of he compuing power of inegraed circuis Moore s law). This remark suggess ha learning algorihms mus process increasing amouns of daa using comparaively smaller compuing resources. This work assumes ha daases have grown o pracically infinie sizes and discusses which learning algorihms asympoically provide he bes generalizaion performance using limied compuing resources. Online algorihms operae by repeiively drawing a fresh random example and adjusing he parameers on he basis of his single example only. Online algorihms can quickly process a large number of examples. On he oher hand, hey usually are no able o fully opimize he cos funcion defined on hese examples. Bach algorihms avoid his issue by compleely opimizing he cos funcion defined on a se of raining examples. On he oher hand, such algorihms canno process as many examples because hey mus ierae several imes over he raining se o achieve he opimum. As daases grow o pracically infinie sizes, we argue ha online algorihms ouperform learning algorihms ha operae by repeiively sweeping over a raining se.

2 Gradien Based Learning Many learning algorihms opimize an empirical cos funcion C n θ) ha can be expressed as he average of a large number of erms Lz, θ). Each erm measures he cos associaed wih running a model wih parameer vecor θ on independen examples z i ypically inpu/oupu pairs z i = x i, y i ).) C n θ) = 1 n Lz i, θ) 1) n Two kinds of opimizaion procedures are ofen menioned in connecion wih his problem: i=1 Bach gradien: Parameer updaes are performed on he basis of he gradien and Hessian informaion accumulaed over a predefined raining se: C n θk) = θk 1) Φ k θk 1)) θ = θk 1) 1 n n Φ L k θ z i, θk 1)) 2) where Φ k is an appropriaely chosen posiive definie symmeric marix. Online gradien: Parameer updaes are performed on he basis of a single sample z picked randomly a each ieraion: θ) = θ 1) 1 Φ L θ z, θ 1)) 3) where Φ is again an appropriaely chosen posiive definie symmeric marix. Very ofen he examples z are chosen by cycling over a randomly permued raining se. Each cycle is called an epoch. This paper however considers siuaions where he supply of raining samples is pracically unlimied. Each ieraion of he online algorihm uilizes a fresh sample, unlikely o have been presened o he sysem before. Simple bach algorihms converge linearly 1 o he opimum θ n of he empirical cos. Careful choices of Φ k make he convergence super-linear or even quadraic 2 in favorable cases Dennis and Schnabel, 1983). Whereas online algorihms may converge o he general area of he opimum a leas as fas as bach algorihms Le Cun e al., 1998), he opimizaion proceeds raher slowly during he final convergence phase Boou and Muraa, 2002). The noisy gradien esimae causes he parameer vecor o flucuae around he opimum in a bowl whose size decreases like 1/ a bes. Online algorihms herefore seem hopelessly slow. However, he above discussion compares he speed of convergence oward he minimum of he empirical cos C n, whereas one should be much more ineresed in he convergence oward he minimum θ of he expeced cos C, which measures he generalizaion performance: C θ) = Lz, θ) pz) dz 4) Densiy pz) represens he unknown disribuion from which he examples are drawn Vapnik, 1974). This is he fundamenal difference beween opimizaion speed and learning speed. 1 Linear convergence speed: log 1/ θk) θ n 2 grows linearly wih k. 2 Quadraic convergence speed: log log 1/ θk) θ n 2 grows linearly wih k. i=1

3 Learning Speed Running an efficien bach algorihm on a raining se of size n quickly yields he empirical opimum θ n. The sequence of empirical opima θ n usually converges o he soluion θ when he raining se size n increases. In conras, online algorihms randomly draw one example z a each ieraion. When hese examples are drawn from a se of n examples, he online algorihm minimizes he empirical error C n. When hese examples are drawn from he asympoic disribuion pz), i minimizes he expeced cos C. Because he supply of raining samples is pracically unlimied, each ieraion of he online algorihm uilizes a fresh example. These fresh examples follow he asympoic disribuion. The parameer vecors θ) hus direcly converge o he opimum θ of he expeced cos C. The convergence speed of he bach θ n and online θ) sequences were firs compared by Muraa and Amari 1999). This secion repors a similar resul whose derivaion uncovers a deeper relaionship beween hese wo sequences. This approach also provides a mahemaically rigorous reamen Boou and Le Cun, 2003). Le us firs define he Hessian marix H and Fisher informaion marix G: ) H = 2 [ L ] [ ] ) T E θ θ Lz, θ ) G = L E θ z, θ ) θ z, θ ) Manipulaing a Taylor expansion of he gradien of C n θ) in he viciniy of θn 1 immediaely provides he following recursive relaion beween θn and θn 1. θn = θn 1 1 ) n Ψ L 1 n θ z n, θn 1) + O n 2 5) wih Ψ n = 1 n n i=1 ) 1 2 θ θ Lz i, θn 1) H 1 Relaion 5) describes he θ n sequence as a recursive sochasic process ha is essenially similar o he online learning algorihm 3). Each ieraion of his algorihm consiss in picking a fresh example z n and updaing he parameers according o 5). This is no a pracical algorihm because we have no analyical expression for he second order erm. We can however apply he mahemaics of online learning algorihms o his sochasic process. The similariy beween 5) and 3) suggess ha boh he bach and online sequences converge a he same speed for adequae choices of he scaling marix Φ. Under cusomary regulariy condiions, he following asympoic speed resuls holds when he scaling marix Φ converges o he inverse H 1 of he Hessian marix. E θ) θ 2) + o 1 ) = E θ θ 2) + o ) 1 = r H 1 G H 1) This convergence speed expression has been discovered many imes. Tsypkin 1973) esablishes 6) for linear sysems. Muraa and Amari 1999) address generic sochasic gradien algorihms wih a consan scaling marix. Our resul Boou and Le Cun, 2003) holds when he scaling marix Φ depends on he previously seen examples, and also holds when he sochasic updae is perurbed by unspecified second order erms, as in equaion 5). See he appendix for a proof skech Boou and LeCun, 2003). Resul 6) applies o boh he online θ) and bach θ) sequences. No only does i esablish ha boh sequences have O 1/) convergence, bu also i provides he value of 6)

he consan. This consan is neiher affeced by he second order erms of 5) nor by he convergence speed of he scaling marix Φ oward H 1. In he Maximum Likelihood case, i is well known ha boh H and G are equal on he opimum. Equaion 6) hen indicaes ha he convergence speed sauraes he Cramer-Rao bound. This fac was known in he case of he naural gradien algorihm Amari, 1998). I remains rue for a large class of online learning algorihms. Resul 6) suggess ha he scaling marix Φ should be a full rank approximaion of he Hessian H. Mainaining such an approximaion becomes expensive when he dimension of he parameer vecor increases. The compuaional cos of each ieraion can be drasically reduced by mainaining only a coarse approximaions of he Hessian e.g. diagonal, blockdiagonal, muliplicaive, ec.). A proper seup ensures ha he convergence speed remains O 1/) despie a less favorable consan facor. The similar naure of he convergence of he bach and online sequences can be summarized as follows. Consider wo opimally designed bach and online learning algorihms. The bes generalizaion error is asympoically achieved by he learning algorihm ha uses he mos examples wihin he allowed ime. 4 Compuaional Cos The discussion so far has esablished ha a properly designed online learning algorihm performs as well as any bach learning algorihm for a same number of examples. We now esablish ha, given he same compuing resources, an online learning algorihm can asympoically process more examples han a bach learning algorihm. Each ieraion of a bach learning algorihm running on N raining examples requires a ime K 1 N + K 2. Consans K 1 and K 2 respecively represen he ime required o process each example, and he ime required o updae he parameers. Resul 6) provides he following asympoic equivalence: θn θ ) 2 1 N The bach algorihm mus perform enough ieraions o approximae θn wih a leas he same accuracy 1/N). An efficien algorihm wih quadraic convergence achieves his afer a number of ieraions asympoically proporional o log log N. Running an online learning algorihm requires a consan ime K 3 per processed example. Le us call T he number of examples processed by he online learning algorihm using he same compuing resources as he bach algorihm. We hen have: K 3 T K 1 N + K 2 ) log log N = T N log log N The parameer θt ) of he online algorihm also converges according o 6). Comparing he accuracies of boh algorihms shows ha he online algorihm asympoically provides a beer soluion by a facor O log log N). θt ) θ ) 2 1 N log log N 1 N θ N θ ) 2 This log log N facor corresponds o he number of ieraions required by he bach algorihm. This number increases slowly wih he desired accuracy of he soluion. In pracice, his facor is much less significan han he acual value of he consans K 1, K 2 and K 3. Experience shows however ha online algorihms are considerably easier o implemen. Each ieraion of he bach algorihm involves a large summaion over all he available examples. Memory mus be allocaed o hold hese examples. On he oher hand, each ieraion of he online algorihm only involves one random example which can hen be discarded.

5 Experimens A simple validaion experimen was carried ou using synheic daa. The examples are inpu/oupu pairs x, y) wih x R 20 and y = ±1. The model is a single sigmoid uni rained using he leas square crierion. Lx, y, θ) = 1.5y fθx)) 2 where fx) = 1.71 anh0.66x) is he sandard sigmoid discussed in LeCun e al. 1998). The sigmoid generaes various curvaure condiions in he parameer space, including negaive curvaure and plaeaus. This simple model represens well he final convergence phase of he learning process. Ye i is also very similar o he widely used generalized linear models GLIM) Chambers and Hasie, 1992). The firs componen of he inpu x is always 1 in order o compensae he absence of a bias parameer in he model. The remaining 19 componens are drawn from wo Gaussian disribuions, cenered on 1, 1,..., 1) for he firs class and +1, +1,..., +1) for he second class. The eigenvalues of he covariance marix of each class range from 1 o 20. Two separae ses for raining and esing were drawn wih 1 000 000 examples each. One hundred permuaions of he firs se are generaed. Each learning algorihm is rained using various number of examples aken sequenially from he beginning of he permued ses. The resuling performance is hen measured on he esing se and averaged over he one hundred permuaions. Bach-Newon algorihm The reference bach algorihm uses he Newon-Raphson algorihm wih Gauss-Newon approximaion Le Cun e al., 1998). Each ieraion visis all he raining and compues boh gradien g and he Gauss-Newon approximaion H of he Hessian marix. g = L θ x i, y i, θ k 1 ) H = f θ k 1 x i )) 2 x i x T i i i The parameers are hen updaed using Newon s formula: θ k = θ k 1 H 1 g Ieraions are repeaed unil he parameer vecor moves by less han 0.01/N where N is he number of raining examples. This algorihm yields quadraic convergence speed. Online-Kalman algorihm The online algorihm performs a single sequenial sweep over he raining examples. The parameer vecor is updaed afer processing each example x, y ) as follows: θ = θ 1 1 τ Φ L θ x, y, θ 1 ) The scalar τ = max 20, 40) makes sure ha he firs few examples do no cause impracically large parameer updaes. The scaling marix Φ is equal o he inverse of a leaky average of he per-example Gauss-Newon approximaion of he Hessian. Φ = 1 2 ) ) ) 1 2 Φ 1 1 τ + f θ 1 x )) 2 x x T τ The implemenaion avoids he marix inversions by direcly compuing Φ from Φ 1 using he marix inversion lemma. see Boou, 1998) for insance.) αa 1 + βuu T ) ) 1 1 = A Au)Au)T α α/β + u T Au

1e 1 1e 1 1e 2 1e 2 1e 3 1e 3 1e 4 1e 4 1000 10000 100000 Figure 1: Average θ θ ) 2 as a funcion of he number of examples. The gray line represens he heoreical predicion 6). Filled circles: bach. Hollow circles: online. The error bars indicae a 95% confidence inerval. 100 1000 10000 Figure 2: Average θ θ ) 2 as a funcion of he raining ime milliseconds). Hollow circles: online. Filled circles: bach. The error bars indicae a 95% confidence inerval. The resuling algorihm slighly differs from he Adapive Naural Gradien algorihm Amari, Park, and Fukumizu, 1998). In paricular, here is lile need o adjus a learning rae parameer in he Gauss-Newon approach. The 1/ or 1/τ) schedule is asympoically opimal. Resuls The opimal parameer vecor θ was firs compued on he esing se using he bachnewon approach. The marices H and G were compued on he esing se as well in order o deermine he consan in relaion 6). Figure 1 plos he average squared disance beween he opimal parameer vecor θ and he parameer vecor θ achieved on raining ses of various sizes. The gray line represens he heoreical predicion. Boh he bach poins and he online poins join he heoreical predicion when he raining se size increases. Figure 2 shows he same daa poins as a funcion of he CPU ime required o run he algorihm on a sandard PC. The online algorihm gradually becomes more efficien when he raining se size increases. This happens because he bach algorihm needs o perform addiional ieraions in order o mainain he same level of accuracy. In pracice, he es se mean squared error MSE) is usually more relevan han he accuracy of he parameer vecor. Figure 3 displays a logarihmic plo of he difference beween he MSE and he bes achievable MSE, ha is o say he MSE achieved by parameer vecor θ. This difference can be approximaed as θ θ ) T H θ θ ). Boh algorihms yield virually idenical errors for he same raining se size. This suggess ha he small differences shown in figure 1 occur along he low curvaure direcions of he cos funcion. Figure 4 shows he MSE as a funcion of he CPU ime. The online algorihm always provides higher accuracy in significanly less ime. As expeced from he heoreical argumen, he online algorihm asympoically ouperforms he super-linear Newon-Raphson algorihm 3. More imporanly, he online algorihm achieves his resul by performing a single sweep over he raining daa. This is a very significan advanage when he daa does no fi in cenral memory and mus be sequenially accessed from a disk based daabase. 3 Generalized linear models are usually rained using he IRLS mehod Chambers and Hasie, 1992) which is closely relaed o he Newon-Raphson algorihm and requires similar compuaional resources.

Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e 4 1000 10000 100000 0.366 0.362 0.358 0.354 0.350 0.346 0.342 100 1000 10000 Figure 3: Average es MSE as a funcion of he number of examples lef). The verical axis shows he logarihm of he difference beween he error and he bes error achievable on he esing se. Boh curves are essenially superposed. Figure 4: Average es MSE as a funcion of he raining ime milliseconds). Hollow circles: online. Filled circles: bach. The gray line indicaes he bes mean squared error achievable on he es se. 6 Conclusion Many popular algorihms do no scale well o large number of examples because hey were designed wih small daa ses in mind. For insance, he raining ime for Suppor Vecor Machines scales somewhere beween N 2 and N 3, where N is he number of examples. Our baseline super-linear bach algorihm learns in N log log N ime. We demonsrae ha adequae online algorihms asympoically achieve he same generalizaion performance in N ime afer a single sweep on he raining se. The convergence of learning algorihms is usually described in erms of a search phase followed by a final convergence phase Boou and Muraa, 2002). Solid empirical evidence Le Cun e al., 1998) suggess ha online algorihms ouperform bach algorihms during he search phase. The presen work provides boh heoreical and experimenal evidence ha an adequae online algorihm ouperforms any bach algorihm during he final convergence phase as well. Appendix 4 : Skech of he convergence speed proof Lemma Le u ) be a sequence of posiive reals verifying he following recurrence: u = 1 α )) 1 + o u 1 + β ) 1 2 + o 2 7) β The lemma saes ha u α 1 when α > 1 and β > 0. The proof is delicae because he resul holds regardless of he unspecified low order erms of he recurrence. However, i is easy o illusrae his convergence wih simple numerical simulaions. Convergence speed Consider he following recursive sochasic process: θ) = θ 1) 1 ) Φ L 1 θ z, θ 1)) + O n 2 Our discussion addresses he final convergence phase of his process. Therefore we assume ha he parameers θ remain confined in a bounded domain D where he cos funcion C θ) is convex and has a single non degenerae minimum θ D. We can assume 4 This secion has been added for he final version 8)

θ = 0 wihou loss of generaliy. We wrie E X) he condiional expecaion of X given all ha is known before ime, including he iniial condiions θ 0 and he seleced examples z 1,..., z 1. We iniially assume also ha Φ is a funcion of z 1,..., z 1 only. Using 8), we wrie E θ θ ) as a funcion of θ 1. Then we simplify 5 and ake he race. E θ 2) = θ 1 2 2 θ 1 2 θ 1 2 ) + o + r H 1 G H 1) ) 1 2 + o 2 Taking he uncondiional expecaion yields a recurence similar o 7). We hen apply he lemma and conclude ha E θ 2 ) r H 1 G H 1). Remark 1 The noaion o X ) is quie ambiguous when dealing wih sochasic processes. There are many possible flavors of convergence, including uniform convergence, almos sure convergence, convergence in probabiliy, ec. Furhermore, i is no rue in general ha E o X )) = o E X )). The complee proof precisely defines he meaning of hese noaions and carefully checks heir properies. Remark 2 The proof skech assumes ha Φ is a funcion of z 1,..., z 1 only. In 5), Ψ also depends on z. The resul sill holds because he conribuion of z vanishes quickly when grows large. Remark 3 The same 1 behavior holds when Φ Φ and when Φ is greaer han 1 2 H 1 in he semi definie sense. The consan however is worse by a facor roughly equal o HΦ. Acknowledgmens The auhors acknowledge exensive discussions wih Yoshua Bengio, Sami Bengio, Ronan Collober, Noboru Muraa, Kenji Fukumizu, Susanna Sill, and Barak Pearlmuer. References Amari, S. 1998). Naural Gradien Works Efficienly in Learning. Neural Compuaion, 102):251 276. Boou, L. 1998). Online Algorihms and Sochasic Approximaions, 9-42. In Saad, D., edior, Online Learning and Neural Neworks. Cambridge Universiy Press, Cambridge, UK. Boou, L. and Muraa, N. 2002). Sochasic Approximaions and Efficien Learning. In Arbib, M. A., edior, The Handbook of Brain Theory and Neural Neworks, Second ediion,. The MIT Press, Cambridge, MA. Boou, L. and Le Cun, Y. 2003). Online Learning for Very Large Daases. NEC Labs TR-2003- L039. To appear: Applied Sochasic Models in Business and Indusry. Wiley. Chambers, J. M. and Hasie, T. J. 1992). Saisical Models in S, Chapman & Hall, London. Dennis, J. and Schnabel, R. B. 1983). Numerical Mehods For Unconsrained Opimizaion and Nonlinear Equaions. Prenice-Hall, Inc., Englewood Cliffs, New Jersey. Amari, S. and Park, H. and Fukumizu, K. 1998). Adapive Mehod of Realizing Naural Gradien Learning for Mulilayer Perceprons, Neural Compuaion, 126):1399 1409 Le Cun, Y., Boou, L., Orr, G. B., and Müller, K.-R. 1998). Efficien Back-prop. In Neural Neworks, Tricks of he Trade, Lecure Noes in Compuer Science 1524. Springer Verlag. Muraa, N. and Amari, S. 1999). Saisical analysis of learning dynamics. Signal Processing, 741):3 28. Vapnik, V. N. and Chervonenkis, A. 1974). Theory of Paern Recogniion in russian). Nauka. Tsypkin, Ya. 1973). Foundaions of he heory of learning sysems. Academic Press. 5 L Recall E Φ z, θ)) C = Φ θ θ) = ΦHθ + o θ ) = θ + o θ ) θ