Learning using Large Datasets

Transcription

1 Learning using Large Datasets Léon Bottou a an Olivier Bousquet b a NEC Laboratories America, Princeton, NJ08540, USA b Google Zürich, 8002 Zürich, Switzerlan Abstract. This contribution evelops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows istinct traeoffs for the case of small-scale an large-scale learning problems. Small-scale learning problems are subject to the usual approximation estimation traeoff. Large-scale learning problems are subject to a qualitatively ifferent traeoff involving the computational complexity of the unerlying optimization algorithms in non-trivial ways. For instance, a meiocre optimization algorithms, stochastic graient escent, is shown to perform very well on large-scale learning problems. Keywors. Large-scale learning. Optimization. Statistics. Introuction The computational complexity of learning algorithms has selom been taken into account by the learning theory. Valiant [1] states that a problem is learnable when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been mae on the statistical aspect (e.g., [2,3,4]), very little has been tol about the complexity sie of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training ata. Two important examples come to min: Data mining exists because competitive avantages can be achieve by analyzing the masses of ata that escribe the life of our computerize society. Since virtually every computer generates ata, the ata volume is proportional to the available computing power. Therefore one nees learning algorithms that scale roughly linearly with the total volume of ata. Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual ata generate by our six senses, using limite amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of ata. This contribution fins its source in the iea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new ecomposition of the test error where an aitional term represents the impact of approximate optimization. In the case of small-scale learning problems, this ecomposition reuces to the well known traeoff between approximation error an estimation error. In the case of

2 large-scale learning problems, the traeoff is more complex because it involves the computational complexity of the learning algorithm. The secon part explores the asymptotic properties of the large-scale learning traeoff for various prototypical learning algorithms uner various assumptions regaring the statistical estimation rates associate with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regarless of the assume rate for the statistical estimation error. Finally, the final part presents some experimental results. 1. Approximate Optimization Following [6,2], we consier a space of input-output pairs (x,y) X Y enowe with a probability istribution P(x, y). The conitional istribution P(y x) represents the unknown relationship between inputs an outputs. The iscrepancy between the preicte output ŷ an the real output y is measure with a loss function l(ŷ,y). Our benchmark is the function f that minimizes the expecte risk E(f) = l(f(x),y)p(x,y) = E[l(f(x),y)], that is, f (x) = arg min E[l(ŷ,y) x]. ŷ Although the istribution P(x,y) is unknown, we are given a sample S of n inepenently rawn training examples (x i,y i ), i = 1...n. We efine the empirical risk E n (f) = 1 n n l(f(x i ),y i ) = E n [l(f(x),y)]. i=1 Our first learning principle consists in choosing a family F of caniate preiction functions an fining the function f n = arg min f F E n (f) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provie that the chosen family F is sufficiently restrictive. Since the optimal function f is unlikely to belong to the family F, we also efine f F = arg min f F E(f). For simplicity, we assume that f, f F an f n are well efine an unique. We can then ecompose the excess error as E [E(f n) E(f )] = E[E(fF) E(f )] {z } + E [E(fn) E(f F)] {z }, (1) = E app + E est where the expectation is taken with respect to the ranom choice of training set. The approximation error E app measures how closely functions in F can approximate the optimal solution f. The estimation error E est measures the effect of minimizing the empirical risk E n (f) instea of the expecte risk E(f). The estimation error is etermine by the number of training examples an by the capacity of the family of functions [2]. Large

3 families 1 of functions have smaller approximation errors but lea to higher estimation errors. This traeoff has been extensively iscusse in the literature [2,3] an lea to excess error that scale between the inverse an the inverse square root of the number of examples [7,8] Optimization Error Fining f n by minimizing the empirical risk E n (f) is often a computationally expensive operation. Since the empirical risk E n (f) is alreay an approximation of the expecte risk E(f), it shoul not be necessary to carry out this minimization with great accuracy. For instance, we coul stop an iterative optimization algorithm long before its convergence. Let us assume that our minimization algorithm returns an approximate solution f n that minimizes the objective function up to a preefine tolerance ρ 0. E n ( f n ) < E n (f n ) + ρ We can then ecompose the excess error E = E [ E( f n ) E(f ) ] as E = E [E(fF) E(f )] {z } + E [E(fn) E(f F)] {z } + EˆE( f n) E(f n) {z }. (2) = E app + E est + E opt We call the aitional term E opt the optimization error. It reflects the impact of the approximate optimization on the generalization performance. Its magnitue is comparable to ρ (see section 2.1.) 1.2. The Approximation Estimation Optimization Traeoff This ecomposition leas to a more complicate compromise. It involves three variables an two constraints. The constraints are the maximal number of available training example an the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, an the number of examples n. This is formalize by the following optimization problem. min F,ρ,n E = E app + E est + E opt { n n subject to max (3) T(F,ρ,n) T max The number n of training examples is a variable because we coul choose to use only a subset of the available training examples in orer to complete the optimization within the allote time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, an ρ increase. The solution of the optimization program (3) epens critically of which buget constraint is active: constraint n < n max on the number of examples, or constraint T < T max on the training time. 1 We often consier neste families of functions of the form F c = {f H, Ω(f) c}. Then, for each value of c, function f n is obtaine by minimizing the regularize empirical risk E n(f)+λω(f) for a suitable choice of the Lagrange coefficient λ. We can then control the estimation-approximation traeoff by choosing λ instea of c.

4 Table 1. Typical variations when F, n, an ρ increase. F n ρ E app (approximation error) ց E est (estimation error) ր ց E opt (optimization error) ր T (computation time) ր ր ց We speak of small-scale learning problem when (3) is constraine by the maximal number of examples n max. Since the computing time is not limite, we can reuce the optimization error E opt to insignificant levels by choosing ρ arbitrarily small. The excess error is then ominate by the approximation an estimation errors, E app an E est. Taking n = n max, we recover the approximation-estimation traeoff that is the object of abunant literature. We speak of large-scale learning problem when (3) is constraine by the maximal computing time T max. Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processe uring the allowe time. The specifics epen on the computational properties of the chosen optimization algorithm through the expression of the computing time T(F, ρ, n). 2. The Asymptotics of Large-scale Learning In the previous section, we have extene the classical approximation-estimation traeoff by taking into account the optimization error. We have given an objective criterion to istiguish small-scale an large-scale learning problems. In the small-scale case, we recover the classical traeoff between approximation an estimation. The large-scale case is substantially ifferent because it involves the computational complexity of the learning algorithm. In orer to clarify the large-scale learning traeoff with sufficient generality, this section makes several simplifications: We are stuying upper bouns of the approximation, estimation, an optimization errors (2). It is often accepte that these upper bouns give a realistic iea of the actual convergence rates [9,10,11,12]. Another way to fin comfort in this approach is to say that we stuy guarantee convergence rates instea of the possibly pathological special cases. We are stuying the asymptotic properties of the traeoff when the problem size increases. Instea of carefully balancing the three terms, we write E = O(E app )+ O(E est ) + O(E opt ) an only nee to ensure that the three terms ecrease with the same asymptotic rate. We are consiering a fixe family of functions F an therefore avoi taking into account the approximation error E app. This part of the traeoff covers a wie spectrum of practical realities such as choosing moels an choosing features. In the context of this work, we o not believe we can meaningfully aress this without iscussing, for instance, the thorny issue of feature selection. Instea we focus on the choice of optimization algorithm. Finally, in orer to keep this paper short, we consier that the family of functions F is linearly parametrize by a vector w R. We also assume that x, y an w

5 are boune, ensuring that there is a constant B such that 0 l(f w (x),y) B an l(,y) is Lipschitz. We first explain how the uniform convergence bouns provie convergence rates that take the optimization error into account. Then we iscuss an compare the asymptotic learning properties of several optimization algorithms Convergence of the Estimation an Optimization Errors The optimization error E opt epens irectly on the optimization accuracy ρ. However, the accuracy ρ involves the empirical quantity E n ( f n ) E n (f n ), whereas the optimization error E opt involves its expecte counterpart E( f n ) E(f n ). This section iscusses the impact on the optimization error E opt an of the optimization accuracy ρ on generalization bouns that leverage the uniform convergence concepts pioneere by Vapnik an Chervonenkis (e.g., [2].) In this iscussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c o not necessarily imply that the constants have ientical values Simple Uniform Convergence Bouns Recall that we assume that F is linearly parametrize by w R. Elementary uniform convergence results then state that» E sup E(f) E n(f) f F r c n, where the expectation is taken with respect to the ranom choice of the training set. 2 This result immeiately provies a boun on the estimation error: E est = E ˆ È(f n) E n(f n) + È n(f n) E n(ff) + È n(ff) E(fF)» r 2 E sup E(f) E n(f) c f F n. This same result also provies a combine boun for the estimation an optimization errors: E est + E opt = EÊ( f n) E n( f n) + EÊ n( f n) E n(f n) + E [E n(f n) E n(ff)] + E [E n(ff) E(fF)] r r r! c n + ρ c n = c ρ +. n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticate bouns are require. q 2 Although the original Vapnik-Chervonenkis bouns have the form c n log n, the logarithmic term can be eliminate using the chaining technique (e.g., [10].)

6 Faster Rates in the Realizable Case When the loss functions l(ŷ,y) is positive, with probability 1 e τ for any τ > 0, relative uniform convergence bouns state that r E(f) E n(f) sup p c f F E(f) n log n + τ n. This result is very useful because it provies faster convergence rates O(log n/n) in the realizable case, that is when l(f n (x i ),y i ) = 0 for all training examples (x i,y i ). We have then E n (f n ) = 0, E n ( f n ) ρ, an we can write q E( f n) ρ c E( f n) r n log n + τ n. Viewing this as a secon egree polynomial inequality in variable E( f n) c ρ + n log n + τ «. n E( f n ), we obtain Integrating this inequality using a stanar technique (see, e.g., [13]), we obtain a better convergence rate of the combine estimation an optimization error: h E est + E opt = E E( f i h n) E(fF) E E( f i n) = c ρ + n log n « Fast Rate Bouns Many authors (e.g., [10,4,12]) obtain fast statistical estimation rates in more general conitions. These bouns have the general form ( ( E app + E est c E app + n log n ) α ) for 1 α 1. (4) 2 This result hols when one can establish the following variance conition: f F E[ (l(f(x),y ) l(f F (X),Y ) ) 2 ] c ( ) 2 1 E(f) E(fF) α. (5) The convergence rate of (4) is escribe by the exponent α which is etermine by the quality of the variance boun (5). Works on fast statistical estimation ientify two main ways to establish such a variance conition. Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the square loss l(ŷ,y) = (ŷ y) 2. Making assumptions on the ata istribution. In the case of pattern recognition problems, for instance, the Tsybakov conition inicates how cleanly the posterior istributions P(y x) cross near the optimal ecision bounary [11,12]. The realizable case iscusse in section can be viewe as an extreme case of this.

7 Despite their much greater complexity, fast rate estimation results can accomoate the optimization accuracy ρ using essentially the methos illustrate in sections an We then obtain a boun of the form E = E app + E est + E opt = E h E( f n) E(f ) i c E app + n log n «α «+ ρ. (6) For instance, a general result with α = 1 is provie by Massart [13, theorem 4.2]. Combining this result with stanar bouns on the complexity of classes of linear functions (e.g., [10]) yiels the following result: E = E app + E est + E opt = E h E( f i n) E(f ) c E app + n log n «+ ρ. (7) See also [15,4] for more bouns taking into account the optimization accuracy Graient Optimization Algorithms We now iscuss an compare the asymptotic learning properties of four graient optimization algorithms. Recall that the family of function F is linearly parametrize by w R. Let w F an w n correspon to the functions f F an f n efine in section 1. In this section, we assume that the functions w l(f w (x),y) are convex an twice ifferentiable with continuous secon erivatives. Convexity ensures that the empirical const function C(w) = E n (f w ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H an the graient covariance matrix G, both measure at the empirical optimum w n. H = 2 C (wn) = En w2 G = E n» l(fwn (x), y) w» 2 l(f wn (x), y) w 2 «l(fwn (x), y) w, (8) «. (9) The relation between these two matrices epens on the chosen loss function. In orer to summarize them, we assume that there are constants λ max λ min > 0 an ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 η : tr(g H 1 ) ν an EigenSpectrum(H) [λ min, λ max ] (10) The conition number κ = λ max /λ min is a goo inicator of the ifficulty of the optimization [16]. The conition λ min > 0 avois complications with stochastic graient algorithms. Note that this conition only implies strict convexity aroun the optimum. For instance, consier the loss function l is obtaine by smoothing the well known hinge loss l(z, y) = max{0, 1 yz} in a small neighborhoo of its non-ifferentiable points. Function C(w) is then piecewise linear with smoothe eges an vertices. It is not strictly convex. However its minimum is likely to be on a smoothe vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yiels fast estimation rates α 1 in (4) an (6). This is not necessarily the case here. The four algorithm consiere in this paper use information about the graient of the cost function to iteratively upate their current estimate w(t) of the parameter vector.

8 Graient Descent (GD) iterates w(t + 1) = w(t) η C w (w(t)) = w(t) η 1 n n i=1 w l( ) f w(t) (x i ),y i where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λ max, this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations epens on the choice of the initial parameter vector. Secon Orer Graient Descent (2GD) iterates 1 C w(t + 1) = w(t) H w (w(t)) = w(t) 1 n H 1 n i=1 w l( ) f w(t) (x i ),y i where matrix H 1 is the inverse of the Hessian matrix (8). This is more favorable than Newton s algorithm because we o not evaluate the local Hessian at each iteration but simply assume that we know in avance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quaratic convergence [16]. When the cost is quaratic, a single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are require to reach accuracy ρ. Stochastic Graient Descent (SGD) picks a ranom training example (x t,y t ) at each iteration an upates the parameter w on the basis of this example only, w(t + 1) = w(t) η t w l( f w(t) (x t ),y t ). Murata [17, section 2.2], characterizes the mean E S [w(t)] an variance Var S [w(t)] with respect to the istribution implie by the ranom examples rawn from the training set S at each iteration. Applying this result to the iscrete training set istribution for η = 1/λ min, we have δw(t) 2 = O(1/t) where δw(t) is a shorthan notation for w(t) w n. We can then write E S[ C(w(t)) inf C ] = E Sˆtr`H δw(t) δw(t) + o`1 t = tr ` H E S[δw(t)] E S[δw(t)] + H Var S[w(t)] + o`1 t tr(gh) + o`1 t t νκ 2 + o`1 t t. (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ 2 /ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limite by the stochastic noise inuce by the ranom choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the ominant term of this boun! When the training set is large, one coul reach the esire accuracy ρ measure on the whole training set without even visiting all the training examples. This is in fact a kin of generalization boun.

9 Table 2. Asymptotic results for graient algorithms (with probability 1). Compare the secon last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legen: n number of examples; parameter imension; κ, ν see equation (10). Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E c(e app + ε) GD O(n) O κ log 1 O nκlog 1 O 2 κ ρ ρ ε 1/α log2 1 ε 2GD O` 2 + n O log log 1 ρ O `2 + n log log 1 O 2 ρ log 1 ε 1/α ε log log 1 ε νκ SGD O() 2 ρ + o 1 O νκ 2 O ν κ 2 ρ ρ ε The first three columns of table 2 report for each algorithm the time for a single iteration, the number of iterations neee to reach a preefine accuracy ρ, an their prouct, the time neee to reach accuracy ρ. These asymptotic results are vali with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bouns the time necessary to reuce the excess error E below c(e app + ε) where c is the constant from (6). This is compute by observing that choosing ρ ` log n α in (6) achieves the fastest rate for ε, with minimal computation time. n We can then use the asymptotic equivalences ρ ε an n log 1. Setting the fourth ε 1/α ε column expressions to T max an solving for ǫ yiels the best excess error achieve by each algorithm within the limite time T max. This provies the asymptotic solution of the Estimation Optimization traeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems epens on both the statistical properties of the estimation proceure an the computational properties of the chosen optimization algorithm. Their combination leas to surprising consequences: The SGD result oes not epen on the estimation rate α. When the estimation rate is poor, there is less nee to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is alreay a kin of generalization boun: its fast rate trumps the slower rate assume for the estimation error. Superlinear optimization brings little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, the learning performance of all these algorithms is ominate by the polynomial term in (1/ε). This explains why improving the constants, κ an ν using preconitioning methos an sensible software engineering often proves more effective than switching to more sophisticate optimization techniques [18]. The SGD algorithm yiels the best generalization performance espite being the worst optimization algorithm. This ha been escribe before [19] in the case of a secon orer stochastic graient escent an observe in experiments. In contrast, since the optimization error E opt of small-scale learning systems can be reuce to insignificant levels, their generalization performance is solely etermine by the statistical properties of their estimation proceure.

10 Table 3. Results with linear SVM on the RCV1 ataset. Moel Algorithm Training Time Objective Test Error Hinge loss, λ = 10 4 See [21,22]. Logistic loss, λ = 10 5 See [23]. SVMLight 23,642 secs % SVMPerf 66 secs % SGD 1.4 secs % LibLinear (ρ = 10 2 ) 30 secs % LibLinear (ρ = 10 3 ) 44 secs % SGD 2.3 secs % 0.25 Testing loss 0.20 Training time (secs) 100 SGD Testing loss SGD 0.25 CONJUGATE GRADIENTS n=10000 n= n= n=30000 n= LibLinear e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost) Figure 1. Training time an testing loss as a function of the optimization accuracy ρ for SGD an LibLinear [23] Training time (secs) Figure 2. Testing loss versus training time for SGD, an for Conjugate Graients running on subsets of the training set. 3. Experiments This section empirically compares the SGD algorithm with other optimization algorithms on a well-known text categorization task, the classification of ocuments belonging to the CCAT category in the RCV1-v2 ataset [20]. Refer to projects/sg for source coe an for aitional experiments that coul not fit in this paper because of space constraints. In orer to collect a large training set, we swap the RCV1-v2 official training an test sets. The resulting training sets an testing sets contain 781,265 an 23,149 examples respectively. The 47,152 TF/IDF features were recompute on the basis of this new split. We use a simple linear moel with the usual hinge loss SVM objective function min w C(w,b) = λ n n l(y t (wx t + b)) i=1 with l(z) = max{0,1 z}. The first two rows of table 3 replicate earlier results [21] reporte for the same ata an the same value of the hyper-parameter λ. The thir row of table 3 reports results obtaine with the SGD algorithm ( w t+1 = w t η t λw + l(y ) t(wx t + b)) w with η t = 1 λ(t + t 0 ). The bias b is upate similarly. Since λ is a lower boun of the smallest eigenvalue of the hessian, our choice of gains η t approximates the optimal scheule (see section 2.2 ).

11 The offset t 0 was chosen to ensure that the initial gain is comparable with the expecte size of the parameter w. The results clearly inicate that SGD offers a goo alternative to the usual SVM solvers. Comparable results were obtaine in [22] using an algorithm that essentially amounts to a stochastic graient correcte by a projection step. Our results inicates that the projection step is not an essential component of this performance. Table 3 also reports results obtaine with the logistic loss l(z) = log(1 + e z ) in orer to avoi the issues relate to the nonifferentiability of the hinge loss. Note that this experiment uses a much better value for λ. Our comparison points were obtaine with a state-of-the-art superlinear optimizer [23], for two values of the optimization accuracy ρ. Yet the very simple SGD algorithm learns faster. Figure 1 shows how much time each algorithm takes to reach a given optimization accuracy. The superlinear algorithm reaches the optimum with 10 igits of accuracy in less than one minute. The stochastic graient starts more quickly but is unable to eliver such a high accuracy. However the upper part of the figure clearly shows that the testing set loss stops ecreasing long before the moment where the superlinear algorithm overcomes the stochastic graient. Figure 2 shows how the testing loss evolves with the training time. The stochastic graient escent curve can be compare with the curves obtaine using conjugate graients 3 on subsets of the training examples with increasing sizes. Assume for instance that our computing time buget is 1 secon. Running the conjugate graient algorithm on a ranom subset of training examples achieves a much better performance than running it on the whole training set. How to guess the right subset size a priori remains unclear. Meanwhile running the SGD algorithm on the full training set reaches the same testing set performance much faster. 4. Conclusion Taking in account buget constraints on both the number of examples an the computation time, we fin qualitative ifferences between the generalization performance of small-scale learning systems an large-scale learning systems. The generalization properties of large-scale learning systems epen on both the statistical properties of the estimation proceure an the computational properties of the optimization algorithm. We illustrate this fact by eriving asymptotic results on graient algorithms supporte by an experimental valiation. Consierable refinements of this framework can be expecte. Extening the analysis to regularize risk formulations woul make results on the complexity of primal an ual optimization algorithms [21,24] irectly exploitable. The choice of surrogate loss function [7,12] coul also have a non-trivial impact in the large-scale case. Acknowlegments Part of this work was fune by NSF grant CCR This experimental setup was suggeste by Olivier Chapelle (personal communication). His specialize variant of the conjugate graients algorithm works nicely in this context because it converges superlinearly with very limite overhea.

12 References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages , [2] Vlaimir N. Vapnik. Estimation of Depenences Base on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, [3] Stéphane Boucheron, Olivier Bousquet, an Gábor Lugosi. Theory of classification: a survey of recent avances. ESAIM: Probability an Statistics, 9: , [4] Peter L. Bartlett an Shahar Menelson. Empirical minimization. Probability Theory an Relate Fiels, 135(3): , [5] J. Stephen Ju. On the complexity of loaing shallow neural networks. Journal of Complexity, 4(3): , [6] Richar O. Dua an Peter E. Hart. Pattern Classification An Scene Analysis. Wiley an Son, [7] Tong Zhang. Statistical behavior an consistency of classification methos base on convex risk minimization. The Annals of Statistics, 32:56 85, [8] Clint Scovel an Ingo Steinwart. Fast rates for support vector machines. In Peter Auer an Ron Meir, eitors, Proceeings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages , Bertinoro, Italy, June Springer-Verlag. [9] Vlaimir N. Vapnik, Esther Levin, an Yann LeCun. Measuring the VC-imension of a learning machine. Neural Computation, 6(5): , [10] Olivier Bousquet. Concentration Inequalities an Empirical Processes Theory Applie to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, [11] Alexanre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), [12] Peter L. Bartlett, Michael I. Joran, an Jon D. McAuliffe. Convexity, classification an risk bouns. Journal of the American Statistical Association, 101(473): , March [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales e la Faculté es Sciences e Toulouse, series 6, 9(2): , [14] Wee S. Lee, Peter L. Bartlett, an Robert C. Williamson. The importance of convexity in learning with square loss. IEEE Transactions on Information Theory, 44(5): , [15] Shahar Menelson. A few notes on statistical learning theory. In Shahar Menelson an Alexaner J. Smola, eitors, Avance Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages Springer-Verlag, Berlin, [16] John E. Dennis, Jr. an Robert B. Schnabel. Numerical Methos For Unconstraine Optimization an Nonlinear Equations. Prentice-Hall, Inc., Englewoo Cliffs, New Jersey, [17] Noboru Murata. A statistical stuy of on-line learning. In Davi Saa, eitor, Online Learning an Neural Networks. Cambrige University Press, Cambrige, UK, [18] Yann Le Cun, Léon Bottou, Genevieve B. Orr, an Klaus-Robert Müller. Efficient backprop. In Neural Networks, Tricks of the Trae, Lecture Notes in Computer Science LNCS Springer Verlag, [19] Léon Bottou an Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, an Bernhar Schölkopf, eitors, Avances in Neural Information Processing Systems 16. MIT Press, Cambrige, MA, [20] Davi D. Lewis, Yiming Yang, Tony G. Rose, an Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: , [21] Thorsten Joachims. Training linear SVMs in linear time. In Proceeings of the 12th ACM SIGKDD International Conference, Philaelphia, PA, August ACM Press. [22] Shai Shalev-Shwartz, Yoram Singer, an Nathan Srebro. Pegasos: Primal estimate subgraient solver for SVM. In Zoubin Ghahramani, eitor, Proceeings of the 24th International Machine Learning Conference, pages , Corvallis, OR, June ACM. [23] Chih-Jen Lin, Ruby C. Weng, an S. Sathiya Keerthi. Trust region newton methos for large-scale logistic regression. In Zoubin Ghahramani, eitor, Proceeings of the 24th International Machine Learning Conference, pages , Corvallis, OR, June ACM. [24] Don Hush, Patrick Kelly, Clint Scovel, an Ingo Steinwart. QP algorithms with guarantee accuracy an run time for support vector machines. Journal of Machine Learning Research, 7: , 2006.