Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

Transcription

1 ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi 3 Abstract The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio We ited to survey some of the mai ew ideas that have led to these recet results Résumé La pratique et la théorie de la recoaissace des formes ot cou des développemets importats durat ces derières aées Ce survol vise à exposer certaies des idées ouvelles qui ot coduit à ces développemets 1991 Mathematics Subject Classificatio 62G08,60E15,68Q32 September 23, 2005 Cotets 1 Itroductio 2 2 Basic model 2 3 Empirical risk miimizatio ad Rademacher averages 3 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies 8 41 Margi-based performace bouds 9 42 Covex cost fuctioals 13 5 Tighter bouds for empirical risk miimizatio Relative deviatios Noise ad fast rates Localizatio Cost fuctios Miimax lower bouds 26 6 PAC-bayesia bouds 29 7 Stability 31 8 Model selectio Oracle iequalities 32 Keywords ad phrases: Patter Recogitio, Statistical Learig Theory, Cocetratio Iequalities, Empirical Processes, Model Selectio The authors ackowledge support by the PASCAL Network of Excellece uder EC grat o The work of the third author was supported by the Spaish Miistry of Sciece ad Techology ad FEDER, grat BMF Laboratoire Probabilités et Modèles Aléatoires, CNRS & Uiversité Paris VII, Paris, Frace, wwwprobajussieufr/~bouchero 2 Pertiece SA, 32 rue des Jeûeurs, Paris, Frace 3 Departmet of Ecoomics, Pompeu Fabra Uiversity, Ramo Trias Fargas 25-27, Barceloa, Spai, lugosi@upfes c EDP Scieces, SMAI 1999

2 2 TITLE WILL BE SET BY THE PUBLISHER 82 A glimpse at model selectio methods Naive pealizatio Ideal pealties Localized Rademacher complexities Pre-testig Revisitig hold-out estimates 45 Refereces 47 1 Itroductio The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio The itroductio of ew ad effective techiques of hadlig high-dimesioal problems such as boostig ad support vector machies have revolutioized the practice of patter recogitio At the same time, the better uderstadig of the applicatio of empirical process theory ad cocetratio iequalities have led to effective ew ways of studyig these methods ad provided a statistical explaatio for their success These ew tools have also helped develop ew model selectio methods that are at the heart of may classificatio algorithms The purpose of this survey is to offer a overview of some of these theoretical tools ad give the mai ideas of the aalysis of some of the importat algorithms This survey does ot attempt to be exhaustive The selectio of the topics is largely biased by the persoal taste of the authors We also limit ourselves to describig the key ideas i a simple way, ofte sacrificig geerality I these cases the reader is poited to the refereces for the sharpest ad more geeral results available Refereces ad bibliographical remarks are give at the ed of each sectio, i a attempt to avoid iterruptios i the argumets 2 Basic model The problem of patter classificatio is about guessig or predictig the ukow class of a observatio A observatio is ofte a collectio of umerical ad/or categorical measuremets represeted by a d-dimesioal vector x but i some cases it may eve be a curve or a image I our model we simply assume that x X where X is some abstract measurable space equipped with a σ-algebra The ukow ature of the observatio is called a class It is deoted by y ad i the simplest case takes values i the biary set 1, 1} I these otes we restrict our attetio to biary classificatio The reaso is simplicity ad that the biary problem already captures may of the mai features of more geeral problems Eve though there is much to say about multiclass classificatio, this survey does ot cover this icreasig field of research I classificatio, oe creates a fuctio g : X 1, 1} which represets oe s guess of y give x The mappig g is called a classifier The classifier errs o x if g(x) y To formalize the learig problem, we itroduce a probabilistic settig, ad let (X, Y ) be a X 1, 1}- valued radom pair, modelig observatio ad its correspodig class The distributio of the radom pair (X, Y ) may be described by the probability distributio of X (give by the probabilities PX A} for all measurable subsets A of X ) ad η(x) = PY = 1 X = x} The fuctio η is called the a posteriori probability We measure the performace of classifier g by its probability of error L(g) = Pg(X) Y } Give η, oe may easily costruct a classifier with miimal probability of error I particular, it is easy to see that if we defie g 1 if η(x) > 1/2 (x) = 1 otherwise

3 TITLE WILL BE SET BY THE PUBLISHER 3 the L(g ) L(g) for ay classifier g The miimal risk L def = L(g ) is called the Bayes risk (or Bayes error) More precisely, it is immediate to see that L(g) L = E [ 1 g(x) g (X)} 2η(X) 1 ] 0 (1) (see, eg, [72]) The optimal classifier g is ofte called the Bayes classifier I the statistical model we focus o, oe has access to a collectio of data (X i, Y i ), 1 i We assume that the data D cosists of a sequece of idepedet idetically distributed (iid) radom pairs (X 1, Y 1 ),, (X, Y ) with the same distributio as that of (X, Y ) A classifier is costructed o the basis of D = (X 1, Y 1,, X, Y ) ad is deoted by g Thus, the value of Y is guessed by g (X) = g (X; X 1, Y 1,, X, Y ) The performace of g is measured by its (coditioal) probability of error L(g ) = Pg (X) Y D } The focus of the theory (ad practice) of classificatio is to costruct classifiers g whose probability of error is as close to L as possible Obviously, the whole arseal of traditioal parametric ad oparametric statistics may be used to attack this problem However, the high-dimesioal ature of may of the ew applicatios (such as image recogitio, text classificatio, micro-biological applicatios, etc) leads to territories beyod the reach of traditioal methods Most ew advaces of statistical learig theory aim to face these ew challeges Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Fukuaga [97], Duda ad Hart [77], Vapik ad Chervoekis [233], Devijver ad Kittler [70], Vapik [229,230], Breima, Friedma, Olshe, ad Stoe [53], Nataraja [175], McLachla [169], Athoy ad Biggs [10], Kears ad Vazirai [117], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235] Kulkari, Lugosi, ad Vekatesh [128], Athoy ad Bartlett [9], Duda, Hart, ad Stork [78], Lugosi [144], ad Medelso [171] 3 Empirical risk miimizatio ad Rademacher averages A simple ad atural approach to the classificatio problem is to cosider a class C of classifiers g : X 1, 1} ad use data-based estimates of the probabilities of error L(g) to select a classifier from the class The most atural choice to estimate the probability of error L(g) = Pg(X) Y } is the error cout L (g) = 1 1 g(xi) Y i} i=1 L (g) is called the empirical error of the classifier g First we outlie the basics of the theory of empirical risk miimizatio (ie, the classificatio aalog of M-estimatio) Deote by g the classifier that miimizes the estimated probability of error over the class: L (g ) L (g) for all g C The the probability of error L(g) = P g(x) Y D } of the selected rule is easily see to satisfy the elemetary iequalities L(g ) if g C L(g) 2 sup L (g) L(g), (2) g C L(g) L (g) + sup L (g) L(g) g C

4 4 TITLE WILL BE SET BY THE PUBLISHER We see that by guarateeig that the uiform deviatio sup g C L (g) L(g) of estimated probabilities from their true values is small, we make sure that the probability of the selected classifier g is ot much larger tha the best probability of error i the class C ad at the same time the empirical estimate L (g ) is also good It is importat to ote at this poit that boudig the excess risk by the maximal deviatio as i (2) is quite loose i may situatios I Sectio 5 we survey some ways of obtaiig improved bouds O the other had, the simple iequality above offers a coveiet way of uderstadig some of the basic priciples ad it is eve sharp i a certai miimax sese, see Sectio 55 Clearly, the radom variable L (g) is biomially distributed with parameters ad L(g) Thus, to obtai bouds for the success of empirical error miimizatio, we eed to study uiform deviatios of biomial radom variables from their meas We formulate the problem i a somewhat more geeral way as follows Let X 1,, X be idepedet, idetically distributed radom variables takig values i some set X ad let F be a class of bouded fuctios X [ 1, 1] Deotig expectatio ad empirical averages by P f = Ef(X 1 ) ad P f = (1/) i=1 f(x i), we are iterested i upper bouds for the maximal deviatio sup(p f P f) f F Cocetratio iequalities are amog the basic tools i studyig such deviatios powerful expoetial cocetratio iequality is the bouded differeces iequality The simplest, yet quite Theorem 31 bouded differeces iequality Le g : X R be a fuctio of variables such that for some oegative costats c 1,, c, sup x 1,,x, x i X g(x 1,, x ) g(x 1,, x i 1, x i, x i+1,, x ) c i, 1 i Let X 1,, X be idepedet radom variables The radom variable Z = g(x 1,, X ) satisfies where C = i=1 c2 i P Z EZ > t} 2e 2t2 /C The bouded differeces assumptio meas that if the i-th variable of g is chaged while keepig all the others fixed, the value of the fuctio caot chage by more tha c i Our mai example for such a fuctio is Z = sup P f P f f F Obviously, Z satisfies the bouded differeces assumptio with c i = 2/ ad therefore, for ay δ (0, 1), with probability at least 1 δ, 2 log sup P f P f E 1 δ sup P f P f + (3) f F f F This cocetratio result allows us to focus o the expected value, which ca be bouded coveietly by a simple symmetrizatio device Itroduce a ghost sample X 1,, X, idepedet of the X i ad distributed idetically If P f = (1/) i=1 f(x i ) deotes the empirical averages measured o the ghost sample, the by Jese s iequality, E sup f F ( [ P f P f = E sup E f F ]) P f P f X 1,, X E sup P f P f f F

5 TITLE WILL BE SET BY THE PUBLISHER 5 Let ow σ 1,, σ be idepedet (Rademacher) radom variables with Pσ i = 1} = Pσ i = 1} = 1/2, idepedet of the X i ad X i The E sup f F [ P f P f = E sup f F [ = E sup 2E f F [ 1 1 sup f F ] (f(x i) f(x i ) i=1 ] σ i (f(x i) f(x i ) i=1 ] σ i f(x i ) Let A R be a bouded set of vectors a = (a 1,, a ), ad itroduce the quatity 1 i=1 1 R (A) = E sup σ i a i a A R (A) is called the Rademacher average associated with A For a give sequece x 1,, x X, we write F(x 1 ) for the class of -vectors (f(x 1 ),, f(x )) with f F Thus, usig this otatio, we have deduced the followig i=1 Theorem 32 With probability at least 1 δ, sup P f P f 2ER (F(X1 )) + f F 2 log 1 δ We also have sup P f P f 2R (F(X1 )) + f F 2 log 2 δ The secod statemet follows simply by oticig that the radom variable R (F(X1 ) satisfies the coditios of the bouded differeces iequality The secod iequality is our first data-depedet performace boud It ivolves the Rademacher average of the coordiate projectio of F give by the data X 1,, X Give the data, oe may compute the Rademacher average, for example, by Mote Carlo itegratio Note that for a give 1 choice of the radom sigs σ 1,, σ, the computatio of sup f F i=1 σ if(x i ) is equivalet to miimizig i=1 σ if(x i ) over f F ad therefore it is computatioally equivalet to empirical risk miimizatio R (F(X1 )) measures the richess of the class F ad provides a sharp estimate for the maximal deviatios I fact, oe may prove that 1 2 ER (F(X1 )) 1 2 E sup f F P f P f 2ER (F(X 1 ))) (see, eg, va der Vaart ad Weller [227]) Next we recall some of the simple structural properties of Rademacher averages Theorem 33 properties of rademacher averages Let A, B be bouded subsets of R ad let c R be a costat The R (A B) R (A) + R (B), R (c A) = c R (A), R (A B) R (A) + R (B)

6 6 TITLE WILL BE SET BY THE PUBLISHER where c A = ca : a A} ad A B = a + b : a A, b B} Moreover, if A = a (1),, a (N) } R is a fiite set, the 2 log N R (A) max j=1,,n a(j) (4) N where deotes Euclidea orm If abscov(a) = j=1 c ja (j) : N N, } N j=1 c j 1, a (j) A is the absolute covex hull of A, the R (A) = R (abscov(a)) (5) Fially, the cotractio priciple states that if φ : R R is a fuctio with φ(0) = 0 ad Lipschitz costat L φ ad φ A is the set of vectors of form (φ(a 1 ),, φ(a )) R with a A, the R (φ A) L φ R (A) proof The first three properties are immediate from the defiitio Iequality (4) follows by Hoeffdig s iequality which states that if X is a bouded zero-mea radom variable takig values i a iterval [α, β], the for ay s > 0, E exp(sx) exp ( s 2 (β α) 2 /8 ) I particular, by idepedece, This implies that E exp ( s 1 ) σ i a i = i=1 e sr(a) = exp i=1 ( 1 se max j=1,,n N Ee s 1 j=1 E exp (s 1 ) σ ia i i=1 σ i a (j) i ) ( s 2 a 2 ) ( i s 2 a 2 ) exp 2 2 = exp 2 2 i=1 E exp P i=1 σia(j) i N max j=1,,n exp ( s max j=1,,n 1 ( s 2 a (j) 2 ) 2 2 Takig the logarithm of both sides, dividig by s, ad choosig s to miimize the obtaied upper boud for R (A), we arrive at (4) The idetity (5) is easily see from the defiitio For a proof of the cotractio priciple, see Ledoux ad Talagrad [133] Ofte it is useful to derive further upper bouds o Rademacher averages As a illustratio, we cosider the case whe F is a class of idicator fuctios Recall that this is the case i our motivatig example i the classificatio problem described above whe each f F is the idicator fuctio of a set of the form (x, y) : g(x) y} I such a case, for ay collectio of poits x 1 = (x 1,, x ), F(x 1 ) is a fiite subset of R whose cardiality is deoted by S F (x 1 ) ad is called the vc shatter coefficiet (where vc stads for Vapik-Chervoekis) Obviously, S F (x 1 ) 2 By iequality (4), we have, for all x 1, i=1 σ i a (j) i ) R (F(x 1 )) 2 log SF (x 1 ) (6) where we used the fact that for each f F, i f(x i) 2 I particular, 2 log SF (X1 E sup P f P f 2E ) f F The logarithm of the vc shatter coefficiet may be upper bouded i terms of a combiatorial quatity, called the vc dimesio If A 1, 1}, the the vc dimesio of A is the size V of the largest set of idices

7 TITLE WILL BE SET BY THE PUBLISHER 7 i 1,, i V } 1,, } such that for each biary V -vector b = (b 1,, b V ) 1, 1} V there exists a a = (a 1,, a ) A such that (a i1,, a iv ) = b The key iequality establishig a relatioship betwee shatter coefficiets ad vc dimesio is kow as Sauer s lemma which states that the cardiality of ay set A 1, 1} may be upper bouded as A where V is the vc dimesio of A I particular, V i=0 ( ) ( + 1) V i log S F (x 1 ) V (x 1 ) log( + 1) where we deote by V (x 1 ) the vc dimesio of F(x 1 ) Thus, the expected maximal deviatio E sup f F P f P f may be upper bouded by 2E 2V (X1 ) log( + 1)/ To obtai distributio-free upper bouds, itroduce the vc dimesio of a class of biary fuctios F, defied by V = sup V (x 1 ),x 1 The we obtai the followig versio of what has bee kow as the Vapik-Chervoekis iequality: Theorem 34 vapik-chervoekis iequality For all distributios oe has E sup(p f P f) 2 f F 2V log( + 1) Also, for a uiversal costat C V E sup(p f P f) C f F The secod iequality, that allows to remove the logarithmic factor, follows from a somewhat refied aalysis (called chaiig) The vc dimesio is a importat combiatorial parameter of the class ad may of its properties are well kow Here we just recall oe useful result ad refer the reader to the refereces for further study: let G be a m-dimesioal vector space of real-valued fuctios defied o X The class of idicator fuctios F = f(x) = 1 g(x) 0 : g G } has vc dimesio V m Bibliographical remarks Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Shorack ad Weller [199], Gié [98], va der Vaart ad Weller [227], Vapik [231], Dudley [83] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis [232, 233] ad re-discovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth [41], Ehrefeucht, Haussler, Kears, ad Valiat [88] For surveys see Nataraja [175], Devroye [71] Athoy ad Biggs [10], Kears ad Vazirai [117], Vapik [230, 231], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235], Athoy ad Bartlett [9], The bouded differeces iequality was formulated explicitly first by McDiarmid [166] (see also the surveys [167]) The martigale methods used by McDiarmid had appeared i early work of Hoeffdig [109], Azuma [18], Yuriksii [242, 243], Milma ad Schechtma [174] Closely related cocetratio results have bee obtaied i various ways icludig iformatio-theoretic methods (see Ahlswede, Gács, ad Körer [1], Marto [154],

8 8 TITLE WILL BE SET BY THE PUBLISHER [155], [156], Dembo [69], Massart [158] ad Rio [183]), Talagrad s iductio method [217], [213], [216] (see also McDiarmid [168], Luczak ad McDiarmid [143], Pacheko [ ]) ad the so-called etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux [132], [131], see also Bobkov ad Ledoux [42], Massart [159], Rio [183], Bouchero, Lugosi, ad Massart [45, 46], Bousquet [47], ad Bouchero, Bousquet, Lugosi, ad Massart [44] Symmetrizatio was at the basis of the origial argumets of Vapik ad Chervoekis [232, 233] We leart the simple symmetrizatio trick show above from Gié ad Zi [99] but differet forms of symmetrizatio have bee at the core of obtaiig related results of similar flavor, see also Athoy ad Shawe-Taylor [11], Cao, Ettiger, Hush, Scovel [55], Herbrich ad Williamso [108], Medelso ad Philips [172] The use of Rademacher averages i classificatio was first promoted by Koltchiskii [124] ad Bartlett, Bouchero, ad Lugosi [24], see also Koltchiskii ad Pacheko [126,127], Bartlett ad Medelso [29], Bartlett, Bousquet, ad Medelso [25], Bousquet, Koltchiskii, ad Pacheko [50], Kégl, Lider, ad Lugosi [13], Medelso [170] Hoeffdig s iequality appears i [109] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad [133] Sauer s lemma was proved idepedetly by Sauer [189], Shelah [198], ad Vapik ad Chervoekis [232] For related combiatorial results we refer to Frakl [90], Haussler [106], Alesker [7], Alo, Be-David, Cesa- Biachi, ad Haussler [8], Szarek ad Talagrad [210], Cesa-Biachi ad Haussler [60], Medelso ad Vershyi [173], [188] The secod iequality of Theorem 34 is based o the method of chaiig, ad was first proved by Dudley [81] The questio of how sup f F P f P f behaves has bee kow as the Gliveko-Catelli problem ad much has bee said about it A few key refereces iclude Vapik ad Chervoekis [232, 234], Dudley [79, 81, 82], Talagrad [211, 212, 214, 218], Dudley, Gié, ad Zi [84], Alo, Be-David, Cesa-Biachi, ad Haussler [8], Li, Log, ad Sriivasa [138], Medelso ad Vershyi [173] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Cover [63], Dudley [80, 83], Steele [204], Weocur ad Dudley [238], Assouad [15], Khovaskii [118], Macityre ad Sotag [149], Goldberg ad Jerrum [101], Karpiski ad A Macityre [114], Koira ad Sotag [121], Athoy ad Bartlett [9], ad Bartlett ad Maass [28] 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies The results summarized i the previous sectio reveal that miimizig the empirical risk L (g) over a class C of classifiers with a vc dimesio much smaller tha the sample size is guarateed to work well This result has two fudametal problems First, by requirig that the vc dimesio be small, oe imposes serious limitatios o the approximatio properties of the class I particular, eve though the differece betwee the probability of error L(g ) of the empirical risk miimizer is close to the smallest probability of error if g C L(g) i the class, if g C L(g) L may be very large The other problem is algorithmic: miimizig the empirical probability of misclassificatio L(g) is very ofte a computatioally difficult problem Eve i seemigly simple cases, for example whe X = R d ad C is the class of classifiers that split the space of observatios by a hyperplae, the miimizatio problem is p hard The computatioal difficulty of learig problems deserves some more attetio Let us cosider i more detail the problem i the case of half-spaces Formally, we are give a sample, that is a sequece of vectors (x 1,, x ) from R d ad a sequece of labels (y 1,, y ) from 1, 1}, ad i order to miimize the empirical misclassificatio risk we are asked to fid w R d ad b R so as to miimize # k : y k ( w, x k b) 0} Without loss of geerality, the vectors costitutig the sample are assumed to have ratioal coefficiets, ad the size of the data is the sum of the bit legths of the vectors makig the sample Not oly miimizig the umber

9 TITLE WILL BE SET BY THE PUBLISHER 9 of misclassificatio errors has bee proved to be at least as hard as solvig ay p-complete problem, but eve approximately miimizig the umber of misclassificatio errors withi a costat factor of the optimum has bee show to be p-hard This meas that, uless p =p, we will ot be able to build a computatioally efficiet empirical risk miimizer for half-spaces that will work for all iput space dimesios If the iput space dimesio d is fixed, a algorithm ruig i O( d 1 log ) steps eumerates the trace of half-spaces o a sample of legth This allows a exhaustive search for the empirical risk miimizer Such a possibility should be cosidered with circumspectio sice its rage of applicatios would exted much beyod problems where iput dimesio is less tha 5 41 Margi-based performace bouds A attempt to solve both of these problems is to modify the empirical fuctioal to be miimized by itroducig a cost fuctio Next we describe the mai ideas of empirical miimizatio of cost fuctioals ad its aalysis We cosider classifiers of the form 1 if f(x) 0 g f (x) = 1 otherwise where f : X R is a real-valued fuctio I such a case the probability of error of g may be writte as L(g f ) = Psg(f(X)) Y } E1 f(x)y <0 To lighte otatio we will simply write L(f) = L(g f ) Let φ : R R + be a oegative cost fuctio such that φ(x) 1 x>0 (Typical choices of φ iclude φ(x) = e x, φ(x) = log 2 (1+e x ), ad φ(x) = (1+x) + ) Itroduce the cost fuctioal ad its empirical versio by A(f) = Eφ( f(x)y ) ad A (f) = 1 φ( f(x i )Y i ) i=1 Obviously, L(f) A(f) ad L (f) A (f) Theorem 41 Assume that the fuctio f is chose from a class F based o the data (Z 1,, Z ) def = (X 1, Y 1 ),, (X, Y ) Let B deote a uiform upper boud o φ( f(x)y) ad let L φ be the Lipschitz costat of φ The the probability of error of the correspodig classifier may be bouded, with probability at least 1 δ, by L(f ) A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ Thus, the Rademacher average of the class of real-valued fuctios f bouds the performace of the classifier

10 10 TITLE WILL BE SET BY THE PUBLISHER proof The proof similar to he argumet of the previous sectio: L(f ) A(f ) A (f ) + sup(a(f) A (f)) f F A (f ) + 2ER (φ H(Z1 2 log 1 δ )) + B (where H is the class of fuctios X 1, 1} R of the form f(x)y, f F) A (f ) + 2L φ ER (H(Z1 2 log 1 δ )) + B (by the cotractio priciple of Theorem 33) = A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ 411 Weighted votig schemes I may applicatios such as boostig ad baggig, classifiers are combied by weighted votig schemes which meas that the classificatio rule is obtaied by meas of fuctios f from a class N F λ = f(x) = N c j g j (x) : N N, c j λ, g 1,, g N C (7) j=1 where C is a class of base classifiers, that is, fuctios defied o X, takig values i 1, 1} A classifier of this form may be thought of as oe that, upo observig x, takes a weighted vote of the classifiers g 1,, g N (usig the weights c 1,, c N ) ad decides accordig to the weighted majority I this case, by (5) ad (6) we have j=1 R (F λ (X 1 )) λr (C(X 1 )) λ 2VC log( + 1) where V C is the vc dimesio of the base class To uderstad the richess of classes formed by weighted averages of classifiers from a base class, just cosider the simple oe-dimesioal example i which the base class C cotais all classifiers of the form g(x) = 21 x a 1, a R The V C = 1 ad the closure of F λ (uder the L orm) is the set of all fuctios of total variatio bouded by 2λ Thus, F λ is rich i the sese that ay classifier may be approximated by classifiers associated with the fuctios i F λ I particular, the vc dimesio of the class of all classifiers iduced by fuctios i F λ is ifiite For such large classes of classifiers it is impossible to guaratee that L(f ) exceeds the miimal risk i the class by somethig of the order of 1/2 (see Sectio 55) However, L(f ) may be made as small as the miimum of the cost fuctioal A(f) over the class plus O( 1/2 ) Summarizig, we have obtaied that if F λ is of the form idicated above, the for ay fuctio f chose from F λ i a data-based maer, the probability of error of the associated classifier satisfies, with probability at least 1 δ, 2VC log( + 1) 2 log 1 δ L(f ) A (f ) + 2L φ λ + B (8) The remarkable fact about this iequality is that the upper boud oly ivolves the vc dimesio of the class C of base classifiers which is typically small The price we pay is that the first term o the right-had side is

11 TITLE WILL BE SET BY THE PUBLISHER 11 the empirical cost fuctioal istead of the empirical probability of error As a first illustratio, cosider the example whe γ is a fixed positive parameter ad 0 if x γ φ(x) = 1 if x x/γ otherwise I this case B = 1 ad L φ = 1/γ Notice also that 1 x>0 φ(x) 1 x> γ ad therefore A (f) L γ (f) where L γ (f) is the so-called margi error defied by L γ (f) = 1 i=1 1 f(xi)y i<γ Notice that for all γ > 0, L γ (f) L (f) ad the L γ (f) is icreasig i γ A iterpretatio of the margi error L γ (f) is that it couts, apart from the umber of misclassified pairs (X i, Y i ), also those which are well classified but oly with a small cofidece (or margi ) by f Thus, (8) implies the followig margi-based boud for the risk: Corollary 42 For ay γ > 0, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ 2VC log( + 1) + γ 2 log 1 δ (9) Notice that, as γ grows, the first term of the sum icreases, while the secod decreases The boud ca be very useful wheever a classifier has a small margi error for a relatively large γ (ie, if the classifier classifies the traiig data well with high cofidece ) sice the secod term oly depeds o the vc dimesio of the small base class C This result has bee used to explai the good behavior of some votig methods such as AdaBoost, sice these methods have a tedecy to fid classifiers that classify the data poits well with a large margi 412 Kerel methods Aother popular way to obtai classificatio rules from a class of real-valued fuctios which is used i kerel methods such as Support Vector Machies (SVM) or Kerel Fisher Discrimiat (KFD) is to cosider balls of a reproducig kerel Hilbert space The basic idea is to use a positive defiite kerel fuctio k : X X R, that is, a symmetric fuctio satisfyig α i α j k(x i, x j ) 0, i,j=1 for all choices of, α 1,, α R ad x 1,, x X Such a fuctio aturally geerates a space of fuctios of the form } F = f( ) = α i k(x i, ) : N, α i R, x i X, i=1 which, with the ier product α i k(x i, ), β j k(x j, ) def = α i β j k(x i, x j ) ca be completed ito a Hilbert space The key property is that for all x 1, x 2 X there exist elemets f x1, f x2 F such that k(x 1, x 2 ) = f x1, f x2 This meas that ay liear algorithm based o computig ier products ca be exteded ito a o-liear versio by replacig the ier products by a kerel fuctio The advatage is that eve though the algorithm remais of low complexity, it works i a class of fuctios that ca potetially represet ay cotiuous fuctio arbitrarily well (provided k is chose appropriately)

12 12 TITLE WILL BE SET BY THE PUBLISHER Algorithms workig with kerels usually perform miimizatio of a cost fuctioal o a ball of the associated reproducig kerel Hilbert space of the form N F λ = f(x) = c j k(x j, x) : N N, j=1 N c i c j k(x i, x j ) λ 2, x 1,, x N X (10) i,j=1 Notice that, i cotrast with (7) where the costrait is of l 1 type, the costrait here is of l 2 type Also, the basis fuctios, istead of beig chose from a fixed class, are determied by elemets of X themselves A importat property of fuctios i the reproducig kerel Hilbert space associated with k is that for all x X, f(x) = f, k(x, ) This is called the reproducig property The reproducig property may be used to estimate precisely the Rademacher average of F λ Ideed, deotig by E σ expectatio with respect to the Rademacher variables σ 1,, σ, we have R (F λ (X 1 )) = 1 E σ sup = 1 E σ sup f λ i=1 f λ i=1 σ i f(x i ) σ i f, k(x i, ) = λ E σ σ i k(x i, ) by the Cauchy-Schwarz iequality, where deotes the orm i the reproducig kerel Hilbert space The Kahae-Khichie iequality states that for ay vectors a 1,, a i a Hilbert space, It is also easy to see that so we obtai 1 2 ( E σ i a i 2 E i=1 E i=1 i=1 ) 2 2 σ i a i E σ i a i 2 σ i a i = E σ i σ j a i, a j = i=1 i,j=1 i=1 a i 2, i=1 λ k(x i, X i ) R (F λ (X1 )) λ k(x i, X i ) 2 i=1 This is very ice as it gives a boud that ca be computed very easily from the data A reasoig similar to the oe leadig to (9), usig the bouded differeces iequality to replace the Rademacher average by its empirical versio, gives the followig Corollary 43 Let f be ay fuctio chose from the ball F λ The, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ k(x i, X i ) + γ i=1 i=1 2 log 2 δ

13 42 Covex cost fuctioals TITLE WILL BE SET BY THE PUBLISHER 13 Next we show that a proper choice of the cost fuctio φ has further advatages To this ed, we cosider oegative covex odecreasig cost fuctios with lim x φ(x) = 0 ad φ(0) = 1 Mai examples of φ iclude the expoetial cost fuctio φ(x) = e x used i AdaBoost ad related boostig algorithms, the logit cost fuctio φ(x) = log 2 (1 + e x ), ad the hige loss (or soft margi loss) φ(x) = (1 + x) + used i support vector machies Oe of the mai advatages of usig covex cost fuctios is that miimizig the empirical cost A (f) ofte becomes a covex optimizatio problem ad is therefore computatioally feasible I fact, most boostig ad support vector machie classifiers may be viewed as empirical miimizers of a covex cost fuctioal However, miimizig covex cost fuctioals have other theoretical advatages To uderstad this, assume, i additio to the above, that φ is strictly covex ad differetiable The it is easy to determie the fuctio f miimizig the cost fuctioal A(f) = Eφ( Y f(x) Just ote that for each x X, ad therefore the fuctio f is give by E [φ( Y f(x) X = x] = η(x)φ( f(x)) + (1 η(x))φ(f(x)) f (x) = argmi α h η(x) (α) where for each η [0, 1], h η (α) = ηφ( α) + (1 η)φ(α) Note that h η is strictly covex ad therefore f is well defied (though it may take values ± if η equals 0 or 1) Assumig that h η is differetiable, the miimum is achieved for the value of α for which h η(α) = 0, that is, whe η 1 η = φ (α) φ ( α) Sice φ is strictly icreasig, we see that the solutio is positive if ad oly if η > 1/2 This reveals the importat fact that the miimizer f of the fuctioal A(f) is such that the correspodig classifier g (x) = 21 f (x) 0 1 is just the Bayes classifier Thus, miimizig a covex cost fuctioal leads to a optimal classifier For example, if φ(x) = e x is the expoetial cost fuctio, the f (x) = (1/2) log(η(x)/(1 η(x))) I the case of the logit cost φ(x) = log 2 (1 + e x ), we have f (x) = log(η(x)/(1 η(x))) We ote here that, eve though the hige loss φ(x) = (1 + x) + does ot satisfy the coditios for φ used above (eg, it is ot strictly covex), it is easy to see that the fuctio f miimizig the cost fuctioal equals f (x) = 1 if η(x) > 1/2 1 if η(x) < 1/2 Thus, i this case the f ot oly iduces the Bayes classifier but it equals to it To obtai iequalities for the probability of error of classifiers based o miimizatio of empirical cost fuctioals, we eed to establish a relatioship betwee the excess probability of error L(f) L ad the correspodig excess cost fuctioal A(f) A where A = A(f ) = if f A(f) Here we recall a simple iequality of Zhag [244] which states that if the fuctio H : [0, 1] R is defied by H(η) = if α h η (α) ad the cost fuctio φ is such that for some positive costats s 1 ad c η s c s (1 H(η)), η [0, 1], the for ay fuctio f : X R, L(f) L 2c (A(f) A ) 1/s (11)

14 14 TITLE WILL BE SET BY THE PUBLISHER (The simple proof of this iequality is based o the expressio (1) ad elemetary covexity properties of h η ) I the special case of the expoetial ad logit cost fuctios H(η) = 2 η(1 η) ad H(η) = η log 2 η (1 η) log 2 (1 η), respectively I both cases it is easy to see that the coditio above is satisfied with s = 2 ad c = 1/ 2 Theorem 44 excess risk of covex risk miimizers Assume that f is chose from a class F λ defied i (7) by miimizig the empirical cost fuctioal A (f) usig either the expoetial of the logit cost fuctio The, with probability at least 1 δ, L(f ) L 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 + ( ) 1/2 2 if A(f) A f F λ proof L(f ) L 2 (A(f ) A ) 1/2 ( ) 1/2 2 A(f ) if A(f) + ( 2 if A(f) A f F λ f F λ ( ) 1/2 2 sup A(f) A (f) + ( 2 if A(f) A f F λ f F λ (just like i (2)) 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 ) 1/2 ) 1/2 + ( ) 1/2 2 if A(f) A f F λ with probability at least 1 δ, where at the last step we used the same boud for sup f Fλ A(f) A (f) as i (8) Note that for the expoetial cost fuctio L φ = e λ ad B = λ while for the logit cost L φ 1 ad B = λ I both cases, if there exists a λ sufficietly large so that if f Fλ A(f) = A, the the approximatio error disappears ad we obtai L(f ) L = O ( 1/4) The fact that the expoet i the rate of covergece is dimesio-free is remarkable (We ote here that these rates may be further improved by applyig the refied techiques resumed i Sectio 53, see also [40]) It is a iterestig approximatio-theoretic challege to uderstad what kid of fuctios f may be obtaied as a covex combiatio of base classifiers ad, more geerally, to describe approximatio properties of classes of fuctios of the form (7) Next we describe a simple example whe the above-metioed approximatio properties are well uderstood Cosider the case whe X = [0, 1] d ad the base class C cotais all decisio stumps, that is, all classifiers of the form s + i,t (x) = 1 x (i) t 1 x (i) <t ad s i,t (x) = 1 x (i) <t 1 x (i) t, t [0, 1], i = 1,, d, where x (i) deotes the i-th coordiate of x I this case the vc dimesio of the base class is easily see to be bouded by V C 2 log 2 (2d) Also it is easy to see that the closure of F λ with respect to the supremum orm cotais all fuctios f of the form f(x) = f 1 (x (1) ) + + f d (x (d) ) where the fuctios f i : [0, 1] R are such that f 1 T V + + f d T V λ where f i T V deotes the total variatio of the fuctio f i Therefore, if f has the above form, we have if f Fλ A(f) = A(f ) Recallig that the fuctio f optimizig the cost A(f) has the form f (x) = 1 2 log η(x) 1 η(x)

15 TITLE WILL BE SET BY THE PUBLISHER 15 i the case of the expoetial cost fuctio ad f (x) = log η(x) 1 η(x) i the case of the logit cost fuctio, we see that boostig usig decisio stumps is especially well fitted to the so-called additive logistic model i which η is assumed to be such that log(η/(1 η)) is a additive fuctio (ie, it ca be writte as a sum of uivariate fuctios of the compoets of x) Thus, whe η permits a additive logistic represetatio the the rate of covergece of the classifier is fast ad has a very mild depedece o the dimesio Cosider ext the case of the hige loss φ(x) = (1 + x) + ofte used i Support Vector Machies ad related kerel methods I this case H(η) = 2 (η, 1 η) ad therefore iequality (11) holds with c = 1/2 ad s = 1 Thus, L(f ) L A(f ) A ad the aalysis above leads to eve better rates of covergece However, i this case f (x) = 21 η(x) 1/2 1 ad approximatig this fuctio by weighted sums of base fuctios may be more difficult tha i the case of expoetial ad logit costs Oce agai, the approximatio-theoretic part of the problem is far from beig well uderstood, ad it is difficult to give recommedatios about which cost fuctio is more advatageous ad what base classes should be used Bibliographical remarks For results o the algorithmic difficulty of empirical risk miimizatio, see Johso ad Preparata [112], Vu [236], Bartlett ad Be-David [26], Be-David, Eiro, ad Simo [32] Boostig algorithms were origially itroduced by Freud ad Schapire (see [91], [94], ad [190]), as adaptive aggregatio of simple classifiers cotaied i a small base class The aalysis based o the observatio that AdaBoost ad related methods ted to produce large-margi classifiers appears i Schapire, Freud, Bartlett, ad Lee [191], ad Koltchiskii ad Pacheko [127] It was Breima [51] who observed that boostig performs gradiet descet optimizatio of a empirical cost fuctio differet from the umber of misclassified samples, see also Maso, Baxter, Bartlett, ad Frea [157], Collis, Schapire, ad Siger [61], Friedma, Hastie, ad Tibshirai [95] Based o this view, various versios of boostig algorithms have bee show to be cosistet i differet settigs, see Breima [52], Bühlma ad Yu [54], Blachard, Lugosi, ad Vayatis [40], Jiag [111], Lugosi ad Vayatis [146], Maor ad Meir [152], Maor, Meir, ad Zhag [153], Zhag [244] Iequality (8) was first obtaied by Schapire, Freud, Bartlett, ad Lee [191] The aalysis preseted here is due to Koltchiskii ad Pacheko [127] Other classifiers based o weighted votig schemes have bee cosidered by Catoi [57 59], Yag [241], Freud, Masour, ad Schapire [93] Kerel methods were pioeered by Aizerma, Braverma, ad Rozooer [2 5], Vapik ad Lerer [228], Bashkirov, Braverma, ad Muchik [31], Vapik ad Chervoekis [233], ad Specht [203] Support vector machies origiate i the pioeerig work of Boser, Guyo, ad Vapik [43], Cortes ad Vapik [62] For surveys we refer to Cristiaii ad Shawe-Taylor [65], Smola, Bartlett, Schölkopf, ad Schuurmas [201], Hastie, Tibshirai, ad Friedma [104], Schölkopf ad Smola [192] The study of uiversal approximatio properties of kerels ad statistical cosistecy of Support Vector Machies is due to Steiwart [ ], Li [140, 141], Zhou [245], ad Blachard, Bousquet, ad Massart [39] We have cosidered the case of miimizatio of a loss fuctio o a ball of the reproducig kerel Hilbert space However, it is computatioally more coveiet to formulate the problem as the miimizatio of a regularized fuctioal of the form 1 mi φ( Y i f(x i )) + λ f 2 f F i=1 The stadard Support Vector Machie algorithm the correspods to the choice of φ(x) = (1 + x) + Kerel based regularizatio algorithms were studied by Kimeldorf ad Wahba [120] ad Crave ad Wahba [64] i the cotext of regressio Relatioships betwee Support Vector Machies ad regularizatio were described

16 16 TITLE WILL BE SET BY THE PUBLISHER by Smola, Schölkopf, ad Müller [202] ad Evhgeiou, Potil, ad Poggio [89] Geeral properties of regularized algorithms i reproducig kerel Hilbert spaces are ivestigated by Cucker ad Smale [68], Steiwart [206], Zhag [244] Various properties of the Support Vector Machie algorithm are ivestigated by Vapik [230, 231], Schölkopf ad Smola [192], Scovel ad Steiwart [195] ad Steiwart [208, 209] The fact that miimizig a expoetial cost fuctioal leads to the Bayes classifier was poited out by Breima [52], see also Lugosi ad Vayatis [146], Zhag [244] For a comprehesive theory of the coectio betwee cost fuctios ad probability of misclassificatio, see Bartlett, Jorda, ad McAuliffe [27] Zhag s lemma (11) appears i [244] For various geeralizatios ad refiemets we refer to Bartlett, Jorda, ad McAuliffe [27] ad Blachard, Lugosi, ad Vayatis [40] 5 Tighter bouds for empirical risk miimizatio This sectio is dedicated to the descriptio of some refiemets of the ideas described i the earlier sectios What we have see so far oly used first-order properties of the fuctios that we cosidered, amely their boudedess It turs out that usig secod-order properties, like the variace of the fuctios, may of the above results ca be made sharper 51 Relative deviatios I order to uderstad the basic pheomeo, let us go back to the simplest case i which oe has a fixed fuctio f with values i 0, 1} I this case, P f is a average of idepedet Beroulli radom variables with parameter p = P f Recall that, as a simple cosequece of (3), with probability at least 1 δ, P f P f 2 log 1 δ (12) This is basically tight whe P f = 1/2, but ca be sigificatly improved whe P f is small Ideed, Berstei s iequality gives, with probability at least 1 δ, P f P f 2Var(f) log 1 δ + 2 log 1 δ 3 (13) Sice f takes its values i 0, 1}, Var(f) = P f(1 P f) P f which shows that whe P f is small, (13) is much better tha (12) 511 Geeral iequalities Next we exploit the pheomeo described above to obtai sharper performace bouds for empirical risk miimizatio Note that if we cosider the differece P f P f uiformly over the class F, the largest deviatios are obtaied by fuctios that have a large variace (ie, P f is close to 1/2) A idea is to scale each fuctio by dividig it by P f so that they all behave i a similar way Thus, we boud the quatity P f P f sup f F P f The first step cosists i symmetrizatio of the tail probabilities If t 2 2, P f P P f sup f F P f t } } P 2P sup f P f f F (P f + P f)/2 t

17 TITLE WILL BE SET BY THE PUBLISHER 17 Next we itroduce Rademacher radom variables, obtaiig, by simple symmetrizatio, 2P sup f F P f P f (P f + P f)/2 t } = 2E [P σ sup f F 1 i=1 σ i(f(x i ) f(x }] i)) t (P f + P f)/2 (where P σ is the coditioal probability, give the X i ad X i ) The last step uses tail bouds for idividual fuctios ad a uio boud over F(X1 2 ), where X1 2 deotes the uio of the iitial sample X1 ad of the extra symmetrizatio sample X 1,, X Summarizig, we obtai the followig iequalities: Theorem 51 Let F be a class of fuctios takig biary values i 0, 1} For ay δ (0, 1), with probability at least 1 δ, all f F satisfy P f P f log S F (X1 2 2 ) + log 4 δ P f Also, with probability at least 1 δ, for all f F, P f P f P f 2 log S F (X1 2) + log 4 δ As a cosequece, we have that for all s > 0, with probability at least 1 δ, sup f F P f P f P f + P f + s/2 2 log S F (X 2 1 ) + log 4 δ s (14) ad the same is true if P ad P are permuted Aother cosequece of Theorem 51 with iterestig applicatios is the followig For all t (0, 1], with probability at least 1 δ, I particular, settig t = 1, f F, P f (1 t)p f implies P f 4 log S F(X1 2 ) + log 4 δ t 2 (15) 512 Applicatios to empirical risk miimizatio f F, P f = 0 implies P f 4 log S F(X1 2 ) + log 4 δ It is easy to see that, for o-egative umbers A, B, C 0, the fact that A B A + C etails A B 2 + B C + C so that we obtai from the secod iequality of Theorem 51 that, with probability at least 1 δ, for all f F, P f P f + 2 P f log S F(X 2 1 ) + log 4 δ + 4 log S F(X1 2 ) + log 4 δ Corollary 52 Let g be the empirical risk miimizer i a class C of vc dimesio V The, with probability at least 1 δ, L(g ) L (g ) + 2 L (g ) 2V log( + 1) + log 4 δ + 4 2V log( + 1) + log 4 δ

18 18 TITLE WILL BE SET BY THE PUBLISHER Cosider first the extreme situatio whe there exists a classifier i C which classifies without error This also meas that for some g C, Y = g (X) with probability oe This is clearly a quite restrictive assumptio, oly satisfied i very special cases Nevertheless, the assumptio that if g C L(g) = 0 has bee commoly used i computatioal learig theory, perhaps because of its mathematical simplicity I such a case, clearly L (g) = 0, so that we get, with probability at least 1 δ, L(g) if L(g) 42V log( + 1) + log 4 δ (16) g C The mai poit here is that the upper boud obtaied ( i this special case is of smaller order of magitude V ) tha i the geeral case (O(V l /) as opposed to O l / ) Oe ca actually obtai a versio which iterpolates betwee these two cases as follows: for simplicity, assume that there is a classifier g i C such that L(g ) = if g C L(g) The we have L (g ) L (g ) = L (g ) L(g ) + L(g ) Usig Berstei s iequality, we get, with probability 1 δ, which, together with Corollary 52, yields: L (g ) L(g ) 2L(g ) log 1 δ + 2 log 1 δ 3, Corollary 53 There exists a costat C such that, with probability at least 1 δ, L(g) if L(g) C g C if L(g)V log + log 1 δ g C + V log + log 1 δ 52 Noise ad fast rates We have see that i the case where f takes values i 0, 1} there is a ice relatioship betwee the variace of f (which cotrols the size of the deviatios betwee P f ad P f) ad its expectatio, amely, Var(f) P f This is the key property that allows oe to obtai faster rates of covergece for L(g ) if g C L(g) I particular, i the ideal situatio metioed above, whe if g C L(g) = 0, the differece L(g ) if g C L(g) may be much smaller tha the worst-case differece sup g C (L(g) L (g)) This actually happes i may cases, wheever the distributio satisfies certai coditios Next we describe such coditios ad show how the fier bouds ca be derived The mai idea is that, i order to get precise rates for L(g ) if g C L(g), we cosider fuctios of the form 1 g(x) Y 1 g (X) Y where g is a classifier miimizig the loss i the class C, that is, such that L(g ) = if g C L(g) Note that fuctios of this form are o loger o-egative To illustrate the basic ideas i the simplest possible settig, cosider the case whe the loss class F is a fiite set of N fuctios of the form 1 g(x) Y 1 g (X) Y I additio, we assume that there is a relatioship betwee the variace ad the expectatio of the fuctios i F give by the iequality Var(f) ( ) α P f (17) h

19 TITLE WILL BE SET BY THE PUBLISHER 19 for some h > 0 ad α (0, 1] By Berstei s iequality ad a uio boud over the elemets of C, we have that, with probability at least 1 δ, for all f F, P f P f + 2(P f/h) α log N δ + 4 log N δ 3 As a cosequece, usig the fact that P f = L (g ) L (g ) 0, we have with probability at least 1 δ, L(g ) L(g ) 2((L(g ) L(g ))/h) α log N δ + 4 log N δ 3 Solvig this iequality for L(g ) L(g ) fially gives that with probability at least 1 δ, L(g ) if g G L(g) ( 2 log N δ h α ) 1 2 α (18) Note that the obtaied rate is the faster tha 1/2 wheever α > 0 I particular, for α = 1 we get 1 as i the ideal case It ow remais to show whether (17) is a reasoable assumptio As the simplest possible example, assume that the Bayes classifier g belogs to the class C (ie, g = g ) ad the a posteriori probability fuctio η is bouded away from 1/2, that is, there exists a positive costat h such that for all x X, 2η(x) 1 > h Note that the assumptio g = g is very restrictive ad is ulikely to be satisfied i practice, especially if the class C is fiite, as it is assumed i this discussio The assumptio that η is bouded away from zero may also appear to be quite specific However, the situatio described here may serve as a first illustratio of a otrivial example whe fast rates may be achieved Sice 1 g(x) Y 1 g (X) Y 1 g(x) g (X), the coditios stated above ad (1) imply that Var(f) E [ ] 1 1 g(x) g (X) h E [ ] 1 2η(X) 1 1 g(x) g (X) = h (L(g) L ) Thus (17) holds with β = 1/h ad α = 1 which shows that, with probability at least 1 δ, L(g ) L C log N δ h (19) Thus, the empirical risk miimizer has a sigificatly better performace tha predicted by the results of the previous sectio wheever the Bayes classifier is i the class C ad the a posteriori probability η stays away from 1/2 The behavior of η i the viciity of 1/2 has bee kow to play a importat role i the difficulty of the classificatio problem, see [72, 239, 240] Roughly speakig, if η has a complex behavior aroud the critical threshold 1/2, the oe caot avoid estimatig η, which is a typically difficult oparametric regressio problem However, the classificatio problem is sigificatly easier tha regressio if η is far from 1/2 with a large probability The coditio of η beig bouded away from 1/2 may be sigificatly relaxed ad geeralized Ideed, i the cotext of discrimiat aalysis, Mamme ad Tsybakov [151] ad Tsybakov [221] formulated a useful coditio that has bee adopted by may authors Let α [0, 1) The the Mamme-Tsybakov coditio may

20 20 TITLE WILL BE SET BY THE PUBLISHER be stated by ay of the followig three equivalet statemets: (1) β > 0, g 0, 1} X, E [ ] 1 g(x) g (X) β(l(g) L ) α (2) c > 0, A X, A ( dp (x) c 2η(x) 1 dp (x) A (3) B > 0, t 0, P 2η(X) 1 t} Bt α We refer to this as the Mamme-Tsybakov oise coditio The proof that these statemets are equivalet is straightforward, ad we omit it, but we commet o the meaig of these statemets Notice first that α has to be i [0, 1] because L(g) L = E [ 2η(X) 1 1 g(x) g (X)] E1g(X) g (X) Also, whe α = 0 these coditios are void The case α = 1 i (1) is realized whe there exists a s > 0 such that 2η(X) 1 > s almost surely (which is just the extreme oise coditio we cosidered above) The most importat cosequece of these coditios is that they imply a relatioship betwee the variace ad the expectatio of fuctios of the form 1 g(x) Y 1 g (X) Y Ideed, we obtai 1 α E [ (1 g(x) Y 1 g (X) Y ) 2] c(l(g) L ) α This is thus eough to get (18) for a fiite class of fuctios The sharper bouds, established i this sectio ad the ext, come at the price of the assumptio that the Bayes classifier is i the class C Because of this, it is difficult to compare the fast rates achieved with the slower rates proved i Sectio 3 O the other had, oise coditios like the Mamme-Tsybakov coditio may be used to get improvemets eve whe g is ot cotaied i C I these cases the approximatio error L(g ) L also eeds to be take ito accout, ad the situatio becomes somewhat more complex We retur to these issues i Sectios 535 ad 8 53 Localizatio The purpose of this sectio is to geeralize the simple argumet of the previous sectio to more geeral classes C of classifiers This geeralizatio reveals the importace of the modulus of cotiuity of the empirical process as a measure of complexity of the learig problem 531 Talagrad s iequality Oe of the most importat recet developmets i empirical process theory is a cocetratio iequality for the supremum of a empirical process first proved by Talagrad [212] ad refied later by various authors This iequality is at the heart of may key developmets i statistical learig theory Here we recall the followig versio: Theorem 54 Let b > 0 ad set F to be a set of fuctios from X to R Assume that all fuctios i F satisfy P f f b The, with probability at least 1 δ, for ay θ > 0, [ ] sup (P f P f) (1 + θ)e sup (P f P f) f F f F which, for θ = 1 traslates to [ sup (P f P f) 2E f F sup (P f P f) f F ] + + 2(sup f F Var(f)) log 1 δ 2(sup f F Var(f)) log 1 δ ) α + (1 + 3/θ)b log 1 δ 3 + 4b log 1 δ 3,

21 TITLE WILL BE SET BY THE PUBLISHER Localizatio: iformal argumet We first explai iformally how Talagrad s iequality ca be used i cojuctio with oise coditios to yield improved results Start by rewritig the iequality of Theorem 54 We have, with probability at least 1 δ, for all f F with Var(f) r, [ P f P f 2E sup (P f P f) f F:Var(f) r ] + C r log 1 δ + C log 1 δ (20) Deote the right-had side of the above iequality by ψ(r) Note that ψ is a icreasig oegative fuctio Cosider the class of fuctios F = (x, y) 1 g(x) y 1 g (x) y : g C} ad assume that g C ad the Mamme-Tsybakov oise coditio is satisfied i the extreme case, that is, 2η(x) 1 > s > 0 for all x X, so for all f F, Var(f) 1 s P f Iequality (20) thus implies that, with probability at least 1 δ, all g C satisfy ( ( )) L(g) L L (g) L (g ) + ψ 1 sup L(g) L s g C I particular, for the empirical risk miimizer g we have, with probability at least 1 δ, ( ( )) L(g ) L ψ 1 sup L(g) L s g C For the sake of a iformal argumet, assume that we somehow kow beforehad what L(g ) is The we ca apply the above iequality to a subclass that oly cotais fuctios with error less tha that of g, ad thus we would obtai somethig like ( ) L(g ) L ψ 1 s (L(g ) L ) This idicates that the quatity that should appear as a upper boud of L(g ) L is somethig like maxr : r ψ(r/s)} We will see that the smallest allowable value is actually the solutio of r = ψ(r/s) The reaso why this boud ca improve the rates is that i may situatios, ψ(r) is of order r/ I this case the solutio r of r = ψ(r/s) satisfies r 1/(s) thus givig a boud of order 1/ for the quatity L(g ) L The argumet sketched here, oce made rigorous, applies to possibly ifiite classes with a complexity measure that captures the size of the empirical process i a small ball (ie, restricted to fuctios with small variace) The ext sectio offers a detailed argumet 533 Localizatio: rigorous argumet Let us itroduce the loss class F = (x, y) 1 g(x) y 1 g (x) y : g C} ad the star-hull of F defied by F = αf : α [0, 1], f F} Notice that for f F or f F, P f 0 Also, deotig by f the fuctio i F correspodig to the empirical risk miimizer g, we have P f 0 Let T : F R + be a fuctio such that for all f F, Var(f) T 2 (f) ad also for α [0, 1], T (αf) αt (f) A importat example is T (f) = P f 2 Itroduce the followig two fuctios which characterize the properties of the problem of iterest (ie, the loss fuctio, the distributio, ad the class of fuctios) The first oe is a sort of modulus of cotiuity of the Rademacher average idexed by the star-hull of F: ψ(r) = ER f F : T (f) r}

22 22 TITLE WILL BE SET BY THE PUBLISHER The secod oe is the modulus of cotiuity of the variace (or rather its upper boud T ) with respect to the expectatio: w(r) = sup T (f) f F :P f r Of course, ψ ad w are o-egative ad o-decreasig Moreover, the maps x ψ(x)/x ad w(x)/x are o-icreasig Ideed, for α 1, ψ(αx) = ER f F : T (f) αx} ER f F : T (f/α) x} ER αf : f F, T (f) x} = αψ(x) This etails that ψ ad w are cotiuous o ]0, 1] I the sequel, we will also use w 1 (x) def = maxu : w(u) x}, so for r > 0, we have w(w 1 (r)) = r Notice also that ψ(1) 1 ad w(1) 1 The aalysis below uses the additioal assumptio that x w(x)/ x is also o-icreasig This ca be eforced by substitutig w (r) for w(r) where w (r) = r sup r r w(r )/ r The purpose of this sectio is to prove the followig theorem which provides sharp distributio-depedet learig rates whe the Bayes classifier g belogs to C I Sectio 535 a extesio is proposed Theorem 55 Let r (δ) deote the miimum of 1 ad of the solutio of the fixed-poit equatio r = 4ψ(w(r)) + w(r) 2 log 1 δ + 8 log 1 δ Let ε deote the solutio of the fixed-poit equatio r = ψ(w(r)) The, if g C, with probability at least 1 δ, the empirical risk miimizer g satisfies ad max (L(g ) L, L (g ) L (g )) r (δ), (21) ( max (L(g ) L, L (g ) L (g )) 2 16ε + (2 (w(ε )) 2 ) log 1 ) δ ε + 8 (22) Remark 56 Both ψ ad w may be replaced by coveiet upper bouds This will prove useful whe derivig data-depedet estimates of these distributio-depedet risk bouds Remark 57 Iequality (22) follows from Iequality (21) by observig that ε r (δ), ad usig the fact that x w(x)/ x ad x ψ(x)/x are o-icreasig This shows that r (δ) satisfies the followig iequality: r r Iequality (22) follows by routie algebra 4 ε + w(ε ) ε 2 log 1 δ + 8 log 1 δ proof The mai idea is to weight the fuctios i the loss class F i order to have a hadle o their variace (which is the key to makig a good use of Talagrad s iequality) To do this, cosider G r = } rf T (f) r : f F

23 TITLE WILL BE SET BY THE PUBLISHER 23 At the ed of the proof, we will cosider r = w(r (δ)) or r = w(ε ) But for a while we will work with a geeric value of r This will serve to motivate the choice of r (δ) We thus apply Talagrad s iequality (Theorem 54) to this class of fuctios Noticig that P g g 2 ad Var(g) r 2 for g G r, we obtai that, o a evet E that has probability at least 1 δ, P f P f T (f) r r 2 log 2E 1 δ sup (P g P g) + r g G r + 8 log 1 δ 3 As show i Sectio 3, we ca upper boud the expectatio o the right-had side by 2E[R (G r )] Notice that for f G r, T (f) r ad also G r F which implies that R (G r ) R f F : T (f) r} We thus obtai P f P f T (f) r r 4ψ(r) + r 2 log 1 δ + 8 log 1 δ 3 Usig the defiitio of w, this yields P f P f w(p f) r r 4ψ(r) + r 2 log 1 δ + 8 log 1 δ 3 (23) The either w(p f) r which implies P f w 1 (r) or w(p f) r I this latter case, P f P f + w(p f) r 4ψ(r) + r 2 log 1 δ + 8 log 1 δ 3 (24) Moreover, as we have assumed that x w(x)/ x is o-icreasig, we also have w(p f) r P f w 1 (r), so that fially (usig the fact that x A x + B implies x A 2 + 2B), P f 2P f log 4ψ(r) 1 δ w 1 + r (r) + 8 log 1 δ 3 2 (25) Sice the fuctio f correspodig to the empirical risk miimizer satisfies P f 0, we obtai that, o the evet E, 2 P f max w log (r), 4ψ(r) 1 δ w 1 + r + 8 log 1 δ (r) 3 To miimize the right-had side, we look for the value of r which makes the two quatities i the maximum equal, that is, w(r (δ)) if r (δ) is smaller tha 1 (otherwise the first statemet i the theorem is trivial)

24 24 TITLE WILL BE SET BY THE PUBLISHER Now, takig r = w(r (δ)) i (24), as 0 P f r (δ), we also have P f w(r (δ)) 4 ψ(w(r (δ))) 2 log 1 δ w(r + (δ)) = r (δ) + 8 log 1 δ 3w(r (δ)) This proves the first part of Theorem Cosequeces To uderstad the meaig of Theorem 55, cosider the case w(x) = (x/h) α/2 with α 1 Observe that such a choice of w is possible uder the Mamme-Tsybakov oise coditio Moreover, if we assume that C is a vc class with vc-dimesio V, the it ca be show (see, eg, Massart [160], Bartlett, Bousquet, ad Medelso [25], [125]) that V ψ(x) Cx log so that ε is upper bouded by ( ) 1/(2 α) V log C 2/(2 α) h α We ca plug this upper boud ito iequality (22) Thus, with probability larger tha 1 δ, ( ) 1/(2 α) ( 1 ( ) L(g ) L 4 h α 8(C 2 V log ) 1/(2 α) + (C 2 V log ) (α 1)/(2 α) + 4 log 1 ) δ 535 A exteded local aalysis I the precedig sectios, we assumed that the Bayes classifier g belogs to the class C ad i the descriptio of the cosequeces that C is a vc class (ad is, therefore, relatively small ) As already poited out, i realistic settigs, it is more reasoable to assume that the Bayes classifier is oly approximated by C Fortuately, the above-described aalysis, the so-called peelig device, is robust ad exteds to the geeral case I the sequel we assume that g miimizes L(g) over g C, be we do ot assume that g = g The loss class F, its star-hull F ad the fuctio ψ are defied as i Sectio 533, that is, F = (x, y) 1 g(x) y 1 g (x) y : g C} Notice that for f F or f F, we still have P f 0 Also, deotig by f the fuctio i F correspodig to the empirical risk miimizer g, ad by f the fuctio i F correspodig to g, we have P f P f 0 Let w( ) be defied as i Sectio 533, that is, the smallest fuctio satisfyig w(r) sup f F,P f r Var[f] such that w(r)/ r is o-icreasig Let agai ɛ be defied as the positive solutio of r = ψ(w(r)) Theorem 58 For ay δ > 0, let r (δ) deote the solutio of r = 4ψ(w(r)) + 2w(r) 2 log 2 δ + 16 log 2 δ 3 ad ε the positive solutio of equatio r = ψ(w(r)) The for ay θ > 0, with probability at least 1 δ, the empirical risk miimizer g satisfies L(g ) L(g ) θ (L(g ) L(g )) + (1 + θ)2 r (δ), 4θ

25 TITLE WILL BE SET BY THE PUBLISHER 25 ad L(g ) L(g ) θ (L(g ) L(g )) + (1 + θ)2 4θ ( ( 32ε + 4 w2 (ε ) ε + 32 ) log 2 ) δ 3 Remark 59 Whe g = g, the boud i this theorem has the same form as the upper boud i (22) Remark 510 The secod boud i the Theorem follows from the first oe i the same way as Iequality (22) follows from Iequality (21) I the proof, we focus o the first boud The proof cosists mostly of replacig the observatio that L (g ) L (g ) i the proof of Theorem 55 by L (g ) L (g ) proof Let r deote a positive real Usig the same approach as i the proof of Theorem 55, that is, by applyig Talagrad s iequality to the reweighted star-hull F, we get that with probability larger tha 1 δ, for all f F such that P f r, P f P f T (f) r r 4ψ(r) + r 2 log 2 δ + 8 log 2 δ 3, while we may also apply Berstei s iequality to f ad use the fact that Var(f ) w(p f) for all f F: P f P f Var(f ) 2 log 2 δ + 8 log 2 δ 3 (w(p f) r) 2 log 2 δ + 8 log 2 δ 3 Addig the two iequalities, we get that, with probability larger tha 1 δ, for all f F (P f P f ) + (P f P f) w(p f) r r 4ψ(r) + 2r 2 log 2 δ + 16 log 2 δ 3 If we focus o f = f, the the two terms i the left-had-side are positive Now we substitute w(r (δ)) for r i the iequalities Hece, usig argumets that parallel the derivatio of (25) we get that, o a evet that has probability larger tha 1 δ, we have either P f r (δ) or at least P f P f P f r (δ) 4ψ(w(r (δ))) + 2w(r (δ)) 2 log 2 δ + 16 log 2 δ 3 = P f r (δ) Stadard computatios lead to the first boud i the Theorem Remark 511 The boud of Theorem 58 helps idetify situatios where takig ito accout oise coditios improves o aive risk bouds This is the case whe the approximatio bias is of the same order of magitude as the estimatio bias Such a situatio occurs whe dealig with a plurality of models, see Sectio 8 Remark 512 The bias term L(g ) L(g ) shows up i Theorem 58 because we do ot wat to assume ay special relatioship betwee Var[1 g(x) Y 1 g (X) Y ] ad L(g) L(g ) Such a relatioship may exists whe dealig with covex risks ad covex models I such a case, it is usually wise to take advatage of it

26 26 TITLE WILL BE SET BY THE PUBLISHER 54 Cost fuctios The refied bouds described i the previous sectio may be carried over to the aalysis of classificatio rules based o the empirical miimizatio of a covex cost fuctioal A (f) = (1/) i=1 φ( f(x i)y i ), over a class F of real-valued fuctios as is the case i may popular algorithms icludig certai versios of boostig ad SVM s The refied bouds improve the oes described i Sectio 4 Most of the argumets described i the previous sectio work i this framework as well, provided the loss fuctio is Lipschitz ad there is a uiform boud o the fuctios (x, y) φ( f(x)y) However, some extra steps are eeded to obtai the results O the oe had, oe relates the excess misclassificatio error L(f) L to the excess loss A(f) A Accordig to [27] Zhag s lemma (11) may be improved uder the Mamme-Tsybakov oise coditios to yield L(f) L(f ) O the other had, cosiderig the class of fuctios ( 2 s ) 1/(s sα+α) c β 1 s (A(f) A ) M = m f (x, y) = φ( yf(x)) φ( yf (x)) : f F}, oe has to relate Var(m f ) to P m f, ad fially compute the modulus of cotiuity of the Rademacher process idexed by M We omit the ofte somewhat techical details ad direct the reader to the refereces for the detailed argumets As a illustrative example, recall the case whe F = F λ is defied as i (7) The, the empirical miimizer f of the cost fuctioal A (f) satisfies, with probability at least 1 δ, ( ) A(f ) A C 1 2 V +2 log(1/δ) V +1 + where the costat C depeds o the cost fuctioal ad the vc dimesio V of the base class C Combiig this with the above improvemet of Zhag s lemma, oe obtais sigificat improvemets of the performace boud of Theorem Miimax lower bouds The purpose of this sectio is to ivestigate the accuracy of the bouds obtaied i the previous sectios We seek aswers for the followig questios: are these upper bouds (at least up to the order of magitude) tight? Is there a much better way of selectig a classifier tha miimizig the empirical error? Let us formulate exactly what we are iterested i Let C be a class of decisio fuctios g : R d 0, 1} The traiig sequece D = ((X 1, Y 1 ),, (X, Y )) is used to select the classifier g (X) = g (X, D ) from C, where the selectio is based o the data D We emphasize here that g ca be a arbitrary fuctio of the data, we do ot restrict our attetio to empirical error miimizatio To make the expositio simpler, we oly cosider classes of fuctios with fiite vc dimesio As before, we measure the performace of the selected classifier by the differece betwee the error probability L(g ) of the selected classifier ad that of the best i the class, L C = if g C L(g) I particular, we seek lower bouds for sup EL(g ) L C, where the supremum is take over all possible distributios of the pair (X, Y ) A lower boud for this quatity meas that o matter what our method of pickig a rule from C is, we may face a distributio such that our method performs worse tha the boud Actually, we ivestigate a stroger problem, i that the supremum is take over all distributios with L C kept at a fixed value betwee zero ad 1/2 We will see that the bouds deped o, V the vc dimesio of

27 TITLE WILL BE SET BY THE PUBLISHER 27 C, ad L C joitly As it turs out, the situatios for L C > 0 ad L C = 0 are quite differet Also, the fact that the oise is cotrolled (with the Mamme-Tsybakov oise coditios) has a importat ifluece Itegratig the deviatio iequalities such as Corollary 53, we have that for ay class C of classifiers with vc dimesio V, a classifier g miimizig the empirical risk satisfies ad also EL(g ) L C O ( LC V C log EL(g ) L C O ( ) VC + V C log Let C be a class of classifiers with vc dimesio V Let P be the set of all distributios of the pair (X, Y ) for which L C = 0 The, for every classificatio rule g based upo X 1, Y 1,, X, Y, ad V 1, sup EL(g ) V 1 ( 1 1 ) (26) P P 2e This ca be geeralized as follows Let C be a class of classificatio fuctios with vc dimesio V 2 Let P be the set of all probability distributios of the pair (X, Y ) for which for fixed L (0, 1/2), L = if g C L(g) The, for every classificatio rule g based upo X 1, Y 1,, X, Y, L(V 1) sup E(L(g ) L) if V 1 P P 32 8 ( max ), 2 (1 2L) 2, 1 L ) (27) I the extreme case of the Mamme-Tsybakov oise coditio, that is, whe sup x 2η(x) 1 h for some positive h, we have see that the rate ca be improved ad that we essetially have, whe g is the empirical error miimizer, ( ) V E(L(g ) L ) C V log, h o matter what L is, provided L = L C There also exist lower bouds uder these circumstaces Let C be a class of classifiers with vc dimesio V Let P be the set of all probability distributios of the pair (X, Y ) for which if L(g) = g C L, ad assume that η(x) 1/2 h almost surely where s > 0 is a costat The, for every classificatio rule g based upo X 1, Y 1,, X, Y, sup E(L(g ) L ) C P P ( ) V V (28) h Thus, there is a small gap betwee upper ad lower bouds (essetially of a logarithmic factor) This gap ca be reduced whe the class of fuctios is rich eough, where richess meas that there exists some d such that all dichotomies of size d ca be realized by fuctios i the class Whe C is such a class, uder the above coditios, oe ca improve (28) to get sup E(L(g ) L ) K(1 s) d P s ) (1 + log s2 d if s d/

28 28 TITLE WILL BE SET BY THE PUBLISHER Bibliographical remarks Iequality (12) is kow as Hoeffdig s iequality [109], while (13) is referred to as Berstei s iequality [34] The costats show here i Berstei s iequality actually follow from a iequality due to Beett [33] Theorem 51 ad their corollaries (16), ad Corollary 53 are due to Vapik ad Chervoekis [232, 233] The proof sketched here is due to Athoy ad Shawe-Taylor [11] Regardig the corollaries of this result, (14) is due to Pollard [181] ad (15) is due to Haussler [105] Breima, Friedma, Olshe, ad Stoe [53] also derive iequalities similar, i spirit, to (14) The fact that the variace ca be related to the expectatio ad that this ca be used to get improved rates has bee kow for a while the cotext of regressio fuctio estimatio or other statistical problems (see [110], [226], [227] ad refereces therei) For example, asymptotic results based o this were obtaied by va de Geer [224] For regressio, Birgé ad Massart [36] ad Lee, Bartlett ad Williamso [134] proved expoetial iequalities The fact that this pheomeo also occurs i the cotext of discrimiat aalysis ad classificatio, uder coditios o the oise (sometimes called margi) has bee poited out by Mamme ad Tsybakov [151], (see also Poloik [182] ad Tsybakov [220] for similar elaboratios o related problems like excess-mass maximizatio or desity level-sets estimatio) Massart [160] showed how to use optimal oise coditios to improve model selectio by pealizatio Talagrad s iequality for empirical processes first appeared i [212] For various improvemets, see Ledoux [132], Massart [159], Rio [184] The versio preseted i Theorem 54 is a applicatio of the refiemet give by Bousquet [47] Variatios o the theme ad detailed proofs appeared i [48] Several methods have bee developed i order to obtai sharp rates for empirical error miimizatio (or M-estimatio) A classical trick is the so-called peelig techique where the idea is to cut the class of iterest ito several pieces (accordig to the variace of the fuctios) ad to apply deviatio iequalities separately to each sub-class This techique, which goes back to Huber [110], is used, for example, by va de Geer [ ] Aother approach cosists i weightig the class ad was used by Vapik ad Chervoekis [232] i the special case of biary valued fuctios ad exteded by Pollard [181], for example Combiig this approach with cocetratio iequalities was proposed by Massart [160] ad this is the approach we have take here The fixed poit of the modulus of cotiuity of the empirical process has bee kow to play a role i the asymptotic behavior of M-estimators [227] More recetly o-asymptotic deviatio iequalities ivolvig this quatity were obtaied, essetially i the work of Massart [160] ad Koltchiskii ad Pacheko [126] Both approaches use a versio of the peelig techique, but the oe of Massart uses i additio a weightig approach More recetly, Medelso [171] obtaied similar results usig a weightig techique but a peelig ito two subclasses oly The mai igrediet was the itroductio of the star-hull of the class (as we do it here) This approach was further exteded i [25] where the peelig ad star-hull approach are compared It is poited out i recet results of Bartlett, Medelso, ad Philips [30] ad Koltchiskii [125] that sharper ad simpler bouds may be obtaied by takig Rademacher averages over level sets of the excess risk rather tha o L 2 (P ) balls Empirical estimates of the fixed poit of type ε were studied by Koltchiskii ad Pacheko [126] i the zero error case I a related work, Lugosi ad Wegkamp [147] obtai bouds i terms of empirically estimated localized Rademacher complexities without oise coditios I their approach, the complexity of a subclass of C cotaiig oly classifiers with a small empirical risk is used to obtai sharper bouds A geeral result, applicable uder geeral oise coditios, was prove by Bartlett, Bousquet ad Medelso [25] Replacig the iequality by a equality i the defiitio of ψ (thus makig the quatity smaller) ca yield better rates for certai classes as show by Bartlett ad Medelso [30] Applicatios of results like Theorem 55 to classificatio with vc classes of fuctios were ivestigated by Massart ad Nédélec [162] Properties of covex loss fuctios were ivestigated by Li [139], Steiwart [206], ad Zhag [244] The improvemet of Zhag s lemma uder the Mamme-Tsybakov oise coditio is due to Bartlett, Jorda ad McAuliffe [27] who establish more geeral results For a further improvemet we refer to Blachard, Lugosi, ad Vayatis [40] The cited improved rates of covergece for A(f ) A is also take from [27] ad [40] which is based o bouds derived by Blachard, Bousquet, ad Massart [39] The latter referece also ivestigates

29 TITLE WILL BE SET BY THE PUBLISHER 29 the special cost fuctio (1 + x) + uder the extreme case α = 1 of the Mamme-Tsybakov oise coditio, see also Bartlett, Jorda ad McAuliffe [27], Steiwart [195] Massart [160] gives a versio of Theorems 55 ad 58 for the case w(r) = c r ad arbitrary bouded loss fuctios which is exteded for geeral w i Bartlett, Jorda ad McAuliffe [27] ad Massart ad Nédélec [162] Bartlett, Bousquet ad Medelso [25] give a empirical versio of Theorem 55 i the case w(r) = c r The lower boud (26) was proved by Vapik ad Chervoekis [233], see also Haussler, Littlestoe, ad Warmuth [107], Blumer, Ehrefeucht, Haussler, ad Warmuth [41] Iequality (27) is due to Audibert [17] who improves o a result of Devroye ad Lugosi [73], see also Simo [200] for related results The lower bouds uder coditios o the oise are due to Massart ad Nédélec [162] Related results uder the Mamme-Tsybakov oise coditio for large classes of fuctios (ie, with polyomial growth of etropy) are give i the work of Mamme ad Tsybakov [151] ad Tsybakov [221] Other miimax results based o growth rate of etropy umbers of the class of fuctio are obtaied i the cotext of classificatio by Yag [239,240] We otice that the distributio which achieves the supremum i the lower bouds typically depeds o the sample size It is thus reasoable to require the lower bouds to be derived i such a way that P does ot deped o the sample size Such results are called strog miimax lower bouds ad were ivestigated by Atos ad Lugosi [14] ad Schuurmas [193] 6 PAC-bayesia bouds We ow describe the so-called pac-bayesia approach to derive error bouds (pac is a acroym for probably approximately correct ) The distictive feature of this approach is that oe assumes that the class C is edowed with a fixed probability measure π (called the prior) ad that the output of the classificatio algorithm is ot a sigle fuctio but rather a probability distributio ρ over the class C (called the posterior) Throughout this sectio we assume that the class C is at most coutably ifiite Give this probability distributio ρ, the error is measured uder expectatio with respect to ρ I other words, the quatities of iterest are ρl(g) def = L(g)dρ(g) ad ρl (g) def = L (g)dρ(g) This models classifiers whose output is radomized, which meas that for x X, the predictio at x is a radom variable takig values i 0, 1} ad equals to oe with probability ρg(x) def = g(x)dρ(g) It is importat to otice that ρ is allowed to deped o the traiig data We first show how to get results relatig ρl(g) ad ρl (g) usig basic techiques ad deviatio iequalities A prelimiary remark is that if ρ does ot deped o the traiig sample, the ρl (g) is simply a sum of idepedet radom variables whose expectatio is ρl(g) so that Hoeffdig s iequality applies trivially So the difficulty comes whe ρ depeds o the data By Hoeffdig s iequality, for the class F = 1 g(x) y : g C}, oe easily gets that for each fixed f F, } log(1/δ) P P f P f δ 2 For ay positive weights π(f) with f F π(f) = 1, oe may write a weighted uio boud as follows } log(1/(π(f)δ)) P f F : P f P f } log(1/(π(f)δ)) P P f P f 2 2 f F f F π(f)δ = δ, so that we obtai that with probability at least 1 δ, log(1/π(f)) + log(1/δ) f F, P f P f (29) 2

30 30 TITLE WILL BE SET BY THE PUBLISHER It is iterestig to otice that ow the boud depeds o the actual fuctio f beig cosidered ad ot just o the set F Now, observe that for ay fuctioal I, ( f F, I(f) 0) ( ρ, ρi(f) 0) where ρ deotes a arbitrary probability measure o F so that we ca take the expectatio of (29) with respect to ρ ad use Jese s iequality This gives, with probability at least 1 δ, K(ρ, π) + H(ρ) + log(1/δ) ρ, ρ(p f P f) 2 where K(ρ, π) deotes the Kullback-Leibler divergece betwee ρ ad π ad H(ρ) is the etropy of ρ Rewritig this i terms of the class C, we get that, with probability at least 1 δ, K(ρ, π) + H(ρ) + log(1/δ) ρ, ρl(g) ρl (g) (30) 2 The left-had side is the differece betwee true ad empirical errors of a radomized classifier which uses ρ as weights for choosig the decisio fuctio (idepedetly of the data) O the right-had side the etropy H of the distributio ρ (which is small whe ρ is cocetrated o a few fuctios) ad the Kullback-Leibler divergece K betwee ρ ad the prior distributio π appear It turs out that the etropy term is ot ecessary The pac-bayes boud is a refied versio of the above which is proved usig covex duality of the relative etropy The startig poit is the followig iequality which follows from covexity properties of the Kullback-Leibler divergece (or relative etropy): for ay radom variable X f, 1 ( ρx f if log πe λx f + K(ρ, π) ) λ>0 λ This iequality is applied to the radom variable X f = (P f P f) 2 + ad this meas that we have to upper boud πe λ(p f Pf)2 + We use Markov s iequality ad Fubii s theorem to get Now for a give f F, P πe λx f ɛ } ɛ 1 πee λx f Ee λ(p f Pf)2 + = 1 + = 1 + = P e λ(p f Pf)2 + t } dt P λ(p f P f) 2 + t } e t dt P P f P f } t/λ e t dt e 2t/λ+t dt = 2 where we have chose λ = 2 1 i the last step With this choice of λ we obtai P πe λx f ɛ } 2 ɛ Choosig ɛ = 2δ 1, we fially obtai that with probability at least 1 δ, The resultig boud has the followig form 2 f Pf) 1 log πeλ(p log(2/δ)

31 TITLE WILL BE SET BY THE PUBLISHER 31 Theorem 61 pac-bayesia boud With probability at least 1 δ, K(ρ, π) + log(2) + log(1/δ) ρ, ρl(g) ρl (g) 2 1 This should be compared to (30) The mai differece is that the etropy of ρ has disappeared ad we ow have a logarithmic factor istead (which is usually domiated by the other terms) To some extet, oe ca cosider that the pac-bayes boud is a refied uio boud where the gai happes whe ρ is ot cocetrated o a sigle fuctio (or more precisely ρ has etropy larger tha log ) A atural questio is whether oe ca take advatage of pac-bayesia bouds to obtai bouds for determiistic classifiers (returig a sigle fuctio ad ot a distributio) but this is ot possible with Theorem 61 whe the space F is ucoutable Ideed, the mai drawback of pac-bayesia bouds is that the complexity term blows up whe ρ is cocetrated o a sigle fuctio, which correspods to the determiistic case Hece, they caot be used directly to recover bouds of the type discussed i previous sectios Oe way to avoid this problem is to allow the prior to deped o the data I that case, oe ca work coditioally o the data (usig a double sample trick) ad i certai circumstaces, the coordiate projectio of the class of fuctios is fiite so that the complexity term remais bouded Aother approach to bridge the gap betwee the determiistic ad radomized cases is to cosider successive approximatig sets (similar to ɛ-ets) of the class of fuctios ad to apply pac-bayesia bouds to each of them This goes i the directio of chaiig or geeric chaiig Bibliographical remarks The pac-bayesia boud of Theorem 61 was derived by McAllester [163] ad later exteded i [164, 165] Lagford ad Seeger [130] ad Seeger [196] gave a easier proof ad some refiemets The symmetrizatio ad coditioig approach was first suggested by Catoi ad studied i [57 59] The chaiig idea appears i the work of Kolmogorov [122, 123] ad was further developed by Dudley [81] ad Pollard [180] It was geeralized by Talagrad [215] ad a detailed accout of recet developmets is give i [219] The chaiig approach to pac-bayesia bouds appears i Audibert ad Bousquet [16] Audibert [17] offers a thorough study of pac-bayesia results 7 Stability Give a classifier g, oe of the fudametal problems is to obtai estimates for the magitude of the differece L(g ) L (g ) betwee the true risk of the classifier ad its estimate L (g ), measured o the same data o which the classifier was traied L (g ) is ofte called the resubstitutio estimate of L(g ) It has bee poited out by various authors that the size of the differece L(g ) L (g ) is closely related to the stability of the classifier g Several otios of stability have bee itroduced, aimig at capturig this idea Roughly speakig, a classifier g is stable if small perturbatios i the data do ot have a big effect o the classifier Uder a proper otio of stability, cocetratio iequalities may be used to obtai estimates for the quatity of iterest A simple example of such a approach is the followig Cosider the case of real-valued classifiers, whe the classifier g is obtaied by thresholdig at zero a real-valued fuctio f : X R Give data (X 1, Y 1 ),, (X, Y ), deote by f i the fuctio that is leared from the data after replacig (X i, Y i ) by a arbitrary pair (x i, y i ) Let φ be a cost fuctio as defied i Sectio 4 ad assume that, for ay set of data, ay replacemet pair, ad ay x, y, φ( yf (x)) φ( yf i (x)) β, for some β > 0 ad that φ( yf(x)) is bouded by some costat M > 0 This is called the uiform stability coditio Uder this coditio, it is easy to see that E [A(f ) A (f )] β

32 32 TITLE WILL BE SET BY THE PUBLISHER (where the fuctioals A ad A are defied i Sectio 4) Moreover, by the bouded differeces iequality, oe easily obtais that with probability at least 1 δ, log(1/δ) A(f ) A (f ) β + (2β + M) 2 Of course, to be of iterest, this boud has to be such that β is a o-icreasig fuctio of such that β 0 as This turs out to be the case for regularizatio-based algorithms such as the support vector machie Hece oe ca obtai error bouds for such algorithms usig the stability approach We omit the details ad refer the iterested reader to the bibliographical remarks for further readig Bibliographical remarks The idea of usig stability of a learig algorithm to obtai error bouds was first exploited by Rogers ad Wager [187], Devroye ad Wager [74, 75] Kears ad Ro [116] ivestigated it further ad itroduced formally several measures of stability Bousquet ad Elisseeff [49] obtaied expoetial bouds uder restrictive coditios o the algorithm, usig the otio of uiform stability These coditios were relaxed by Kuti ad Niyogi [129] The lik betwee stability ad cosistecy of the empirical error miimizer was studied by Poggio, Rifki, Mukherjee ad Niyogi [179] 81 Oracle iequalities 8 Model selectio Whe facig a cocrete classificatio problem, choosig the right set C of possible classifiers is a key to success If C is so large that it ca approximate arbitrarily well ay measurable classifier, the C is susceptible to overfittig ad is ot suitable for empirical risk miimizatio, or empirical φ-risk miimizatio O the other had, if C is a small class, for example a class with fiite vc dimesio, C will be uable to approximate i ay reasoable sese a large set of measurable classificatio rules I order to achieve a good balace betwee estimatio error ad approximatio error, a variety of techiques have bee cosidered I the remaider of the paper, we will focus o the aalysis of model selectio methods which could be regarded as heirs of the structural risk miimizatio priciple of Vapik ad Chervoekis Model selectio aims at gettig the best of differet worlds simultaeously Cosider a possibly ifiite collectio of classes of classifiers C 1, C 2,, C k, Each class is called a model Our guess is that some of these models cotai reasoably good classifiers for the patter recogitio problem we are facig Assume that for each of these models, we have a learig algorithm that picks a classificatio rule g,k from C k whe give the sample D The model selectio problem may the be stated as follows: select amog (g,k ) k a good classifier Notice here that the word selectio may be too restrictive Rather that selectig some special g,k, we may cosider combiig them usig a votig scheme ad use a boostig algorithm where the base class would just be the (data-depedet) collectio (g,k ) k For the sake of brevity, we will just focus o model selectio i the arrow sese I a ideal world, before we see the data D, a beevolet oracle with the full kowledge of the oise coditios ad of the Bayes classifier would tell us which model (say k) miimizes the expected excess risk E[L(g,k ) L ], if such a model exists i our collectio The we could use our learig rule for this most promisig model with the guaratee that [ E L(g ) L ] if E [ L(g, k,k) L ] k But as the most promisig model k depeds o the learig problem ad may eve ot exist, there is o hope to perfectly mimic the behavior of the beevolet ad powerful oracle What statistical learig theory has tried hard to do is to approximate the beevolet oracle i various ways

33 TITLE WILL BE SET BY THE PUBLISHER 33 It is importat to thik about what could be reasoable upper bouds o the right-had side of the precedig oracle iequality It seems reasoable to icorporate a factor C at least as large as 1 ad additive terms of the form C γ(k, )/ where γ( ) is a slowly growig fuctio of its argumets ad to ask for [ ( E L(g ) L ] C if E [ L(g,ˆk,k) L ] ) γ(k, ) + C, (31) k where ˆk is the idex of the model selected accordig to empirical evidece Let L k = if g C k L(g) for each model idex k I order to uderstad the role of γ( ), it is useful to split E[L(g,k ) L ] i a bias term L k L ad a variace term E[L(g,k ) L k ] The last iequality traslates ito [ ( E L(g ) L ] C if L,ˆk k L + E [ ) L(g,k) L ] k + C γ(k, ) k The term C γ(k,) should ideally be at most of the same order of magitude as L(g,k ) L k To make the roadmap more detailed, we may ivoke the robust aalysis of the performace of empirical risk miimizatio sketched i Sectio 535 Recall that w( ) was defied i such a way that Var(1 g g ) w (L(g) L ) for all classifiers g k C k ad such that w(r)/ r is o-icreasig Explicit costructios of w( ) were possible uder the Mamme-Tsybakov oise coditios To take ito accout the plurality ad the richess of models, for each model C k, let ψ k be defied as ψ k (r) = ER f Fk : } Var(f) r where Fk is the star-hull of the loss class defied by C k (see Sectio 535) For each k, let ɛ k be defied as the positive solutio of r = ψ k (w(r)) The, viewig Theorem 58, we ca get sesible upper bouds o the excess risk for each model ad we may look for oracle iequalities of the form [ ( E L(g ) L ] C if (L k L + C,ˆk k ɛ k + w2 (ɛ k ) ɛ k )) log k (32) The right-had side is the of the same order of magitude as the ifimum of the upper bouds o the excess risk described i Sectio A glimpse at model selectio methods As we ow have a clear picture of what we are after, we may look for methods suitable to achieve this goal The model selectio problem looks like a multiple hypotheses testig problem: we have to test may pairs of hypotheses where the ull hypothesis is L(g,k ) L(g,k ) agaist the alterative L(g,k ) > L(g,k ) Depedig o the sceario, we may or may ot have fresh data to test these pairs of hypotheses Whatever the situatio, the tests are ot idepedet Furthermore there does ot seem to be ay obvious way to combie possibly coflictig aswers Most data-itesive model selectio methods we are aware of ca be described i the followig way: for each pair of models C k ad C k, a threshold τ(k, k, D ) is built ad model C k is favored with respect to model C k if L (g,k) L (g,k ) τ(k, k, D ) The threshold τ(,, ) may or may ot deped o the data The the results of the may pairwise tests are combied i order to select a model Model selectio by pealizatio may be regarded as a simple istace of this scheme I the pealizatio settig, the threshold τ(k, k, D ) is the differece betwee two terms that deped o the models: τ(k, k, D ) = pe(, k ) pe(, k)

34 34 TITLE WILL BE SET BY THE PUBLISHER The selected idex ˆk miimizes the pealized empirical risk L (g,k) + pe(, k) Such a scheme is attractive sice the combiatio of the results of the pairwise tests is extremely simple As a matter of fact, it is ot ecessary to perform all pairwise tests, it is eough to fid the idex that miimizes the pealized empirical risk Nevertheless, performig model selectio usig pealizatio suffers from some drawbacks: it will become apparet below that the ideal pealty that should be used i order to mimic the beevolet oracle, should, with high probability, be of the order of E [ L(g,k) L ] As see i Sectio 535, the sharpest bouds we ca get o the last quatity deped o oise coditios, model complexity ad o the model approximatio capability L k L Although oise coditios ad model complexity ca be estimated from the data (otwithstadig computatioal problems), estimatig the model bias L k L seems to be beyod the reach of our uderstadig I fact, estimatig L is kow to be a difficult statistical problem, see Devroye, Györfi, ad Lugosi [72], Atos, Devroye, ad Györfi [12] As far as classificatio is cocered, model selectio by pealizatio may ot put the burde where it should be If we allow the combiatio of the results of pairwise tests to be somewhat more complicated tha a simple search for the miimum i a list, we may avoid the pealty calibratio bottleeck I this respect, the so-called pre-testig method has proved to be quite successful whe models are ested The corerstoe of the pre-testig methods cosists of the defiitio of the threshold τ(k, k, D ) for k k that takes ito accout the complexity of C k, as well as the oise coditios Istead of attemptig a ubiased estimatio of the excess risk i each model as the pealizatio approach, the pre-testig approach attempts to estimate differeces betwee excess risks But whatever promisig the pre-testig method may look like, it will be hard to covice practitioers to abado cross-validatio ad other resamplig methods Ideed, a straightforward aalysis of the hold-out approach to model selectio suggests that hold-out ejoys almost all the desirable features of ay foreseeable model selectio method The rest of this sectio is orgaized as follows I Subsectio 83 we illustrate how the results collected i Sectios 3 ad 4 ca be used to desig simple pealties ad derive some easy oracle iequalities that capture classical results cocerig structural risk miimizatio It will be obvious that these oracle iequalities are far from beig satisfactory I Subsectio 84, we poit out the problems that have to be faced i order to calibrate pealties usig the refied ad robust aalysis of empirical risk miimizatio give i Sectio 535 I Subsectio 86, we rely o these developmets to illustrate the possibilities of pre-testig methods ad we coclude i Subsectio 87 by showig how hold-out ca be aalyzed ad justified by resortig to a robust versio of the elemetary argumet give at the begiig of Subsectio 52 Bibliographical remarks Early work o model selectio i the cotext of regressio or predictio with squared loss ca be foud i Mallows [150], Akaike [6] Mallows itroduced the C p criterio i [150] Greader [102] discusses the use of regularizatio i statistical iferece Vapik ad Chervoekis [233] proposed the structural risk miimizatio approach to model selectio i classificatio, see also Vapik [ ], Lugosi ad Zeger [148] The cocept of oracle iequality was advocated by Dooho ad Johstoe [76] A thorough accout of the cocept of oracle iequality ca be foud i Johstoe [113] Barro [21], Barro ad Cover [23], [22] ivestigate model selectio usig complexity regularizatio which is a kid of pealizatio i the framework of discrete models for desity estimatio ad regressio A geeral ad ifluetial approach to o-parametric iferece through pealty-based model selectio is described i Barro, Birgé ad Massart [20], see also Birgé ad Massart [37], [38] These papers provide a profoud accout of the use of sharp bouds o the excess risk for model selectio via pealizatio I particular, these papers

35 TITLE WILL BE SET BY THE PUBLISHER 35 pioeered the use of sharp cocetratio iequalities i solvig model selectio problems, see also Baraud [19], Castella [56] for illustratios i regressio ad desity estimatio A recet accout of iferece methods i o-parametric settigs ca be foud i Tsybakov [222] Kerel methods ad earest-eighbor rules have bee used to desig uiversal learig rules ad i some sese bypass the model selectio problem We refer to Devroye, Györfi ad Lugosi [72] for expositio ad refereces Hall [103] ad may other authors use resamplig techiques to perform model selectio 83 Naive pealizatio We start with describig a aive approach that uses ideas exposed at the first part of this survey Pealtybased model selectio chooses the model ˆk that miimizes L (g,k) + pe(, k), amog all models (C k ) k N I other words, the selected classifier is g As i the precedig sectio, pe(, k),ˆk is a positive, possibly data-depedet, quatity The ituitio behid usig pealties is that as large models ted to overfit, ad are thus proe to producig excessively small empirical risks, they should be pealized The aive pealties cosidered i this sectio are estimates of the expected amout of overfittig E[sup g Ck L(g) L (g)] Takig the expectatio as a pealty is urealistic as it assumes the kowledge of the true uderlyig distributio Therefore, it should be replaced by either a distributio-free pealty or a data-depedet quatity Distributio-free pealties may lead to highly coservative bouds The reaso is that sice a distributio-free upper boud holds for all distributios, it is ecessarily loose i special cases whe the distributio is such that the expected maximal deviatio is small This may occur, for example, if the distributio of the data is cocetrated o a small-dimesioal maifold ad i may other cases I recet years, several data-drive pealizatio procedures have bee proposed Such procedures are motivated accordig to computatioal or to statistical cosideratios Here we oly focus o statistical argumets Rademacher averages, as preseted i Sectio 3 are by ow regarded as a stadard basis for desigig data-drive pealties Theorem 81 For each k, let F k = 1 g(x) y : g C k } deote the loss class associated with C k Let pe(, k) be defied by log k 18 log k pe(, k) = 3R (F k ) + + (33) Let ˆk ) be defied as arg mi (L (g,k ) + pe(, k) The E [ L(g,ˆk ) L ] if k ( L(g k) L + 3E [R (F k )] + log k ) 18 log k 2π (34) Iequality (34) has the same form as the geeric oracle iequality (31) The multiplicative costat i frot of the ifimum is optimal sice it is equal to 1 At first glace, the additive term might seem quite satisfactory: if oise coditios are ot favorable, E [R (F k )] is of the order of the excess risk i the k-th model O the other had, i view of the oracle iequality (32) we are lookig for, this iequality is loose whe oise coditios are favorable, for example, whe the Mamme-Tsybakov coditios are eforced with some expoet α > 0 I the sequel, we will sometimes uses the followig property Rademacher averages are sharply cocetrated: they ot oly satisfy the bouded differeces iequality, but also Berstei-like iequalities, give i the ext lemma

36 36 TITLE WILL BE SET BY THE PUBLISHER Lemma 82 Let F k deote a class of fuctios with values i [ 1, 1], ad R (F k ) the correspodig coditioal Rademacher averages The Var (R (F k )) 1 E [R (F k )] ( ɛ 2 ) 2(E[R (F k )] + ε/3) ( ɛ 2 ) 2E[R (F k )] P R (F k ) E [R (F k )] + ɛ} exp P R (F k ) E [R (F k )] ɛ} exp proof of theorem 81 By the defiitio of the selectio criterio, we have for all k, L(g,ˆk ) L L(g,k) L ( L(g,k) L (g,k) pe(, k) ) + Takig expectatios, we get E [ L(g,ˆk ) L ] E [ L(g,k) L ] E [( L(g,k) L (g,k) ) pe(, k) ] ( ) L(g ) L,ˆk (g ) pe(, ˆk),ˆk +E [( )] L(g ) L,ˆk (g ) pe(, ˆk),ˆk E [ L(g,k) L + pe(, k) ] [ ( + E sup L(g,k ) L (g,k) pe(, k) )] k E [ L(g,k) L + pe(, k) ] [ ( )] + E sup sup (L(g) L (g)) pe(, k) k g C k E [ L(g,k) L + pe(, k) ] + [ ( ) ] E sup (L(g) L (g)) pe(, k) g C k k + The tail bouds for Rademacher averages give i Lemma 82 ca the be exploited as follows: } P sup (L(g) L (g)) pe(, k) + 2δ g C k [ ] } log k P sup (L(g) L (g)) E sup (L(g) L (g)) + g C k g C k + δ +P R (F k ) 2 3 E [R 18 log k (F k )] δ } 3 3 (usig the bouded differeces iequality for the first term ad Lemma 82 for the secod term) 1 k 2 exp( 2δ2 ) + 1 ( k 2 exp δ ) 9 Itegratig by parts ad summig with respect to k leads to the oracle iequality of the theorem Bibliographical remarks Data-depedet pealties were suggested by Lugosi ad Nobel [145], ad i the closely related luckiess framework itroduced by Shawe-Taylor, Bartlett, Williamso, ad Athoy [197], see also Freud [92] Pealizatio based o Rademacher averages was suggested by Bartlett, Bouchero, ad Lugosi [24] ad Koltchiskii [124] For refiemets ad further developmet, see Koltchiskii ad Pacheko [126], Lozao [142], [29], Bartlett, Bousquet ad Medelso [25], Bousquet, Koltchiskii ad Pacheko [50] Lugosi ad

37 TITLE WILL BE SET BY THE PUBLISHER 37 Wegkamp [147], Herbrich ad Williamso [108], Medelso ad Philips [172] The proof that Rademacher averages, empirical vc-etropy ad empirical vc-dimesio are sharply cocetrated aroud their mea ca be foud i Bouchero, Lugosi, ad Massart [45, 46] Fromot [96] poits out that Rademacher averages are actually a special case of weighted bootstrap estimates of the supremum of empirical processes, ad shows how a large collectio of variats of bootstrap estimates ca be used i model selectio for classificatio We refer to Gié [100] ad Efro et al [85 87] for geeral results o the bootstrap Empirical ivestigatios o the performace of model selectio based o Rademacher pealties ca be foud i Lozao [142] ad Bartlett, Bouchero, ad Lugosi [24] Both papers build o a framework elaborated i Kears, Masour, Ng, ad Ro [115] Ideed, [115] is a early attempt to compare model selectio criteria origiatig i structural risk miimizatio theory, mdl (Miimum Descriptio Legth priciple), ad the performace of hold-out estimates of overfittig This paper itroduced the iterval problem where empirical risk miimizatio ad model selectio ca be performed i a computatioally efficiet way Lugosi ad Wegkamp [147] propose a refied pealizatio scheme based o localized Rademacher complexities that recociles bouds preseted i this sectio ad the results described by Koltchiskii ad Pacheko [126] whe the optimal risk equals zero 84 Ideal pealties Naive pealties that ted to overestimate the excess risk i each model lead to coservative model selectio strategies For moderate sample sizes, they ted to favor small models Ecouragig results reported i simulatio studies should ot mislead the readers Model selectio based o aive Rademacher pealizatio maages to mimic the oracle whe sample size is large eough to make the aive upper boud o the estimatio bias small with respect to the approximatio bias As model selectio is ideally geared toward situatios where sample size is ot too large, oe caot feel satisfied by aive Rademacher pealties We ca guess quite easily what good pealties should be like If we could build pealties i such a way that, with probability larger tha k 2, L(g,k) L C ( (L (g,k) L (g ) + pe(, k) ) + C log(2 k2 ) the by the defiitio of model selectio by pealizatio, ad a simple uio boud, with probability larger tha 1 1/, for ay k, we would have L(g ),ˆk L L(g ),ˆk L + C (L (g,k) + pe(, k)) C (L (g ) + pe(, ˆk)),ˆk C ( L (g,k) L (g ) + pe(, k) ) + C log (2 ˆk 2 ) C (L (gk) L (g ) + pe(, k)) + C log(2 ˆk 2 ) Assumig we oly cosider polyomially may models (as a fuctio of ), this would lead to } EL(g ),ˆk L if C [E[L k L + pe(, k)]] + C log(2e) + 1 k Is this sufficiet to meet the objectives set i Sectio 81? This is where the robust aalysis of empirical risk miimizatio (Sectio 535) comes ito play If we assume that, with high probability, the quatities ɛ k defied at the ed of Sectio 81 ca be tightly estimated by data-depedet quatities ad used as pealties, the we are almost doe,

38 38 TITLE WILL BE SET BY THE PUBLISHER The followig statemet, that we abusively call a theorem, summarizes what could be achieved usig such ideal pealties For the sake of brevity, we provide a theorem with C = 2, but some more care allows oe to develop oracle iequalities with arbitrary C > 1 Theorem 83 If for every k the pe(, k) 32ε k + (4 (w(ε k ))2 ε + 32 ) log(4k 2 ), (35) k 3 E[L(g,ˆk ) L ] C if k (L k L + pe(, k)) + 1 Most of the proof cosists i checkig that if pe(, k) is chose accordig to (35) the we ca ivoke the robust results o learig rates stated i Theorem 58 to coclude proof Followig the secod boud i Theorem 58 (with θ = 1), with probability at least 1 1 k 2 k 1 1 2, we have, for every k, that is, L(g,k) L(g ) 2 ( L (g,k) L (g ) ) ( + 32ε k + (4 (w(ε k ))2 ε + 32 ) log(4k 2 ) ), k 3 L(g,k) L(g ) 2 ( L (g,k) L (g ) + pe(, k) ) The Theorem follows by observig that L(g,ˆk ) L 1 If pe(, k) is about the same as the right-had side of (35), the the oracle iequality of Theorem 83 has the same form as the ideal oracle iequality described at the ed of Sectio 81 This should evertheless ot be cosidered as a defiitive result but rather as a icetive to look for better pealties It could also possibly poit toward a dead ed Theorem 83 actually calls for buildig estimators of the sequece (ɛ k ), that is, of the sequece of fixed poits of fuctios ψ k w Recall that ψ k (w(r)) E [R f : f F k, P f r}] If the star-shaped loss class f : f Fk, P f r} were kow, give the fact that for a fixed class of fuctios F, R (F) is sharply cocetrated aroud its expectatio, estimatig ψ k w would be statistically feasible But the loss class of iterest depeds ot oly o the k-th model C k, but also o the ukow Bayes classifier g We will ot pursue the search for ideal data-depedet pealties ad look for roudabouts I the ext sectio, we will see that whe g C k, eve though g is ukow, sesible estimates of ε k ca be costructed I Sectio 86, we will see how to use these estimates i model selectio Bibliographical remarks The results described i this sectio are ispired by Massart [160] where the cocept of ideal pealty i classificatio is clarified The otio that ideal pealties should be rooted i sharp risk estimates goes back to the pioeerig works of Akaike [6] ad Mallows [150] As far as classificatio is cocered, a detailed accout of these ideas ca be foud i the eighth chapter of Massart [161] Various approaches to the excess risk estimatio i classificatio ca be foud i Bartlett, Bousquet, ad Medelso [25] ad Koltchiskii [125], where a discussio of the limits of pealizatio ca also be foud 85 Localized Rademacher complexities The purpose of this sectio is to show how the distributio-depedet upper bouds o the excess risk of empirical risk miimizatio derived i Sectio 53 ca be estimated from above ad from below whe the Bayes classifier belogs to the model This is ot eough to make pealizatio work but it will prove coveiet whe ivestigatig a pre-testig method i the ext sectio

39 TITLE WILL BE SET BY THE PUBLISHER 39 I this sectio, we are cocered with a sigle model C which cotais the Bayes classifier g The miimizer of the empirical risk will is deoted by g The loss class is F = 1 g(x) Y 1 g (X) Y : g C} The fuctios ψ( ) ad w( ) are defied as i Sectio 53 The quatity ε is defied as the solutio of the fixed-poit equatio r = ψ(w(r)) As, thaks to Theorems 55 ad 58, ε cotais relevat iformatio o the excess risk, it may be temptig to try to estimate ε from the data However this is ot the easiest way to proceed As we will eed bouds with prescribed accuracy ad cofidece for the excess risk, we will rather try to estimate from above ad below the boud r (δ) defied i the statemet of Theorem 55 Recall that r (δ) is defied as the solutio of the equatio 2 log 1 δ r = 4ψ(w(r)) + w(r) + 8 log 1 δ 3 I order to estimate r (δ), we estimate ψ( ) ad w( ) by some fuctios ˆψ ad ŵ ad solve the correspodig fixed-poit equatios The ratioale for this is cotaied i the followig propositio Propositio 84 Assume that the fuctios ˆψ ad ŵ satisfy the followig coditios: (1) ˆψ ad ŵ are o-egative, o-decreasig o [0, 1] (2) The fuctio r ŵ(r)/ r is o-icreasig (3) The fuctio r ˆψ(r)/r is o-icreasig (4) ˆψ w(r (δ)) > ψ(w(r (δ))) (5) There exist costats κ 1, κ 2 1 such that ˆψ(w(r (δ))) κ 1 ψ(κ 2 w(r (δ))) (6) ŵ(r δ ) w(r δ ) (7) There exist costats κ 3, κ 4 1 such that ŵ(r δ ) κ 3w(κ 4 r (δ)) The the followig holds: (1) There exists ˆr (δ) > 0, that solves r = 4 ˆψ(ŵ(r)) + ŵ(r) (2) If κ = κ 1 κ 2 κ 3 κ4, the r (δ) ˆr (δ) κr (δ) 2 log 2 δ + 8 log 2 δ 3 The proof of the propositio relies o elemetary calculus ad is left to the reader A pleasat cosequece of this lemma is that we may focus o the behavior of ˆψ ad ŵ at w(r (δ)) ad r (δ) I order to build estimates for ŵ, we will assume that w is defied by w(r) v(r) = sup P α g g : α [0, 1], α(l(g) L ) r} This esures that w(r)/ r is o-icreasig Before describig data-depedet fuctios ˆψ ad ŵ that satisfy the coditios of the lemma, we check that, withi model C, above a critical threshold related to r (δ), the empirical excess risk L (g) L (g ) faithfully reflects the excess risk L(g) L(g ) The followig lemma ad corollary could have bee stated right after the proof of Theorem 55 i Sectio 5 It should be cosidered as a collectio of ratio-type cocetratio iequalities Lemma 85 With probability larger tha 1 2δ for all g i C: ( L (g) L (g ) L(g) L(g ) + r (δ) ) (L(g) L(g )) r (δ) ad ( L(g) L(g ) r (δ) ) (L(g) L(g )) r (δ) L (g) L (g ) The proof cosists i revisitig the proof of Theorem 55 A iterestig cosequece of this observatio is the followig corollary: Corollary 86 There exists K 1 such that, with probability larger tha 1 δ, g C : L(g) L(g ) Kr (δ)} ( g C : L (g) L (g ) K ) } r (δ) K

40 40 TITLE WILL BE SET BY THE PUBLISHER ad, with probability larger tha 1 δ, g C : L(g) L(g ) Kr (δ)} ( g C : L (g) L (g ) K 1 1 ) } r (δ) K I order to compute approximatios for ψ( ) ad w( ) it will also be useful to rely o the fact that the L 2 (P ) metric structure of the loss class F faithfully reflect the L 2 (P ) metric o F Note that, for ay classifier g C, (g(x) g (x)) 2 = 1 g(x) y 1 g (x) y As a matter of fact, this is eve easier to establish tha the precedig lemma Squares of empirical L 2 distaces to g are sums of iid radom variables So we are agai i a positio to ivoke tools from empirical process theory Moreover the coectio betwee P g g ad the variace of g g is obvious: Var[g g ] P g g Lemma 87 Let s (δ) deote the solutio of the fixed-poit equatio s = 4ψ( s) + s 2 log 1 δ + 8 log 1 δ 3 The, with probability larger tha 1 2δ, for all g C, ( 1 θ ) P g g 1 ( 2 2θ s (δ) P g g 1 + θ ) P g g θ s (δ) The proof repeats agai the proof of Theorem 55 This lemma will be used thaks to the followig corollary: Corollary 88 For K 1, with probability larger tha 1 δ, g C : P g g Ks (δ)} ad, with probability larger tha 1 δ, g C : P g g Ks (δ)} g C : P g g K ( ) } s (δ), K ( g C : P g g K 1 1 ) } s (δ) K We are ow equipped to build estimators of w( ) ad ψ( ) Whe buildig a estimator of w( ), the guidelie cosists of two simple observatios: 1 2 sup P g g : L(g) L(g ) L(g ) + r} sup P g g : L(g) L(g ) + r} sup P g g : L(g) L(g ) L(g ) + r} This prompts us to try to estimate sup P α g g : α(l(g) L(g )) r} This will prove to be feasible thaks to the results described above Lemma 89 Let K > 2 ad let ˆv be defied as K 1 ŵ 2 (r) = sup P α g g : α [0, 1], g, g C, L (g) L (g ) L (g ) + 1α ( K ) } r K 1 1 K Let κ 3 = 2(1+1/ K) K+1 1 1/ ad κ K 1 4 = K 2 K 1 The, with probability larger tha 1 4δ, w(r (δ)) ŵ(r (δ)) κ 3 w (κ 4 r (δ))

41 TITLE WILL BE SET BY THE PUBLISHER 41 proof Let r be such that r r (δ) Thaks to Lemma 85, with( probability ) larger tha 1 δ, L (g ) L (g ) r (δ) r ad L(g) L(g ) K α r implies L (g) L (g ) K α K r, so ŵ 2 (r) 1 1 1/ K 1 sup P α g g : α [0, 1], g C, α(l(g) L(g )) Kr} Furthermore, by Lemma 87, with probability larger tha 1 2δ, ŵ 2 (r) 1 1 1/ K 1 sup 1 1 1/ K 1 w 2 (r) α(1 1 K 1 )P g g : α [0, 1], g C, ( ) 1 (1 w 2 (Kr) K 1 K 1 r L(g) L(g ) K } r α r O the other had, applyig the elemetary observatio above, Lemma 85, ad the Lemma 87, with probability larger tha 1 2δ, ŵ 2 (r) 2 1 1/ K 1 sup P α g g : α [0, 1], g C, L (g) L (g ) + K ( ) } r α K 2 1 1/ K 1 sup 2(1 + 1/ K) 1 1/ K 1 sup 2(1 + 1/ K) 1 1/ K 1 w2 ( P α g g : α [0, 1], g C, L(g) L(g ) + K α P α g g : α [0, 1], g C, L(g) L(g ) + K α ) K 2 K + 1 r K 1 Lemma 810 Assume that ψ(w(r (δ)) 8 log 1 δ Let K 4 Let 2 } K + 1 r K 1 2 } K + 1 r K 1 ˆψ(r) = 2R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g r 2, α 2 P g g Kr 2 } The, with probability larger tha 1 6δ, ψ(w(r (δ))) ˆψ(w(r (δ))) 8ψ( 2(K + 2)w(r (δ))) proof Note that ˆψ is positive, o-decreasig, ad because it is defied with respect to star-hulls, ˆψ(r)/r is o-decreasig First recall that, by Theorem 55 ad Lemma 87, takig θ = 1 there, with probability larger tha 1 2δ, P g g w 2 (r (δ)) P g g 3 2 w2 (r (δ)) + s (δ) 2 P g g 3 ( P g g + w 2 (r (δ)) ) + s (δ) 2

42 42 TITLE WILL BE SET BY THE PUBLISHER Let us first establish that, with probability larger tha 1 2δ, ˆψ(w(r (δ))) is larger tha the empirical Rademacher complexity of the star-hull of a fixed class of loss-fuctios For K 4 we have Kw 2 (r (δ)) 6 2 w2 (r (δ)) + s (δ) Ivokig the observatios above, with probability larger tha 1 2δ, ˆψ(w(r (δ))) 2R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g Kw 2 (r (δ)) } 2R α(1 g(x) Y 1 g (X) Y ) : α [0, 1], α 2 3 ( P g g + w 2 (r (δ)) ) + α 2 s (δ) Kw 2 (r (δ)) 2 2R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g w 2 (r (δ)) } By Lemma 82, with probability larger tha 1 δ, the empirical Rademacher complexity is larger tha half of its expected value Let us ow check that ˆψ(w(r (δ)) ca be upper bouded by a multiple of ψ(w(r (δ))) Ivokig agai the observatios above, with probability larger tha 1 2δ, ˆψ(w(r (δ))) 4R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g Kw 2 (r (δ)) } ( ) } 1 4R α(1 g(x) Y 1 g (X) Y ) : α [0, 1], α 2 2 P g g s (δ) w 2 (r (δ)) Kw 2 (r (δ)) 4R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g 2 ( s (δ) + (K + 1)w 2 (r (δ)) ) } 4R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g 2(K + 2)w 2 (r (δ)) } Now the last quatity is agai the coditioal Rademacher average with respect to a fixed class of fuctios By Lemma 82, with probability larger tha 1 δ 3 1 δ, R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g 2(K + 2)w 2 (r (δ)) } 2E [ R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g 2(K + 2)w 2 (r (δ)) }] Hece, with probability larger tha 1 3δ, ˆψ(w(r (δ))) 8ψ( 2(K + 2)w(r (δ))) We may ow coclude this sectio by the followig result We combie Propositio 84, Lemma 89 ad Lemma 810 ad choose K = 4 i both Lemmas Propositio 811 Let ψ ad w be defied as ψ(r) = E [ R f : f F, P f 2 r 2}], ad w = (r) = P } sup f : f F, P f r Let ŵ be defied by ŵ 2 (r) = 3 sup P α g g : α [0, 1], g, g C, L (g) L (g ) L (g ) + 1α ( ) } r, } ad ˆψ be defied by ˆψ(r) = 2R α(1g(x) Y 1 g (X) Y ) : α [0, 1], α 2 P g g 4r 2, α 2 P g g 4r 2 }

43 Let ˆr (δ) be defied as the solutio of equatio r = 4 ˆψ(ŵ(r)) + ŵ(r) at least 1 10δ, TITLE WILL BE SET BY THE PUBLISHER 43 r (δ) ˆr (δ) 480r (δ) 2 2 log 1 δ + 8 log 1 δ 3 The, with probability Note although we give explicit costats, o attempt has bee made to optimize the value of these costats It is believed that the last costat 480 ca be dramatically improved, at least by beig more careful Bibliographical remarks Aalogs of Theorem 85 ca be foud i Koltchiskii [125] ad Bartlett, Medelso, ad Philips [30] The presetatio give here is ispired by [125] The idea of estimatig the δ-reliable excess risk bouds r (δ) is put forward i [125] where several variats are exposed 86 Pre-testig I classificatio, the difficulties ecoutered by model selectio through pealizatio partly stem from the fact that pealty calibratio compels us to compare each L(g,k ) with the uaccessible golde stadard L(g ), although we actually oly eed to compare L(g,k ) with L(g,k ), ad to calibrate a threshold τ(k, k, D ) so that L (g,k ) L (g,k ) τ(k, k, D ) whe L(g,k ) is ot sigificatly larger tha L(g,k ) As estimatig the excess risk looks easier whe the Bayes classifier belogs to the model, we will preset i this sectio a settig where the performace of the model selectio method essetially relies o the ability to estimate the excess risk whe the Bayes classifier belogs to the model Throughout this sectio, we will rely o a few o-trivial assumptios Assumptio 81 (1) The sequece of models (C k ) k is ested: C k C k+1 (2) There exists some idex k such that for all k k, the Bayes classifier g belogs to C k That is, we assume that the approximatio bias vaishes for sufficietly large models Coformig to a somewhat misguidig traditio, we call model C k the true model (3) There exists a costat Γ such that, for each k k, with probability larger tha 1 δ 12k, for all j k 2 ( ) ( ) δ δ rk 12k 2 τ(j, k, D ) Γrk 12k 2 where r k ( ) is a distributio-depedet upper boud o the excess risk i model C k with tuable reliability, defied as i Sectio 533 For each pair of idices j k, let the threshold τ(j, k, D ) be defied by τ(j, k, D ) = ˆr k ( ) 12 k 2, where ˆr k ( ) is defied as i Propositio 811 from Sectio 85 Hece we may take Γ = 480 Note that for k k, the threshold looks like the ideal pealty described by (35) The pre-testig method cosists i first determiig which models are admissible Model C j is said to be admissible if for all k larger tha j, there exists some g C j C k, such that L (g) L (g,k ) τ(j, k, D ) The aggregatio procedure the selects the smallest admissible idex ˆk = mi j : k > j, g C j, L (g) L (g,k) τ(j, k, D ) } ad outputs the miimizer g of the empirical risk i,ˆk Cˆk Note that the pre-testig procedure does ot fit exactly i the framework of the compariso method metioed i Sectio 82 There, model selectio was supposed to be based o comparisos betwee empirical risk miimizers Here, model selectio is based o the (estimated) ability to approximate g,j by classifiers from C k

44 44 TITLE WILL BE SET BY THE PUBLISHER Theorem 812 Let δ > 0 Let (C k ) k deote a collectio of ested models that satisfies Assumptio 81 Let the idex ˆk ad the classifier g be chose accordig to the pre-testig procedure The, with probability larger,ˆk tha 1 δ, L(g ),ˆk L (Γ ( ) 1 + 4Γ) rk δ 12(k ) 2 The theorem implies that, with probability larger tha 1 δ, the excess risk of the selected classifier is of the same order of magitude as the available upper boud o the excess risk of the true model Note that this statemet does ot exactly match the goal we assiged ourselves i Sectio 81 The excess risk of the aggregated classifier is ot compared with the excess risk of the oracle Although the true model may coicide with the oracle for large sample sizes, this may ot be the case for small ad moderate sample sizes The proof is orgaized ito three lemmas Lemma 813 With probability larger tha 1 δ 3, model C k is admissible proof From Theorem 55, for each k k, with probability larger tha 1 L (g,k ) L (g,k) r k ( ) δ 12k 2 The proof of Lemma 813 is the completed ( by usig the assumptio that for each idex k larger tha k, with probability larger tha 1 δ 12k, r δ ) 2 k 12k τ(k, k, D 2 ) holds, ad resortig to the uio boud δ 12k 2 : The ext lemma deals with models which suffer a excessive approximatio bias The proof of this lemma will agai rely o Theorem 55 But, this time, the model uder ivestigatio is C k Lemma 814 Uder Assumptio 81, let κ be such that κ κ Γ, the with probability larger tha 1 δ 3, o idex k < k such that ( ) δ if L(g) L + κrk g C k (k ) 2 is admissible proof As all models C k satisfyig the coditio i the lemma are icluded i ( )} δ g Ck : L(g) L + κrk (k ) 2, it is eough to focus o the empirical process idexed by C k, ad to apply Lemma 85 to C k θ = 1/ κ, for all k of iterest, with probability larger tha 1 δ 12(k ), we have 2 Choosig Now, with probability larger tha 1 L (g,k) L (g,k ) L (g,k) L (g ) (κ 3 ( ) δ κ 3 ) rk (k ) ( ) 2 δ Γ rk (k ) 2 δ 12(k ) 2, the right had side is larger tha τ(k, k, D ) The third lemma is a direct cosequece of Theorem 58 It esures that, with high probability, the pretestig procedure provides a trade-off betwee estimatio bias ad approximatio bias which is ot much worse tha the oe provided by model C k

45 TITLE WILL BE SET BY THE PUBLISHER 45 Lemma 815 Let κ be such that κ κ Γ Uder Assumptio 81, for ay k k such that with probability larger tha 1 δ k 2, ( ) δ if L(g) L + κ rk 2 g C k 12k 2, L(g,k) L(g ) (κ + ( ) δ κ)rk 2 12k 2 Bibliographical remarks Pre-testig procedures were proposed by Lepskii [136], [137], [135] for performig model selectio i a regressio cotext They are also discussed by Birgé [35] Their use i model selectio for classificatio was pioeered by Tsybakov [221] which is the mai source of ispiratio for this sectio Koltchiskii [125] also revisits compariso-based methods usig cocetratio iequalities ad provides a uified accout of pealty-based ad compariso-based model selectio techiques i classificatio I this sectio we preseted model selectio from a hybrid perspective, mixig the efficiecy viewpoit advocated at the begiig of Sectio 8 (tryig to miimize the classificatio risk without assumig aythig about the optimal classifier g ) ad the cosistecy viewpoit I the latter perspective, it is assumed that there exists a true model, that is, a miimal model without approximatio bias ad the goal is to first idetify this true model (see Csiszár ad Shields [67], Csiszár [66] for examples of recet results i the cosistecy approach for differet problems), ad the perform estimatio i this hopefully true model The mai tools i the costructio of data-depedet thresholds for determiig admissibility are ratio-type uiform deviatio iequalities The itroductio of Talagrad s iequality for suprema of empirical processes greatly simplified the derivatio of such ratio-type iequalities A early accout of ratio-type iequalities, predatig [216], ca be foud i Chapter V of va de Geer [226] Bartlett, Medelso, ad Philips [30] provide a cocise ad comprehesive compariso betwee the radom empirical structure ad the origial structure of the loss class This aalysis is geared toward the aalysis of empirical risk miimizatio The use ad aalysis local Rademacher complexities was promoted by Koltchiskii ad Pacheko [126] (i the special case where L(g ) = 0) ad reached a certai level of maturity i Bartlett, Bousquet, ad Medelso [25], where Rademacher complexities of L 2 balls aroud g are cosidered Koltchiskii [125] wet oe step further ad poited out that there is o eed to estimate separately complexity ad oise coditios: what matters is φ(w( )) Koltchiskii [125] (as well as Bartlett, Medelso, ad Philips [30]) proposed to compute localized Rademacher complexities o the level sets of the empirical risk Lugosi ad Wegkamp [147] propose pealties based o empirical Rademacher complexities of the class of classifiers reduced to those with small empirical risk ad obtai oracle iequalities that do ot eed the assumptio that the optimal classifier is i oe of the models Va de Geer ad Tsybakov [223] recetly poited out that i some special cases, pealty-based model selectio ca achieve adaptivity to the oise coditios 87 Revisitig hold-out estimates Desigig ad assessig model selectio policies based o either pealizatio or pre-testig requires a good commad of empirical processes theory This partly explais why re-samplig techiques like te-fold crossvalidatio ted to be favored by practitioers Moreover, there is o simple way to reduce the computatio of risk estimates that are at the core of the model selectio techiques to empirical risk miimizatio, while re-samplig methods do ot suffer from such a drawback: accordig to the computatioal complexity perspective, carryig out te-fold cross-validatio is ot much harder tha empirical risk miimizatio Obtaiig o-asymptotic oracle iequalities for such crossvalidatio methods remais a challege

46 46 TITLE WILL BE SET BY THE PUBLISHER The simplest cross validatio method is hold-out It cosists i splittig the sample of size +m i two parts: a traiig set of legth ad a test set of legth m Let us deote by L m(g) the average loss of g o the test set Note that oce the traiig set has bee used to derive a collectio of cadidate classifiers (g,k ) k, the model selectio problem turs out to look like the problem we cosidered at the begiig of Sectio 52: pickig a classifier from a fiite collectio Here the collectio is data-depedet but we may aalyze the problem by reasoig coditioally o the traiig set A secod difficulty is raised by the fact we may ot assume aymore that the Bayes classifier belogs to the collectio of cadidate classifiers We eed to robustify the argumet of Sectio 52 Heceforth, let g, k deote the miimizer of the probability of error i the collectio (g,k ) k The followig theorem is a strog icetive to theoretically ivestigate ad practically use resamplig methods Moreover its proof is surprisigly simple Theorem 816 Let (g,k ) k N deote a collectio of classifiers obtaied by processig a radom traiig sample of legth Let k deote the idex k that miimizes E [ L(g,k) L(g ) ] Let ˆk deote the idex k that miimizes L m(g,k ) where the empirical risk L m is evaluated o a idepedet test sample of legth m Let w( ) be such that, for ay classifier g, Var[1 g g ] w (L(g) L ) ad such that w(x)/ x is o-icreasig Let τ deote the smallest positive solutio of w(ɛ) = m ɛ If θ (0, 1), the [ ] [ E L(g ),ˆk L(g ) (1 + θ) if E [ L(g,k) L(g ) ] ( 8 + k 3m + τ ) ] log N θ Remark 817 Assume the Mamme-Tsybakov oise coditios with expoet α hold, that is, we ca choose w(r) = ( ) r α/2 h for some positive h The, as τ = ( ) 1 1/(2 α), m h the theorem traslates ito α [ [ ] E L(g ),ˆk L(g ) (1 + θ) if E [ L(g,k) L(g ) ] ( ) ] 4 + k 3m + 1 log N θ (m h α ) 1/(2 α) Note that the hold-out based model selectio method does ot eed to estimate the fuctio w( ) Usig the otatio of (31), the oracle iequality of Theorem 816 is almost optimal as far as the additive terms are cocered Note however that the multiplicative factor o the right-had side depeds o the ratio betwee the miimal excess risk for samples of legth ad samples of legth + m This ratio depeds o the settig of the learig problem, that is, o the approximatio capabilities of the model collectio ad the oise coditios As a matter of fact, the choice of a good trade-off betwee traiig ad test sample sizes is still a matter of debate proof By Berstei s iequality ad a uio boud over the elemets of C, with probability at least 1 δ, for all g,k, L(g,k) L(g ) L m(g,k) L m(g ) + 2 log N δ m w(l(g,k) L(g )) + 4 log N δ 3m, ad L(g ) L(g, k ) L m(g ) L m(g, k ) + 2 log N δ m w(l(g, k ) L(g )) + 4 log N δ 3m

47 TITLE WILL BE SET BY THE PUBLISHER 47 Summig the two iequalities, we obtai L(g,k) L(g, k ) L m(g,k) L m(g, k ) log N δ m w(l(g,k) L(g )) + 8 log N δ 3m, (36) As L m(g ),ˆk L m(g ) 0, with probability larger tha 1 δ,, k L(g,ˆk ) L(g, k ) 2 2 log N δ m w(l(g,ˆk ) L(g )) + 8 log N δ 3m (37) Let τ be defied as i the statemet of the theorem If L(g,ˆk ) L(g ) τ, the w(l(g,ˆk ) L(g ))/ m (L(g,ˆk ) L(g ))τ, ad we have L(g 2 ),ˆk L(g ) 2 log N, k δ τ L(g ),ˆk L(g ) + 8 log N δ 3m θ 2 (L(g,ˆk ) L(g )) + 8 2θ τ log N δ + 8 log N δ 3m Hece, with probability larger tha 1 δ (with respect to the test set), L(g,ˆk ) L(g ) 1 1 θ/2 (L(g ) 1, k L(g )) + (1 θ/2) 4 log N ( τ δ θ + 2 ) 3m Fially, takig expectatio with respect to the traiig set ad the test set, we get the oracle iequality stated i the theorem Bibliographical remarks Hastie, Tibshirai ad Friedma [104] provide a applicatio-orieted discussio of model selectio strategies They provide a argumet i defese of the hold-out methodology A early accout of usig hold-out estimates i model selectio ca be foud i Lugosi ad Nobel [145] ad i Bartlett, Bouchero, ad Lugosi [24] A sharp use of hold-out estimates i a adaptive regressio framework is described by Wegkamp i [237] This sectio essetially comes from the course otes by P Massart [161] where better costats ad expoetial iequalities for excess risk ca be foud Ackowledgmets We thak Aestis Atoiadis for ecouragig us to write this survey We are idebted to the associate editor ad the referees for the excellet suggestios that sigificatly improved the paper Refereces [1] R Ahlswede, P Gács, ad J Körer Bouds o coditioal probabilities with applicatios i multi-user commuicatio Zeitschrift für Wahrscheilichkeitstheorie ud verwadte Gebiete, 34: , 1976 (correctio i 39: ,1977) [2] MA Aizerma, EM Braverma, ad LI Rozooer The method of potetial fuctios for the problem of restorig the characteristic of a fuctio coverter from radomly observed poits Automatio ad Remote Cotrol, 25: , 1964 [3] MA Aizerma, EM Braverma, ad LI Rozooer The probability problem of patter recogitio learig ad the method of potetial fuctios Automatio ad Remote Cotrol, 25: , 1964 [4] MA Aizerma, EM Braverma, ad LI Rozooer Theoretical foudatios of the potetial fuctio method i patter recogitio learig Automatio ad Remote Cotrol, 25: , 1964 [5] MA Aizerma, EM Braverma, ad LI Rozooer Method of potetial fuctios i the theory of learig machies Nauka, Moscow, 1970 [6] H Akaike A ew look at the statistical model idetificatio IEEE Trasactios o Automatic Cotrol, 19: , 1974 [7] S Alesker A remark o the Szarek-Talagrad theorem Combiatorics, Probability, ad Computig, 6: , 1997 [8] N Alo, S Be-David, N Cesa-Biachi, ad D Haussler Scale-sesitive dimesios, uiform covergece, ad learability Joural of the ACM, 44: , 1997

48 48 TITLE WILL BE SET BY THE PUBLISHER [9] M Athoy ad P L Bartlett Neural Network Learig: Theoretical Foudatios Cambridge Uiversity Press, Cambridge, 1999 [10] M Athoy ad N Biggs Computatioal Learig Theory Cambridge Tracts i Theoretical Computer Sciece (30) Cambridge Uiversity Press, Cambridge, 1992 [11] M Athoy ad J Shawe-Taylor A result of Vapik with applicatios Discrete Applied Mathematics, 47: , 1993 [12] A Atos, L Devroye, ad L Györfi Lower bouds for Bayes error estimatio IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 21: , 1999 [13] A Atos, B Kégl, T Lider, ad G Lugosi Data-depedet margi-based geeralizatio bouds for classificatio Joural of Machie Learig Research, 3:73 98, 2002 [14] A Atos ad G Lugosi Strog miimax lower bouds for learig Machie learig, 30:31 56, 1998 [15] P Assouad Desité et dimesio Aales de l Istitut Fourier, 33: , 1983 [16] J-Y Audibert ad O Bousquet Pac-Bayesia geeric chaiig I L Saul, S Thru, ad B Schölkopf, editors, Advaces i Neural Iformatio Processig Systems, 16, Cambridge, Mass, MIT Press 2004 [17] J-Y Audibert PAC-Bayesia Statistical Learig Theory PhD Thesis, Uiversité Paris 6, Pierre et Marie Curie, 2004 [18] K Azuma Weighted sums of certai depedet radom variables Tohoku Mathematical Joural, 68: , 1967 [19] Y Baraud Model selectio for regressio o a fixed desig Probability Theory ad Related Fields, 117(4): , 2000 [20] AR Barro, L Birgé, ad P Massart Risks bouds for model selectio via pealizatio Probability Theory ad Related Fields, 113: , 1999 [21] AR Barro Logically smooth desity estimatio Techical Report TR 56, Departmet of Statistics, Staford Uiversity, 1985 [22] AR Barro Complexity regularizatio with applicatio to artificial eural etworks I G Roussas, editor, Noparametric Fuctioal Estimatio ad Related Topics, pages NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991 [23] AR Barro ad TM Cover Miimum complexity desity estimatio IEEE Trasactios o Iformatio Theory, 37: , 1991 [24] P Bartlett, S Bouchero, ad G Lugosi Model selectio ad error estimatio Machie Learig, 48:85 113, 2001 [25] P Bartlett, O Bousquet, ad S Medelso Localized Rademacher complexities The Aals of Statistics, 33: , 2005 [26] P L Bartlett ad S Be-David Hardess results for eural etwork approximatio problems Theoretical Computer Sciece, 284:53 66, 2002 [27] PL Bartlett, M I Jorda, ad J D McAuliffe Covexity, classificatio, ad risk bouds Joural of the America Statistical Associatio, to appear, 2005 [28] PL Bartlett ad W Maass Vapik-Chervoekis dimesio of eural ets I Michael A Arbib, editor, The Hadbook of Brai Theory ad Neural Networks, pages MIT Press, 2003 Secod Editio [29] PL Bartlett ad S Medelso Rademacher ad gaussia complexities: risk bouds ad structural results Joural of Machie Learig Research, 3: , 2002 [30] P L Bartlett, S Medelso, ad P Philips Local Complexities for Empirical Risk Miimizatio I Proceedigs of the 17th Aual Coferece o Learig Theory (COLT), Spriger, 2004 [31] O Bashkirov, EM Braverma, ad IE Muchik Potetial fuctio algorithms for patter recogitio learig machies Automatio ad Remote Cotrol, 25: , 1964 [32] S Be-David, N Eiro, ad H-U Simo Limitatios of learig via embeddigs i Euclidea half spaces Joural of Machie Learig Research, 3: , 2002 [33] G Beett Probability iequalities for the sum of idepedet radom variables Joural of the America Statistical Associatio, 57:33 45, 1962 [34] SN Berstei The Theory of Probabilities Gostehizdat Publishig House, Moscow, 1946 [35] L Birgé A alterative poit of view o Lepski s method I State of the art i probability ad statistics (Leide, 1999), volume 36 of IMS Lecture Notes Moogr Ser, pages Ist Math Statist, Beachwood, OH, 2001 [36] L Birgé ad P Massart Rates of covergece for miimum cotrast estimators Probability Theory ad Related Fields, 97: , 1993 [37] L Birgé ad P Massart From model selectio to adaptive estimatio I E Torgerse D Pollard ad G Yag, editors, Festschrift for Lucie Le Cam: Research papers i Probability ad Statistics, pages Spriger, New York, 1997 [38] L Birgé ad P Massart Miimum cotrast estimators o sieves: expoetial bouds ad rates of covergece Beroulli, 4: , 1998 [39] G Blachard, O Bousquet, ad P Massart Statistical performace of support vector machies The Aals of Statistics, to appear, 2006 [40] G Blachard, G Lugosi, ad N Vayatis O the rates of covergece of regularized boostig classifiers Joural of Machie Learig Research, 4: , 2003 [41] A Blumer, A Ehrefeucht, D Haussler, ad MK Warmuth Learability ad the Vapik-Chervoekis dimesio Joural of the ACM, 36: , 1989

49 TITLE WILL BE SET BY THE PUBLISHER 49 [42] S Bobkov ad M Ledoux Poicaré s iequalities ad Talagrads s cocetratio pheomeo for the expoetial distributio Probability Theory ad Related Fields, 107: , 1997 [43] B Boser, I Guyo, ad VN Vapik A traiig algorithm for optimal margi classifiers I Proceedigs of the Fifth Aual ACM Workshop o Computatioal Learig Theory (COLT), pages Associatio for Computig Machiery, New York, NY, 1992 [44] S Bouchero, O Bousquet, G Lugosi, ad P Massart Momet iequalities for fuctios of idepedet radom variables The Aals of Probability, 33: , 2005 [45] S Bouchero, G Lugosi, ad P Massart A sharp cocetratio iequality with applicatios Radom Structures ad Algorithms, 16: , 2000 [46] S Bouchero, G Lugosi, ad P Massart Cocetratio iequalities usig the etropy method The Aals of Probability, 31: , 2003 [47] O Bousquet A Beett cocetratio iequality ad its applicatio to suprema of empirical processes C R Acad Sci Paris, 334: , 2002 [48] O Bousquet Cocetratio iequalities for sub-additive fuctios usig the etropy method I C Houdré E Gié ad D Nualart, editors, Stochastic Iequalities ad Applicatios Birkhauser, 2003 [49] O Bousquet ad A Elisseeff Stability ad geeralizatio Joural of Machie Learig Research, 2: , 2002 [50] O Bousquet, V Koltchiskii, ad D Pacheko Some local measures of complexity of covex hulls ad geeralizatio bouds I Proceedigs of the 15th Aual Coferece o Computatioal Learig Theory (COLT), pages Spriger, 2002 [51] L Breima Arcig classifiers The Aals of Statistics, 26: , 1998 [52] L Breima Some ifiite theory for predictor esembles The Aals of Statistics, 2004 [53] L Breima, JH Friedma, RA Olshe, ad CJ Stoe Classificatio ad Regressio Trees Wadsworth Iteratioal, Belmot, CA, 1984 [54] P Bühlma ad B Yu Boostig with the l 2 -loss: Regressio ad classificatio Joural of the America Statistical Associatio, 98: , 2004 [55] A Cao, JM Ettiger, D Hush, ad C Scovel Machie learig with data depedet hypothesis classes Joural of Machie Learig Research, 2: , 2002 [56] G Castella Desity estimatio via expoetial model selectio IEEE Trasactios o Iformatio Theory, 49(8): , 2003 [57] O Catoi Radomized estimators ad empirical complexity for patter recogitio ad least square regressio preprit PMA-677 [58] O Catoi Statistical learig theory ad stochastic optimizatio Ecole d été de Probabilités de Sait-Flour XXXI Spriger- Verlag, Lecture Notes i Mathematics, Vol 1851, 2004 [59] O Catoi Localized empirical complexity bouds ad radomized estimators, 2003 Preprit [60] N Cesa-Biachi ad D Haussler A graph-theoretic geeralizatio of the Sauer-Shelah lemma Discrete Applied Mathematics, 86:27 35, 1998 [61] M Collis, RE Schapire, ad Y Siger Logistic regressio, AdaBoost ad Bregma distaces Machie Learig, 48: , 2002 [62] C Cortes ad VN Vapik Support vector etworks Machie Learig, 20:1 25, 1995 [63] TM Cover Geometrical ad statistical properties of systems of liear iequalities with applicatios i patter recogitio IEEE Trasactios o Electroic Computers, 14: , 1965 [64] P Crave ad G Wahba Smoothig oisy data with splie fuctios: estimatig the correct degree of smoothig by the method of geeralized cross-validatio Numer Math, 31: , 1979 [65] N Cristiaii ad J Shawe-Taylor A Itroductio to Support Vector Machies ad other kerel-based learig methods Cambridge Uiversity Press, Cambridge, UK, 2000 [66] I Csiszár Large-scale typicality of Markov sample paths ad cosistecy of MDL order estimators IEEE Trasactios o Iformatio Theory, 48: , 2002 [67] I Csiszár ad P Shields The cosistecy of the BIC Markov order estimator The Aals of Statistics, 28: , 2000 [68] F Cucker ad S Smale O the mathematical foudatios of learig Bulleti of the America Mathematical Society, pages 1 50, Jauary 2002 [69] A Dembo Iformatio iequalities ad cocetratio of measure The Aals of Probability, 25: , 1997 [70] PA Devijver ad J Kittler Patter Recogitio: A Statistical Approach Pretice-Hall, Eglewood Cliffs, NJ, 1982 [71] L Devroye Automatic patter recogitio: A study of the probability of error IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 10: , 1988 [72] L Devroye, L Györfi, ad G Lugosi A Probabilistic Theory of Patter Recogitio Spriger-Verlag, New York, 1996 [73] L Devroye ad G Lugosi Lower bouds i patter recogitio ad learig Patter Recogitio, 28: , 1995 [74] L Devroye ad T Wager Distributio-free iequalities for the deleted ad holdout error estimates IEEE Trasactios o Iformatio Theory, 25(2): , 1979

50 50 TITLE WILL BE SET BY THE PUBLISHER [75] L Devroye ad T Wager Distributio-free performace bouds for potetial fuctio rules IEEE Trasactios o Iformatio Theory, 25(5): , 1979 [76] D L Dooho ad I M Johstoe Ideal spatial adaptatio by wavelet shrikage Biometrika, 81(3): , 1994 [77] RO Duda ad PE Hart Patter Classificatio ad Scee Aalysis Joh Wiley, New York, 1973 [78] RO Duda, PE Hart, ad DGStork Patter Classificatio Joh Wiley ad Sos, 2000 [79] RM Dudley Cetral limit theorems for empirical measures The Aals of Probability, 6: , 1978 [80] RM Dudley Balls i R k do ot cut all subsets of k + 2 poits Advaces i Mathematics, 31 (3): , 1979 [81] RM Dudley Empirical processes I Ecole de Probabilité de St Flour 1982 Lecture Notes i Mathematics #1097, Spriger- Verlag, New York, 1984 [82] RM Dudley Uiversal Dosker classes ad metric etropy The Aals of Probability, 15: , 1987 [83] RM Dudley Uiform Cetral Limit Theorems Cambridge Uiversity Press, Cambridge, 1999 [84] RM Dudley, E Gié, ad J Zi Uiform ad uiversal Gliveko-Catelli classes Joural of Theoretical Probability, 4: , 1991 [85] B Efro Bootstrap methods: aother look at the jackkife The Aals of Statistics, 7:1 26, 1979 [86] B Efro The jackkife, the bootstrap, ad other resamplig plas SIAM, Philadelphia, 1982 [87] B Efro ad R J Tibshirai A Itroductio to the Bootstrap Chapma ad Hall, New York, 1994 [88] A Ehrefeucht, D Haussler, M Kears, ad L Valiat A geeral lower boud o the umber of examples eeded for learig Iformatio ad Computatio, 82: , 1989 [89] T Evgeiou, M Potil, ad T Poggio Regularizatio etworks ad support vector machies I A J Smola, P L Bartlett, B Schölkopf, ad D Schuurmas, editors, Advaces i Large Margi Classifiers, pages , Cambridge, MA, 2000 MIT Press [90] P Frakl O the trace of fiite sets Joural of Combiatorial Theory, Series A, 34:41 45, 1983 [91] Y Freud Boostig a weak learig algorithm by majority Iformatio ad Computatio, 121: , 1995 [92] Y Freud Self boudig learig algorithms I Proceedigs of the 11th Aual Coferece o Computatioal Learig Theory, pages , 1998 [93] Y Freud, Y Masour, ad R E Schapire Geeralizatio bouds for averaged classifiers (how to be a Bayesia without believig) The Aals of Statistics, 2004 [94] Y Freud ad R Schapire A decisio-theoretic geeralizatio of o-lie learig ad a applicatio to boostig Joural of Computer ad System Scieces, 55: , 1997 [95] J Friedma, T Hastie, ad R Tibshirai Additive logistic regressio: a statistical view of boostig The Aals of Statistics, 28: , 2000 [96] M Fromot Some problems related to model selectio: adaptive tests ad bootstrap calibratio of pealties Thèse de doctorat, Uiversité Paris-Sud, december 2003 [97] K Fukuaga Itroductio to Statistical Patter Recogitio Academic Press, New York, 1972 [98] E Gié Empirical processes ad applicatios: a overview Beroulli, 2:1 28, 1996 [99] E Gié ad J Zi Some limit theorems for empirical processes The Aals of Probability, 12: , 1984 [100] E Gié Lectures o some aspects of the bootstrap I Lectures o probability theory ad statistics (Sait-Flour, 1996), volume 1665 of Lecture Notes i Math, pages Spriger, Berli, 1997 [101] P Goldberg ad M Jerrum Boudig the Vapik-Chervoekis dimesio of cocept classes parametrized by real umbers Machie Learig, 18: , 1995 [102] U Greader Abstract iferece Joh Wiley & Sos Ic, New York, 1981 [103] P Hall Large sample optimality of least squares cross-validatio i desity estimatio Aals of Statistics, 11(4): , 1983 [104] T Hastie, R Tibshirai, ad J Friedma The Elemets of Statistical Learig Spriger Series i Statistics Spriger Verlag, New York, 2001 [105] D Haussler Decisio theoretic geeralizatios of the pac model for eural ets ad other learig applicatios Iformatio ad Computatio, 100:78 150, 1992 [106] D Haussler Sphere packig umbers for subsets of the boolea -cube with bouded Vapik-Chervoekis dimesio Joural of Combiatorial Theory, Series A, 69: , 1995 [107] D Haussler, N Littlestoe, ad M Warmuth Predictig 0, 1} fuctios from radomly draw poits I Proceedigs of the 29th IEEE Symposium o the Foudatios of Computer Sciece, pages IEEE Computer Society Press, Los Alamitos, CA, 1988 [108] R Herbrich ad RC Williamso Algorithmic luckiess Joural of Machie Learig Research, 3: , 2003 [109] W Hoeffdig Probability iequalities for sums of bouded radom variables Joural of the America Statistical Associatio, 58:13 30, 1963 [110] P Huber The behavior of the maximum likelihood estimates uder o-stadard coditios I Proc Fifth Berkeley Symposium o Probability ad Mathematical Statistics, pages Uiv Califoria Press, 1967 [111] W Jiag Process cosistecy for adaboost The Aals of Statistics, 32:13 29, 2004 [112] DS Johso ad FP Preparata The desest hemisphere problem Theoretical Computer Sciece, 6:93 107, 1978

51 TITLE WILL BE SET BY THE PUBLISHER 51 [113] I Johstoe Fuctio estimatio ad gaussia sequece models Techical Report Departmet of Statistics, Staford Uiversity, 2002 [114] M Karpiski ad A Macityre Polyomial bouds for vc dimesio of sigmoidal ad geeral pfaffia eural etworks Joural of Computer ad System Sciece, 54, 1997 [115] M Kears, Y Masour, AY Ng, ad D Ro A experimetal ad theoretical compariso of model selectio methods I Proceedigs of the Eighth Aual ACM Workshop o Computatioal Learig Theory, pages Associatio for Computig Machiery, New York, 1995 [116] M J Kears ad D Ro Algorithmic stability ad saity-check bouds for leave-oe-out cross-validatio Neural Computatio, 11(6): , 1999 [117] MJ Kears ad UV Vazirai A Itroductio to Computatioal Learig Theory MIT Press, Cambridge, Massachusetts, 1994 [118] A G Khovaskii Fewomials Traslatios of Mathematical Moographs, vol 88, America Mathematical Society, 1991 [119] JC Kieffer Strogly cosistet code-based idetificatio ad order estimatio for costraied fiite-state model classes IEEE Trasactios o Iformatio Theory, 39: , 1993 [120] G S Kimeldorf ad G Wahba A correspodece betwee Bayesia estimatio o stochastic processes ad smoothig by splies The Aals of Mathematical Statistics, 41: , 1970 [121] P Koira ad ED Sotag Neural etworks with quadratic vc dimesio Joural of Computer ad System Sciece, 54, 1997 [122] A N Kolmogorov O the represetatio of cotiuous fuctios of several variables by superpositio of cotiuous fuctios of oe variable ad additio Dokl Akad Nauk SSSR, 114: , 1957 [123] A N Kolmogorov ad V M Tikhomirov ε-etropy ad ε-capacity of sets i fuctioal spaces America Mathematical Society Traslatios, Series 2, 17: , 1961 [124] V Koltchiskii Rademacher pealties ad structural risk miimizatio IEEE Trasactios o Iformatio Theory, 47: , 2001 [125] V Koltchiskii Local Rademacher complexities ad oracle iequalities i risk miimizatio Mauscript, september 2003 [126] V Koltchiskii ad D Pacheko Rademacher processes ad boudig the risk of fuctio learig I E Gié, DM Maso, ad JA Weller, editors, High Dimesioal Probability II, pages , 2000 [127] V Koltchiskii ad D Pacheko Empirical margi distributios ad boudig the geeralizatio error of combied classifiers The Aals of Statistics, 30, 2002 [128] S Kulkari, G Lugosi, ad S Vekatesh Learig patter classificatio a survey IEEE Trasactios o Iformatio Theory, 44: , 1998 Iformatio Theory: Commemorative special issue [129] S Kuti ad P Niyogi Almost-everywhere algorithmic stability ad geeralizatio error I UAI-2002: Ucertaity i Artificial Itelligece, 2002 [130] J Lagford ad M Seeger Bouds for averagig classifiers CMU-CS , Caregie Mello Uiversity, 2001 [131] M Ledoux Isoperimetry ad gaussia aalysis I P Berard, editor, Lectures o Probability Theory ad Statistics, pages Ecole d Eté de Probabilités de St-Flour XXIV-1994, 1996 [132] M Ledoux O Talagrad s deviatio iequalities for product measures ESAIM: Probability ad Statistics, 1:63 87, [133] M Ledoux ad M Talagrad Probability i Baach Space Spriger-Verlag, New York, 1991 [134] W S Lee, P L Bartlett, ad R C Williamso The importace of covexity i learig with squared loss IEEE Trasactios o Iformatio Theory, 44(5): , 1998 [135] O V Lepskiĭ, E Mamme, ad V G Spokoiy Optimal spatial adaptatio to ihomogeeous smoothess: a approach based o kerel estimates with variable badwidth selectors Aals of Statistics, 25(3): , 1997 [136] OV Lepskiĭ A problem of adaptive estimatio i Gaussia white oise Teor Veroyatost i Primee, 35(3): , 1990 [137] OV Lepskiĭ Asymptotically miimax adaptive estimatio I Upper bouds Optimally adaptive estimates Teor Veroyatost i Primee, 36(4): , 1991 [138] Y Li, PM Log, ad A Sriivasa Improved bouds o the sample complexity of learig Joural of Computer ad System Scieces, 62: , 2001 [139] Y Li A ote o margi-based loss fuctios i classificatio Techical Report 1029r, Departmet of Statistics, Uiversity of Wiscosi, Madiso, 1999 [140] Y Li Some asymptotic properties of the support vector machie Techical Report 1044r, Departmet of Statistics, Uiversity of Wiscosi, Madiso, 1999 [141] Y Li Support vector machies ad the bayes rule i classificatio Data Miig ad Kowledge Discovery, 6: , 2002 [142] F Lozao Model selectio usig Rademacher pealizatio I Proceedigs of the Secod ICSC Symposia o Neural Computatio (NC2000) ICSC Adademic Press, 2000 [143] MJ Luczak ad C McDiarmid Cocetratio for locally actig permutatios Discrete Mathematics, 265: , 2003 [144] G Lugosi Patter classificatio ad learig theory I L Györfi, editor, Priciples of Noparametric Learig, pages 5 62 Spriger, Wie, 2002

52 52 TITLE WILL BE SET BY THE PUBLISHER [145] G Lugosi ad A Nobel Adaptive model selectio usig empirical complexities The Aals of Statistics, 27: , 1999 [146] G Lugosi ad N Vayatis O the Bayes-risk cosistecy of regularized boostig methods The Aals of Statistics, 32:30 55, 2004 [147] G Lugosi ad M Wegkamp Complexity regularizatio via localized radom pealties The Aals of Statistics, 2: , 2004 [148] G Lugosi ad K Zeger Cocept learig usig complexity regularizatio IEEE Trasactios o Iformatio Theory, 42:48 54, 1996 [149] A Macityre ad ED Sotag Fiiteess results for sigmoidal eural etworks I Proceedigs of the 25th Aual ACM Symposium o the Theory of Computig, pages Associatio of Computig Machiery, New York, 1993 [150] CL Mallows Some commets o C p Techometrics, 15: , 1997 [151] E Mamme ad A Tsybakov Smooth discrimiatio aalysis The Aals of Statistics, 27(6): , 1999 [152] S Maor ad R Meir Weak learers ad improved covergece rate i boostig I Advaces i Neural Iformatio Processig Systems 13: Proc NIPS 2000, 2001 [153] S Maor, R Meir, ad T Zhag The cosistecy of greedy algorithms for classificatio I Proceedigs of the 15th Aual Coferece o Computatioal Learig Theory, 2002 [154] K Marto A simple proof of the blowig-up lemma IEEE Trasactios o Iformatio Theory, 32: , 1986 [155] K Marto Boudig d-distace by iformatioal divergece: a way to prove measure cocetratio The Aals of Probability, 24: , 1996 [156] K Marto A measure cocetratio iequality for cotractig Markov chais Geometric ad Fuctioal Aalysis, 6: , 1996 Erratum: 7: , 1997 [157] L Maso, J Baxter, PL Bartlett, ad M Frea Fuctioal gradiet techiques for combiig hypotheses I AJ Smola, PL Bartlett, B Schölkopf, ad D Schuurmas, editors, Advaces i Large Margi Classifiers, pages MIT Press, Cambridge, MA, 1999 [158] P Massart Optimal costats for Hoeffdig type iequalities Techical report, Mathematiques, Uiversité de Paris-Sud, Report 9886, 1998 [159] P Massart About the costats i Talagrad s cocetratio iequalities for empirical processes The Aals of Probability, 28: , 2000 [160] P Massart Some applicatios of cocetratio iequalities to statistics Aales de la Faculté des Sciecies de Toulouse, IX: , 2000 [161] P Massart Ecole d Eté de Probabilité de Sait-Flour XXXIII, chapter Cocetratio iequalities ad model selectio LNM Spriger-Verlag, 2003 [162] P Massart ad E Nédélec Risk bouds for statistical learig The Aals of Statistics, to appear [163] D A McAllester Some pac-bayesia theorems I Proceedigs of the 11th Aual Coferece o Computatioal Learig Theory, pages ACM Press, 1998 [164] D A McAllester pac-bayesia model averagig I Proceedigs of the 12th Aual Coferece o Computatioal Learig Theory ACM Press, 1999 [165] D A McAllester pac-bayesia stochastic model selectio Machie Learig, 51:5 21, 2003 [166] C McDiarmid O the method of bouded differeces I Surveys i Combiatorics 1989, pages Cambridge Uiversity Press, Cambridge, 1989 [167] C McDiarmid Cocetratio I M Habib, C McDiarmid, J Ramirez-Alfosi, ad B Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, pages Spriger, New York, 1998 [168] C McDiarmid Cocetratio for idepedet permutatios Combiatorics, Probability, ad Computig, 2: , 2002 [169] GJ McLachla Discrimiat Aalysis ad Statistical Patter Recogitio Joh Wiley, New York, 1992 [170] S Medelso Improvig the sample complexity usig global data IEEE Trasactios o Iformatio Theory, 48: , 2002 [171] S Medelso A few otes o statistical learig theory I S Medelso ad A Smola, editors, Advaced Lectures i Machie Learig, LNCS 2600, pages 1 40 Spriger, 2003 [172] S Medelso ad P Philips O the importace of small coordiate projectios Joural of Machie Learig Research, 5: , 2004 [173] S Medelso ad R Vershyi Etropy ad the combiatorial dimesio Ivetioes Mathematicae, 152:37 55, 2003 [174] V Milma ad G Schechma Asymptotic theory of fiite-dimesioal ormed spaces Spriger-Verlag, New York, 1986 [175] BK Nataraja Machie Learig: A Theoretical Approach Morga Kaufma, Sa Mateo, CA, 1991 [176] D Pacheko A ote o Talagrad s cocetratio iequality Electroic Commuicatios i Probability, 6, 2001 [177] D Pacheko Some extesios of a iequality of Vapik ad Chervoekis Electroic Commuicatios i Probability, 7, 2002 [178] D Pacheko Symmetrizatio approach to cocetratio iequalities for empirical processes The Aals of Probability, 31: , 2003 [179] T Poggio, S Rifki, S Mukherjee, ad P Niyogi Geeral coditios for predictivity i learig theory Nature, 428: , 2004

53 TITLE WILL BE SET BY THE PUBLISHER 53 [180] D Pollard Covergece of Stochastic Processes Spriger-Verlag, New York, 1984 [181] D Pollard Uiform ratio limit theorems for empirical processes Scadiavia Joural of Statistics, 22: , 1995 [182] W Poloik Measurig mass cocetratios ad estimatig desity cotour clusters a excess mass approach The Aals of Statistics, 23(3): , 1995 [183] E Rio Iégalités de cocetratio pour les processus empiriques de classes de parties Probability Theory ad Related Fields, 119: , 2001 [184] E Rio Ue iegalité de Beett pour les maxima de processus empiriques I Colloque e l hoeur de J Bretagolle, D Dacuha-Castelle et I Ibragimov, Aales de l Istitut Heri Poicaré, 2001 [185] B D Ripley Patter Recogitio ad Neural Networks Cambridge Uiversity Press, 1996 [186] J Rissae A uiversal prior for itegers ad estimatio by miimum descriptio legth Aals of Statistics, 11: , 1983 [187] WH Rogers ad TJ Wager A fiite sample distributio-free performace boud for local discrimiatio rules Aals of Statistics, 6: , 1978 [188] M Rudelso, R Vershyi Combiatorics of radom processes ad sectios of covex bodies The Aals of Math, to appear, 2004 [189] N Sauer O the desity of families of sets Joural of Combiatorial Theory Series A, 13: , 1972 [190] RE Schapire The stregth of weak learability Machie Learig, 5: , 1990 [191] RE Schapire, Y Freud, P Bartlett, ad WS Lee Boostig the margi: a ew explaatio for the effectiveess of votig methods The Aals of Statistics, 26: , 1998 [192] B Schölkopf ad A J Smola Learig with Kerels MIT Press, Cambridge, MA, 2002 [193] D Schuurmas Characterizig ratioal versus expoetial learig curves I Computatioal Learig Theory: Secod Europea Coferece EuroCOLT 95, pages Spriger Verlag, 1995 [194] G Schwarz Estimatig the dimesio of a model The Aals of Statistics, 6: , 1978 [195] C Scovel ad I Steiwart Fast rates for support vector machies Los Alamos Natioal Laboratory Techical Report LA-UR , 2003 [196] M Seeger PAC-Bayesia geeralisatio error bouds for gaussia process classificatio Joural of Machie Learig Research, 3: , 2002 [197] J Shawe-Taylor, P L Bartlett, R C Williamso, ad M Athoy Structural risk miimizatio over data-depedet hierarchies IEEE Trasactios o Iformatio Theory, 44(5): , 1998 [198] S Shelah A combiatorial problem: Stability ad order for models ad theories i ifiity laguages Pacific Joural of Mathematics, 41: , 1972 [199] GR Shorack ad J Weller Empirical Processes with Applicatios i Statistics Wiley, New York, 1986 [200] HU Simo Geeral lower bouds o the umber of examples eeded for learig probabilistic cocepts I Proceedigs of the Sixth Aual ACM Coferece o Computatioal Learig Theory, pages Associatio for Computig Machiery, New York, 1993 [201] A J Smola, P L Bartlett, B Schölkopf, ad D Schuurmas, editors Advaces i Large Margi Classifiers MIT Press, Cambridge, MA, 2000 [202] A J Smola, B Schölkopf, ad K-R Müller The coectio betwee regularizatio operators ad support vector kerels Neural Networks, 11: , 1998 [203] DF Specht Probabilistic eural etworks ad the polyomial Adalie as complemetary techiques for classificatio IEEE Trasactios o Neural Networks, 1: , 1990 [204] JM Steele Existece of submatrices with all possible colums Joural of Combiatorial Theory, Series A, 28:84 88, 1978 [205] I Steiwart O the ifluece of the kerel o the cosistecy of support vector machies Joural of Machie Learig Research, pages 67 93, 2001 [206] I Steiwart Cosistecy of support vector machies ad other regularized kerel machies IEEE Trasactios o Iformatio Theory, 51: , 2005 [207] I Steiwart Support vector machies are uiversally cosistet Joural of Complexity, 18: , 2002 [208] I Steiwart O the optimal parameter choice i ν-support vector machies IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 25: , 2003 [209] I Steiwart Sparseess of support vector machies Joural of Machie Learig Research, 4: , 2003 [210] SJ Szarek ad M Talagrad O the covexified Sauer-Shelah theorem Joural of Combiatorial Theory, Series B, 69: , 1997 [211] M Talagrad The Gliveko-Catelli problem The Aals of Probability, 15: , 1987 [212] M Talagrad Sharper bouds for Gaussia ad empirical processes The Aals of Probability, 22:28 76, 1994 [213] M Talagrad Cocetratio of measure ad isoperimetric iequalities i product spaces Publicatios Mathématiques de l IHES, 81:73 205, 1995 [214] M Talagrad The Gliveko-Catelli problem, te years later Joural of Theoretical Probability, 9: , 1996 [215] M Talagrad Majorizig measures: the geeric chaiig The Aals of Probability, 24: , 1996 (Special Ivited Paper)

54 54 TITLE WILL BE SET BY THE PUBLISHER [216] M Talagrad New cocetratio iequalities i product spaces Ivetioes Mathematicae, 126: , 1996 [217] M Talagrad A ew look at idepedece The Aals of Probability, 24:1 34, 1996 (Special Ivited Paper) [218] M Talagrad Vapik-Chervoekis type coditios ad uiform Dosker classes of fuctios The Aals of Probability, 31: , 2003 [219] M Talagrad The geeric chaiig: upper ad lower bouds for stochastic processes Spriger-Verlag, New York, 2005 [220] A Tsybakov O oparametric estimatio of desity level sets A Stat, 25(3): , 1997 [221] A B Tsybakov Optimal aggregatio of classifiers i statistical learig The Aals of Statistics, 32: , 2004 [222] A B Tsybakov Itroductio l estimatio o-paramétrique Spriger, 2004 [223] A Tsybakov ad S va de Geer Square root pealty: adaptatio to the margi i classificatio ad i edge estimatio The Aals of Statistics, to appear, 2005 [224] S Va de Geer A ew approach to least-squares estimatio, with applicatios The Aals of Statistics, 15: , 1987 [225] S Va de Geer Estimatig a regressio fuctio The Aals of Statistics, 18: , 1990 [226] S va de Geer Empirical Processes i M-Estimatio Cambridge Uiversity Press, Cambridge, UK, 2000 [227] AW va der Waart ad JA Weller Weak covergece ad empirical processes Spriger-Verlag, New York, 1996 [228] V Vapik ad A Lerer Patter recogitio usig geeralized portrait method Automatio ad Remote Cotrol, 24: , 1963 [229] VN Vapik Estimatio of Depedecies Based o Empirical Data Spriger-Verlag, New York, 1982 [230] VN Vapik The Nature of Statistical Learig Theory Spriger-Verlag, New York, 1995 [231] VN Vapik Statistical Learig Theory Joh Wiley, New York, 1998 [232] VN Vapik ad AYa Chervoekis O the uiform covergece of relative frequecies of evets to their probabilities Theory of Probability ad its Applicatios, 16: , 1971 [233] VN Vapik ad AYa Chervoekis Theory of Patter Recogitio Nauka, Moscow, 1974 (i Russia); Germa traslatio: Theorie der Zeicheerkeug, Akademie Verlag, Berli, 1979 [234] VN Vapik ad AYa Chervoekis Necessary ad sufficiet coditios for the uiform covergece of meas to their expectatios Theory of Probability ad its Applicatios, 26: , 1981 [235] M Vidyasagar A Theory of Learig ad Geeralizatio Spriger, New York, 1997 [236] V Vu O the ifeasibility of traiig eural etworks with small mea squared error IEEE Trasactios o Iformatio Theory, 44: , 1998 [237] Marte Wegkamp Model selectio i oparametric regressio Aals of Statistics, 31(1): , 2003 [238] RS Weocur ad RM Dudley Some special Vapik-Chervoekis classes Discrete Mathematics, 33: , 1981 [239] Y Yag Miimax oparametric classificatio I Rates of covergece IEEE Trasactios o Iformatio Theory, 45(7): , 1999 [240] Y Yag Miimax oparametric classificatio II Model selectio for adaptatio IEEE Trasactios o Iformatio Theory, 45(7): , 1999 [241] Y Yag Adaptive estimatio i patter recogitio by combiig differet procedures Statistica Siica, 10: , 2000 [242] VV Yuriksii Expoetial bouds for large deviatios Theory of Probability ad its Applicatios, 19: , 1974 [243] VV Yuriksii Expoetial iequalities for sums of radom vectors Joural of Multivariate Aalysis, 6: , 1976 [244] T Zhag Statistical behavior ad cosistecy of classificatio methods based o covex risk miimizatio The Aals of Statistics, 32:56 85, 2004 [245] D-X Zhou Capacity of reproducig kerel spaces i learig theory IEEE Trasactios o Iformatio Theory, 49: , 2003