Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

Transcription

1 ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi 3 Abstract The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio We ited to survey some of the mai ew ideas that have led to these recet results Résumé La pratique et la théorie de la recoaissace des formes ot cou des développemets importats durat ces derières aées Ce survol vise à exposer certaies des idées ouvelles qui ot coduit à ces développemets 1991 Mathematics Subject Classificatio 62G08,60E15,68Q32 September 23, 2005 Cotets 1 Itroductio 2 2 Basic model 2 3 Empirical risk miimizatio ad Rademacher averages 3 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies 8 41 Margi-based performace bouds 9 42 Covex cost fuctioals 13 5 Tighter bouds for empirical risk miimizatio Relative deviatios Noise ad fast rates Localizatio Cost fuctios Miimax lower bouds 26 6 PAC-bayesia bouds 29 7 Stability 31 8 Model selectio Oracle iequalities 32 Keywords ad phrases: Patter Recogitio, Statistical Learig Theory, Cocetratio Iequalities, Empirical Processes, Model Selectio The authors ackowledge support by the PASCAL Network of Excellece uder EC grat o The work of the third author was supported by the Spaish Miistry of Sciece ad Techology ad FEDER, grat BMF Laboratoire Probabilités et Modèles Aléatoires, CNRS & Uiversité Paris VII, Paris, Frace, wwwprobajussieufr/~bouchero 2 Pertiece SA, 32 rue des Jeûeurs, Paris, Frace 3 Departmet of Ecoomics, Pompeu Fabra Uiversity, Ramo Trias Fargas 25-27, Barceloa, Spai, lugosi@upfes c EDP Scieces, SMAI 1999

2 2 TITLE WILL BE SET BY THE PUBLISHER 82 A glimpse at model selectio methods Naive pealizatio Ideal pealties Localized Rademacher complexities Pre-testig Revisitig hold-out estimates 45 Refereces 47 1 Itroductio The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio The itroductio of ew ad effective techiques of hadlig high-dimesioal problems such as boostig ad support vector machies have revolutioized the practice of patter recogitio At the same time, the better uderstadig of the applicatio of empirical process theory ad cocetratio iequalities have led to effective ew ways of studyig these methods ad provided a statistical explaatio for their success These ew tools have also helped develop ew model selectio methods that are at the heart of may classificatio algorithms The purpose of this survey is to offer a overview of some of these theoretical tools ad give the mai ideas of the aalysis of some of the importat algorithms This survey does ot attempt to be exhaustive The selectio of the topics is largely biased by the persoal taste of the authors We also limit ourselves to describig the key ideas i a simple way, ofte sacrificig geerality I these cases the reader is poited to the refereces for the sharpest ad more geeral results available Refereces ad bibliographical remarks are give at the ed of each sectio, i a attempt to avoid iterruptios i the argumets 2 Basic model The problem of patter classificatio is about guessig or predictig the ukow class of a observatio A observatio is ofte a collectio of umerical ad/or categorical measuremets represeted by a d-dimesioal vector x but i some cases it may eve be a curve or a image I our model we simply assume that x X where X is some abstract measurable space equipped with a σ-algebra The ukow ature of the observatio is called a class It is deoted by y ad i the simplest case takes values i the biary set 1, 1} I these otes we restrict our attetio to biary classificatio The reaso is simplicity ad that the biary problem already captures may of the mai features of more geeral problems Eve though there is much to say about multiclass classificatio, this survey does ot cover this icreasig field of research I classificatio, oe creates a fuctio g : X 1, 1} which represets oe s guess of y give x The mappig g is called a classifier The classifier errs o x if g(x) y To formalize the learig problem, we itroduce a probabilistic settig, ad let (X, Y ) be a X 1, 1}- valued radom pair, modelig observatio ad its correspodig class The distributio of the radom pair (X, Y ) may be described by the probability distributio of X (give by the probabilities PX A} for all measurable subsets A of X ) ad η(x) = PY = 1 X = x} The fuctio η is called the a posteriori probability We measure the performace of classifier g by its probability of error L(g) = Pg(X) Y } Give η, oe may easily costruct a classifier with miimal probability of error I particular, it is easy to see that if we defie g 1 if η(x) > 1/2 (x) = 1 otherwise

3 TITLE WILL BE SET BY THE PUBLISHER 3 the L(g ) L(g) for ay classifier g The miimal risk L def = L(g ) is called the Bayes risk (or Bayes error) More precisely, it is immediate to see that L(g) L = E [ 1 g(x) g (X)} 2η(X) 1 ] 0 (1) (see, eg, [72]) The optimal classifier g is ofte called the Bayes classifier I the statistical model we focus o, oe has access to a collectio of data (X i, Y i ), 1 i We assume that the data D cosists of a sequece of idepedet idetically distributed (iid) radom pairs (X 1, Y 1 ),, (X, Y ) with the same distributio as that of (X, Y ) A classifier is costructed o the basis of D = (X 1, Y 1,, X, Y ) ad is deoted by g Thus, the value of Y is guessed by g (X) = g (X; X 1, Y 1,, X, Y ) The performace of g is measured by its (coditioal) probability of error L(g ) = Pg (X) Y D } The focus of the theory (ad practice) of classificatio is to costruct classifiers g whose probability of error is as close to L as possible Obviously, the whole arseal of traditioal parametric ad oparametric statistics may be used to attack this problem However, the high-dimesioal ature of may of the ew applicatios (such as image recogitio, text classificatio, micro-biological applicatios, etc) leads to territories beyod the reach of traditioal methods Most ew advaces of statistical learig theory aim to face these ew challeges Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Fukuaga [97], Duda ad Hart [77], Vapik ad Chervoekis [233], Devijver ad Kittler [70], Vapik [229,230], Breima, Friedma, Olshe, ad Stoe [53], Nataraja [175], McLachla [169], Athoy ad Biggs [10], Kears ad Vazirai [117], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235] Kulkari, Lugosi, ad Vekatesh [128], Athoy ad Bartlett [9], Duda, Hart, ad Stork [78], Lugosi [144], ad Medelso [171] 3 Empirical risk miimizatio ad Rademacher averages A simple ad atural approach to the classificatio problem is to cosider a class C of classifiers g : X 1, 1} ad use data-based estimates of the probabilities of error L(g) to select a classifier from the class The most atural choice to estimate the probability of error L(g) = Pg(X) Y } is the error cout L (g) = 1 1 g(xi) Y i} i=1 L (g) is called the empirical error of the classifier g First we outlie the basics of the theory of empirical risk miimizatio (ie, the classificatio aalog of M-estimatio) Deote by g the classifier that miimizes the estimated probability of error over the class: L (g ) L (g) for all g C The the probability of error L(g) = P g(x) Y D } of the selected rule is easily see to satisfy the elemetary iequalities L(g ) if g C L(g) 2 sup L (g) L(g), (2) g C L(g) L (g) + sup L (g) L(g) g C

4 4 TITLE WILL BE SET BY THE PUBLISHER We see that by guarateeig that the uiform deviatio sup g C L (g) L(g) of estimated probabilities from their true values is small, we make sure that the probability of the selected classifier g is ot much larger tha the best probability of error i the class C ad at the same time the empirical estimate L (g ) is also good It is importat to ote at this poit that boudig the excess risk by the maximal deviatio as i (2) is quite loose i may situatios I Sectio 5 we survey some ways of obtaiig improved bouds O the other had, the simple iequality above offers a coveiet way of uderstadig some of the basic priciples ad it is eve sharp i a certai miimax sese, see Sectio 55 Clearly, the radom variable L (g) is biomially distributed with parameters ad L(g) Thus, to obtai bouds for the success of empirical error miimizatio, we eed to study uiform deviatios of biomial radom variables from their meas We formulate the problem i a somewhat more geeral way as follows Let X 1,, X be idepedet, idetically distributed radom variables takig values i some set X ad let F be a class of bouded fuctios X [ 1, 1] Deotig expectatio ad empirical averages by P f = Ef(X 1 ) ad P f = (1/) i=1 f(x i), we are iterested i upper bouds for the maximal deviatio sup(p f P f) f F Cocetratio iequalities are amog the basic tools i studyig such deviatios powerful expoetial cocetratio iequality is the bouded differeces iequality The simplest, yet quite Theorem 31 bouded differeces iequality Le g : X R be a fuctio of variables such that for some oegative costats c 1,, c, sup x 1,,x, x i X g(x 1,, x ) g(x 1,, x i 1, x i, x i+1,, x ) c i, 1 i Let X 1,, X be idepedet radom variables The radom variable Z = g(x 1,, X ) satisfies where C = i=1 c2 i P Z EZ > t} 2e 2t2 /C The bouded differeces assumptio meas that if the i-th variable of g is chaged while keepig all the others fixed, the value of the fuctio caot chage by more tha c i Our mai example for such a fuctio is Z = sup P f P f f F Obviously, Z satisfies the bouded differeces assumptio with c i = 2/ ad therefore, for ay δ (0, 1), with probability at least 1 δ, 2 log sup P f P f E 1 δ sup P f P f + (3) f F f F This cocetratio result allows us to focus o the expected value, which ca be bouded coveietly by a simple symmetrizatio device Itroduce a ghost sample X 1,, X, idepedet of the X i ad distributed idetically If P f = (1/) i=1 f(x i ) deotes the empirical averages measured o the ghost sample, the by Jese s iequality, E sup f F ( [ P f P f = E sup E f F ]) P f P f X 1,, X E sup P f P f f F

5 TITLE WILL BE SET BY THE PUBLISHER 5 Let ow σ 1,, σ be idepedet (Rademacher) radom variables with Pσ i = 1} = Pσ i = 1} = 1/2, idepedet of the X i ad X i The E sup f F [ P f P f = E sup f F [ = E sup 2E f F [ 1 1 sup f F ] (f(x i) f(x i ) i=1 ] σ i (f(x i) f(x i ) i=1 ] σ i f(x i ) Let A R be a bouded set of vectors a = (a 1,, a ), ad itroduce the quatity 1 i=1 1 R (A) = E sup σ i a i a A R (A) is called the Rademacher average associated with A For a give sequece x 1,, x X, we write F(x 1 ) for the class of -vectors (f(x 1 ),, f(x )) with f F Thus, usig this otatio, we have deduced the followig i=1 Theorem 32 With probability at least 1 δ, sup P f P f 2ER (F(X1 )) + f F 2 log 1 δ We also have sup P f P f 2R (F(X1 )) + f F 2 log 2 δ The secod statemet follows simply by oticig that the radom variable R (F(X1 ) satisfies the coditios of the bouded differeces iequality The secod iequality is our first data-depedet performace boud It ivolves the Rademacher average of the coordiate projectio of F give by the data X 1,, X Give the data, oe may compute the Rademacher average, for example, by Mote Carlo itegratio Note that for a give 1 choice of the radom sigs σ 1,, σ, the computatio of sup f F i=1 σ if(x i ) is equivalet to miimizig i=1 σ if(x i ) over f F ad therefore it is computatioally equivalet to empirical risk miimizatio R (F(X1 )) measures the richess of the class F ad provides a sharp estimate for the maximal deviatios I fact, oe may prove that 1 2 ER (F(X1 )) 1 2 E sup f F P f P f 2ER (F(X 1 ))) (see, eg, va der Vaart ad Weller [227]) Next we recall some of the simple structural properties of Rademacher averages Theorem 33 properties of rademacher averages Let A, B be bouded subsets of R ad let c R be a costat The R (A B) R (A) + R (B), R (c A) = c R (A), R (A B) R (A) + R (B)

6 6 TITLE WILL BE SET BY THE PUBLISHER where c A = ca : a A} ad A B = a + b : a A, b B} Moreover, if A = a (1),, a (N) } R is a fiite set, the 2 log N R (A) max j=1,,n a(j) (4) N where deotes Euclidea orm If abscov(a) = j=1 c ja (j) : N N, } N j=1 c j 1, a (j) A is the absolute covex hull of A, the R (A) = R (abscov(a)) (5) Fially, the cotractio priciple states that if φ : R R is a fuctio with φ(0) = 0 ad Lipschitz costat L φ ad φ A is the set of vectors of form (φ(a 1 ),, φ(a )) R with a A, the R (φ A) L φ R (A) proof The first three properties are immediate from the defiitio Iequality (4) follows by Hoeffdig s iequality which states that if X is a bouded zero-mea radom variable takig values i a iterval [α, β], the for ay s > 0, E exp(sx) exp ( s 2 (β α) 2 /8 ) I particular, by idepedece, This implies that E exp ( s 1 ) σ i a i = i=1 e sr(a) = exp i=1 ( 1 se max j=1,,n N Ee s 1 j=1 E exp (s 1 ) σ ia i i=1 σ i a (j) i ) ( s 2 a 2 ) ( i s 2 a 2 ) exp 2 2 = exp 2 2 i=1 E exp P i=1 σia(j) i N max j=1,,n exp ( s max j=1,,n 1 ( s 2 a (j) 2 ) 2 2 Takig the logarithm of both sides, dividig by s, ad choosig s to miimize the obtaied upper boud for R (A), we arrive at (4) The idetity (5) is easily see from the defiitio For a proof of the cotractio priciple, see Ledoux ad Talagrad [133] Ofte it is useful to derive further upper bouds o Rademacher averages As a illustratio, we cosider the case whe F is a class of idicator fuctios Recall that this is the case i our motivatig example i the classificatio problem described above whe each f F is the idicator fuctio of a set of the form (x, y) : g(x) y} I such a case, for ay collectio of poits x 1 = (x 1,, x ), F(x 1 ) is a fiite subset of R whose cardiality is deoted by S F (x 1 ) ad is called the vc shatter coefficiet (where vc stads for Vapik-Chervoekis) Obviously, S F (x 1 ) 2 By iequality (4), we have, for all x 1, i=1 σ i a (j) i ) R (F(x 1 )) 2 log SF (x 1 ) (6) where we used the fact that for each f F, i f(x i) 2 I particular, 2 log SF (X1 E sup P f P f 2E ) f F The logarithm of the vc shatter coefficiet may be upper bouded i terms of a combiatorial quatity, called the vc dimesio If A 1, 1}, the the vc dimesio of A is the size V of the largest set of idices

7 TITLE WILL BE SET BY THE PUBLISHER 7 i 1,, i V } 1,, } such that for each biary V -vector b = (b 1,, b V ) 1, 1} V there exists a a = (a 1,, a ) A such that (a i1,, a iv ) = b The key iequality establishig a relatioship betwee shatter coefficiets ad vc dimesio is kow as Sauer s lemma which states that the cardiality of ay set A 1, 1} may be upper bouded as A where V is the vc dimesio of A I particular, V i=0 ( ) ( + 1) V i log S F (x 1 ) V (x 1 ) log( + 1) where we deote by V (x 1 ) the vc dimesio of F(x 1 ) Thus, the expected maximal deviatio E sup f F P f P f may be upper bouded by 2E 2V (X1 ) log( + 1)/ To obtai distributio-free upper bouds, itroduce the vc dimesio of a class of biary fuctios F, defied by V = sup V (x 1 ),x 1 The we obtai the followig versio of what has bee kow as the Vapik-Chervoekis iequality: Theorem 34 vapik-chervoekis iequality For all distributios oe has E sup(p f P f) 2 f F 2V log( + 1) Also, for a uiversal costat C V E sup(p f P f) C f F The secod iequality, that allows to remove the logarithmic factor, follows from a somewhat refied aalysis (called chaiig) The vc dimesio is a importat combiatorial parameter of the class ad may of its properties are well kow Here we just recall oe useful result ad refer the reader to the refereces for further study: let G be a m-dimesioal vector space of real-valued fuctios defied o X The class of idicator fuctios F = f(x) = 1 g(x) 0 : g G } has vc dimesio V m Bibliographical remarks Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Shorack ad Weller [199], Gié [98], va der Vaart ad Weller [227], Vapik [231], Dudley [83] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis [232, 233] ad re-discovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth [41], Ehrefeucht, Haussler, Kears, ad Valiat [88] For surveys see Nataraja [175], Devroye [71] Athoy ad Biggs [10], Kears ad Vazirai [117], Vapik [230, 231], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235], Athoy ad Bartlett [9], The bouded differeces iequality was formulated explicitly first by McDiarmid [166] (see also the surveys [167]) The martigale methods used by McDiarmid had appeared i early work of Hoeffdig [109], Azuma [18], Yuriksii [242, 243], Milma ad Schechtma [174] Closely related cocetratio results have bee obtaied i various ways icludig iformatio-theoretic methods (see Ahlswede, Gács, ad Körer [1], Marto [154],

8 8 TITLE WILL BE SET BY THE PUBLISHER [155], [156], Dembo [69], Massart [158] ad Rio [183]), Talagrad s iductio method [217], [213], [216] (see also McDiarmid [168], Luczak ad McDiarmid [143], Pacheko [ ]) ad the so-called etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux [132], [131], see also Bobkov ad Ledoux [42], Massart [159], Rio [183], Bouchero, Lugosi, ad Massart [45, 46], Bousquet [47], ad Bouchero, Bousquet, Lugosi, ad Massart [44] Symmetrizatio was at the basis of the origial argumets of Vapik ad Chervoekis [232, 233] We leart the simple symmetrizatio trick show above from Gié ad Zi [99] but differet forms of symmetrizatio have bee at the core of obtaiig related results of similar flavor, see also Athoy ad Shawe-Taylor [11], Cao, Ettiger, Hush, Scovel [55], Herbrich ad Williamso [108], Medelso ad Philips [172] The use of Rademacher averages i classificatio was first promoted by Koltchiskii [124] ad Bartlett, Bouchero, ad Lugosi [24], see also Koltchiskii ad Pacheko [126,127], Bartlett ad Medelso [29], Bartlett, Bousquet, ad Medelso [25], Bousquet, Koltchiskii, ad Pacheko [50], Kégl, Lider, ad Lugosi [13], Medelso [170] Hoeffdig s iequality appears i [109] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad [133] Sauer s lemma was proved idepedetly by Sauer [189], Shelah [198], ad Vapik ad Chervoekis [232] For related combiatorial results we refer to Frakl [90], Haussler [106], Alesker [7], Alo, Be-David, Cesa- Biachi, ad Haussler [8], Szarek ad Talagrad [210], Cesa-Biachi ad Haussler [60], Medelso ad Vershyi [173], [188] The secod iequality of Theorem 34 is based o the method of chaiig, ad was first proved by Dudley [81] The questio of how sup f F P f P f behaves has bee kow as the Gliveko-Catelli problem ad much has bee said about it A few key refereces iclude Vapik ad Chervoekis [232, 234], Dudley [79, 81, 82], Talagrad [211, 212, 214, 218], Dudley, Gié, ad Zi [84], Alo, Be-David, Cesa-Biachi, ad Haussler [8], Li, Log, ad Sriivasa [138], Medelso ad Vershyi [173] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Cover [63], Dudley [80, 83], Steele [204], Weocur ad Dudley [238], Assouad [15], Khovaskii [118], Macityre ad Sotag [149], Goldberg ad Jerrum [101], Karpiski ad A Macityre [114], Koira ad Sotag [121], Athoy ad Bartlett [9], ad Bartlett ad Maass [28] 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies The results summarized i the previous sectio reveal that miimizig the empirical risk L (g) over a class C of classifiers with a vc dimesio much smaller tha the sample size is guarateed to work well This result has two fudametal problems First, by requirig that the vc dimesio be small, oe imposes serious limitatios o the approximatio properties of the class I particular, eve though the differece betwee the probability of error L(g ) of the empirical risk miimizer is close to the smallest probability of error if g C L(g) i the class, if g C L(g) L may be very large The other problem is algorithmic: miimizig the empirical probability of misclassificatio L(g) is very ofte a computatioally difficult problem Eve i seemigly simple cases, for example whe X = R d ad C is the class of classifiers that split the space of observatios by a hyperplae, the miimizatio problem is p hard The computatioal difficulty of learig problems deserves some more attetio Let us cosider i more detail the problem i the case of half-spaces Formally, we are give a sample, that is a sequece of vectors (x 1,, x ) from R d ad a sequece of labels (y 1,, y ) from 1, 1}, ad i order to miimize the empirical misclassificatio risk we are asked to fid w R d ad b R so as to miimize # k : y k ( w, x k b) 0} Without loss of geerality, the vectors costitutig the sample are assumed to have ratioal coefficiets, ad the size of the data is the sum of the bit legths of the vectors makig the sample Not oly miimizig the umber

9 TITLE WILL BE SET BY THE PUBLISHER 9 of misclassificatio errors has bee proved to be at least as hard as solvig ay p-complete problem, but eve approximately miimizig the umber of misclassificatio errors withi a costat factor of the optimum has bee show to be p-hard This meas that, uless p =p, we will ot be able to build a computatioally efficiet empirical risk miimizer for half-spaces that will work for all iput space dimesios If the iput space dimesio d is fixed, a algorithm ruig i O( d 1 log ) steps eumerates the trace of half-spaces o a sample of legth This allows a exhaustive search for the empirical risk miimizer Such a possibility should be cosidered with circumspectio sice its rage of applicatios would exted much beyod problems where iput dimesio is less tha 5 41 Margi-based performace bouds A attempt to solve both of these problems is to modify the empirical fuctioal to be miimized by itroducig a cost fuctio Next we describe the mai ideas of empirical miimizatio of cost fuctioals ad its aalysis We cosider classifiers of the form 1 if f(x) 0 g f (x) = 1 otherwise where f : X R is a real-valued fuctio I such a case the probability of error of g may be writte as L(g f ) = Psg(f(X)) Y } E1 f(x)y <0 To lighte otatio we will simply write L(f) = L(g f ) Let φ : R R + be a oegative cost fuctio such that φ(x) 1 x>0 (Typical choices of φ iclude φ(x) = e x, φ(x) = log 2 (1+e x ), ad φ(x) = (1+x) + ) Itroduce the cost fuctioal ad its empirical versio by A(f) = Eφ( f(x)y ) ad A (f) = 1 φ( f(x i )Y i ) i=1 Obviously, L(f) A(f) ad L (f) A (f) Theorem 41 Assume that the fuctio f is chose from a class F based o the data (Z 1,, Z ) def = (X 1, Y 1 ),, (X, Y ) Let B deote a uiform upper boud o φ( f(x)y) ad let L φ be the Lipschitz costat of φ The the probability of error of the correspodig classifier may be bouded, with probability at least 1 δ, by L(f ) A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ Thus, the Rademacher average of the class of real-valued fuctios f bouds the performace of the classifier

10 10 TITLE WILL BE SET BY THE PUBLISHER proof The proof similar to he argumet of the previous sectio: L(f ) A(f ) A (f ) + sup(a(f) A (f)) f F A (f ) + 2ER (φ H(Z1 2 log 1 δ )) + B (where H is the class of fuctios X 1, 1} R of the form f(x)y, f F) A (f ) + 2L φ ER (H(Z1 2 log 1 δ )) + B (by the cotractio priciple of Theorem 33) = A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ 411 Weighted votig schemes I may applicatios such as boostig ad baggig, classifiers are combied by weighted votig schemes which meas that the classificatio rule is obtaied by meas of fuctios f from a class N F λ = f(x) = N c j g j (x) : N N, c j λ, g 1,, g N C (7) j=1 where C is a class of base classifiers, that is, fuctios defied o X, takig values i 1, 1} A classifier of this form may be thought of as oe that, upo observig x, takes a weighted vote of the classifiers g 1,, g N (usig the weights c 1,, c N ) ad decides accordig to the weighted majority I this case, by (5) ad (6) we have j=1 R (F λ (X 1 )) λr (C(X 1 )) λ 2VC log( + 1) where V C is the vc dimesio of the base class To uderstad the richess of classes formed by weighted averages of classifiers from a base class, just cosider the simple oe-dimesioal example i which the base class C cotais all classifiers of the form g(x) = 21 x a 1, a R The V C = 1 ad the closure of F λ (uder the L orm) is the set of all fuctios of total variatio bouded by 2λ Thus, F λ is rich i the sese that ay classifier may be approximated by classifiers associated with the fuctios i F λ I particular, the vc dimesio of the class of all classifiers iduced by fuctios i F λ is ifiite For such large classes of classifiers it is impossible to guaratee that L(f ) exceeds the miimal risk i the class by somethig of the order of 1/2 (see Sectio 55) However, L(f ) may be made as small as the miimum of the cost fuctioal A(f) over the class plus O( 1/2 ) Summarizig, we have obtaied that if F λ is of the form idicated above, the for ay fuctio f chose from F λ i a data-based maer, the probability of error of the associated classifier satisfies, with probability at least 1 δ, 2VC log( + 1) 2 log 1 δ L(f ) A (f ) + 2L φ λ + B (8) The remarkable fact about this iequality is that the upper boud oly ivolves the vc dimesio of the class C of base classifiers which is typically small The price we pay is that the first term o the right-had side is

11 TITLE WILL BE SET BY THE PUBLISHER 11 the empirical cost fuctioal istead of the empirical probability of error As a first illustratio, cosider the example whe γ is a fixed positive parameter ad 0 if x γ φ(x) = 1 if x x/γ otherwise I this case B = 1 ad L φ = 1/γ Notice also that 1 x>0 φ(x) 1 x> γ ad therefore A (f) L γ (f) where L γ (f) is the so-called margi error defied by L γ (f) = 1 i=1 1 f(xi)y i<γ Notice that for all γ > 0, L γ (f) L (f) ad the L γ (f) is icreasig i γ A iterpretatio of the margi error L γ (f) is that it couts, apart from the umber of misclassified pairs (X i, Y i ), also those which are well classified but oly with a small cofidece (or margi ) by f Thus, (8) implies the followig margi-based boud for the risk: Corollary 42 For ay γ > 0, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ 2VC log( + 1) + γ 2 log 1 δ (9) Notice that, as γ grows, the first term of the sum icreases, while the secod decreases The boud ca be very useful wheever a classifier has a small margi error for a relatively large γ (ie, if the classifier classifies the traiig data well with high cofidece ) sice the secod term oly depeds o the vc dimesio of the small base class C This result has bee used to explai the good behavior of some votig methods such as AdaBoost, sice these methods have a tedecy to fid classifiers that classify the data poits well with a large margi 412 Kerel methods Aother popular way to obtai classificatio rules from a class of real-valued fuctios which is used i kerel methods such as Support Vector Machies (SVM) or Kerel Fisher Discrimiat (KFD) is to cosider balls of a reproducig kerel Hilbert space The basic idea is to use a positive defiite kerel fuctio k : X X R, that is, a symmetric fuctio satisfyig α i α j k(x i, x j ) 0, i,j=1 for all choices of, α 1,, α R ad x 1,, x X Such a fuctio aturally geerates a space of fuctios of the form } F = f( ) = α i k(x i, ) : N, α i R, x i X, i=1 which, with the ier product α i k(x i, ), β j k(x j, ) def = α i β j k(x i, x j ) ca be completed ito a Hilbert space The key property is that for all x 1, x 2 X there exist elemets f x1, f x2 F such that k(x 1, x 2 ) = f x1, f x2 This meas that ay liear algorithm based o computig ier products ca be exteded ito a o-liear versio by replacig the ier products by a kerel fuctio The advatage is that eve though the algorithm remais of low complexity, it works i a class of fuctios that ca potetially represet ay cotiuous fuctio arbitrarily well (provided k is chose appropriately)

12 12 TITLE WILL BE SET BY THE PUBLISHER Algorithms workig with kerels usually perform miimizatio of a cost fuctioal o a ball of the associated reproducig kerel Hilbert space of the form N F λ = f(x) = c j k(x j, x) : N N, j=1 N c i c j k(x i, x j ) λ 2, x 1,, x N X (10) i,j=1 Notice that, i cotrast with (7) where the costrait is of l 1 type, the costrait here is of l 2 type Also, the basis fuctios, istead of beig chose from a fixed class, are determied by elemets of X themselves A importat property of fuctios i the reproducig kerel Hilbert space associated with k is that for all x X, f(x) = f, k(x, ) This is called the reproducig property The reproducig property may be used to estimate precisely the Rademacher average of F λ Ideed, deotig by E σ expectatio with respect to the Rademacher variables σ 1,, σ, we have R (F λ (X 1 )) = 1 E σ sup = 1 E σ sup f λ i=1 f λ i=1 σ i f(x i ) σ i f, k(x i, ) = λ E σ σ i k(x i, ) by the Cauchy-Schwarz iequality, where deotes the orm i the reproducig kerel Hilbert space The Kahae-Khichie iequality states that for ay vectors a 1,, a i a Hilbert space, It is also easy to see that so we obtai 1 2 ( E σ i a i 2 E i=1 E i=1 i=1 ) 2 2 σ i a i E σ i a i 2 σ i a i = E σ i σ j a i, a j = i=1 i,j=1 i=1 a i 2, i=1 λ k(x i, X i ) R (F λ (X1 )) λ k(x i, X i ) 2 i=1 This is very ice as it gives a boud that ca be computed very easily from the data A reasoig similar to the oe leadig to (9), usig the bouded differeces iequality to replace the Rademacher average by its empirical versio, gives the followig Corollary 43 Let f be ay fuctio chose from the ball F λ The, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ k(x i, X i ) + γ i=1 i=1 2 log 2 δ

13 42 Covex cost fuctioals TITLE WILL BE SET BY THE PUBLISHER 13 Next we show that a proper choice of the cost fuctio φ has further advatages To this ed, we cosider oegative covex odecreasig cost fuctios with lim x φ(x) = 0 ad φ(0) = 1 Mai examples of φ iclude the expoetial cost fuctio φ(x) = e x used i AdaBoost ad related boostig algorithms, the logit cost fuctio φ(x) = log 2 (1 + e x ), ad the hige loss (or soft margi loss) φ(x) = (1 + x) + used i support vector machies Oe of the mai advatages of usig covex cost fuctios is that miimizig the empirical cost A (f) ofte becomes a covex optimizatio problem ad is therefore computatioally feasible I fact, most boostig ad support vector machie classifiers may be viewed as empirical miimizers of a covex cost fuctioal However, miimizig covex cost fuctioals have other theoretical advatages To uderstad this, assume, i additio to the above, that φ is strictly covex ad differetiable The it is easy to determie the fuctio f miimizig the cost fuctioal A(f) = Eφ( Y f(x) Just ote that for each x X, ad therefore the fuctio f is give by E [φ( Y f(x) X = x] = η(x)φ( f(x)) + (1 η(x))φ(f(x)) f (x) = argmi α h η(x) (α) where for each η [0, 1], h η (α) = ηφ( α) + (1 η)φ(α) Note that h η is strictly covex ad therefore f is well defied (though it may take values ± if η equals 0 or 1) Assumig that h η is differetiable, the miimum is achieved for the value of α for which h η(α) = 0, that is, whe η 1 η = φ (α) φ ( α) Sice φ is strictly icreasig, we see that the solutio is positive if ad oly if η > 1/2 This reveals the importat fact that the miimizer f of the fuctioal A(f) is such that the correspodig classifier g (x) = 21 f (x) 0 1 is just the Bayes classifier Thus, miimizig a covex cost fuctioal leads to a optimal classifier For example, if φ(x) = e x is the expoetial cost fuctio, the f (x) = (1/2) log(η(x)/(1 η(x))) I the case of the logit cost φ(x) = log 2 (1 + e x ), we have f (x) = log(η(x)/(1 η(x))) We ote here that, eve though the hige loss φ(x) = (1 + x) + does ot satisfy the coditios for φ used above (eg, it is ot strictly covex), it is easy to see that the fuctio f miimizig the cost fuctioal equals f (x) = 1 if η(x) > 1/2 1 if η(x) < 1/2 Thus, i this case the f ot oly iduces the Bayes classifier but it equals to it To obtai iequalities for the probability of error of classifiers based o miimizatio of empirical cost fuctioals, we eed to establish a relatioship betwee the excess probability of error L(f) L ad the correspodig excess cost fuctioal A(f) A where A = A(f ) = if f A(f) Here we recall a simple iequality of Zhag [244] which states that if the fuctio H : [0, 1] R is defied by H(η) = if α h η (α) ad the cost fuctio φ is such that for some positive costats s 1 ad c η s c s (1 H(η)), η [0, 1], the for ay fuctio f : X R, L(f) L 2c (A(f) A ) 1/s (11)

14 14 TITLE WILL BE SET BY THE PUBLISHER (The simple proof of this iequality is based o the expressio (1) ad elemetary covexity properties of h η ) I the special case of the expoetial ad logit cost fuctios H(η) = 2 η(1 η) ad H(η) = η log 2 η (1 η) log 2 (1 η), respectively I both cases it is easy to see that the coditio above is satisfied with s = 2 ad c = 1/ 2 Theorem 44 excess risk of covex risk miimizers Assume that f is chose from a class F λ defied i (7) by miimizig the empirical cost fuctioal A (f) usig either the expoetial of the logit cost fuctio The, with probability at least 1 δ, L(f ) L 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 + ( ) 1/2 2 if A(f) A f F λ proof L(f ) L 2 (A(f ) A ) 1/2 ( ) 1/2 2 A(f ) if A(f) + ( 2 if A(f) A f F λ f F λ ( ) 1/2 2 sup A(f) A (f) + ( 2 if A(f) A f F λ f F λ (just like i (2)) 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 ) 1/2 ) 1/2 + ( ) 1/2 2 if A(f) A f F λ with probability at least 1 δ, where at the last step we used the same boud for sup f Fλ A(f) A (f) as i (8) Note that for the expoetial cost fuctio L φ = e λ ad B = λ while for the logit cost L φ 1 ad B = λ I both cases, if there exists a λ sufficietly large so that if f Fλ A(f) = A, the the approximatio error disappears ad we obtai L(f ) L = O ( 1/4) The fact that the expoet i the rate of covergece is dimesio-free is remarkable (We ote here that these rates may be further improved by applyig the refied techiques resumed i Sectio 53, see also [40]) It is a iterestig approximatio-theoretic challege to uderstad what kid of fuctios f may be obtaied as a covex combiatio of base classifiers ad, more geerally, to describe approximatio properties of classes of fuctios of the form (7) Next we describe a simple example whe the above-metioed approximatio properties are well uderstood Cosider the case whe X = [0, 1] d ad the base class C cotais all decisio stumps, that is, all classifiers of the form s + i,t (x) = 1 x (i) t 1 x (i) <t ad s i,t (x) = 1 x (i) <t 1 x (i) t, t [0, 1], i = 1,, d, where x (i) deotes the i-th coordiate of x I this case the vc dimesio of the base class is easily see to be bouded by V C 2 log 2 (2d) Also it is easy to see that the closure of F λ with respect to the supremum orm cotais all fuctios f of the form f(x) = f 1 (x (1) ) + + f d (x (d) ) where the fuctios f i : [0, 1] R are such that f 1 T V + + f d T V λ where f i T V deotes the total variatio of the fuctio f i Therefore, if f has the above form, we have if f Fλ A(f) = A(f ) Recallig that the fuctio f optimizig the cost A(f) has the form f (x) = 1 2 log η(x) 1 η(x)

15 TITLE WILL BE SET BY THE PUBLISHER 15 i the case of the expoetial cost fuctio ad f (x) = log η(x) 1 η(x) i the case of the logit cost fuctio, we see that boostig usig decisio stumps is especially well fitted to the so-called additive logistic model i which η is assumed to be such that log(η/(1 η)) is a additive fuctio (ie, it ca be writte as a sum of uivariate fuctios of the compoets of x) Thus, whe η permits a additive logistic represetatio the the rate of covergece of the classifier is fast ad has a very mild depedece o the dimesio Cosider ext the case of the hige loss φ(x) = (1 + x) + ofte used i Support Vector Machies ad related kerel methods I this case H(η) = 2 (η, 1 η) ad therefore iequality (11) holds with c = 1/2 ad s = 1 Thus, L(f ) L A(f ) A ad the aalysis above leads to eve better rates of covergece However, i this case f (x) = 21 η(x) 1/2 1 ad approximatig this fuctio by weighted sums of base fuctios may be more difficult tha i the case of expoetial ad logit costs Oce agai, the approximatio-theoretic part of the problem is far from beig well uderstood, ad it is difficult to give recommedatios about which cost fuctio is more advatageous ad what base classes should be used Bibliographical remarks For results o the algorithmic difficulty of empirical risk miimizatio, see Johso ad Preparata [112], Vu [236], Bartlett ad Be-David [26], Be-David, Eiro, ad Simo [32] Boostig algorithms were origially itroduced by Freud ad Schapire (see [91], [94], ad [190]), as adaptive aggregatio of simple classifiers cotaied i a small base class The aalysis based o the observatio that AdaBoost ad related methods ted to produce large-margi classifiers appears i Schapire, Freud, Bartlett, ad Lee [191], ad Koltchiskii ad Pacheko [127] It was Breima [51] who observed that boostig performs gradiet descet optimizatio of a empirical cost fuctio differet from the umber of misclassified samples, see also Maso, Baxter, Bartlett, ad Frea [157], Collis, Schapire, ad Siger [61], Friedma, Hastie, ad Tibshirai [95] Based o this view, various versios of boostig algorithms have bee show to be cosistet i differet settigs, see Breima [52], Bühlma ad Yu [54], Blachard, Lugosi, ad Vayatis [40], Jiag [111], Lugosi ad Vayatis [146], Maor ad Meir [152], Maor, Meir, ad Zhag [153], Zhag [244] Iequality (8) was first obtaied by Schapire, Freud, Bartlett, ad Lee [191] The aalysis preseted here is due to Koltchiskii ad Pacheko [127] Other classifiers based o weighted votig schemes have bee cosidered by Catoi [57 59], Yag [241], Freud, Masour, ad Schapire [93] Kerel methods were pioeered by Aizerma, Braverma, ad Rozooer [2 5], Vapik ad Lerer [228], Bashkirov, Braverma, ad Muchik [31], Vapik ad Chervoekis [233], ad Specht [203] Support vector machies origiate i the pioeerig work of Boser, Guyo, ad Vapik [43], Cortes ad Vapik [62] For surveys we refer to Cristiaii ad Shawe-Taylor [65], Smola, Bartlett, Schölkopf, ad Schuurmas [201], Hastie, Tibshirai, ad Friedma [104], Schölkopf ad Smola [192] The study of uiversal approximatio properties of kerels ad statistical cosistecy of Support Vector Machies is due to Steiwart [ ], Li [140, 141], Zhou [245], ad Blachard, Bousquet, ad Massart [39] We have cosidered the case of miimizatio of a loss fuctio o a ball of the reproducig kerel Hilbert space However, it is computatioally more coveiet to formulate the problem as the miimizatio of a regularized fuctioal of the form 1 mi φ( Y i f(x i )) + λ f 2 f F i=1 The stadard Support Vector Machie algorithm the correspods to the choice of φ(x) = (1 + x) + Kerel based regularizatio algorithms were studied by Kimeldorf ad Wahba [120] ad Crave ad Wahba [64] i the cotext of regressio Relatioships betwee Support Vector Machies ad regularizatio were described

16 16 TITLE WILL BE SET BY THE PUBLISHER by Smola, Schölkopf, ad Müller [202] ad Evhgeiou, Potil, ad Poggio [89] Geeral properties of regularized algorithms i reproducig kerel Hilbert spaces are ivestigated by Cucker ad Smale [68], Steiwart [206], Zhag [244] Various properties of the Support Vector Machie algorithm are ivestigated by Vapik [230, 231], Schölkopf ad Smola [192], Scovel ad Steiwart [195] ad Steiwart [208, 209] The fact that miimizig a expoetial cost fuctioal leads to the Bayes classifier was poited out by Breima [52], see also Lugosi ad Vayatis [146], Zhag [244] For a comprehesive theory of the coectio betwee cost fuctios ad probability of misclassificatio, see Bartlett, Jorda, ad McAuliffe [27] Zhag s lemma (11) appears i [244] For various geeralizatios ad refiemets we refer to Bartlett, Jorda, ad McAuliffe [27] ad Blachard, Lugosi, ad Vayatis [40] 5 Tighter bouds for empirical risk miimizatio This sectio is dedicated to the descriptio of some refiemets of the ideas described i the earlier sectios What we have see so far oly used first-order properties of the fuctios that we cosidered, amely their boudedess It turs out that usig secod-order properties, like the variace of the fuctios, may of the above results ca be made sharper 51 Relative deviatios I order to uderstad the basic pheomeo, let us go back to the simplest case i which oe has a fixed fuctio f with values i 0, 1} I this case, P f is a average of idepedet Beroulli radom variables with parameter p = P f Recall that, as a simple cosequece of (3), with probability at least 1 δ, P f P f 2 log 1 δ (12) This is basically tight whe P f = 1/2, but ca be sigificatly improved whe P f is small Ideed, Berstei s iequality gives, with probability at least 1 δ, P f P f 2Var(f) log 1 δ + 2 log 1 δ 3 (13) Sice f takes its values i 0, 1}, Var(f) = P f(1 P f) P f which shows that whe P f is small, (13) is much better tha (12) 511 Geeral iequalities Next we exploit the pheomeo described above to obtai sharper performace bouds for empirical risk miimizatio Note that if we cosider the differece P f P f uiformly over the class F, the largest deviatios are obtaied by fuctios that have a large variace (ie, P f is close to 1/2) A idea is to scale each fuctio by dividig it by P f so that they all behave i a similar way Thus, we boud the quatity P f P f sup f F P f The first step cosists i symmetrizatio of the tail probabilities If t 2 2, P f P P f sup f F P f t } } P 2P sup f P f f F (P f + P f)/2 t

17 TITLE WILL BE SET BY THE PUBLISHER 17 Next we itroduce Rademacher radom variables, obtaiig, by simple symmetrizatio, 2P sup f F P f P f (P f + P f)/2 t } = 2E [P σ sup f F 1 i=1 σ i(f(x i ) f(x }] i)) t (P f + P f)/2 (where P σ is the coditioal probability, give the X i ad X i ) The last step uses tail bouds for idividual fuctios ad a uio boud over F(X1 2 ), where X1 2 deotes the uio of the iitial sample X1 ad of the extra symmetrizatio sample X 1,, X Summarizig, we obtai the followig iequalities: Theorem 51 Let F be a class of fuctios takig biary values i 0, 1} For ay δ (0, 1), with probability at least 1 δ, all f F satisfy P f P f log S F (X1 2 2 ) + log 4 δ P f Also, with probability at least 1 δ, for all f F, P f P f P f 2 log S F (X1 2) + log 4 δ As a cosequece, we have that for all s > 0, with probability at least 1 δ, sup f F P f P f P f + P f + s/2 2 log S F (X 2 1 ) + log 4 δ s (14) ad the same is true if P ad P are permuted Aother cosequece of Theorem 51 with iterestig applicatios is the followig For all t (0, 1], with probability at least 1 δ, I particular, settig t = 1, f F, P f (1 t)p f implies P f 4 log S F(X1 2 ) + log 4 δ t 2 (15) 512 Applicatios to empirical risk miimizatio f F, P f = 0 implies P f 4 log S F(X1 2 ) + log 4 δ It is easy to see that, for o-egative umbers A, B, C 0, the fact that A B A + C etails A B 2 + B C + C so that we obtai from the secod iequality of Theorem 51 that, with probability at least 1 δ, for all f F, P f P f + 2 P f log S F(X 2 1 ) + log 4 δ + 4 log S F(X1 2 ) + log 4 δ Corollary 52 Let g be the empirical risk miimizer i a class C of vc dimesio V The, with probability at least 1 δ, L(g ) L (g ) + 2 L (g ) 2V log( + 1) + log 4 δ + 4 2V log( + 1) + log 4 δ

18 18 TITLE WILL BE SET BY THE PUBLISHER Cosider first the extreme situatio whe there exists a classifier i C which classifies without error This also meas that for some g C, Y = g (X) with probability oe This is clearly a quite restrictive assumptio, oly satisfied i very special cases Nevertheless, the assumptio that if g C L(g) = 0 has bee commoly used i computatioal learig theory, perhaps because of its mathematical simplicity I such a case, clearly L (g) = 0, so that we get, with probability at least 1 δ, L(g) if L(g) 42V log( + 1) + log 4 δ (16) g C The mai poit here is that the upper boud obtaied ( i this special case is of smaller order of magitude V ) tha i the geeral case (O(V l /) as opposed to O l / ) Oe ca actually obtai a versio which iterpolates betwee these two cases as follows: for simplicity, assume that there is a classifier g i C such that L(g ) = if g C L(g) The we have L (g ) L (g ) = L (g ) L(g ) + L(g ) Usig Berstei s iequality, we get, with probability 1 δ, which, together with Corollary 52, yields: L (g ) L(g ) 2L(g ) log 1 δ + 2 log 1 δ 3, Corollary 53 There exists a costat C such that, with probability at least 1 δ, L(g) if L(g) C g C if L(g)V log + log 1 δ g C + V log + log 1 δ 52 Noise ad fast rates We have see that i the case where f takes values i 0, 1} there is a ice relatioship betwee the variace of f (which cotrols the size of the deviatios betwee P f ad P f) ad its expectatio, amely, Var(f) P f This is the key property that allows oe to obtai faster rates of covergece for L(g ) if g C L(g) I particular, i the ideal situatio metioed above, whe if g C L(g) = 0, the differece L(g ) if g C L(g) may be much smaller tha the worst-case differece sup g C (L(g) L (g)) This actually happes i may cases, wheever the distributio satisfies certai coditios Next we describe such coditios ad show how the fier bouds ca be derived The mai idea is that, i order to get precise rates for L(g ) if g C L(g), we cosider fuctios of the form 1 g(x) Y 1 g (X) Y where g is a classifier miimizig the loss i the class C, that is, such that L(g ) = if g C L(g) Note that fuctios of this form are o loger o-egative To illustrate the basic ideas i the simplest possible settig, cosider the case whe the loss class F is a fiite set of N fuctios of the form 1 g(x) Y 1 g (X) Y I additio, we assume that there is a relatioship betwee the variace ad the expectatio of the fuctios i F give by the iequality Var(f) ( ) α P f (17) h

19 TITLE WILL BE SET BY THE PUBLISHER 19 for some h > 0 ad α (0, 1] By Berstei s iequality ad a uio boud over the elemets of C, we have that, with probability at least 1 δ, for all f F, P f P f + 2(P f/h) α log N δ + 4 log N δ 3 As a cosequece, usig the fact that P f = L (g ) L (g ) 0, we have with probability at least 1 δ, L(g ) L(g ) 2((L(g ) L(g ))/h) α log N δ + 4 log N δ 3 Solvig this iequality for L(g ) L(g ) fially gives that with probability at least 1 δ, L(g ) if g G L(g) ( 2 log N δ h α ) 1 2 α (18) Note that the obtaied rate is the faster tha 1/2 wheever α > 0 I particular, for α = 1 we get 1 as i the ideal case It ow remais to show whether (17) is a reasoable assumptio As the simplest possible example, assume that the Bayes classifier g belogs to the class C (ie, g = g ) ad the a posteriori probability fuctio η is bouded away from 1/2, that is, there exists a positive costat h such that for all x X, 2η(x) 1 > h Note that the assumptio g = g is very restrictive ad is ulikely to be satisfied i practice, especially if the class C is fiite, as it is assumed i this discussio The assumptio that η is bouded away from zero may also appear to be quite specific However, the situatio described here may serve as a first illustratio of a otrivial example whe fast rates may be achieved Sice 1 g(x) Y 1 g (X) Y 1 g(x) g (X), the coditios stated above ad (1) imply that Var(f) E [ ] 1 1 g(x) g (X) h E [ ] 1 2η(X) 1 1 g(x) g (X) = h (L(g) L ) Thus (17) holds with β = 1/h ad α = 1 which shows that, with probability at least 1 δ, L(g ) L C log N δ h (19) Thus, the empirical risk miimizer has a sigificatly better performace tha predicted by the results of the previous sectio wheever the Bayes classifier is i the class C ad the a posteriori probability η stays away from 1/2 The behavior of η i the viciity of 1/2 has bee kow to play a importat role i the difficulty of the classificatio problem, see [72, 239, 240] Roughly speakig, if η has a complex behavior aroud the critical threshold 1/2, the oe caot avoid estimatig η, which is a typically difficult oparametric regressio problem However, the classificatio problem is sigificatly easier tha regressio if η is far from 1/2 with a large probability The coditio of η beig bouded away from 1/2 may be sigificatly relaxed ad geeralized Ideed, i the cotext of discrimiat aalysis, Mamme ad Tsybakov [151] ad Tsybakov [221] formulated a useful coditio that has bee adopted by may authors Let α [0, 1) The the Mamme-Tsybakov coditio may

20 20 TITLE WILL BE SET BY THE PUBLISHER be stated by ay of the followig three equivalet statemets: (1) β > 0, g 0, 1} X, E [ ] 1 g(x) g (X) β(l(g) L ) α (2) c > 0, A X, A ( dp (x) c 2η(x) 1 dp (x) A (3) B > 0, t 0, P 2η(X) 1 t} Bt α We refer to this as the Mamme-Tsybakov oise coditio The proof that these statemets are equivalet is straightforward, ad we omit it, but we commet o the meaig of these statemets Notice first that α has to be i [0, 1] because L(g) L = E [ 2η(X) 1 1 g(x) g (X)] E1g(X) g (X) Also, whe α = 0 these coditios are void The case α = 1 i (1) is realized whe there exists a s > 0 such that 2η(X) 1 > s almost surely (which is just the extreme oise coditio we cosidered above) The most importat cosequece of these coditios is that they imply a relatioship betwee the variace ad the expectatio of fuctios of the form 1 g(x) Y 1 g (X) Y Ideed, we obtai 1 α E [ (1 g(x) Y 1 g (X) Y ) 2] c(l(g) L ) α This is thus eough to get (18) for a fiite class of fuctios The sharper bouds, established i this sectio ad the ext, come at the price of the assumptio that the Bayes classifier is i the class C Because of this, it is difficult to compare the fast rates achieved with the slower rates proved i Sectio 3 O the other had, oise coditios like the Mamme-Tsybakov coditio may be used to get improvemets eve whe g is ot cotaied i C I these cases the approximatio error L(g ) L also eeds to be take ito accout, ad the situatio becomes somewhat more complex We retur to these issues i Sectios 535 ad 8 53 Localizatio The purpose of this sectio is to geeralize the simple argumet of the previous sectio to more geeral classes C of classifiers This geeralizatio reveals the importace of the modulus of cotiuity of the empirical process as a measure of complexity of the learig problem 531 Talagrad s iequality Oe of the most importat recet developmets i empirical process theory is a cocetratio iequality for the supremum of a empirical process first proved by Talagrad [212] ad refied later by various authors This iequality is at the heart of may key developmets i statistical learig theory Here we recall the followig versio: Theorem 54 Let b > 0 ad set F to be a set of fuctios from X to R Assume that all fuctios i F satisfy P f f b The, with probability at least 1 δ, for ay θ > 0, [ ] sup (P f P f) (1 + θ)e sup (P f P f) f F f F which, for θ = 1 traslates to [ sup (P f P f) 2E f F sup (P f P f) f F ] + + 2(sup f F Var(f)) log 1 δ 2(sup f F Var(f)) log 1 δ ) α + (1 + 3/θ)b log 1 δ 3 + 4b log 1 δ 3,