Introduction to Statistical Learning Theory

Transcription

1 Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 Max-Plack Istitute for Biological Cyberetics Spemastr 38, D Tübige, Germay olivierbousquet@m4xorg WWW home page: 2 Uiversité de Paris-Sud, Laboratoire d Iformatique Bâtimet 490, F Orsay Cedex, Frace stephaebouchero@lrifr WWW home page: 3 Departmet of Ecoomics, Pompeu Fabra Uiversity Ramo Trias Fargas 25-27, Barceloa, Spai lugosi@upfes WWW home page: Abstract The goal of statistical learig theory is to study, i a statistical framework, the properties of learig algorithms I particular, most results take the form of so-called error bouds This tutorial itroduces the techiques that are used to obtai such results 1 Itroductio The mai goal of statistical learig theory is to provide a framework for studyig the problem of iferece, that is of gaiig kowledge, makig predictios, makig decisios or costructig models from a set of data This is studied i a statistical framework, that is there are assumptios of statistical ature about the uderlyig pheomea (i the way the data is geerated) As a motivatio for the eed of such a theory, let us just quote V Vapik: (Vapik, 1]) Nothig is more practical tha a good theory Ideed, a theory of iferece should be able to give a formal defiitio of words like learig, geeralizatio, overfittig, ad also to characterize the performace of learig algorithms so that, ultimately, it may help desig better learig algorithms There are thus two goals: make thigs more precise ad derive ew or improved algorithms 11 Learig ad Iferece What is uder study here is the process of iductive iferece which ca roughly be summarized as the followig steps:

2 176 Bousquet, Bouchero & Lugosi 1 Observe a pheomeo 2 Costruct a model of that pheomeo 3 Make predictios usig this model Of course, this defiitio is very geeral ad could be take more or less as the goal of Natural Scieces The goal of Machie Learig is to actually automate this process ad the goal of Learig Theory is to formalize it I this tutorial we cosider a special case of the above process which is the supervised learig framework for patter recogitio I this framework, the data cosists of istace-label pairs, where the label is either +1 or 1 Give a set of such pairs, a learig algorithm costructs a fuctio mappig istaces to labels This fuctio should be such that it makes few mistakes whe predictig the label of usee istaces Of course, give some traiig data, it is always possible to build a fuctio that fits exactly the data But, i the presece of oise, this may ot be the best thig to do as it would lead to a poor performace o usee istaces (this is usually referred to as overfittig) The geeral idea behid the desig of Fig 1 Trade-off betwee fit ad complexity learig algorithms is thus to look for regularities (i a sese to be defied later) i the observed pheomeo (ie traiig data) These ca the be geeralized from the observed past to the future Typically, oe would look, i a collectio of possible models, for oe which fits well the data, but at the same time is as simple as possible (see Figure 1) This immediately raises the questio of how to measure ad quatify simplicity of a model (ie a { 1, +1}-valued fuctio)

3 Statistical Learig Theory 177 It turs out that there are may ways to do so, but o best oe For example i Physics, people ted to prefer models which have a small umber of costats ad that correspod to simple mathematical formulas Ofte, the legth of descriptio of a model i a codig laguage ca be a idicatio of its complexity I classical statistics, the umber of free parameters of a model is usually a measure of its complexity Surprisigly as it may seem, there is o uiversal way of measurig simplicity (or its couterpart complexity) ad the choice of a specific measure iheretly depeds o the problem at had It is actually i this choice that the desiger of the learig algorithm itroduces kowledge about the specific pheomeo uder study This lack of uiversally best choice ca actually be formalized i what is called the No Free Luch theorem, which i essece says that, if there is o assumptio o how the past (ie traiig data) is related to the future (ie test data), predictio is impossible Eve more, if there is o a priori restrictio o the possible pheomea that are expected, it is impossible to geeralize ad there is thus o better algorithm (ay algorithm would be beate by aother oe o some pheomeo) Hece the eed to make assumptios, like the fact that the pheomeo we observe ca be explaied by a simple model However, as we said, simplicity is ot a absolute otio, ad this leads to the statemet that data caot replace kowledge, or i pseudo-mathematical terms: Geeralizatio = Data + Kowledge 12 Assumptios We ow make more precise the assumptios that are made by the Statistical Learig Theory framework Ideed, as we said before we eed to assume that the future (ie test) observatios are related to the past (ie traiig) oes, so that the pheomeo is somewhat statioary At the core of the theory is a probabilistic model of the pheomeo (or data geeratio process) Withi this model, the relatioship betwee past ad future observatios is that they both are sampled idepedetly from the same distributio (iid) The idepedece assumptio meas that each ew observatio yields maximum iformatio The idetical distributio meas that the observatios give iformatio about the uderlyig pheomeo (here a probability distributio) A immediate cosequece of this very geeral settig is that oe ca costruct algorithms (eg k-earest eighbors with appropriate k) that are cosistet, which meas that, as oe gets more ad more data, the predictios of the algorithm are closer ad closer to the optimal oes So this seems to idicate that we ca have some sort of uiversal algorithm Ufortuately, ay (cosistet) algorithm ca have a arbitrarily bad behavior whe give a fiite traiig set These otios are formalized i Appedix B Agai, this discussio idicates that geeralizatio ca oly come whe oe adds specific kowledge to the data Each learig algorithm ecodes specific

4 178 Bousquet, Bouchero & Lugosi kowledge (or a specific assumptio about how the optimal classifier looks like), ad works best whe this assumptio is satisfied by the problem to which it is applied Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Athoy ad Bartlett 2], Breima, Friedma, Olshe, ad Stoe 3], Devroye, Györfi, ad Lugosi 4], Duda ad Hart 5], Fukuaga 6], Kears ad Vazirai 7], Kulkari, Lugosi, ad Vekatesh 8], Lugosi 9], McLachla 10], Medelso 11], Nataraja 12], Vapik 13, 14, 1], ad Vapik ad Chervoekis 15] 2 Formalizatio We cosider a iput space X ad output space Y Sice we restrict ourselves to biary classificatio, we choose Y = { 1, 1} Formally, we assume that the pairs (X, Y ) X Y are radom variables distributed accordig to a ukow distributio P We observe a sequece of iid pairs (X i, Y i ) sampled accordig to P ad the goal is to costruct a fuctio g : X Y which predicts Y from X We eed a criterio to choose this fuctio g This criterio is a low probability of error P (g(x) Y ) We thus defie the risk of g as R(g) = P (g(x) Y ) = g(x) Y ] Notice that P ca be decomposed as P X P (Y X) We itroduce the regressio fuctio η(x) = Y X = x] = 2 Y = 1 X = x] 1 ad the target fuctio (or Bayes classifier) t(x) = sg η(x) This fuctio achieves the miimum risk over all possible measurable fuctios: R(t) = if g R(g) We will deote the value R(t) by R, called the Bayes risk I the determiistic case, oe has Y = t(x) almost surely ( Y = 1 X] {0, 1}) ad R = 0 I the geeral case we ca defie the oise level as s(x) = mi( Y = 1 X = x], 1 Y = 1 X = x]) = (1 η(x))/2 (s(x) = 0 almost surely i the determiistic case) ad this gives R = s(x) Our goal is thus to idetify this fuctio t, but sice P is ukow we caot directly measure the risk ad we also caot kow directly the value of t at the data poits We ca oly measure the agreemet of a cadidate fuctio with the data This is called the empirical risk: R (g) = 1 i=1 g(x i) Y i It is commo to use this quatity as a criterio to select a estimate of t

5 Statistical Learig Theory Algorithms Now that the goal is clearly specified, we review the commo strategies to (approximately) achieve it We deote by g the fuctio retured by the algorithm Because oe caot compute R(g) but oly approximate it by R (g), it would be ureasoable to look for the fuctio miimizig R (g) amog all possible fuctios Ideed, whe the iput space is ifiite, oe ca always costruct a fuctio g which perfectly predicts the labels of the traiig data (ie g (X i ) = Y i, ad R (g ) = 0), but behaves o the other poits as the opposite of the target fuctio t, ie g (X) = Y so that R(g ) = 1 4 So oe would have miimum empirical risk but maximum risk It is thus ecessary to prevet this overfittig situatio There are essetially two ways to do this (which ca be combied) The first oe is to restrict the class of fuctios i which the miimizatio is performed, ad the secod is to modify the criterio to be miimized (eg addig a pealty for complicated fuctios) Empirical Risk Miimizatio This algorithm is oe of the most straightforward, yet it is usually efficiet The idea is to choose a model G of possible fuctios ad to miimize the empirical risk i that model: g = arg mi g G R (g) Of course, this will work best whe the target fuctio belogs to G However, it is rare to be able to make such a assumptio, so oe may wat to elarge the model as much as possible, while prevetig overfittig Structural Risk Miimizatio The idea here is to choose a ifiite sequece {G d : d = 1, 2, } of models of icreasig size ad to miimize the empirical risk i each model with a added pealty for the size of the model: g = arg mi g G d,d R (g) + pe(d, ) The pealty pe(d, ) gives preferece to models where estimatio error is small ad measures the size or capacity of the model Regularizatio Aother, usually easier to implemet approach cosists i choosig a large model G (possibly dese i the cotiuous fuctios for example) ad to defie o G a regularizer, typically a orm g The oe has to miimize the regularized empirical risk: g = arg mi g G R (g) + λ g 2 4 Strictly speakig this is oly possible if the probability distributio satisfies some mild coditios (eg has o atoms) Otherwise, it may ot be possible to achieve R(g ) = 1 but eve i this case, provided the support of P cotais ifiitely may poits, a similar pheomeo occurs

6 180 Bousquet, Bouchero & Lugosi Compared to SRM, there is here a free parameter λ, called the regularizatio parameter which allows to choose the right trade-off betwee fit ad complexity Tuig λ is usually a hard problem ad most ofte, oe uses extra validatio data for this task Most existig (ad successful) methods ca be thought of as regularizatio methods Normalized Regularizatio There are other possible approaches whe the regularizer ca, i some sese, be ormalized, ie whe it correspods to some probability distributio over G Give a probability distributio π defied o G (usually called a prior), oe ca use as a regularizer log π(g) 5 Reciprocally, from a regularizer of the form g 2, if there exists a measure µ o G such that e λ g 2 dµ(g) < for some λ > 0, the oe ca costruct a prior correspodig to this regularizer For example, if G is the set of hyperplaes i d goig through the origi, G ca be idetified with d ad, takig µ as the Lebesgue measure, it is possible to go from the Euclidea orm regularizer to a spherical Gaussia measure o d as a prior 6 This type of ormalized regularizer, or prior, ca be used to costruct aother probability distributio ρ o G (usually called posterior), as ρ(g) = e γr(g) π(g), Z(γ) where γ 0 is a free parameter ad Z(γ) is a ormalizatio factor There are several ways i which this ρ ca be used If we take the fuctio maximizig it, we recover regularizatio as arg max ρ(g) = arg mi γr (g) log π(g), g G g G where the regularizer is γ 1 log π(g) 7 Also, ρ ca be used to radomize the predictios I that case, before computig the predicted label for a iput x, oe samples a fuctio g accordig to ρ ad outputs g(x) This procedure is usually called Gibbs classificatio Aother way i which the distributio ρ costructed above ca be used is by takig the expected predictio of the fuctios i G: g (x) = sg( ρ(g(x))) 5 This is fie whe G is coutable I the cotiuous case, oe has to cosider the desity associated to π We omit these details 6 Geeralizatio to ifiite dimesioal Hilbert spaces ca also be doe but it requires more care Oe ca for example establish a correspodece betwee the orm of a reproducig kerel Hilbert space ad a Gaussia process prior whose covariace fuctio is the kerel of this space 7 Note that miimizig γr (g) log π(g) is equivalet to miimizig R (g) γ 1 log π(g)

7 Statistical Learig Theory 181 This is typically called Bayesia averagig At this poit we have to isist agai o the fact that the choice of the class G ad of the associated regularizer or prior, has to come from a priori kowledge about the task at had, ad there is o uiversally best choice 22 Bouds We have preseted the framework of the theory ad the type of algorithms that it studies, we ow itroduce the kid of results that it aims at The overall goal is to characterize the risk that some algorithm may have i a give situatio More precisely, a learig algorithm takes as iput the data (X 1, Y 1 ),, (X, Y ) ad produces a fuctio g which depeds o this data We wat to estimate the risk of g However, R(g ) is a radom variable (sice it depeds o the data) ad it caot be computed from the data (sice it also depeds o the ukow P ) Estimates of R(g ) thus usually take the form of probabilistic bouds Notice that whe the algorithm chooses its output from a model G, it is possible, by itroducig the best fuctio g i G, with R(g ) = if g G R(g), to write R(g ) R = R(g ) R ] + R(g ) R(g )] The first term o the right had side is usually called the approximatio error, ad measures how well ca fuctios i G approach the target (it would be zero if t G) The secod term, called estimatio error is a radom quatity (it depeds o the data) ad measures how close is g to the best possible choice i G Estimatig the approximatio error is usually hard sice it requires kowledge about the target Classically, i Statistical Learig Theory it is preferable to avoid makig specific assumptios about the target (such as its belogig to some model), but the assumptios are rather o the value of R, or o the oise fuctio s It is also kow that for ay (cosistet) algorithm, the rate of covergece to zero of the approximatio error 8 ca be arbitrarily slow if oe does ot make assumptios about the regularity of the target, while the rate of covergece of the estimatio error ca be computed without ay such assumptio We will thus focus o the estimatio error Aother possible decompositio of the risk is the followig: R(g ) = R (g ) + R(g ) R (g )] I this case, oe estimates the risk by its empirical couterpart, ad some quatity which approximates (or upper bouds) R(g ) R (g ) To summarize, we write the three type of results we may be iterested i 8 For this coverge to mea aythig, oe has to cosider algorithms which choose fuctios from a class which grows with the sample size This is the case for example of Structural Risk Miimizatio or Regularizatio based algorithms

8 182 Bousquet, Bouchero & Lugosi Error boud: R(g ) R (g ) + B(, G) This correspods to the estimatio of the risk from a empirical quatity Error boud relative to the best i the class: R(g ) R(g ) + B(, G) This tells how optimal is the algorithm give the model it uses Error boud relative to the Bayes risk: R(g ) R + B(, G) This gives theoretical guaratees o the covergece to the Bayes risk 3 Basic Bouds I this sectio we show how to obtai simple error bouds (also called geeralizatio bouds) The elemetary material from probability theory that is eeded here ad i the later sectios is summarized i Appedix A 31 Relatioship to Empirical Processes ] Recall that we wat to estimate the risk R(g ) = g (X) Y of the fuctio g retured by the algorithm after seeig the data (X 1, Y 1 ),, (X, Y ) This quatity caot be observed (P is ukow) ad is a radom variable (sice it depeds o the data) Hece oe way to make a statemet about this quatity is to say how it relates to a estimate such as the empirical risk R (g ) This relatioship ca take the form of upper ad lower bouds for R(g ) R (g ) > ε] For coveiece, let Z i = (X i, Y i ) ad Z = (X, Y ) Give G defie the loss class F = {f : (x, y) g(x) y : g G} (1) Notice that G cotais fuctios with rage i { 1, 1} while F cotais oegative fuctios with rage i {0, 1} I the remaider of the tutorial, we will go back ad forth betwee F ad G (as there is a bijectio betwee them), sometimes statig the results i terms of fuctios i F ad sometimes i terms of fuctios i G It will be clear from the cotext which classes G ad F we refer to, ad F will always be derived from the last metioed class G i the way of (1) We use the shorthad otatio P f = f(x, Y )] ad P f = 1 i=1 f(x i, Y i ) P is usually called the empirical measure associated to the traiig sample With this otatio, the quatity of iterest (differece betwee true ad empirical risks) ca be writte as P f P f (2) A empirical process is a collectio of radom variables idexed by a class of fuctios, ad such that each radom variable is distributed as a sum of iid radom variables (values take by the fuctio at the data): {P f P f}

9 Statistical Learig Theory 183 Oe of the most studied quatity associated to empirical processes is their supremum: sup P f P f It is clear that if we kow a upper boud o this quatity, it will be a upper boud o (2) This shows that the theory of empirical processes is a great source of tools ad techiques for Statistical Learig Theory 32 Hoeffdig s Iequality Let us rewrite agai the quatity we are iterested i as follows R(g) R (g) = f(z)] 1 f(z i ) i=1 It is easy to recogize here the differece betwee the expectatio ad the empirical average of the radom variable f(z) By the law of large umbers, we immediately obtai that lim 1 ] f(z i ) f(z)] = 0 = 1 i=1 This idicates that with eough samples, the empirical risk of a fuctio is a good approximatio to its true risk It turs out that there exists a quatitative versio of the law of large umbers whe the variables are bouded Theorem 1 (Hoeffdig) Let Z 1,, Z be iid radom variables with f(z) a, b] The for all ε > 0, we have 1 f(z i ) i=1 ] f(z)] > ε ) 2 exp ( 2ε2 (b a) 2 Let us rewrite the above formula to better uderstad its cosequeces Deote the right had side by δ The P f P f > (b a) log 2 δ 2 δ, or (by iversio, see Appedix A) with probability at least 1 δ, P f P f (b a) log 2 δ 2

10 184 Bousquet, Bouchero & Lugosi Applyig this to f(z) = g(x) Y we get that for ay g, ad ay δ > 0, with probability at least 1 δ R(g) R (g) + log 2 δ 2 (3) Notice that oe has to cosider a fixed fuctio g ad the probability is with respect to the samplig of the data If the fuctio depeds o the data this does ot apply! 33 Limitatios Although the above result seems very ice (sice it applies to ay class of bouded fuctios), it is actually severely limited Ideed, what it essetially says is that for each (fixed) fuctio f F, there is a set S of samples for which log 2 δ 2 P f P f (ad this set of samples has measure S] 1 δ) However, these sets S may be differet for differet fuctios I other words, for the observed sample, oly some of the fuctios i F will satisfy this iequality Aother way to explai the limitatio of Hoeffdig s iequality is the followig If we take for G the class of all { 1, 1}-valued (measurable) fuctios, the for ay fixed sample, there exists a fuctio f F such that P f P f = 1 To see this, take the fuctio which is f(x i ) = Y i o the data ad f(x) = Y everywhere else This does ot cotradict Hoeffdig s iequality but shows that it does ot yield what we eed Figure 2 illustrates the above argumetatio The horizotal axis correspods Risk R R (g) R R(g) g g* g Fuctio class Fig 2 Covergece of the empirical risk to the true risk over the class of fuctios

11 Statistical Learig Theory 185 to the fuctios i the class The two curves represet the true risk ad the empirical risk (for some traiig sample) of these fuctios The true risk is fixed, while for each differet sample, the empirical risk will be a differet curve If we observe a fixed fuctio g ad take several differet samples, the poit o the empirical curve will fluctuate aroud the true risk with fluctuatios cotrolled by Hoeffdig s iequality However, for a fixed sample, if the class G is big eough, oe ca fid somewhere alog the axis, a fuctio for which the differece betwee the two curves will be very large 34 Uiform Deviatios Before seeig the data, we do ot kow which fuctio the algorithm will choose The idea is to cosider uiform deviatios R(f ) R (f ) sup(r(f) R (f)) (4) I other words, if we ca upper boud the supremum o the right, we are doe For this, we eed a boud which holds simultaeously for all fuctios i a class Let us explai how oe ca costruct such uiform bouds Cosider two fuctios f 1, f 2 ad defie C i = {(x 1, y 1 ),, (x, y ) : P f i P f i > ε} This set cotais all the bad samples, ie those for which the boud fails From Hoeffdig s iequality, for each i C i ] δ We wat to measure how may samples are bad for i = 1 or i = 2 For this we use (see Appedix A) C 1 C 2 ] C 1 ] + C 2 ] 2δ More geerally, if we have N fuctios i our class, we ca write As a result we obtai C 1 C N ] N C i ] i=1 f {f 1,, f N } : P f P f > ε] N P f i P f i > ε] i=1 N exp ( 2ε 2)

12 186 Bousquet, Bouchero & Lugosi Hece, for G = {g 1,, g N }, for all δ > 0 with probability at least 1 δ, log N + log 1 δ g G, R(g) R (g) + 2 This is a error boud Ideed, if we kow that our algorithm picks fuctios from G, we ca apply this result to g itself Notice that the mai differece with Hoeffdig s iequality is the extra log N term o the right had side This is the term which accouts for the fact that we wat N bouds to hold simultaeously Aother iterpretatio of this term is as the umber of bits oe would require to specify oe fuctio i G It turs out that this kid of codig iterpretatio of geeralizatio bouds is ofte possible ad ca be used to obtai error estimates 16] 35 Estimatio Error Usig the same idea as before, ad with o additioal effort, we ca also get a boud o the estimatio error We start from the iequality R(g ) R (g ) + sup(r(g) R (g)), g G which we combie with (4) ad with the fact that sice g miimizes the empirical risk i G, R (g ) R (g ) 0 Thus we obtai R(g ) = R(g ) R(g ) + R(g ) R (g ) R (g ) + R(g ) R(g ) + R(g ) 2 sup R(g) R (g) + R(g ) g G We obtai that with probability at least 1 δ R(g ) R(g log N + log 2 δ ) We otice that i the right had side, both terms deped o the size of the class G If this size icreases, the first term will decrease, while the secod will icrease 36 Summary ad Perspective At this poit, we ca summarize what we have exposed so far Iferece requires to put assumptios o the process geeratig the data (data sampled iid from a ukow P ), geeralizatio requires kowledge (eg restrictio, structure, or prior)

13 Statistical Learig Theory 187 The error bouds are valid with respect to the repeated samplig of traiig sets For a fixed fuctio g, for most of the samples For most of the samples if G = N R(g) R (g) 1/ sup R(g) R (g) log N/ g G The extra variability comes from the fact that the chose g chages with the data So the result we have obtaied so far is that with high probability, for a fiite class of size N, log N + log 1 δ sup(r(g) R (g)) g G 2 There are several thigs that ca be improved: Hoeffdig s iequality oly uses the boudedess of the fuctios, ot their variace The uio boud is as bad as if all the fuctios i the class were idepedet (ie if f 1 (Z) ad f 2 (Z) were idepedet) The supremum over G of R(g) R (g) is ot ecessarily what the algorithm would choose, so that upper boudig R(g ) R (g ) by the supremum might be loose 4 Ifiite Case: Vapik-Chervoekis Theory I this sectio we show how to exted the previous results to the case where the class G is ifiite This requires, i the o-coutable case, the itroductio of tools from Vapik-Chervoekis Theory 41 Refied Uio Boud ad Coutable Case We first start with a simple refiemet of the uio boud that allows to exted the previous results to the (coutably) ifiite case Recall that by Hoeffdig s iequality, for each f F, for each δ > 0 (possibly depedig o f, which we write δ(f)), P f P f > log 1 δ(f) 2 δ(f)

14 188 Bousquet, Bouchero & Lugosi Hece, if we have a coutable set F, the uio boud immediately yields log 1 f F : P f P f > δ(f) 2 δ(f) Choosig δ(f) = δp(f) with p(f) = 1, this makes the right-had side equal to δ ad we get the followig result With probability at least 1 δ, log 1 p(f) f F, P f P f + + log 1 δ 2 We otice that if F is fiite (with size N), takig a uiform p gives the log N as before Usig this approach, it is possible to put kowledge about the algorithm ito p(f), but p should be chose before seeig the data, so it is ot possible to cheat by settig all the weight to the fuctio retured by the algorithm after seeig the data (which would give the smallest possible boud) But, i geeral, if p is well-chose, the boud will have a small value Hece, the boud ca be improved if oe kows ahead of time the fuctios that the algorithm is likely to pick (ie kowledge improves the boud) 42 Geeral Case Whe the set G is ucoutable, the previous approach does ot directly work The geeral idea is to look at the fuctio class projected o the sample More precisely, give a sample z 1,, z, we cosider F z1,,z = {(f(z 1 ),, f(z )) : f F} The size of this set is the umber of possible ways i which the data (z 1,, z ) ca be classified Sice the fuctios f ca oly take two values, this set will always be fiite, o matter how big F is Defiitio 1 (Growth fuctio) The growth fuctio is the maximum umber of ways ito which poits ca be classified by the fuctio class: S F () = sup F z1,,z (z 1,,z ) We have defied the growth fuctio i terms of the loss class F but we ca do the same with the iitial class G ad otice that S F () = S G () It turs out that this growth fuctio ca be used as a measure of the size of a class of fuctio as demostrated by the followig result Theorem 2 (Vapik-Chervoekis) For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) log S G(2) + log 2 δ

15 Statistical Learig Theory 189 Notice that, i the fiite case where G = N, we have S G () N so that this boud is always better tha the oe we had before (except for the costats) But the problem becomes ow oe of computig S G () 43 VC Dimesio Sice g { 1, 1}, it is clear that S G () 2 If S G () = 2, there is a set of size such that the class of fuctios ca geerate ay classificatio o these poits (we say that G shatters the set) Defiitio 2 (VC dimesio) The VC dimesio of a class G is the largest such that S G () = 2 I other words, the VC dimesio of a class G is the size of the largest set that it ca shatter I order to illustrate this defiitio, we give some examples The first oe is the set of half-plaes i d (see Figure 3) I this case, as depicted for the case d = 2, oe ca shatter a set of d + 1 poits but o set of d + 2 poits, which meas that the VC dimesio is d + 1 Fig 3 Computig the VC dimesio of hyperplaes i dimesio 2: a set of 3 poits ca be shattered, but o set of four poits It is iterestig to otice that the umber of parameters eeded to defie half-spaces i d is d, so that a atural questio is whether the VC dimesio is related to the umber of parameters of the fuctio class The ext example, depicted i Figure 4, is a family of fuctios with oe parameter oly: {sg(si(tx)) : t } which actually has ifiite VC dimesio (this is a exercise left to the reader)

16 190 Bousquet, Bouchero & Lugosi Fig 4 VC dimesio of siusoids It remais to show how the otio of VC dimesio ca brig a solutio to the problem of computig the growth fuctio Ideed, at first glace, if we kow that a class has VC dimesio h, it etails that for all h, S G () = 2 ad S G () < 2 otherwise This seems of little use, but actually, a itriguig pheomeo occurs for h as depicted i Figure 5 The growth fuctio log(s()) h Fig 5 Typical behavior of the log growth fuctio which is expoetial (its logarithm is liear) up util the VC dimesio, becomes polyomial afterwards This behavior is captured i the followig lemma Lemma 1 (Vapik ad Chervoekis, Sauer, Shelah) Let G be a class of fuctios with fiite VC-dimesio h The for all, S G () h i=0 ( ), i

17 Statistical Learig Theory 191 ad for all h, ( e ) h S G () h Usig this lemma alog with Theorem 2 we immediately obtai that if G has VC dimesio h, with probability at least 1 δ, 2e h log h g G, R(g) R (g) log 2 δ What is importat to recall from this result, is that the differece betwee the true ad empirical risk is at most of order h log A iterpretatio of VC dimesio ad growth fuctios is that they measure the effective size of the class, that is the size of the projectio of the class oto fiite samples I additio, this measure does ot just cout the umber of fuctios i the class but depeds o the geometry of the class (rather its projectios) Fially, the fiiteess of the VC dimesio esures that the empirical risk will coverge uiformly over the class to the true risk 44 Symmetrizatio We ow idicate how to prove Theorem 2 The key igrediet to the proof is the so-called symmetrizatio lemma The idea is to replace the true risk by a estimate computed o a idepedet set of data This is of course a mathematical techique ad does ot mea oe eeds to have more data to be able to apply the result The extra data set is usually called virtual or ghost sample We will deote by Z 1,, Z a idepedet (ghost) sample ad by P the correspodig empirical measure Lemma 2 (Symmetrizatio) For ay t > 0, such that t 2 2, ] ] sup(p P )f t 2 sup(p P )f t/2 Proof Let f be the fuctio achievig the supremum (ote that it depeds o Z 1,, Z ) Oe has (with deotig the cojuctio of two evets), (P P )f >t (P P )f<t/2 = (P P )f >t (P P )f t/2 (P P)f>t/2 Takig expectatios with respect to the secod sample gives (P P )f >t (P P )f < t/2] (P P )f > t/2]

18 192 Bousquet, Bouchero & Lugosi By Chebyshev s iequality (see Appedix A), (P P )f t/2] 4Varf t 2 1 t 2 Ideed, a radom variable with rage i 0, 1] has variace less tha 1/4 Hece (P P )f >t(1 1 t 2 ) (P P )f > t/2] Takig expectatio with respect to first sample gives the result This lemma allows to replace the expectatio P f by a empirical average over the ghost sample As a result, the right had side oly depeds o the projectio of the class F o the double sample: F Z1,,Z,Z 1,,Z, which cotais fiitely may differet vectors Oe ca thus use the simple uio boud that was preseted before i the fiite case The other igrediet that is eeded to obtai Theorem 2 is agai Hoeffdig s iequality i the followig form: P f P f > t] e t2 /2 We ow just have to put the pieces together: sup (P P )f t ] 2 = 2 sup (P P )f t/2 ] ] sup Z1 (P,,Z,Z 1,,Z P )f t/2 2S F (2) (P P )f t/2] 4S F (2)e t2 /8 Usig iversio fiishes the proof of Theorem 2 45 VC Etropy Oe importat aspect of the VC dimesio is that it is distributio idepedet Hece, it allows to get bouds that do ot deped o the problem at had: the same boud holds for ay distributio Although this may be see as a advatage, it ca also be a drawback sice, as a result, the boud may be loose for most distributios We ow show how to modify the proof above to get a distributio-depedet result We use the followig otatio N (F, z 1 ) := F z1,,z Defiitio 3 (VC etropy) The (aealed) VC etropy is defied as H F () = log N (F, Z 1 )]

19 Statistical Learig Theory 193 Theorem 3 For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) H G(2) + log 2 δ Proof We agai begi with the symmetrizatio lemma so that we have to upper boud the quatity ] I = sup (P Z 1,Z 1 P )f t/2 Let σ 1,, σ be idepedet radom variables such that P (σ i = 1) = P (σ i = 1) = 1/2 (they are called Rademacher variables) We otice that the quatities (P P )f ad 1 i=1 σ i(f(z i ) f(z i)) have the same distributio sice chagig oe σ i correspods to exchagig Z i ad Z i Hece we have I σ sup Z 1,Z 1 1 ]] σ i (f(z i) f(z i )) t/2, i=1 ad the uio boud leads to I N ( F, Z1, Z1 ) max f 1 ]] σ i (f(z i) f(z i )) t/2 i=1 Sice σ i (f(z i ) f(z i)) 1, 1], Hoeffdig s iequality fially gives I N (F, Z, Z )] e t2 /8 The rest of the proof is as before 5 Capacity Measures We have see so far three measures of capacity or size of classes of fuctio: the VC dimesio ad growth fuctio both distributio idepedet, ad the VC etropy which depeds o the distributio Apart from the VC dimesio, they are usually hard or impossible to compute There are however other measures which ot oly may give sharper estimates, but also have properties that make their computatio possible from the data oly 51 Coverig Numbers We start by edowig the fuctio class F with the followig (radom) metric d (f, f ) = 1 {f(z i) f (Z i ) : i = 1,, }

20 194 Bousquet, Bouchero & Lugosi This is the ormalized Hammig distace of the projectios o the sample Give such a metric, we say that a set f 1,, f N covers F at radius ε if F N i=1b(f i, ε) We the defie the coverig umbers of F as follows Defiitio 4 (Coverig umber) The coverig umber of F at radius ε, with respect to d, deoted by N(F, ε, ) is the miimum size of a cover of radius ε Notice that it does ot matter if we apply this defiitio to the origial class G or the loss class F, sice N(F, ε, ) = N(G, ε, ) The coverig umbers characterize the size of a fuctio class as measured by the metric d The rate of growth of the logarithm of N(G, ε, ) usually called the metric etropy, is related to the classical cocept of vector dimesio Ideed, if G is a compact set i a d-dimesioal Euclidea space, N(G, ε, ) ε d Whe the coverig umbers are fiite, it is possible to approximate the class G by a fiite set of fuctios (which cover G) Which agai allows to use the fiite uio boud, provided we ca relate the behavior of all fuctios i G to that of fuctios i the cover A typical result, which we provide without proof, is the followig Theorem 4 For ay t > 0, g G : R(g) > R (g) + t] 8 N(G, t, )] e t2 /128 Coverig umbers ca also be defied for classes of real-valued fuctios We ow relate the coverig umbers to the VC dimesio Notice that, because the fuctios i G ca oly take two values, for all ε > 0, N(G, ε, ) G Z 1 = N(G, Z1 ) Hece the VC etropy correspods to log coverig umbers at miimal scale, which implies N(G, ε, ) h log e h, but oe ca have a cosiderably better result Lemma 3 (Haussler) Let G be a class of VC dimesio h The, for all ε > 0, all, ad ay sample, N(G, ε, ) Ch(4e) h ε h The iterest of this result is that the upper boud does ot deped o the sample size The coverig umber boud is a geeralizatio of the VC etropy boud where the scale is adapted to the error It turs out that this result ca be improved by cosiderig all scales (see Sectio 52) 52 Rademacher Averages Recall that we used i the proof of Theorem 3 Rademacher radom variables, ie idepedet { 1, 1}-valued radom variables with probability 1/2 of takig either value

21 Statistical Learig Theory 195 For coveiece we itroduce the followig otatio (siged empirical measure) R f = 1 i=1 σ if(z i ) We will deote by σ the expectatio take with respect to the Rademacher variables (ie coditioally to the data) while will deote the expectatio with respect to all the radom variables (ie the data, the ghost sample ad the Rademacher variables) Defiitio 5 (Rademacher averages) For a class F of fuctios, the Rademacher average is defied as R(F) = sup R f, ad the coditioal Rademacher average is defied as R (F) = σ sup R f We ow state the fudametal result ivolvig Rademacher averages Theorem 5 For all δ > 0, with probability at least 1 δ, log 1 δ f F, P f P f + 2R(F) + 2, ad also, with probability at least 1 δ, f F, P f P f + 2R (F) + 2 log 2 δ It is remarkable that oe ca obtai a boud (secod part of the theorem) which depeds solely o the data The proof of the above result requires a powerful tool called a cocetratio iequality for empirical processes Actually, Hoeffdig s iequality is a (simple) cocetratio iequality, i the sese that whe icreases, the empirical average is cocetrated aroud the expectatio It is possible to geeralize this result to fuctios that deped o iid radom variables as show i the theorem below Theorem 6 (McDiarmid 17]) Assume for all i = 1,,, the for all ε > 0, sup F (z 1,, z i,, z ) F (z 1,, z i,, z ) c, z 1,,z,z i F F ] > ε] 2 exp ) ( 2ε2 c 2 The meaig of this result is thus that, as soo as oe has a fuctio of idepedet radom variables, which is such that its variatio is bouded whe oe variable is modified, the fuctio will satisfy a Hoeffdig-like iequality

22 196 Bousquet, Bouchero & Lugosi Proof of Theorem 5 To prove Theorem 5, we will have to follow the followig three steps: 1 Use cocetratio to relate sup P f P f to its expectatio, 2 use symmetrizatio to relate the expectatio to the Rademacher average, 3 use cocetratio agai to relate the Rademacher average to the coditioal oe We first show that McDiarmid s iequality ca be applied to sup P f P f We deote temporarily by P i the empirical measure obtaied by modifyig oe elemet (eg Z i is replaced by Z i ) of the sample It is easy to check that the followig holds sup (P f P f) sup Sice f {0, 1} we obtai (P f Pf) i sup Pf i P f P i f P f = 1 f(z i) f(z i ) 1, ad thus McDiarmid s iequality ca be applied with c = 1/ This cocludes the first step of the proof We ext prove the (first part of the) followig symmetrizatio lemma Lemma 4 For ay class F, ad sup P f P f 2 sup P f P f 1 2 sup R f, sup R f 1 2 Proof We oly prove the first part We itroduce a ghost sample ad its correspodig measure P We successively use the fact that P f = P f ad the supremum is a covex fuctio (hece we ca apply Jese s iequality, see Appedix A): sup P f P f = sup P f] P f sup P f P f ] 1 = σ sup σ i (f(z i) f(z i )) i=1 ] 1 σ sup σ i f(z i) + σ = 2 sup R f i=1 sup 1 ] σ i f(z i )) i=1

23 Statistical Learig Theory 197 where the third step uses the fact that f(z i ) f(z i ) ad σ i(f(z i ) f(z i )) have the same distributio ad the last step uses the fact that the σ i f(z i ) ad σ i f(z i ) have the same distributio The above already establishes the first part of Theorem 5 For the secod part, we eed to use cocetratio agai For this we apply McDiarmid s iequality to the followig fuctioal F (Z 1,, Z ) = R (F) It is easy to check that F satisfies McDiarmid s assumptios with c = 1 As a result, F = R(F) ca be sharply estimated by F = R (F) Loss Class ad Iitial Class I order to make use of Theorem 5 we have to relate the Rademacher average of the loss class to those of the iitial class This ca be doe with the followig derivatio where oe uses the fact that σ i ad σ i Y i have the same distributio R(F) = = = 1 2 sup g G sup g G 1 1 sup g G σ i i=1 i=1 1 g(x i) Y i ] ] 1 σ i 2 (1 Y ig(x i )) ] σ i Y i g(x i ) = 1 2 R(G) i=1 Notice that the same is valid for coditioal Rademacher averages, so that we obtai that with probability at least 1 δ, 2 log 2 δ g G, R(g) R (g) + R (G) + Computig the Rademacher Averages We ow assess the difficulty of actually computig the Rademacher averages We write the followig ] 1 1 σ i g(x i ) 2 sup g G i=1 = sup g G = if g G ] 1 σ ig(x i ) 2 ] 1 σ i g(x i ) 2 ] i=1 i=1 = 1 2 if g G R (g, σ)

24 198 Bousquet, Bouchero & Lugosi This idicates that, give a sample ad a choice of the radom variables σ 1,, σ, computig R (G) is ot harder tha computig the empirical risk miimizer i G Ideed, the procedure would be to geerate the σ i radomly ad miimize the empirical error i G with respect to the labels σ i A advatage of rewritig R (G) as above is that it gives a ituitio of what it actually measures: it measures how much the class G ca fit radom oise If the class G is very large, there will always be a fuctio which ca perfectly fit the σ i ad the R (G) = 1/2, so that there is o hope of uiform covergece to zero of the differece betwee true ad empirical risks For a fiite set with G = N, oe ca show that R (G) 2 log N/, where we agai see the logarithmic factor log N A cosequece of this is that, by cosiderig the projectio o the sample of a class G with VC dimesio h, ad usig Lemma 1, we have h log e h R(G) 2 This result alog with Theorem 5 allows to recover the Vapik Chervoekis boud with a cocetratio-based proof Although the beefit of usig cocetratio may ot be etirely clear at that poit, let us just metio that oe ca actually improve the depedece o of the above boud This is based o the so-called chaiig techique The idea is to use coverig umbers at all scales i order to capture the geometry of the class i a better way tha the VC etropy does Oe has the followig result, called Dudley s etropy boud R (F) C log N(F, t, ) dt 0 As a cosequece, alog with Haussler s upper boud, we ca get the followig result h R (F) C We ca thus, with this approach, remove the uecessary log factor of the VC boud 6 Advaced Topics I this sectio, we poit out several ways i which the results preseted so far ca be improved The mai source of improvemet actually comes, as metioed earlier, from the fact that Hoeffdig ad McDiarmid iequalities do ot make use of the variace of the fuctios

25 Statistical Learig Theory Biomial Tails We recall that the fuctios we cosider are biary valued So, if we cosider a fixed fuctio f, the distributio of P f is actually a biomial law of parameters P f ad (sice we are summig iid radom variables f(z i ) which ca either be 0 or 1 ad are equal to 1 with probability f(z i ) = P f) Deotig p = P f, we ca have a exact expressio for the deviatios of P f from P f: P f P f t] = (p t) k=0 ( ) p k (1 p) k k Sice this expressio is ot easy to maipulate, we have used a upper boud provided by Hoeffdig s iequality However, there exist other (sharper) upper bouds The followig quatities are a upper boud o P f P f t], ( ) (1 p t) ( (p+t) 1 p p 1 p t p+t) (expoetial) e p 1 p ((1 t/p) log(1 t/p)+t/p) (Beett) e t 2 2p(1 p)+2t/3 e 2t2 (Berstei) (Hoeffdig) Examiig the above bouds (ad usig iversio), we ca say that roughly speakig, the small deviatios of P f P f have a Gaussia behavior of the form exp( t 2 /2p(1 p)) (ie Gaussia with variace p(1 p)) while the large deviatios have a Poisso behavior of the form exp( 3t/2) So the tails are heavier tha Gaussia, ad Hoeffdig s iequality cosists i upper boudig the tails with a Gaussia with maximum variace, hece the term exp( 2t 2 ) Each fuctio f F has a differet variace P f(1 P f) P f Moreover, for each f F, by Berstei s iequality, with probability at least 1 δ, P f P f + 2P f log 1 δ + 2 log 1 δ 3 The Gaussia part (secod term i the right had side) domiates (for P f ot too small, or large eough), ad it depeds o P f We thus wat to combie Berstei s iequality with the uio boud ad the symmetrizatio 62 Normalizatio The idea is to cosider the ratio P f P f P f Here (f {0, 1}), Varf P f 2 = P f

26 200 Bousquet, Bouchero & Lugosi The reaso for cosiderig this ratio is that after ormalizatio, fluctuatios are more uiform i the class F Hece the supremum i P f P f sup P f ot ecessarily attaied at fuctios with large variace as it was the case previously Moreover, we kow that our goal is to fid fuctios with small error P f (hece small variace) The ormalized supremum takes this ito accout We ow state a result similar to Theorem 2 for the ormalized supremum Theorem 7 (Vapik-Chervoekis, 18]) For δ > 0 with probability at least 1 δ, f F, P f P f log S F (2) + log 4 δ 2, P f ad also with probability at least 1 δ, sup f F, P f P f P f 2 log S F (2) + log 4 δ Proof We oly give a sketch of the proof The first step is a variatio of the symmetrizatio lemma ] ] P f P f P t P f 2 f P f (P f + P f)/2 t sup The secod step cosists i radomizatio (with Rademacher variables) ]] = 2 σ sup 1 i=1 σ i(f(z i ) f(z i)) t (P f + P f)/2 Fially, oe uses a tail boud of Berstei type Let us explore the cosequeces of this result From the fact that for o-egative umbers A, B, C, A B + C A A B + C 2 + BC, we easily get for example f F, P f P f + 2 P f log S F(2) + log 4 δ +4 log S F(2) + log 4 δ

27 Statistical Learig Theory 201 I the ideal situatio where there is o oise (ie Y = t(x) almost surely), ad t G, deotig by g the empirical risk miimizer, we have R = 0 ad also R (g ) = 0 I particular, whe G is a class of VC dimesio h, we obtai ( ) h log R(g ) = O So, i a way, Theorem 7 allows to iterpolate betwee the best case where the rate of covergece is O(h log /) ad the worst case where the rate is O( h log /) (it does ot allow to remove the log factor i this case) It is also possible to derive from Theorem 7 relative error bouds for the miimizer of the empirical error With probability at least 1 δ, R(g ) R(g ) + 2 R(g ) log S G(2) + log 4 δ +4 log S G(2) + log 4 δ We otice here that whe R(g ) = 0 (ie t G ad R = 0), the rate is agai of order 1/ while, as soo as R(g ) > 0, the rate is of order 1/ Therefore, it is ot possible to obtai a rate with a power of i betwee 1/2 ad 1 The mai reaso is that the factor of the square root term R(g ) is ot the right quatity to use here sice it does ot vary with We will see later that oe ca have istead R(g ) R(g ) as a factor, which is usually covergig to zero with icreasig Ufortuately, Theorem 7 caot be applied to fuctios of the type f f (which would be eeded to have the metioed factor), so we will eed a refied approach 63 Noise Coditios The refiemet we seek to obtai requires certai specific assumptios about the oise fuctio s(x) The ideal case beig whe s(x) = 0 everywhere (which correspods to R = 0 ad Y = t(x)) We ow itroduce quatities that measure how well-behaved the oise fuctio is The situatio is favorable whe the regressio fuctio η(x) is ot too close to 0, or at least ot too ofte close to 1/2 Ideed, η(x) = 0 meas that the oise is maximum at x (s(x) = 1/2) ad that the label is completely udetermied (ay predictio would yield a error with probability 1/2) Defiitios There are two types of coditios Defiitio 6 (Massart s Noise Coditio) For some c > 0, assume η(x) > 1 almost surely c

28 202 Bousquet, Bouchero & Lugosi This coditio implies that there is o regio where the decisio is completely radom, or the oise is bouded away from 1/2 Defiitio 7 (Tsybakov s Noise Coditio) Let α 0, 1], assume that oe the followig equivalet coditios is satisfied (i) c > 0, g { 1, 1} X, g(x)η(x) 0] c(r(g) R ) α (ii) c > 0, A X, dp (x) c( η(x) dp (x)) α (iii) B > 0, t 0, η(x) t] Bt α 1 α A Coditio (iii) is probably the easiest to iterpret: it meas that η(x) is close to the critical value 0 with low probability We idicate how to prove that coditios (i), (ii) ad (iii) are ideed equivalet: (i) (ii) It is easy to check that R(g) R = η(x) gη 0] For each fuctio g, there exists a set A such that A = gη 0 (ii) (iii) Let A = {x : η(x) t} η t] = dp (x) c( η(x) dp (x)) α A A ct α ( dp (x)) α A A (iii) (i) We write η t] c 1 1 α t α 1 α Takig t = R(g) R = η(x) gη 0] ] t gη 0 = t η t] t η t gη>0 ] η t t(1 Bt α 1 α ) t gη > 0] = t( gη 0] Bt α 1 α ) ( (1 α) gη 0] B gη 0] ) (1 α)/α fially gives B 1 α (1 α) ( 1 α)α α (R(g) R ) α We otice that the parameter α has to be i 0, 1] Ideed, oe has the opposite iequality R(g) R = η(x) gη 0] gη 0] = g(x)η(x) 0], which is icompatible with coditio (i) if α > 1 We also otice that whe α = 0, Tsybakov s coditio is void, ad whe α = 1, it is equivalet to Massart s coditio

29 Statistical Learig Theory 203 Cosequeces The coditios we impose o the oise yield a crucial relatioship betwee the variace ad the expectatio of fuctios i the so-called relative loss class defied as F = {(x, y) f(x, y) t(x) y : f F} This relatioship will allow to exploit Berstei type iequalities applied to this latter class Uder Massart s coditio, oe has (writte i terms of the iitial class) for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ), or, equivaletly, for f F, Varf P f 2 cp f Uder Tsybakov s coditio this becomes for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ) α, ad for f F, Varf P f 2 c(p f) α I the fiite case, with G = N, oe ca easily apply Berstei s iequality to F ad the fiite uio boud to get that with probability at least 1 δ, for all g G, R(g) R R (g) R (t) + 8c(R(g) R ) α log N δ + 4 log N δ 3 As a cosequece, whe t G, ad g is the miimizer of the empirical error (hece R (g) R (t)), oe has R(g ) R C ( ) 1 log N 2 α δ which always better tha 1/2 for α > 0 ad is valid eve if R > 0, 64 Local Rademacher Averages I this sectio we geeralize the above result by itroducig a localized versio of the Rademacher averages Goig from the fiite to the geeral case is more ivolved tha what has bee see before We first give the appropriate defiitios, the state the result ad give a proof sketch Defiitios Local Rademacher averages refer to Rademacher averages of subsets of the fuctio class determied by a coditio o the variace of the fuctio Defiitio 8 (Local Rademacher Average) The local Rademacher average at radius r 0 for the class F is defied as R(F, r) = sup R f :P f 2 r

30 204 Bousquet, Bouchero & Lugosi The reaso for this defiitio is that, as we have see before, the crucial igrediet to obtai better rates of covergece is to use the variace of the fuctios Localizig the Rademacher average allows to focus o the part of the fuctio class where the fast rate pheomeo occurs, that are fuctios with small variace Next we itroduce the cocept of a sub-root fuctio, a real-valued fuctio with certai mootoy properties Defiitio 9 (Sub-Root Fuctio) A fuctio ψ : is sub-root if (i) ψ is o-decreasig, (ii) ψ is o egative, (iii) ψ(r)/ r is o-icreasig A immediate cosequece of this defiitio is the followig result Lemma 5 A sub-root fuctio (i) is cotiuous, (ii) has a uique (o-zero) fixed poit r satisfyig ψ(r ) = r Figure 6 shows a typical sub-root fuctio ad its fixed poit 3 x phi(x) Fig 6 A example of a sub-root fuctio ad its fixed poit Before seeig the ratioale for itroducig the sub-root cocept, we eed yet aother defiitio, that of a star-hull (somewhat similar to a covex hull) Defiitio 10 (Star-Hull) Let F be a set of fuctios Its star-hull is defied as F = {αf : f F, α 0, 1]}

31 Statistical Learig Theory 205 Now, we state a lemma that idicates that by takig the star-hull of a class of fuctios, we are guarateed that the local Rademacher average behaves like a sub-root fuctio, ad thus has a uique fixed poit This fixed poit will tur out to be the key quatity i the relative error bouds Lemma 6 For ay class of fuctios F, R ( F, r) is sub-root Oe legitimate questio is whether takig the star-hull does ot elarge the class too much Oe way to see what the effect is o the size of the class is to compare the metric etropy (log coverig umbers) of F ad of F It is possible to see that the etropy icreases oly by a logarithmic factor, which is essetially egligible Result We ow state the mai result ivolvig local Rademacher averages ad their fixed poit Theorem 8 Let F be a class of bouded fuctios (eg f 1, 1]) ad r be the fixed poit of R( F, r) There exists a costat C > 0 such that with probability at least 1 δ, ( r f F, P f P f C Varf + log 1 δ ) + log log If i additio the fuctios i F satisfy Varf c(p f) β, the oe obtais that with probability at least 1 δ, ( f F, P f C P f + (r log 1 ) 1 2 β + δ ) + log log Proof We oly give the mai steps of the proof 1 The startig poit is Talagrad s iequality for empirical processes, a geeralizatio of McDiarmid s iequality of Berstei type (ie which icludes the variace) This iequality tells that with high probability, sup P f P f sup P f P f ] + c sup Varf/ + c /, for some costats c, c 2 The secod step cosists i peelig the class, that is splittig the class ito subclasses accordig to the variace of the fuctios F k = {f : Varf x k, x k+1 )},