Introduction to Statistical Learning Theory



Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

I. Chi-squared Distributions

Chapter 7 Methods of Finding Estimators

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx


Convexity, Inequalities, and Norms

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Hypothesis testing. Null and alternative hypotheses

Overview of some probability distributions.

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

A probabilistic proof of a binomial identity

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Asymptotic Growth of Functions

Sequences and Series

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Statistical inference: example 1. Inferential Statistics

Confidence Intervals for One Mean

5: Introduction to Estimation

Maximum Likelihood Estimators.

1. C. The formula for the confidence interval for a population mean is: x t, which was

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

1 Computing the Standard Deviation of Sample Means

MARTINGALES AND A BASIC APPLICATION

Lecture 5: Span, linear independence, bases, and dimension

Department of Computer Science, University of Otago

Plug-in martingales for testing exchangeability on-line

3. Greatest Common Divisor - Least Common Multiple

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Section 11.3: The Integral Test

Notes on exponential generating functions and structures.

Modified Line Search Method for Global Optimization

Normal Distribution.

1 The Gaussian channel

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Chapter 14 Nonparametric Statistics

Statistical Learning Theory

Output Analysis (2, Chapters 10 &11 Law)

Entropy of bi-capacities

Incremental calculation of weighted mean and variance

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Determining the sample size

Universal coding for classes of sources

Soving Recurrence Relations

CHAPTER 3 DIGITAL CODING OF SIGNALS

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

LECTURE 13: Cross-validation

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

The Stable Marriage Problem

Ekkehart Schlicht: Economic Surplus and Derived Demand

THE HEIGHT OF q-binary SEARCH TREES

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Theorems About Power Series

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

Infinite Sequences and Series

THE ABRACADABRA PROBLEM

Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

5 Boolean Decision Trees (February 11)

1. MATHEMATICAL INDUCTION

Chapter 5: Inner Product Spaces

Concentration of Measure

Basic Elements of Arithmetic Sequences and Series

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

3 Basic Definitions of Probability Theory

Math C067 Sampling Distributions

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

How To Solve The Homewor Problem Beautifully

1 Correlation and Regression Analysis

Class Meeting # 16: The Fourier Transform on R n

Lecture 4: Cheeger s Inequality

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Measures of Spread and Boxplots Discrete Math, Section 9.4

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Lesson 17 Pearson s Correlation Coefficient

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Totally Corrective Boosting Algorithms that Maximize the Margin

CS103X: Discrete Structures Homework 4 Solutions

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

A Mathematical Perspective on Gambling

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

4.3. The Integral and Comparison Tests

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Transcription:

Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 Max-Plack Istitute for Biological Cyberetics Spemastr 38, D-72076 Tübige, Germay olivierbousquet@m4xorg WWW home page: http://wwwkybmpgde/~bousquet 2 Uiversité de Paris-Sud, Laboratoire d Iformatique Bâtimet 490, F-91405 Orsay Cedex, Frace stephaebouchero@lrifr WWW home page: http://wwwlrifr/~bouchero 3 Departmet of Ecoomics, Pompeu Fabra Uiversity Ramo Trias Fargas 25-27, 08005 Barceloa, Spai lugosi@upfes WWW home page: http://wwwecoupfes/~lugosi Abstract The goal of statistical learig theory is to study, i a statistical framework, the properties of learig algorithms I particular, most results take the form of so-called error bouds This tutorial itroduces the techiques that are used to obtai such results 1 Itroductio The mai goal of statistical learig theory is to provide a framework for studyig the problem of iferece, that is of gaiig kowledge, makig predictios, makig decisios or costructig models from a set of data This is studied i a statistical framework, that is there are assumptios of statistical ature about the uderlyig pheomea (i the way the data is geerated) As a motivatio for the eed of such a theory, let us just quote V Vapik: (Vapik, 1]) Nothig is more practical tha a good theory Ideed, a theory of iferece should be able to give a formal defiitio of words like learig, geeralizatio, overfittig, ad also to characterize the performace of learig algorithms so that, ultimately, it may help desig better learig algorithms There are thus two goals: make thigs more precise ad derive ew or improved algorithms 11 Learig ad Iferece What is uder study here is the process of iductive iferece which ca roughly be summarized as the followig steps:

176 Bousquet, Bouchero & Lugosi 1 Observe a pheomeo 2 Costruct a model of that pheomeo 3 Make predictios usig this model Of course, this defiitio is very geeral ad could be take more or less as the goal of Natural Scieces The goal of Machie Learig is to actually automate this process ad the goal of Learig Theory is to formalize it I this tutorial we cosider a special case of the above process which is the supervised learig framework for patter recogitio I this framework, the data cosists of istace-label pairs, where the label is either +1 or 1 Give a set of such pairs, a learig algorithm costructs a fuctio mappig istaces to labels This fuctio should be such that it makes few mistakes whe predictig the label of usee istaces Of course, give some traiig data, it is always possible to build a fuctio that fits exactly the data But, i the presece of oise, this may ot be the best thig to do as it would lead to a poor performace o usee istaces (this is usually referred to as overfittig) The geeral idea behid the desig of 15 1 05 0 0 05 1 15 Fig 1 Trade-off betwee fit ad complexity learig algorithms is thus to look for regularities (i a sese to be defied later) i the observed pheomeo (ie traiig data) These ca the be geeralized from the observed past to the future Typically, oe would look, i a collectio of possible models, for oe which fits well the data, but at the same time is as simple as possible (see Figure 1) This immediately raises the questio of how to measure ad quatify simplicity of a model (ie a { 1, +1}-valued fuctio)

Statistical Learig Theory 177 It turs out that there are may ways to do so, but o best oe For example i Physics, people ted to prefer models which have a small umber of costats ad that correspod to simple mathematical formulas Ofte, the legth of descriptio of a model i a codig laguage ca be a idicatio of its complexity I classical statistics, the umber of free parameters of a model is usually a measure of its complexity Surprisigly as it may seem, there is o uiversal way of measurig simplicity (or its couterpart complexity) ad the choice of a specific measure iheretly depeds o the problem at had It is actually i this choice that the desiger of the learig algorithm itroduces kowledge about the specific pheomeo uder study This lack of uiversally best choice ca actually be formalized i what is called the No Free Luch theorem, which i essece says that, if there is o assumptio o how the past (ie traiig data) is related to the future (ie test data), predictio is impossible Eve more, if there is o a priori restrictio o the possible pheomea that are expected, it is impossible to geeralize ad there is thus o better algorithm (ay algorithm would be beate by aother oe o some pheomeo) Hece the eed to make assumptios, like the fact that the pheomeo we observe ca be explaied by a simple model However, as we said, simplicity is ot a absolute otio, ad this leads to the statemet that data caot replace kowledge, or i pseudo-mathematical terms: Geeralizatio = Data + Kowledge 12 Assumptios We ow make more precise the assumptios that are made by the Statistical Learig Theory framework Ideed, as we said before we eed to assume that the future (ie test) observatios are related to the past (ie traiig) oes, so that the pheomeo is somewhat statioary At the core of the theory is a probabilistic model of the pheomeo (or data geeratio process) Withi this model, the relatioship betwee past ad future observatios is that they both are sampled idepedetly from the same distributio (iid) The idepedece assumptio meas that each ew observatio yields maximum iformatio The idetical distributio meas that the observatios give iformatio about the uderlyig pheomeo (here a probability distributio) A immediate cosequece of this very geeral settig is that oe ca costruct algorithms (eg k-earest eighbors with appropriate k) that are cosistet, which meas that, as oe gets more ad more data, the predictios of the algorithm are closer ad closer to the optimal oes So this seems to idicate that we ca have some sort of uiversal algorithm Ufortuately, ay (cosistet) algorithm ca have a arbitrarily bad behavior whe give a fiite traiig set These otios are formalized i Appedix B Agai, this discussio idicates that geeralizatio ca oly come whe oe adds specific kowledge to the data Each learig algorithm ecodes specific

178 Bousquet, Bouchero & Lugosi kowledge (or a specific assumptio about how the optimal classifier looks like), ad works best whe this assumptio is satisfied by the problem to which it is applied Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Athoy ad Bartlett 2], Breima, Friedma, Olshe, ad Stoe 3], Devroye, Györfi, ad Lugosi 4], Duda ad Hart 5], Fukuaga 6], Kears ad Vazirai 7], Kulkari, Lugosi, ad Vekatesh 8], Lugosi 9], McLachla 10], Medelso 11], Nataraja 12], Vapik 13, 14, 1], ad Vapik ad Chervoekis 15] 2 Formalizatio We cosider a iput space X ad output space Y Sice we restrict ourselves to biary classificatio, we choose Y = { 1, 1} Formally, we assume that the pairs (X, Y ) X Y are radom variables distributed accordig to a ukow distributio P We observe a sequece of iid pairs (X i, Y i ) sampled accordig to P ad the goal is to costruct a fuctio g : X Y which predicts Y from X We eed a criterio to choose this fuctio g This criterio is a low probability of error P (g(x) Y ) We thus defie the risk of g as R(g) = P (g(x) Y ) = g(x) Y ] Notice that P ca be decomposed as P X P (Y X) We itroduce the regressio fuctio η(x) = Y X = x] = 2 Y = 1 X = x] 1 ad the target fuctio (or Bayes classifier) t(x) = sg η(x) This fuctio achieves the miimum risk over all possible measurable fuctios: R(t) = if g R(g) We will deote the value R(t) by R, called the Bayes risk I the determiistic case, oe has Y = t(x) almost surely ( Y = 1 X] {0, 1}) ad R = 0 I the geeral case we ca defie the oise level as s(x) = mi( Y = 1 X = x], 1 Y = 1 X = x]) = (1 η(x))/2 (s(x) = 0 almost surely i the determiistic case) ad this gives R = s(x) Our goal is thus to idetify this fuctio t, but sice P is ukow we caot directly measure the risk ad we also caot kow directly the value of t at the data poits We ca oly measure the agreemet of a cadidate fuctio with the data This is called the empirical risk: R (g) = 1 i=1 g(x i) Y i It is commo to use this quatity as a criterio to select a estimate of t

Statistical Learig Theory 179 21 Algorithms Now that the goal is clearly specified, we review the commo strategies to (approximately) achieve it We deote by g the fuctio retured by the algorithm Because oe caot compute R(g) but oly approximate it by R (g), it would be ureasoable to look for the fuctio miimizig R (g) amog all possible fuctios Ideed, whe the iput space is ifiite, oe ca always costruct a fuctio g which perfectly predicts the labels of the traiig data (ie g (X i ) = Y i, ad R (g ) = 0), but behaves o the other poits as the opposite of the target fuctio t, ie g (X) = Y so that R(g ) = 1 4 So oe would have miimum empirical risk but maximum risk It is thus ecessary to prevet this overfittig situatio There are essetially two ways to do this (which ca be combied) The first oe is to restrict the class of fuctios i which the miimizatio is performed, ad the secod is to modify the criterio to be miimized (eg addig a pealty for complicated fuctios) Empirical Risk Miimizatio This algorithm is oe of the most straightforward, yet it is usually efficiet The idea is to choose a model G of possible fuctios ad to miimize the empirical risk i that model: g = arg mi g G R (g) Of course, this will work best whe the target fuctio belogs to G However, it is rare to be able to make such a assumptio, so oe may wat to elarge the model as much as possible, while prevetig overfittig Structural Risk Miimizatio The idea here is to choose a ifiite sequece {G d : d = 1, 2, } of models of icreasig size ad to miimize the empirical risk i each model with a added pealty for the size of the model: g = arg mi g G d,d R (g) + pe(d, ) The pealty pe(d, ) gives preferece to models where estimatio error is small ad measures the size or capacity of the model Regularizatio Aother, usually easier to implemet approach cosists i choosig a large model G (possibly dese i the cotiuous fuctios for example) ad to defie o G a regularizer, typically a orm g The oe has to miimize the regularized empirical risk: g = arg mi g G R (g) + λ g 2 4 Strictly speakig this is oly possible if the probability distributio satisfies some mild coditios (eg has o atoms) Otherwise, it may ot be possible to achieve R(g ) = 1 but eve i this case, provided the support of P cotais ifiitely may poits, a similar pheomeo occurs

180 Bousquet, Bouchero & Lugosi Compared to SRM, there is here a free parameter λ, called the regularizatio parameter which allows to choose the right trade-off betwee fit ad complexity Tuig λ is usually a hard problem ad most ofte, oe uses extra validatio data for this task Most existig (ad successful) methods ca be thought of as regularizatio methods Normalized Regularizatio There are other possible approaches whe the regularizer ca, i some sese, be ormalized, ie whe it correspods to some probability distributio over G Give a probability distributio π defied o G (usually called a prior), oe ca use as a regularizer log π(g) 5 Reciprocally, from a regularizer of the form g 2, if there exists a measure µ o G such that e λ g 2 dµ(g) < for some λ > 0, the oe ca costruct a prior correspodig to this regularizer For example, if G is the set of hyperplaes i d goig through the origi, G ca be idetified with d ad, takig µ as the Lebesgue measure, it is possible to go from the Euclidea orm regularizer to a spherical Gaussia measure o d as a prior 6 This type of ormalized regularizer, or prior, ca be used to costruct aother probability distributio ρ o G (usually called posterior), as ρ(g) = e γr(g) π(g), Z(γ) where γ 0 is a free parameter ad Z(γ) is a ormalizatio factor There are several ways i which this ρ ca be used If we take the fuctio maximizig it, we recover regularizatio as arg max ρ(g) = arg mi γr (g) log π(g), g G g G where the regularizer is γ 1 log π(g) 7 Also, ρ ca be used to radomize the predictios I that case, before computig the predicted label for a iput x, oe samples a fuctio g accordig to ρ ad outputs g(x) This procedure is usually called Gibbs classificatio Aother way i which the distributio ρ costructed above ca be used is by takig the expected predictio of the fuctios i G: g (x) = sg( ρ(g(x))) 5 This is fie whe G is coutable I the cotiuous case, oe has to cosider the desity associated to π We omit these details 6 Geeralizatio to ifiite dimesioal Hilbert spaces ca also be doe but it requires more care Oe ca for example establish a correspodece betwee the orm of a reproducig kerel Hilbert space ad a Gaussia process prior whose covariace fuctio is the kerel of this space 7 Note that miimizig γr (g) log π(g) is equivalet to miimizig R (g) γ 1 log π(g)

Statistical Learig Theory 181 This is typically called Bayesia averagig At this poit we have to isist agai o the fact that the choice of the class G ad of the associated regularizer or prior, has to come from a priori kowledge about the task at had, ad there is o uiversally best choice 22 Bouds We have preseted the framework of the theory ad the type of algorithms that it studies, we ow itroduce the kid of results that it aims at The overall goal is to characterize the risk that some algorithm may have i a give situatio More precisely, a learig algorithm takes as iput the data (X 1, Y 1 ),, (X, Y ) ad produces a fuctio g which depeds o this data We wat to estimate the risk of g However, R(g ) is a radom variable (sice it depeds o the data) ad it caot be computed from the data (sice it also depeds o the ukow P ) Estimates of R(g ) thus usually take the form of probabilistic bouds Notice that whe the algorithm chooses its output from a model G, it is possible, by itroducig the best fuctio g i G, with R(g ) = if g G R(g), to write R(g ) R = R(g ) R ] + R(g ) R(g )] The first term o the right had side is usually called the approximatio error, ad measures how well ca fuctios i G approach the target (it would be zero if t G) The secod term, called estimatio error is a radom quatity (it depeds o the data) ad measures how close is g to the best possible choice i G Estimatig the approximatio error is usually hard sice it requires kowledge about the target Classically, i Statistical Learig Theory it is preferable to avoid makig specific assumptios about the target (such as its belogig to some model), but the assumptios are rather o the value of R, or o the oise fuctio s It is also kow that for ay (cosistet) algorithm, the rate of covergece to zero of the approximatio error 8 ca be arbitrarily slow if oe does ot make assumptios about the regularity of the target, while the rate of covergece of the estimatio error ca be computed without ay such assumptio We will thus focus o the estimatio error Aother possible decompositio of the risk is the followig: R(g ) = R (g ) + R(g ) R (g )] I this case, oe estimates the risk by its empirical couterpart, ad some quatity which approximates (or upper bouds) R(g ) R (g ) To summarize, we write the three type of results we may be iterested i 8 For this coverge to mea aythig, oe has to cosider algorithms which choose fuctios from a class which grows with the sample size This is the case for example of Structural Risk Miimizatio or Regularizatio based algorithms

182 Bousquet, Bouchero & Lugosi Error boud: R(g ) R (g ) + B(, G) This correspods to the estimatio of the risk from a empirical quatity Error boud relative to the best i the class: R(g ) R(g ) + B(, G) This tells how optimal is the algorithm give the model it uses Error boud relative to the Bayes risk: R(g ) R + B(, G) This gives theoretical guaratees o the covergece to the Bayes risk 3 Basic Bouds I this sectio we show how to obtai simple error bouds (also called geeralizatio bouds) The elemetary material from probability theory that is eeded here ad i the later sectios is summarized i Appedix A 31 Relatioship to Empirical Processes ] Recall that we wat to estimate the risk R(g ) = g (X) Y of the fuctio g retured by the algorithm after seeig the data (X 1, Y 1 ),, (X, Y ) This quatity caot be observed (P is ukow) ad is a radom variable (sice it depeds o the data) Hece oe way to make a statemet about this quatity is to say how it relates to a estimate such as the empirical risk R (g ) This relatioship ca take the form of upper ad lower bouds for R(g ) R (g ) > ε] For coveiece, let Z i = (X i, Y i ) ad Z = (X, Y ) Give G defie the loss class F = {f : (x, y) g(x) y : g G} (1) Notice that G cotais fuctios with rage i { 1, 1} while F cotais oegative fuctios with rage i {0, 1} I the remaider of the tutorial, we will go back ad forth betwee F ad G (as there is a bijectio betwee them), sometimes statig the results i terms of fuctios i F ad sometimes i terms of fuctios i G It will be clear from the cotext which classes G ad F we refer to, ad F will always be derived from the last metioed class G i the way of (1) We use the shorthad otatio P f = f(x, Y )] ad P f = 1 i=1 f(x i, Y i ) P is usually called the empirical measure associated to the traiig sample With this otatio, the quatity of iterest (differece betwee true ad empirical risks) ca be writte as P f P f (2) A empirical process is a collectio of radom variables idexed by a class of fuctios, ad such that each radom variable is distributed as a sum of iid radom variables (values take by the fuctio at the data): {P f P f}

Statistical Learig Theory 183 Oe of the most studied quatity associated to empirical processes is their supremum: sup P f P f It is clear that if we kow a upper boud o this quatity, it will be a upper boud o (2) This shows that the theory of empirical processes is a great source of tools ad techiques for Statistical Learig Theory 32 Hoeffdig s Iequality Let us rewrite agai the quatity we are iterested i as follows R(g) R (g) = f(z)] 1 f(z i ) i=1 It is easy to recogize here the differece betwee the expectatio ad the empirical average of the radom variable f(z) By the law of large umbers, we immediately obtai that lim 1 ] f(z i ) f(z)] = 0 = 1 i=1 This idicates that with eough samples, the empirical risk of a fuctio is a good approximatio to its true risk It turs out that there exists a quatitative versio of the law of large umbers whe the variables are bouded Theorem 1 (Hoeffdig) Let Z 1,, Z be iid radom variables with f(z) a, b] The for all ε > 0, we have 1 f(z i ) i=1 ] f(z)] > ε ) 2 exp ( 2ε2 (b a) 2 Let us rewrite the above formula to better uderstad its cosequeces Deote the right had side by δ The P f P f > (b a) log 2 δ 2 δ, or (by iversio, see Appedix A) with probability at least 1 δ, P f P f (b a) log 2 δ 2

184 Bousquet, Bouchero & Lugosi Applyig this to f(z) = g(x) Y we get that for ay g, ad ay δ > 0, with probability at least 1 δ R(g) R (g) + log 2 δ 2 (3) Notice that oe has to cosider a fixed fuctio g ad the probability is with respect to the samplig of the data If the fuctio depeds o the data this does ot apply! 33 Limitatios Although the above result seems very ice (sice it applies to ay class of bouded fuctios), it is actually severely limited Ideed, what it essetially says is that for each (fixed) fuctio f F, there is a set S of samples for which log 2 δ 2 P f P f (ad this set of samples has measure S] 1 δ) However, these sets S may be differet for differet fuctios I other words, for the observed sample, oly some of the fuctios i F will satisfy this iequality Aother way to explai the limitatio of Hoeffdig s iequality is the followig If we take for G the class of all { 1, 1}-valued (measurable) fuctios, the for ay fixed sample, there exists a fuctio f F such that P f P f = 1 To see this, take the fuctio which is f(x i ) = Y i o the data ad f(x) = Y everywhere else This does ot cotradict Hoeffdig s iequality but shows that it does ot yield what we eed Figure 2 illustrates the above argumetatio The horizotal axis correspods Risk R R (g) R R(g) g g* g Fuctio class Fig 2 Covergece of the empirical risk to the true risk over the class of fuctios

Statistical Learig Theory 185 to the fuctios i the class The two curves represet the true risk ad the empirical risk (for some traiig sample) of these fuctios The true risk is fixed, while for each differet sample, the empirical risk will be a differet curve If we observe a fixed fuctio g ad take several differet samples, the poit o the empirical curve will fluctuate aroud the true risk with fluctuatios cotrolled by Hoeffdig s iequality However, for a fixed sample, if the class G is big eough, oe ca fid somewhere alog the axis, a fuctio for which the differece betwee the two curves will be very large 34 Uiform Deviatios Before seeig the data, we do ot kow which fuctio the algorithm will choose The idea is to cosider uiform deviatios R(f ) R (f ) sup(r(f) R (f)) (4) I other words, if we ca upper boud the supremum o the right, we are doe For this, we eed a boud which holds simultaeously for all fuctios i a class Let us explai how oe ca costruct such uiform bouds Cosider two fuctios f 1, f 2 ad defie C i = {(x 1, y 1 ),, (x, y ) : P f i P f i > ε} This set cotais all the bad samples, ie those for which the boud fails From Hoeffdig s iequality, for each i C i ] δ We wat to measure how may samples are bad for i = 1 or i = 2 For this we use (see Appedix A) C 1 C 2 ] C 1 ] + C 2 ] 2δ More geerally, if we have N fuctios i our class, we ca write As a result we obtai C 1 C N ] N C i ] i=1 f {f 1,, f N } : P f P f > ε] N P f i P f i > ε] i=1 N exp ( 2ε 2)

186 Bousquet, Bouchero & Lugosi Hece, for G = {g 1,, g N }, for all δ > 0 with probability at least 1 δ, log N + log 1 δ g G, R(g) R (g) + 2 This is a error boud Ideed, if we kow that our algorithm picks fuctios from G, we ca apply this result to g itself Notice that the mai differece with Hoeffdig s iequality is the extra log N term o the right had side This is the term which accouts for the fact that we wat N bouds to hold simultaeously Aother iterpretatio of this term is as the umber of bits oe would require to specify oe fuctio i G It turs out that this kid of codig iterpretatio of geeralizatio bouds is ofte possible ad ca be used to obtai error estimates 16] 35 Estimatio Error Usig the same idea as before, ad with o additioal effort, we ca also get a boud o the estimatio error We start from the iequality R(g ) R (g ) + sup(r(g) R (g)), g G which we combie with (4) ad with the fact that sice g miimizes the empirical risk i G, R (g ) R (g ) 0 Thus we obtai R(g ) = R(g ) R(g ) + R(g ) R (g ) R (g ) + R(g ) R(g ) + R(g ) 2 sup R(g) R (g) + R(g ) g G We obtai that with probability at least 1 δ R(g ) R(g log N + log 2 δ ) + 2 2 We otice that i the right had side, both terms deped o the size of the class G If this size icreases, the first term will decrease, while the secod will icrease 36 Summary ad Perspective At this poit, we ca summarize what we have exposed so far Iferece requires to put assumptios o the process geeratig the data (data sampled iid from a ukow P ), geeralizatio requires kowledge (eg restrictio, structure, or prior)

Statistical Learig Theory 187 The error bouds are valid with respect to the repeated samplig of traiig sets For a fixed fuctio g, for most of the samples For most of the samples if G = N R(g) R (g) 1/ sup R(g) R (g) log N/ g G The extra variability comes from the fact that the chose g chages with the data So the result we have obtaied so far is that with high probability, for a fiite class of size N, log N + log 1 δ sup(r(g) R (g)) g G 2 There are several thigs that ca be improved: Hoeffdig s iequality oly uses the boudedess of the fuctios, ot their variace The uio boud is as bad as if all the fuctios i the class were idepedet (ie if f 1 (Z) ad f 2 (Z) were idepedet) The supremum over G of R(g) R (g) is ot ecessarily what the algorithm would choose, so that upper boudig R(g ) R (g ) by the supremum might be loose 4 Ifiite Case: Vapik-Chervoekis Theory I this sectio we show how to exted the previous results to the case where the class G is ifiite This requires, i the o-coutable case, the itroductio of tools from Vapik-Chervoekis Theory 41 Refied Uio Boud ad Coutable Case We first start with a simple refiemet of the uio boud that allows to exted the previous results to the (coutably) ifiite case Recall that by Hoeffdig s iequality, for each f F, for each δ > 0 (possibly depedig o f, which we write δ(f)), P f P f > log 1 δ(f) 2 δ(f)

188 Bousquet, Bouchero & Lugosi Hece, if we have a coutable set F, the uio boud immediately yields log 1 f F : P f P f > δ(f) 2 δ(f) Choosig δ(f) = δp(f) with p(f) = 1, this makes the right-had side equal to δ ad we get the followig result With probability at least 1 δ, log 1 p(f) f F, P f P f + + log 1 δ 2 We otice that if F is fiite (with size N), takig a uiform p gives the log N as before Usig this approach, it is possible to put kowledge about the algorithm ito p(f), but p should be chose before seeig the data, so it is ot possible to cheat by settig all the weight to the fuctio retured by the algorithm after seeig the data (which would give the smallest possible boud) But, i geeral, if p is well-chose, the boud will have a small value Hece, the boud ca be improved if oe kows ahead of time the fuctios that the algorithm is likely to pick (ie kowledge improves the boud) 42 Geeral Case Whe the set G is ucoutable, the previous approach does ot directly work The geeral idea is to look at the fuctio class projected o the sample More precisely, give a sample z 1,, z, we cosider F z1,,z = {(f(z 1 ),, f(z )) : f F} The size of this set is the umber of possible ways i which the data (z 1,, z ) ca be classified Sice the fuctios f ca oly take two values, this set will always be fiite, o matter how big F is Defiitio 1 (Growth fuctio) The growth fuctio is the maximum umber of ways ito which poits ca be classified by the fuctio class: S F () = sup F z1,,z (z 1,,z ) We have defied the growth fuctio i terms of the loss class F but we ca do the same with the iitial class G ad otice that S F () = S G () It turs out that this growth fuctio ca be used as a measure of the size of a class of fuctio as demostrated by the followig result Theorem 2 (Vapik-Chervoekis) For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) + 2 2 log S G(2) + log 2 δ

Statistical Learig Theory 189 Notice that, i the fiite case where G = N, we have S G () N so that this boud is always better tha the oe we had before (except for the costats) But the problem becomes ow oe of computig S G () 43 VC Dimesio Sice g { 1, 1}, it is clear that S G () 2 If S G () = 2, there is a set of size such that the class of fuctios ca geerate ay classificatio o these poits (we say that G shatters the set) Defiitio 2 (VC dimesio) The VC dimesio of a class G is the largest such that S G () = 2 I other words, the VC dimesio of a class G is the size of the largest set that it ca shatter I order to illustrate this defiitio, we give some examples The first oe is the set of half-plaes i d (see Figure 3) I this case, as depicted for the case d = 2, oe ca shatter a set of d + 1 poits but o set of d + 2 poits, which meas that the VC dimesio is d + 1 Fig 3 Computig the VC dimesio of hyperplaes i dimesio 2: a set of 3 poits ca be shattered, but o set of four poits It is iterestig to otice that the umber of parameters eeded to defie half-spaces i d is d, so that a atural questio is whether the VC dimesio is related to the umber of parameters of the fuctio class The ext example, depicted i Figure 4, is a family of fuctios with oe parameter oly: {sg(si(tx)) : t } which actually has ifiite VC dimesio (this is a exercise left to the reader)

190 Bousquet, Bouchero & Lugosi Fig 4 VC dimesio of siusoids It remais to show how the otio of VC dimesio ca brig a solutio to the problem of computig the growth fuctio Ideed, at first glace, if we kow that a class has VC dimesio h, it etails that for all h, S G () = 2 ad S G () < 2 otherwise This seems of little use, but actually, a itriguig pheomeo occurs for h as depicted i Figure 5 The growth fuctio log(s()) h Fig 5 Typical behavior of the log growth fuctio which is expoetial (its logarithm is liear) up util the VC dimesio, becomes polyomial afterwards This behavior is captured i the followig lemma Lemma 1 (Vapik ad Chervoekis, Sauer, Shelah) Let G be a class of fuctios with fiite VC-dimesio h The for all, S G () h i=0 ( ), i

Statistical Learig Theory 191 ad for all h, ( e ) h S G () h Usig this lemma alog with Theorem 2 we immediately obtai that if G has VC dimesio h, with probability at least 1 δ, 2e h log h g G, R(g) R (g) + 2 2 + log 2 δ What is importat to recall from this result, is that the differece betwee the true ad empirical risk is at most of order h log A iterpretatio of VC dimesio ad growth fuctios is that they measure the effective size of the class, that is the size of the projectio of the class oto fiite samples I additio, this measure does ot just cout the umber of fuctios i the class but depeds o the geometry of the class (rather its projectios) Fially, the fiiteess of the VC dimesio esures that the empirical risk will coverge uiformly over the class to the true risk 44 Symmetrizatio We ow idicate how to prove Theorem 2 The key igrediet to the proof is the so-called symmetrizatio lemma The idea is to replace the true risk by a estimate computed o a idepedet set of data This is of course a mathematical techique ad does ot mea oe eeds to have more data to be able to apply the result The extra data set is usually called virtual or ghost sample We will deote by Z 1,, Z a idepedet (ghost) sample ad by P the correspodig empirical measure Lemma 2 (Symmetrizatio) For ay t > 0, such that t 2 2, ] ] sup(p P )f t 2 sup(p P )f t/2 Proof Let f be the fuctio achievig the supremum (ote that it depeds o Z 1,, Z ) Oe has (with deotig the cojuctio of two evets), (P P )f >t (P P )f<t/2 = (P P )f >t (P P )f t/2 (P P)f>t/2 Takig expectatios with respect to the secod sample gives (P P )f >t (P P )f < t/2] (P P )f > t/2]

192 Bousquet, Bouchero & Lugosi By Chebyshev s iequality (see Appedix A), (P P )f t/2] 4Varf t 2 1 t 2 Ideed, a radom variable with rage i 0, 1] has variace less tha 1/4 Hece (P P )f >t(1 1 t 2 ) (P P )f > t/2] Takig expectatio with respect to first sample gives the result This lemma allows to replace the expectatio P f by a empirical average over the ghost sample As a result, the right had side oly depeds o the projectio of the class F o the double sample: F Z1,,Z,Z 1,,Z, which cotais fiitely may differet vectors Oe ca thus use the simple uio boud that was preseted before i the fiite case The other igrediet that is eeded to obtai Theorem 2 is agai Hoeffdig s iequality i the followig form: P f P f > t] e t2 /2 We ow just have to put the pieces together: sup (P P )f t ] 2 = 2 sup (P P )f t/2 ] ] sup Z1 (P,,Z,Z 1,,Z P )f t/2 2S F (2) (P P )f t/2] 4S F (2)e t2 /8 Usig iversio fiishes the proof of Theorem 2 45 VC Etropy Oe importat aspect of the VC dimesio is that it is distributio idepedet Hece, it allows to get bouds that do ot deped o the problem at had: the same boud holds for ay distributio Although this may be see as a advatage, it ca also be a drawback sice, as a result, the boud may be loose for most distributios We ow show how to modify the proof above to get a distributio-depedet result We use the followig otatio N (F, z 1 ) := F z1,,z Defiitio 3 (VC etropy) The (aealed) VC etropy is defied as H F () = log N (F, Z 1 )]

Statistical Learig Theory 193 Theorem 3 For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) + 2 2 H G(2) + log 2 δ Proof We agai begi with the symmetrizatio lemma so that we have to upper boud the quatity ] I = sup (P Z 1,Z 1 P )f t/2 Let σ 1,, σ be idepedet radom variables such that P (σ i = 1) = P (σ i = 1) = 1/2 (they are called Rademacher variables) We otice that the quatities (P P )f ad 1 i=1 σ i(f(z i ) f(z i)) have the same distributio sice chagig oe σ i correspods to exchagig Z i ad Z i Hece we have I σ sup Z 1,Z 1 1 ]] σ i (f(z i) f(z i )) t/2, i=1 ad the uio boud leads to I N ( F, Z1, Z1 ) max f 1 ]] σ i (f(z i) f(z i )) t/2 i=1 Sice σ i (f(z i ) f(z i)) 1, 1], Hoeffdig s iequality fially gives I N (F, Z, Z )] e t2 /8 The rest of the proof is as before 5 Capacity Measures We have see so far three measures of capacity or size of classes of fuctio: the VC dimesio ad growth fuctio both distributio idepedet, ad the VC etropy which depeds o the distributio Apart from the VC dimesio, they are usually hard or impossible to compute There are however other measures which ot oly may give sharper estimates, but also have properties that make their computatio possible from the data oly 51 Coverig Numbers We start by edowig the fuctio class F with the followig (radom) metric d (f, f ) = 1 {f(z i) f (Z i ) : i = 1,, }

194 Bousquet, Bouchero & Lugosi This is the ormalized Hammig distace of the projectios o the sample Give such a metric, we say that a set f 1,, f N covers F at radius ε if F N i=1b(f i, ε) We the defie the coverig umbers of F as follows Defiitio 4 (Coverig umber) The coverig umber of F at radius ε, with respect to d, deoted by N(F, ε, ) is the miimum size of a cover of radius ε Notice that it does ot matter if we apply this defiitio to the origial class G or the loss class F, sice N(F, ε, ) = N(G, ε, ) The coverig umbers characterize the size of a fuctio class as measured by the metric d The rate of growth of the logarithm of N(G, ε, ) usually called the metric etropy, is related to the classical cocept of vector dimesio Ideed, if G is a compact set i a d-dimesioal Euclidea space, N(G, ε, ) ε d Whe the coverig umbers are fiite, it is possible to approximate the class G by a fiite set of fuctios (which cover G) Which agai allows to use the fiite uio boud, provided we ca relate the behavior of all fuctios i G to that of fuctios i the cover A typical result, which we provide without proof, is the followig Theorem 4 For ay t > 0, g G : R(g) > R (g) + t] 8 N(G, t, )] e t2 /128 Coverig umbers ca also be defied for classes of real-valued fuctios We ow relate the coverig umbers to the VC dimesio Notice that, because the fuctios i G ca oly take two values, for all ε > 0, N(G, ε, ) G Z 1 = N(G, Z1 ) Hece the VC etropy correspods to log coverig umbers at miimal scale, which implies N(G, ε, ) h log e h, but oe ca have a cosiderably better result Lemma 3 (Haussler) Let G be a class of VC dimesio h The, for all ε > 0, all, ad ay sample, N(G, ε, ) Ch(4e) h ε h The iterest of this result is that the upper boud does ot deped o the sample size The coverig umber boud is a geeralizatio of the VC etropy boud where the scale is adapted to the error It turs out that this result ca be improved by cosiderig all scales (see Sectio 52) 52 Rademacher Averages Recall that we used i the proof of Theorem 3 Rademacher radom variables, ie idepedet { 1, 1}-valued radom variables with probability 1/2 of takig either value

Statistical Learig Theory 195 For coveiece we itroduce the followig otatio (siged empirical measure) R f = 1 i=1 σ if(z i ) We will deote by σ the expectatio take with respect to the Rademacher variables (ie coditioally to the data) while will deote the expectatio with respect to all the radom variables (ie the data, the ghost sample ad the Rademacher variables) Defiitio 5 (Rademacher averages) For a class F of fuctios, the Rademacher average is defied as R(F) = sup R f, ad the coditioal Rademacher average is defied as R (F) = σ sup R f We ow state the fudametal result ivolvig Rademacher averages Theorem 5 For all δ > 0, with probability at least 1 δ, log 1 δ f F, P f P f + 2R(F) + 2, ad also, with probability at least 1 δ, f F, P f P f + 2R (F) + 2 log 2 δ It is remarkable that oe ca obtai a boud (secod part of the theorem) which depeds solely o the data The proof of the above result requires a powerful tool called a cocetratio iequality for empirical processes Actually, Hoeffdig s iequality is a (simple) cocetratio iequality, i the sese that whe icreases, the empirical average is cocetrated aroud the expectatio It is possible to geeralize this result to fuctios that deped o iid radom variables as show i the theorem below Theorem 6 (McDiarmid 17]) Assume for all i = 1,,, the for all ε > 0, sup F (z 1,, z i,, z ) F (z 1,, z i,, z ) c, z 1,,z,z i F F ] > ε] 2 exp ) ( 2ε2 c 2 The meaig of this result is thus that, as soo as oe has a fuctio of idepedet radom variables, which is such that its variatio is bouded whe oe variable is modified, the fuctio will satisfy a Hoeffdig-like iequality

196 Bousquet, Bouchero & Lugosi Proof of Theorem 5 To prove Theorem 5, we will have to follow the followig three steps: 1 Use cocetratio to relate sup P f P f to its expectatio, 2 use symmetrizatio to relate the expectatio to the Rademacher average, 3 use cocetratio agai to relate the Rademacher average to the coditioal oe We first show that McDiarmid s iequality ca be applied to sup P f P f We deote temporarily by P i the empirical measure obtaied by modifyig oe elemet (eg Z i is replaced by Z i ) of the sample It is easy to check that the followig holds sup (P f P f) sup Sice f {0, 1} we obtai (P f Pf) i sup Pf i P f P i f P f = 1 f(z i) f(z i ) 1, ad thus McDiarmid s iequality ca be applied with c = 1/ This cocludes the first step of the proof We ext prove the (first part of the) followig symmetrizatio lemma Lemma 4 For ay class F, ad sup P f P f 2 sup P f P f 1 2 sup R f, sup R f 1 2 Proof We oly prove the first part We itroduce a ghost sample ad its correspodig measure P We successively use the fact that P f = P f ad the supremum is a covex fuctio (hece we ca apply Jese s iequality, see Appedix A): sup P f P f = sup P f] P f sup P f P f ] 1 = σ sup σ i (f(z i) f(z i )) i=1 ] 1 σ sup σ i f(z i) + σ = 2 sup R f i=1 sup 1 ] σ i f(z i )) i=1

Statistical Learig Theory 197 where the third step uses the fact that f(z i ) f(z i ) ad σ i(f(z i ) f(z i )) have the same distributio ad the last step uses the fact that the σ i f(z i ) ad σ i f(z i ) have the same distributio The above already establishes the first part of Theorem 5 For the secod part, we eed to use cocetratio agai For this we apply McDiarmid s iequality to the followig fuctioal F (Z 1,, Z ) = R (F) It is easy to check that F satisfies McDiarmid s assumptios with c = 1 As a result, F = R(F) ca be sharply estimated by F = R (F) Loss Class ad Iitial Class I order to make use of Theorem 5 we have to relate the Rademacher average of the loss class to those of the iitial class This ca be doe with the followig derivatio where oe uses the fact that σ i ad σ i Y i have the same distributio R(F) = = = 1 2 sup g G sup g G 1 1 sup g G σ i i=1 i=1 1 g(x i) Y i ] ] 1 σ i 2 (1 Y ig(x i )) ] σ i Y i g(x i ) = 1 2 R(G) i=1 Notice that the same is valid for coditioal Rademacher averages, so that we obtai that with probability at least 1 δ, 2 log 2 δ g G, R(g) R (g) + R (G) + Computig the Rademacher Averages We ow assess the difficulty of actually computig the Rademacher averages We write the followig ] 1 1 σ i g(x i ) 2 sup g G i=1 = sup 1 2 + 1 g G = 1 2 1 if g G ] 1 σ ig(x i ) 2 ] 1 σ i g(x i ) 2 ] i=1 i=1 = 1 2 if g G R (g, σ)

198 Bousquet, Bouchero & Lugosi This idicates that, give a sample ad a choice of the radom variables σ 1,, σ, computig R (G) is ot harder tha computig the empirical risk miimizer i G Ideed, the procedure would be to geerate the σ i radomly ad miimize the empirical error i G with respect to the labels σ i A advatage of rewritig R (G) as above is that it gives a ituitio of what it actually measures: it measures how much the class G ca fit radom oise If the class G is very large, there will always be a fuctio which ca perfectly fit the σ i ad the R (G) = 1/2, so that there is o hope of uiform covergece to zero of the differece betwee true ad empirical risks For a fiite set with G = N, oe ca show that R (G) 2 log N/, where we agai see the logarithmic factor log N A cosequece of this is that, by cosiderig the projectio o the sample of a class G with VC dimesio h, ad usig Lemma 1, we have h log e h R(G) 2 This result alog with Theorem 5 allows to recover the Vapik Chervoekis boud with a cocetratio-based proof Although the beefit of usig cocetratio may ot be etirely clear at that poit, let us just metio that oe ca actually improve the depedece o of the above boud This is based o the so-called chaiig techique The idea is to use coverig umbers at all scales i order to capture the geometry of the class i a better way tha the VC etropy does Oe has the followig result, called Dudley s etropy boud R (F) C log N(F, t, ) dt 0 As a cosequece, alog with Haussler s upper boud, we ca get the followig result h R (F) C We ca thus, with this approach, remove the uecessary log factor of the VC boud 6 Advaced Topics I this sectio, we poit out several ways i which the results preseted so far ca be improved The mai source of improvemet actually comes, as metioed earlier, from the fact that Hoeffdig ad McDiarmid iequalities do ot make use of the variace of the fuctios

Statistical Learig Theory 199 61 Biomial Tails We recall that the fuctios we cosider are biary valued So, if we cosider a fixed fuctio f, the distributio of P f is actually a biomial law of parameters P f ad (sice we are summig iid radom variables f(z i ) which ca either be 0 or 1 ad are equal to 1 with probability f(z i ) = P f) Deotig p = P f, we ca have a exact expressio for the deviatios of P f from P f: P f P f t] = (p t) k=0 ( ) p k (1 p) k k Sice this expressio is ot easy to maipulate, we have used a upper boud provided by Hoeffdig s iequality However, there exist other (sharper) upper bouds The followig quatities are a upper boud o P f P f t], ( ) (1 p t) ( (p+t) 1 p p 1 p t p+t) (expoetial) e p 1 p ((1 t/p) log(1 t/p)+t/p) (Beett) e t 2 2p(1 p)+2t/3 e 2t2 (Berstei) (Hoeffdig) Examiig the above bouds (ad usig iversio), we ca say that roughly speakig, the small deviatios of P f P f have a Gaussia behavior of the form exp( t 2 /2p(1 p)) (ie Gaussia with variace p(1 p)) while the large deviatios have a Poisso behavior of the form exp( 3t/2) So the tails are heavier tha Gaussia, ad Hoeffdig s iequality cosists i upper boudig the tails with a Gaussia with maximum variace, hece the term exp( 2t 2 ) Each fuctio f F has a differet variace P f(1 P f) P f Moreover, for each f F, by Berstei s iequality, with probability at least 1 δ, P f P f + 2P f log 1 δ + 2 log 1 δ 3 The Gaussia part (secod term i the right had side) domiates (for P f ot too small, or large eough), ad it depeds o P f We thus wat to combie Berstei s iequality with the uio boud ad the symmetrizatio 62 Normalizatio The idea is to cosider the ratio P f P f P f Here (f {0, 1}), Varf P f 2 = P f

200 Bousquet, Bouchero & Lugosi The reaso for cosiderig this ratio is that after ormalizatio, fluctuatios are more uiform i the class F Hece the supremum i P f P f sup P f ot ecessarily attaied at fuctios with large variace as it was the case previously Moreover, we kow that our goal is to fid fuctios with small error P f (hece small variace) The ormalized supremum takes this ito accout We ow state a result similar to Theorem 2 for the ormalized supremum Theorem 7 (Vapik-Chervoekis, 18]) For δ > 0 with probability at least 1 δ, f F, P f P f log S F (2) + log 4 δ 2, P f ad also with probability at least 1 δ, sup f F, P f P f P f 2 log S F (2) + log 4 δ Proof We oly give a sketch of the proof The first step is a variatio of the symmetrizatio lemma ] ] P f P f P t P f 2 f P f (P f + P f)/2 t sup The secod step cosists i radomizatio (with Rademacher variables) ]] = 2 σ sup 1 i=1 σ i(f(z i ) f(z i)) t (P f + P f)/2 Fially, oe uses a tail boud of Berstei type Let us explore the cosequeces of this result From the fact that for o-egative umbers A, B, C, A B + C A A B + C 2 + BC, we easily get for example f F, P f P f + 2 P f log S F(2) + log 4 δ +4 log S F(2) + log 4 δ

Statistical Learig Theory 201 I the ideal situatio where there is o oise (ie Y = t(x) almost surely), ad t G, deotig by g the empirical risk miimizer, we have R = 0 ad also R (g ) = 0 I particular, whe G is a class of VC dimesio h, we obtai ( ) h log R(g ) = O So, i a way, Theorem 7 allows to iterpolate betwee the best case where the rate of covergece is O(h log /) ad the worst case where the rate is O( h log /) (it does ot allow to remove the log factor i this case) It is also possible to derive from Theorem 7 relative error bouds for the miimizer of the empirical error With probability at least 1 δ, R(g ) R(g ) + 2 R(g ) log S G(2) + log 4 δ +4 log S G(2) + log 4 δ We otice here that whe R(g ) = 0 (ie t G ad R = 0), the rate is agai of order 1/ while, as soo as R(g ) > 0, the rate is of order 1/ Therefore, it is ot possible to obtai a rate with a power of i betwee 1/2 ad 1 The mai reaso is that the factor of the square root term R(g ) is ot the right quatity to use here sice it does ot vary with We will see later that oe ca have istead R(g ) R(g ) as a factor, which is usually covergig to zero with icreasig Ufortuately, Theorem 7 caot be applied to fuctios of the type f f (which would be eeded to have the metioed factor), so we will eed a refied approach 63 Noise Coditios The refiemet we seek to obtai requires certai specific assumptios about the oise fuctio s(x) The ideal case beig whe s(x) = 0 everywhere (which correspods to R = 0 ad Y = t(x)) We ow itroduce quatities that measure how well-behaved the oise fuctio is The situatio is favorable whe the regressio fuctio η(x) is ot too close to 0, or at least ot too ofte close to 1/2 Ideed, η(x) = 0 meas that the oise is maximum at x (s(x) = 1/2) ad that the label is completely udetermied (ay predictio would yield a error with probability 1/2) Defiitios There are two types of coditios Defiitio 6 (Massart s Noise Coditio) For some c > 0, assume η(x) > 1 almost surely c

202 Bousquet, Bouchero & Lugosi This coditio implies that there is o regio where the decisio is completely radom, or the oise is bouded away from 1/2 Defiitio 7 (Tsybakov s Noise Coditio) Let α 0, 1], assume that oe the followig equivalet coditios is satisfied (i) c > 0, g { 1, 1} X, g(x)η(x) 0] c(r(g) R ) α (ii) c > 0, A X, dp (x) c( η(x) dp (x)) α (iii) B > 0, t 0, η(x) t] Bt α 1 α A Coditio (iii) is probably the easiest to iterpret: it meas that η(x) is close to the critical value 0 with low probability We idicate how to prove that coditios (i), (ii) ad (iii) are ideed equivalet: (i) (ii) It is easy to check that R(g) R = η(x) gη 0] For each fuctio g, there exists a set A such that A = gη 0 (ii) (iii) Let A = {x : η(x) t} η t] = dp (x) c( η(x) dp (x)) α A A ct α ( dp (x)) α A A (iii) (i) We write η t] c 1 1 α t α 1 α Takig t = R(g) R = η(x) gη 0] ] t gη 0 = t η t] t η t gη>0 ] η t t(1 Bt α 1 α ) t gη > 0] = t( gη 0] Bt α 1 α ) ( (1 α) gη 0] B gη 0] ) (1 α)/α fially gives B 1 α (1 α) ( 1 α)α α (R(g) R ) α We otice that the parameter α has to be i 0, 1] Ideed, oe has the opposite iequality R(g) R = η(x) gη 0] gη 0] = g(x)η(x) 0], which is icompatible with coditio (i) if α > 1 We also otice that whe α = 0, Tsybakov s coditio is void, ad whe α = 1, it is equivalet to Massart s coditio

Statistical Learig Theory 203 Cosequeces The coditios we impose o the oise yield a crucial relatioship betwee the variace ad the expectatio of fuctios i the so-called relative loss class defied as F = {(x, y) f(x, y) t(x) y : f F} This relatioship will allow to exploit Berstei type iequalities applied to this latter class Uder Massart s coditio, oe has (writte i terms of the iitial class) for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ), or, equivaletly, for f F, Varf P f 2 cp f Uder Tsybakov s coditio this becomes for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ) α, ad for f F, Varf P f 2 c(p f) α I the fiite case, with G = N, oe ca easily apply Berstei s iequality to F ad the fiite uio boud to get that with probability at least 1 δ, for all g G, R(g) R R (g) R (t) + 8c(R(g) R ) α log N δ + 4 log N δ 3 As a cosequece, whe t G, ad g is the miimizer of the empirical error (hece R (g) R (t)), oe has R(g ) R C ( ) 1 log N 2 α δ which always better tha 1/2 for α > 0 ad is valid eve if R > 0, 64 Local Rademacher Averages I this sectio we geeralize the above result by itroducig a localized versio of the Rademacher averages Goig from the fiite to the geeral case is more ivolved tha what has bee see before We first give the appropriate defiitios, the state the result ad give a proof sketch Defiitios Local Rademacher averages refer to Rademacher averages of subsets of the fuctio class determied by a coditio o the variace of the fuctio Defiitio 8 (Local Rademacher Average) The local Rademacher average at radius r 0 for the class F is defied as R(F, r) = sup R f :P f 2 r

204 Bousquet, Bouchero & Lugosi The reaso for this defiitio is that, as we have see before, the crucial igrediet to obtai better rates of covergece is to use the variace of the fuctios Localizig the Rademacher average allows to focus o the part of the fuctio class where the fast rate pheomeo occurs, that are fuctios with small variace Next we itroduce the cocept of a sub-root fuctio, a real-valued fuctio with certai mootoy properties Defiitio 9 (Sub-Root Fuctio) A fuctio ψ : is sub-root if (i) ψ is o-decreasig, (ii) ψ is o egative, (iii) ψ(r)/ r is o-icreasig A immediate cosequece of this defiitio is the followig result Lemma 5 A sub-root fuctio (i) is cotiuous, (ii) has a uique (o-zero) fixed poit r satisfyig ψ(r ) = r Figure 6 shows a typical sub-root fuctio ad its fixed poit 3 x phi(x) 25 2 15 1 05 0 0 05 1 15 2 25 3 Fig 6 A example of a sub-root fuctio ad its fixed poit Before seeig the ratioale for itroducig the sub-root cocept, we eed yet aother defiitio, that of a star-hull (somewhat similar to a covex hull) Defiitio 10 (Star-Hull) Let F be a set of fuctios Its star-hull is defied as F = {αf : f F, α 0, 1]}

Statistical Learig Theory 205 Now, we state a lemma that idicates that by takig the star-hull of a class of fuctios, we are guarateed that the local Rademacher average behaves like a sub-root fuctio, ad thus has a uique fixed poit This fixed poit will tur out to be the key quatity i the relative error bouds Lemma 6 For ay class of fuctios F, R ( F, r) is sub-root Oe legitimate questio is whether takig the star-hull does ot elarge the class too much Oe way to see what the effect is o the size of the class is to compare the metric etropy (log coverig umbers) of F ad of F It is possible to see that the etropy icreases oly by a logarithmic factor, which is essetially egligible Result We ow state the mai result ivolvig local Rademacher averages ad their fixed poit Theorem 8 Let F be a class of bouded fuctios (eg f 1, 1]) ad r be the fixed poit of R( F, r) There exists a costat C > 0 such that with probability at least 1 δ, ( r f F, P f P f C Varf + log 1 δ ) + log log If i additio the fuctios i F satisfy Varf c(p f) β, the oe obtais that with probability at least 1 δ, ( f F, P f C P f + (r log 1 ) 1 2 β + δ ) + log log Proof We oly give the mai steps of the proof 1 The startig poit is Talagrad s iequality for empirical processes, a geeralizatio of McDiarmid s iequality of Berstei type (ie which icludes the variace) This iequality tells that with high probability, sup P f P f sup P f P f ] + c sup Varf/ + c /, for some costats c, c 2 The secod step cosists i peelig the class, that is splittig the class ito subclasses accordig to the variace of the fuctios F k = {f : Varf x k, x k+1 )},

206 Bousquet, Bouchero & Lugosi 3 We ca the apply Talagrad s iequality to each of the sub-classes separately to get with high probability ] sup P f P f k sup P f P f k + c xvarf/ + c /, 4 The the symmetrizatio lemma allows to itroduce local Rademacher averages We get that with high probability f F, P f P f 2R(F, xvarf) + c xvarf/ + c / 5 We the have to solve this iequality Thigs are simple if R behaves like a square root fuctio sice we ca upper boud the local Rademacher average by the value of its fixed poit With high probability, P f P f 2 r Varf + c xvarf/ + c / 6 Fially, we use the relatioship betwee variace ad expectatio Varf c(p f) α, ad solve the iequality i P f to get the result We will ot got ito the details of how to apply the above result, but we give some remarks about its use A importat example is the case where the class F is of fiite VC dimesio h I that case, oe has rh log R(F, r) C, so that r C h log As a cosequece, we obtai, uder Tsybakov coditio, a rate of covergece of P f to P f is O(1/ 1/(2 α) ) It is importat to ote that i this case, the rate of covergece of P f to P f i O(1/ ) So we obtai a fast rate by lookig at the relative error These fast rates ca be obtaied provided t G (but it is ot eeded that R = 0) This requiremet ca be removed if oe uses structural risk miimizatio or regularizatio Aother related result is that, as i the global case, oe ca obtai a boud with data-depedet (ie coditioal) local Rademacher averages R (F, r) = σ sup R f :P f 2 r The result is the same as before (with differet costats) uder the same coditios as i Theorem 8 With probability at least 1 δ, ( P f C P f + (r) log 1 ) 1 2 α + δ + log log

Statistical Learig Theory 207 where r is the fixed poit of a sub-root upper boud of R (F, r) Hece, we ca get improved rates whe the oise is well-behaved ad these rates iterpolate betwee 1/2 ad 1 However, it is ot i geeral possible to estimate the parameters (c ad α) eterig i the oise coditios, but we will ot discuss this issue further here Aother poit is that although the capacity measure that we use seems local, it does deped o all the fuctios i the class, but each of them is implicitly appropriately rescaled Ideed, i R( F, r), each fuctio f F with P f 2 r is cosidered at scale r/p f 2 Bibliographical remarks Hoeffdig s iequality appears i 19] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad 20] Vapik-Chervoekis-Sauer-Shelah s lemma was proved idepedetly by Sauer 21], Shelah 22], ad Vapik ad Chervoekis 18] For related combiatorial results we refer to Alesker 23], Alo, Be-David, Cesa-Biachi, ad Haussler 24], Cesa-Biachi ad Haussler 25], Frakl 26], Haussler 27], Szarek ad Talagrad 28] Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Dudley 29], Gié 30], Vapik 1], va der Vaart ad Weller 31] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis 18, 15] ad re-discovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth 32], Ehrefeucht, Haussler, Kears, ad Valiat 33] For surveys see Athoy ad Bartlett 2], Devroye, Györfi, ad Lugosi 4], Kears ad Vazirai 7], Nataraja 12], Vapik 14, 1] The questio of how sup (P (f) P (f)) behaves has bee kow as the Gliveko-Catelli problem ad much has bee said about it A few key refereces iclude Alo, Be-David, Cesa-Biachi, ad Haussler 24], Dudley 34, 35, 36], Talagrad 37, 38], Vapik ad Chervoekis 18, 39] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Athoy ad Bartlett 2], Assouad 40], Cover 41], Dudley 42, 29], Goldberg ad Jerrum 43], Karpiski ad A Macityre 44], Khovaskii 45], Koira ad Sotag 46], Macityre ad Sotag 47], Steele 48], ad Weocur ad Dudley 49] The bouded differeces iequality was formulated explicitly first by Mc- Diarmid 17] who proved it by martigale methods (see the surveys 17], 50]), but closely related cocetratio results have bee obtaied i various ways icludig iformatio-theoretic methods (see Alhswede, Gács, ad Körer 51], Marto 52], 53],54], Dembo 55], Massart 56] ad Rio 57]), Talagrad s iductio method 58],59],60] (see also Luczak ad McDiarmid 61], McDiarmid 62], Pacheko 63, 64, 65]) ad the so-called etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux 66],67], see also Bobkov ad Ledoux 68], Massart 69], Rio 57], Bouchero, Lugosi, ad Massart 70], 71], Bouchero, Bousquet, Lugosi, ad Massart 72], ad Bousquet 73] Symmetrizatio lemmas ca be foud i Gié ad Zi 74] ad Vapik ad Chervoekis 18, 15]

208 Bousquet, Bouchero & Lugosi The use of Rademacher averages i classificatio was first promoted by Koltchiskii 75] ad Bartlett, Bouchero, ad Lugosi 76], see also Koltchiskii ad Pacheko 77, 78], Bartlett ad Medelso 79], Bartlett, Bousquet, ad Medelso 80], Bousquet, Koltchiskii, ad Pacheko 81], Kégl, Lider, ad Lugosi 82] A Probability Tools This sectio recalls some basic facts from probability theory that are used throughout this tutorial (sometimes without explicitly metioig it) We deote by A ad B some evets (ie elemets of a σ-algebra), ad by X some real-valued radom variable A1 Basic Facts Uio: A or B] A] + B] Iclusio: If A B, the A] B] Iversio: If X > t] F (t) the with probability at least 1 δ, Expectatio: If X 0, A2 Basic Iequalities X] = X F 1 (δ) 0 X t] dt All the iequalities below are valid as soo as the right-had side exists Jese: for f covex, Markov: If X 0 the for all t > 0, f( X]) f(x)] Chebyshev: for t > 0, X t] X] t Cheroff: for all t, X X] t] VarX t 2 X t] if λ 0 e λ(x t)]

Statistical Learig Theory 209 B No Free Luch We ca ow give a formal defiitio of cosistecy ad state the core results about the impossibility of uiversally good algorithms Defiitio 11 (Cosistecy) A algorithm is cosistet if for ay probability measure P, lim R(g ) = R almost surely It is importat to uderstad the reasos that make possible the existece of cosistet algorithms I the case where the iput space X is coutable, thigs are somehow easy sice eve if there is o relatioship at all betwee iputs ad outputs, by repeatedly samplig data idepedetly from P, oe will get to see a icreasig umber of differet iputs which will evetually coverge to all the iputs So, i the coutable case, a algorithm which would simply lear by heart (ie makes a majority vote whe the istace has bee see before, ad produces a arbitrary predictio otherwise) would be cosistet I the case where X is ot coutable (eg X = ), thigs are more subtle Ideed, i that case, there is a seemigly iocet assumptio that becomes crucial: to be able to defie a probability measure P o X, oe eeds a σ-algebra o that space, which is typically the Borel σ-algebra So the hidde assumptio is that P is a Borel measure This meas that the topology of plays a role here, ad thus, the target fuctio t will be Borel measurable I a sese this guaratees that it is possible to approximate t from its value (or approximate value) at a fiite umber of poits The algorithms that will achieve cosistecy are thus those who use the topology i the sese of geeralizig the observed values to eighborhoods (eg local classifiers) I a way, the measurability of t is oe of the crudest otios of smoothess of fuctios We ow cite two importat results The first oe tells that for a fixed sample size, oe ca costruct arbitrarily bad problems for a give algorithm Theorem 9 (No Free Luch, see eg 4]) For ay algorithm, ay ad ay ε > 0, there exists a distributio P such that R = 0 ad R(g ) 12 ] ε = 1 The secod result is more subtle ad idicates that give a algorithm, oe ca costruct a problem for which this algorithm will coverge as slowly as oe wishes Theorem 10 (No Free Luch at All, see eg 4]) For ay algorithm, ad ay sequece (a ) that coverges to 0, there exists a probability distributio P such that R = 0 ad R(g ) a I the above theorem, the bad probability measure is costructed o a coutable set (where the outputs are ot related at all to the iputs so that o geeralizatio is possible), ad is such that the rate at which oe gets to see ew iputs is as slow as the covergece of a

210 Bousquet, Bouchero & Lugosi Fially we metio other otios of cosistecy Defiitio 12 (VC cosistecy of ERM) The ERM algorithm is cosistet if for ay probability measure P, R(g ) R(g ) i probability, ad R (g ) R(g ) i probability Defiitio 13 (VC o-trivial cosistecy of ERM) The ERM algorithm is o-trivially cosistet for the set G ad the probability distributio P if for ay c, Refereces if P (f) :P f>c if P (f) i probability :P f>c 1 Vapik, V: Statistical Learig Theory Joh Wiley, New York (1998) 2 Athoy, M, Bartlett, PL: Neural Network Learig: Theoretical Foudatios Cambridge Uiversity Press, Cambridge (1999) 3 Breima, L, Friedma, J, Olshe, R, Stoe, C: Classificatio ad Regressio Trees Wadsworth Iteratioal, Belmot, CA (1984) 4 Devroye, L, Györfi, L, Lugosi, G: A Probabilistic Theory of Patter Recogitio Spriger-Verlag, New York (1996) 5 Duda, R, Hart, P: Patter Classificatio ad Scee Aalysis Joh Wiley, New York (1973) 6 Fukuaga, K: Itroductio to Statistical Patter Recogitio Academic Press, New York (1972) 7 Kears, M, Vazirai, U: A Itroductio to Computatioal Learig Theory MIT Press, Cambridge, Massachusetts (1994) 8 Kulkari, S, Lugosi, G, Vekatesh, S: Learig patter classificatio a survey IEEE Trasactios o Iformatio Theory 44 (1998) 2178 2206 Iformatio Theory: 1948 1998 Commemorative special issue 9 Lugosi, G: Patter classificatio ad learig theory I Györfi, L, ed: Priciples of Noparametric Learig, Spriger, Viea (2002) 5 62 10 McLachla, G: Discrimiat Aalysis ad Statistical Patter Recogitio Joh Wiley, New York (1992) 11 Medelso, S: A few otes o statistical learig theory I Medelso, S, Smola, A, eds: Advaced Lectures i Machie Learig LNCS 2600, Spriger (2003) 1 40 12 Nataraja, B: Machie Learig: A Theoretical Approach Morga Kaufma, Sa Mateo, CA (1991) 13 Vapik, V: Estimatio of Depedecies Based o Empirical Data Spriger-Verlag, New York (1982) 14 Vapik, V: The Nature of Statistical Learig Theory Spriger-Verlag, New York (1995) 15 Vapik, V, Chervoekis, A: Theory of Patter Recogitio Nauka, Moscow (1974) (i Russia); Germa traslatio: Theorie der Zeicheerkeug, Akademie Verlag, Berli, 1979

Statistical Learig Theory 211 16 vo Luxburg, U, Bousquet, O, Schölkopf, B: A compressio approach to support vector model selectio The Joural of Machie Learig Research 5 (2004) 293 323 17 McDiarmid, C: O the method of bouded differeces I: Surveys i Combiatorics 1989, Cambridge Uiversity Press, Cambridge (1989) 148 188 18 Vapik, V, Chervoekis, A: O the uiform covergece of relative frequecies of evets to their probabilities Theory of Probability ad its Applicatios 16 (1971) 264 280 19 Hoeffdig, W: Probability iequalities for sums of bouded radom variables Joural of the America Statistical Associatio 58 (1963) 13 30 20 Ledoux, M, Talagrad, M: Probability i Baach Space Spriger-Verlag, New York (1991) 21 Sauer, N: O the desity of families of sets Joural of Combiatorial Theory Series A 13 (1972) 145 147 22 Shelah, S: A combiatorial problem: Stability ad order for models ad theories i ifiity laguages Pacific Joural of Mathematics 41 (1972) 247 261 23 Alesker, S: A remark o the Szarek-Talagrad theorem Combiatorics, Probability, ad Computig 6 (1997) 139 144 24 Alo, N, Be-David, S, Cesa-Biachi, N, Haussler, D: Scale-sesitive dimesios, uiform covergece, ad learability Joural of the ACM 44 (1997) 615 631 25 Cesa-Biachi, N, Haussler, D: A graph-theoretic geeralizatio of the Sauer- Shelah lemma Discrete Applied Mathematics 86 (1998) 27 35 26 Frakl, P: O the trace of fiite sets Joural of Combiatorial Theory, Series A 34 (1983) 41 45 27 Haussler, D: Sphere packig umbers for subsets of the boolea -cube with bouded Vapik-Chervoekis dimesio Joural of Combiatorial Theory, Series A 69 (1995) 217 232 28 Szarek, S, Talagrad, M: O the covexified Sauer-Shelah theorem Joural of Combiatorial Theory, Series B 69 (1997) 183 192 29 Dudley, R: Uiform Cetral Limit Theorems Cambridge Uiversity Press, Cambridge (1999) 30 Gié, E: Empirical processes ad applicatios: a overview Beroulli 2 (1996) 1 28 31 va der Waart, A, Weller, J: Weak covergece ad empirical processes Spriger-Verlag, New York (1996) 32 Blumer, A, Ehrefeucht, A, Haussler, D, Warmuth, M: Learability ad the Vapik-Chervoekis dimesio Joural of the ACM 36 (1989) 929 965 33 Ehrefeucht, A, Haussler, D, Kears, M, Valiat, L: A geeral lower boud o the umber of examples eeded for learig Iformatio ad Computatio 82 (1989) 247 261 34 Dudley, R: Cetral limit theorems for empirical measures Aals of Probability 6 (1978) 899 929 35 Dudley, R: Empirical processes I: Ecole de Probabilité de St Flour 1982, Lecture Notes i Mathematics #1097, Spriger-Verlag, New York (1984) 36 Dudley, R: Uiversal Dosker classes ad metric etropy Aals of Probability 15 (1987) 1306 1326 37 Talagrad, M: The Gliveko-Catelli problem Aals of Probability 15 (1987) 837 870 38 Talagrad, M: Sharper bouds for Gaussia ad empirical processes Aals of Probability 22 (1994) 28 76

212 Bousquet, Bouchero & Lugosi 39 Vapik, V, Chervoekis, A: Necessary ad sufficiet coditios for the uiform covergece of meas to their expectatios Theory of Probability ad its Applicatios 26 (1981) 821 832 40 Assouad, P: Desité et dimesio Aales de l Istitut Fourier 33 (1983) 233 282 41 Cover, T: Geometrical ad statistical properties of systems of liear iequalities with applicatios i patter recogitio IEEE Trasactios o Electroic Computers 14 (1965) 326 334 42 Dudley, R: Balls i R k do ot cut all subsets of k + 2 poits Advaces i Mathematics 31 (3) (1979) 306 308 43 Goldberg, P, Jerrum, M: Boudig the Vapik-Chervoekis dimesio of cocept classes parametrized by real umbers Machie Learig 18 (1995) 131 148 44 Karpiski, M, Macityre, A: Polyomial bouds for vc dimesio of sigmoidal ad geeral Pfaffia eural etworks Joural of Computer ad System Sciece 54 (1997) 45 Khovaskii, AG: Fewomials Traslatios of Mathematical Moographs, vol 88, America Mathematical Society (1991) 46 Koira, P, Sotag, E: Neural etworks with quadratic vc dimesio Joural of Computer ad System Sciece 54 (1997) 47 Macityre, A, Sotag, E: Fiiteess results for sigmoidal eural etworks I: Proceedigs of the 25th Aual ACM Symposium o the Theory of Computig, Associatio of Computig Machiery, New York (1993) 325 334 48 Steele, J: Existece of submatrices with all possible colums Joural of Combiatorial Theory, Series A 28 (1978) 84 88 49 Weocur, R, Dudley, R: Some special Vapik-Chervoekis classes Discrete Mathematics 33 (1981) 313 318 50 McDiarmid, C: Cocetratio I Habib, M, McDiarmid, C, Ramirez-Alfosi, J, Reed, B, eds: Probabilistic Methods for Algorithmic Discrete Mathematics, Spriger, New York (1998) 195 248 51 Ahlswede, R, Gács, P, Körer, J: Bouds o coditioal probabilities with applicatios i multi-user commuicatio Zeitschrift für Wahrscheilichkeitstheorie ud verwadte Gebiete 34 (1976) 157 177 (correctio i 39:353 354,1977) 52 Marto, K: A simple proof of the blowig-up lemma IEEE Trasactios o Iformatio Theory 32 (1986) 445 446 53 Marto, K: Boudig d-distace by iformatioal divergece: a way to prove measure cocetratio Aals of Probability 24 (1996) 857 866 54 Marto, K: A measure cocetratio iequality for cotractig Markov chais Geometric ad Fuctioal Aalysis 6 (1996) 556 571 Erratum: 7:609 613, 1997 55 Dembo, A: Iformatio iequalities ad cocetratio of measure Aals of Probability 25 (1997) 927 939 56 Massart, P: Optimal costats for Hoeffdig type iequalities Techical report, Mathematiques, Uiversité de Paris-Sud, Report 9886 (1998) 57 Rio, E: Iégalités de cocetratio pour les processus empiriques de classes de parties Probability Theory ad Related Fields 119 (2001) 163 175 58 Talagrad, M: A ew look at idepedece Aals of Probability 24 (1996) 1 34 (Special Ivited Paper) 59 Talagrad, M: Cocetratio of measure ad isoperimetric iequalities i product spaces Publicatios Mathématiques de l IHES 81 (1995) 73 205 60 Talagrad, M: New cocetratio iequalities i product spaces Ivetioes Mathematicae 126 (1996) 505 563 61 Luczak, MJ, McDiarmid, C: Cocetratio for locally actig permutatios Discrete Mathematics (2003) to appear

Statistical Learig Theory 213 62 McDiarmid, C: Cocetratio for idepedet permutatios Combiatorics, Probability, ad Computig 2 (2002) 163 178 63 Pacheko, D: A ote o Talagrad s cocetratio iequality Electroic Commuicatios i Probability 6 (2001) 64 Pacheko, D: Some extesios of a iequality of Vapik ad Chervoekis Electroic Commuicatios i Probability 7 (2002) 65 Pacheko, D: Symmetrizatio approach to cocetratio iequalities for empirical processes Aals of Probability to appear (2003) 66 Ledoux, M: O Talagrad s deviatio iequalities for product measures ESAIM: Probability ad Statistics 1 (1997) 63 87 http://wwwemathfr/ps/ 67 Ledoux, M: Isoperimetry ad Gaussia aalysis I Berard, P, ed: Lectures o Probability Theory ad Statistics, Ecole d Eté de Probabilités de St-Flour XXIV- 1994 (1996) 165 294 68 Bobkov, S, Ledoux, M: Poicaré s iequalities ad Talagrads s cocetratio pheomeo for the expoetial distributio Probability Theory ad Related Fields 107 (1997) 383 400 69 Massart, P: About the costats i Talagrad s cocetratio iequalities for empirical processes Aals of Probability 28 (2000) 863 884 70 Bouchero, S, Lugosi, G, Massart, P: A sharp cocetratio iequality with applicatios Radom Structures ad Algorithms 16 (2000) 277 292 71 Bouchero, S, Lugosi, G, Massart, P: Cocetratio iequalities usig the etropy method The Aals of Probability 31 (2003) 1583 1614 72 Bouchero, S, Bousquet, O, Lugosi, G, Massart, P: Momet iequalities for fuctios of idepedet radom variables The Aals of Probability (2004) to appear 73 Bousquet, O: A Beett cocetratio iequality ad its applicatio to suprema of empirical processes C R Acad Sci Paris 334 (2002) 495 500 74 Gié, E, Zi, J: Some limit theorems for empirical processes Aals of Probability 12 (1984) 929 989 75 Koltchiskii, V: Rademacher pealties ad structural risk miimizatio IEEE Trasactios o Iformatio Theory 47 (2001) 1902 1914 76 Bartlett, P, Bouchero, S, Lugosi, G: Model selectio ad error estimatio Machie Learig 48 (2001) 85 113 77 Koltchiskii, V, Pacheko, D: Empirical margi distributios ad boudig the geeralizatio error of combied classifiers Aals of Statistics 30 (2002) 78 Koltchiskii, V, Pacheko, D: Rademacher processes ad boudig the risk of fuctio learig I Gié, E, Maso, D, Weller, J, eds: High Dimesioal Probability II (2000) 443 459 79 Bartlett, P, Medelso, S: Rademacher ad Gaussia complexities: risk bouds ad structural results Joural of Machie Learig Research 3 (2002) 463 482 80 Bartlett, P, Bousquet, O, Medelso, S: Localized Rademacher complexities I: Proceedigs of the 15th aual coferece o Computatioal Learig Theory (2002) 44 48 81 Bousquet, O, Koltchiskii, V, Pacheko, D: Some local measures of complexity of covex hulls ad geeralizatio bouds I: Proceedigs of the 15th Aual Coferece o Computatioal Learig Theory, Spriger (2002) 59 73 82 Atos, A, Kégl, B, Lider, T, Lugosi, G: Data-depedet margi-based geeralizatio bouds for classificatio Joural of Machie Learig Research 3 (2002) 73 98