Statistical Learning Theory

Transcription

1 1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 27-28, 2012

2 2 / 130 Refereces Parts of these lectures are based o O. Bousquet, S. Bouchero, G. Lugosi: Itroductio to Statistical Learig Theory, MLSS otes by O. Bousquet S. Medelso: A Few Notes o Statistical Learig Theory Lecture otes by S. Shalev-Shwartz Lecture otes (S. R. ad K. Sridhara) Prerequisites: a basic familiarity with Probability is assumed.

3 3 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, Bias-Variace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad Scale-Sesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio Olie-to-Batch Coversio, SVM optimizatio

4 Example #1: Hadwritte Digit Recogitio Imagie you are asked to write a computer program that recogizes postal codes o evelopes. You observe the huge amout of variatio ad ambiguity i the data: Oe ca try to hard-code all the possibilities, but likely to fail. It would be ice if a program looked at a large corpus of data ad leared the distictios! This picture of MNIST dataset was yaked from 4 / 130

5 5 / 130 Example #1: Hadwritte Digit Recogitio Need to represet data i the computer. Pixel itesities is oe possibility, but ot ecessarily the best oe. Feature represetatio: feature map We also eed to specify the label of this example: 3. The labeled example is the ( , 3. ( After lookig at may of these examples, we wat the program to predict the label of the ext had-writte digit.

6 Example #2: Predict Topic of a News Article You would like to automatically collect ews stories from the web ad display them to the reader i the best possible way. You would like to group or filter these articles by topic. Hard-codig possible topics for articles is a dautig task! Represetatio i the computer: This is a bag-of-words represetatio. If 1 stads for the category politics, the this example ca be represeted as 6 / 130 ( , After lookig at may of such examples, we would like the program to predict the topic of a ew article. (

7 7 / 130 Why Machie Learig? Impossible to hard-code all the kowledge ito a computer program. The systems eed to be adaptive to the chages i the eviromet. Examples: Computer visio: face detectio, face recogitio Audio: voice recogitio, parsig Text: documet topics, traslatio Ad placemet o web pages Movie recommedatios spam detectio

8 8 / 130 Machie Learig (Huma) learig is the process of acquirig kowledge or skill. Quite vague. How ca we build a mathematical theory for somethig so imprecise? Machie Learig is cocered with the desig ad aalysis of algorithms that improve performace after observig data. That is, the acquired kowledge comes from data. We eed to make mathematically precise the followig terms: performace, improve, data.

9 9 / 130 Learig from Examples How is it possible to coclude somethig geeral from specific examples? Learig is iheretly a ill-posed problem, as there are may alteratives that could be cosistet with the observed examples. Learig ca be see as the process of iductio (as opposed to deductio): extrapolatig from examples. Prior kowledge is how we make the problem well-posed. Memorizatio is ot learig, ot iductio. Our theory should make this apparet. Very importat to delieate assumptios. The we will be able to prove mathematically that certai learig algorithms perform well.

10 Data Space of iputs (or, predictors): X e.g. x X {0, 1,..., 216 }64 is a strig of pixel itesities i a 8 8 image. e.g. x X R33,000 is a set of gee expressio levels. x1 = x2 =... x1 = x2 =... x1 = / 130 x2 = # cigarettes/day # driks/day BMI

11 11 / 130 Data Sometimes the space X is uiquely defied for the problem. I other cases, such as i visio/text/audio applicatios, may possibilities exist, ad a good feature represetatio is key to obtaiig good performace. This importat part of machie learig applicatios will ot be discussed i this lecture, ad we will assume that X has bee chose by the practitioer.

12 12 / 130 Data Space of outputs (or, resposes): Y e.g. y Y = {0, 1} is a biary label (1 = cat ) e.g. y Y = [0, 200] is life expectacy A pair (x, y) is a labeled example. e.g. (x, y) is a example of a image with a label y = 1, which stads for the presece of a face i the image x Dataset (or traiig data): examples {(x 1, y 1),..., (x, y )} e.g. a collectio of images labeled accordig to the presece or absece of a face

13 13 / 130 The Multitude of Learig Frameworks Presece/absece of labeled data: Supervised Learig: {(x 1, y 1),..., (x, y )} Usupervised Learig: {x 1,..., x } Semi-supervised Learig: a mix of the above This distictio is importat, as labels are ofte difficult or expesive to obtai (e.g. ca collect a large corpus of s, but which oes are spam?) Types of labels: Biary Classificatio / Patter Recogitio: Y = {0, 1} Multiclass: Y = {0,..., K} Regressio: Y R Structure predictio: Y is a set of complex objects (graphs, traslatios)

14 14 / 130 The Multitude of Learig Frameworks Problems also differ i the protocol for obtaiig data: Passive Active ad i assumptios o data: Batch (typically i.i.d.) Olie (i.i.d. or worst-case or some stochastic process) Eve more ivolved: Reiforcemet Learig ad other frameworks.

15 15 / 130 Why Theory?... theory is the first term i the Taylor series of practice Thomas M. Cover, 1990 Shao Lecture Theory ad Practice should go had-i-had. Boostig, Support Vector Machies came from theoretical cosideratios. Sometimes, theory is suggestig practical methods, sometimes practice comes ahead ad theory tries to catch up ad explai the performace.

16 16 / 130 This tutorial First 2/3 of the tutorial: we will study the problem of supervised learig (with a focus o biary classificatio) with a i.i.d. assumptio o the data. The last 1/3 of the tutorial: we will tur to olie learig without the i.i.d. assumptio.

19 19 / 130 Statistical Learig Theory The variable x is related to y, ad we would like to lear this relatioship from data. The relatioship is ecapsulated by a distributio P o X Y. Example: x = [weight, blood glucose,...] ad y is the risk of diabetes. We assume there is a relatioship betwee x ad y: it is less likely to see certai x co-occur with low risk ad ulikely to see some other x co-occur with high risk. This relatioship is ecapsulated by P(x, y). This is a assumptio about the populatio of all (x, y). However, what we see is a sample.

20 20 / 130 Statistical Learig Theory Data deoted by {(x 1, y 1),..., (x, y )}, where is the sample size. The distributio P is ukow to us (otherwise, there is o learig to be doe). The observed data are sampled idepedetly from P (the i.i.d. assumptio) It is ofte helpful to write P = P x P y x. The distributio P x o the iputs is called the margial distributio, while P y x is the coditioal distributio.

21 21 / 130 Statistical Learig Theory Upo observig the traiig data {(x 1, y 1),..., (x, y )}, the learer is asked to summarize what she had leared about the relatioship betwee x ad y. The learer s summary takes the form of a fuctio ˆf X Y. The hat idicates that this fuctio depeds o the traiig data. Learig algorithm: a mappig {(x 1, y 1),..., (x, y )} ˆf. The quality of the leared relatioship is give by comparig the respose ˆf (x) to y for a pair (x, y) idepedetly draw from the same distributio P: E (x,y) l(ˆf (x), y) where l Y Y R is a loss fuctio. This is our measure of performace.

22 22 / 130 Loss Fuctios Idicator loss (classificatio): l(y, y ) = I {y y } Square loss: l(y, y ) = (y y ) 2 Absolute loss: l(y, y ) = y y

23 23 / 130 Examples Probably the simplest learig algorithm that you are probably familiar with is liear least squares: Give (x 1, y 1),..., (x, y ), let 1 ˆβ = arg mi β R d i=1 (y i β, x i ) 2 ad defie ˆf (x) = ˆβ, x Aother basic method is regularized least squares: 1 ˆβ = arg mi β R d i=1 (y i β, x i ) 2 + λ β 2

24 24 / 130 Methods vs Problems Algorithms ˆf Distributios P

25 25 / 130 Expected Loss ad Empirical Loss The expected loss of ay fuctio f X Y is L(f) = El(f(x), y) Sice P is ukow, we caot calculate L(f). However, we ca calculate the empirical loss of f X Y ˆL(f) = 1 i=1 l(f(x i ), y i )

26 ... agai, what is radom here? Sice data (x 1, y 1),..., (x, y ) are a radom i.i.d. draw from P, ˆL(f) is a radom quatity ˆf is a radom quatity (a radom fuctio, output of our learig procedure after seeig data) hece, L(ˆf ) is also a radom quatity for a give f X Y, the quatity L(f) is ot radom! It is importat that these are uderstood before we proceed further. 26 / 130

27 27 / 130 The Gold Stadard Withi the framework we set up, the smallest expected loss is achieved by the Bayes optimal fuctio f = arg mi L(f) f where the miimizatio is over all (measurable) predictio rules f X Y. The value of the lowest expected loss is called the Bayes error: L(f ) = if L(f) f Of course, we caot calculate ay of these quatities sice P is ukow.

28 28 / 130 Bayes Optimal Fuctio Bayes optimal fuctio f takes o the followig forms i these two particular cases: Biary classificatio (Y = {0, 1}) with the idicator loss: f (x) = I {η(x) 1/2}, where η(x) = E[Y X = x] 1 0 (x)

29 28 / 130 Bayes Optimal Fuctio Bayes optimal fuctio f takes o the followig forms i these two particular cases: Biary classificatio (Y = {0, 1}) with the idicator loss: f (x) = I {η(x) 1/2}, where η(x) = E[Y X = x] 1 0 (x) Regressio (Y = R) with squared loss: f (x) = η(x), where η(x) = E[Y X = x]

30 29 / 130 The big questio: is there a way to costruct a learig algorithm with a guaratee that L(ˆf ) L(f ) is small for large eough sample size?

32 31 / 130 Cosistecy A algorithm that esures lim L(ˆf ) = L(f ) almost surely is called cosistet. Cosistecy esures that our algorithm is approachig the best possible predictio performace as the sample size icreases. The good ews: cosistecy is possible to achieve. easy if X is a fiite or coutable set ot too hard if X is ifiite, ad the uderlyig relatioship betwee x ad y is cotiuous

33 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the so-called No Free Luch Theorems. Uless we posit further assumptios,

34 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the so-called No Free Luch Theorems. Uless we posit further assumptios, For ay algorithm ˆf, ay ad ay ɛ > 0, there exists a distributio P such that L(f ) = 0 ad EL(ˆf ) 1 2 ɛ

35 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the so-called No Free Luch Theorems. Uless we posit further assumptios, For ay algorithm ˆf, ay ad ay ɛ > 0, there exists a distributio P such that L(f ) = 0 ad EL(ˆf ) 1 2 ɛ For ay algorithm ˆf, ad ay sequece a that coverges to 0, there exists a probability distributio P such that L(f ) = 0 ad for all EL(ˆf ) a Referece: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Patter Recogitio), (Bousquet, Bouchero, Lugosi, 2004)

36 33 / 130 is this really bad ews? Not really. We always have some domai kowledge. Two ways of icorporatig prior kowledge: Direct way: assume that the distributio P is ot arbitrary (also kow as a modelig approach, geerative approach, statistical modelig) Idirect way: redefie the goal to perform as well as a referece set F of predictors: L(ˆf ) if f F L(f) This is kow as a discrimiative approach. F ecapsulates our iductive bias.

37 34 / 130 Pros/Cos of the two approaches Pros of the discrimiative approach: we ever assume that P takes some particular form, but we rather put our prior kowledge ito what are the types of predictors that will do well. Cos: caot really iterpret ˆf. Pros of the geerative approach: ca estimate the model / parameters of the distributio (iferece). Cos: it is ot clear what the aalysis says if the assumptio is actually violated. Both approaches have their advatages. A machie learig researcher or practitioer should ideally kow both ad should uderstad their stregths ad weakesses. I this tutorial we oly focus o the discrimiative approach.

38 35 / 130 Example: Liear Discrimiat Aalysis Cosider the classificatio problem with Y = {0, 1}. Suppose the class-coditioal desities are multivariate Gaussia with the same covariace Σ = I: p(x y = 0) = (2π) k/2 exp { 1 2 x µ0 2 } ad p(x y = 1) = (2π) k/2 exp { 1 2 x µ1 2 } The best (Bayes) classifier is f = I {P(y=1 x) 1/2} which correspods to the half-space defied by the decisio boudary p(x y = 1) p(x y = 0). This boudary is liear.

39 36 / 130 Example: Liear Discrimiat Aalysis The (liear) optimal decisio boudary comes from our geerative assumptio o the form of the uderlyig distributio. Alteratively, we could have idirectly postulated that we will be lookig for a liear discrimiat betwee the two classes, without makig distributioal assumptios. Such liear discrimiat (classificatio) fuctios are I { w,x b} for a uit-orm w ad some bias b R. Quadratic Discrimiat Aalysis: If uequal correlatio matrices Σ 1 ad Σ 2 are assumed, the resultig boudary is quadratic. We ca the defie classificatio fuctio by I {q(x) 0} where q(x) is a quadratic fuctio.

40 37 / 130 Bias-Variace Tradeoff How do we choose the iductive bias F? L(ˆf ) L(f ) = L(ˆf ) if f F L(f) Estimatio Error + if f F L(f) L(f ) Approximatio Error ˆf f F f F Clearly, the two terms are at odds with each other: Makig F larger meas smaller approximatio error but (as we will see) larger estimatio error Takig a larger sample meas smaller estimatio error ad has o effect o the approximatio error. Thus, it makes sese to trade off size of F ad. This is called Structural Risk Miimizatio, or Method of Sieves, or Model Selectio.

41 38 / 130 Bias-Variace Tradeoff We will oly focus o the estimatio error, yet the ideas we develop will make it possible to read about model selectio o your ow. Note: if we guessed correctly ad f F, the L(ˆf ) L(f ) = L(ˆf ) if f F L(f) For a particular problem, oe hopes that prior kowledge about the problem ca esure that the approximatio error if f F L(f) L(f ) is small.

42 39 / 130 Occam s Razor Occam s Razor is ofte quoted as a priciple for choosig the simplest theory or explaatio out of the possible oes. However, this is a rather philosophical argumet sice simplicity is ot uiquely defied. We will discuss this issue later. What we will do is to try to uderstad complexity whe it comes to behavior of certai stochastic processes. Such a questio will be well-defied mathematically.

43 40 / 130 Lookig Ahead So far: represeted prior kowledge by meas of the class F. Lookig forward, we ca fid a algorithm that, after lookig at a dataset of size, produces ˆf such that L(ˆf ) if f F L(f) decreases (i a certai sese which we will make precise) at a o-trivial rate which depeds o richess of F. This will give a sample complexity guaratee: how may samples are eeded to make the error smaller tha a desired accuracy.

45 42 / 130 Types of Bouds I expectatio vs i probability (cotrol the mea vs cotrol the tails): E {L(ˆf ) if L(f)} < ψ() vs P (L(ˆf ) if L(f) ɛ) < ψ(, ɛ) f F f F

46 42 / 130 Types of Bouds I expectatio vs i probability (cotrol the mea vs cotrol the tails): E {L(ˆf ) if L(f)} < ψ() vs P (L(ˆf ) if L(f) ɛ) < ψ(, ɛ) f F f F The i-probability boud ca be iverted as P (L(ˆf ) if L(f) φ(δ, )) < δ f F by settig δ = ψ(ɛ, ) ad solvig for ɛ. I this lecture, we are after the fuctio φ(δ, ). We will call it the rate. With high probability typically meas logarithmic depedece of φ(δ, ) o 1/δ. Very desirable: the boud grows oly modestly eve for high cofidece bouds.

47 43 / 130 Sample Complexity Sample complexity is the sample size required by the algorithm ˆf to guaratee L(ˆf ) if f F L(f) ɛ with probability at least 1 δ. Of course, we just eed to ivert a boud P (L(ˆf ) if L(f) φ(δ, )) < δ f F by settig ɛ = φ(δ, ) ad solvig for. I other words, (ɛ, δ) is sample complexity of the algorithm ˆf if as soo as (ɛ, δ). P (L(ˆf ) if L(f) ɛ) δ f F Hece, rate ca be traslated ito sample complexity ad vice versa. Easy to remember: rate O(1/ ) meas O(1/ɛ 2 ) sample complexity, whereas rate O(1/) is a smaller O(1/ɛ) sample complexity.

48 44 / 130 Types of Bouds Other distictios to keep i mid: We ca ask for bouds (either i expectatio or i probability) o the followig radom variables: L(ˆf ) L(f ) (A) L(ˆf ) if f F L(f) (B) L(ˆf ) ˆL(ˆf ) (C) sup {L(f) ˆL(f)} f F (D) sup {L(f) ˆL(f) pe (f)} f F (E) Let s make sure we uderstad the differeces betwee these radom quatities!

49 45 / 130 Types of Bouds Upper bouds o (D) ad (E) are used as tools for achievig the other bouds. Let s see why. Obviously, for ay algorithm that outputs ˆf F, L(ˆf ) ˆL(ˆf ) sup {L(f) ˆL(f)} f F ad so a boud o (D) implies a boud o (C). How about a boud o (B)? Is it implied by (C) or (D)? It depeds o what the algorithm does! Deote f F = arg mi f F L(f). Suppose (D) is small. It the makes sese to ask the learig algorithm to miimize or (approximately miimize) the empirical error (why?)

50 46 / 130 Caoical Algorithms Empirical Risk Miimizatio (ERM) algorithm: ˆf = arg mi ˆL(f) f F Regularized Empirical Risk Miimizatio algorithm: ˆf = arg mi ˆL(f) + pe (f) f F We will deal with the regularized ERM a bit later. For ow, let s focus o ERM. Remark: to actually compute f F miimizig the above objectives, oe eeds to employ some optimizatio methods. I practice, the objective might be optimized oly approximately.

51 47 / 130 Performace of ERM If ˆf is a ERM, L(ˆf ) L(f F ) {L(ˆf ) ˆL(ˆf )} + {ˆL(ˆf) ˆL(f F )} + {ˆL(fF ) L(f F )} {L(ˆf ) ˆL(ˆf )} + {ˆL(fF ) L(f F )} (C) sup {L(f) ˆL(f)} + {ˆL(fF ) L(f F )} f F (D) because the secod term is egative. So, (C) also implies a boud o (B) whe ˆf is ERM (or close to ERM). Also, (D) also implies a boud o (B). What about this extra term ˆL(f F ) L(f F )? Cetral Limit Theorem says that for i.i.d. radom variables with bouded secod momet, the average coverges to the expectatio. Let s quatify this.

52 48 / 130 Hoeffdig Iequality Let W, W 1,..., W be i.i.d. such that P (a W b) = 1. The P (EW 1 i=1 W i > ɛ) exp ( 2ɛ2 (b a) ) 2 ad P ( 1 i=1 W i EW > ɛ) exp ( 2ɛ2 (b a) ) 2 Let W i = l(f F (x i ), y i ). Clearly, W 1,..., W i are i.i.d. The, P ( L(f F ) ˆL(f F ) > ɛ) 2 exp ( 2ɛ2 (b a) ) 2 assumig a l(f F (x), y) b for all x X, y Y.

53 49 / 130 Wait, Are We Doe? Ca t we coclude directly that (C) is small? That is, P (El(ˆf (x), y) 1 i=1 l(ˆf (x i ), y i ) > ɛ) 2 exp ( 2ɛ2 (b a) )? 2

54 49 / 130 Wait, Are We Doe? Ca t we coclude directly that (C) is small? That is, P (El(ˆf (x), y) 1 i=1 l(ˆf (x i ), y i ) > ɛ) 2 exp ( 2ɛ2 (b a) )? 2 No! The radom variables l(ˆf (x i ), y i ) are ot ecessarily idepedet ad it is possible that El(ˆf (x), y) = EW El(ˆf (x i ), y i ) = EW i The expected loss is out of sample performace while the secod term is i sample. We say that l(ˆf (x i ), y i ) is a biased estimate of El(ˆf (x), y). How bad ca this bias be?

55 50 / 130 Example X = [0, 1], Y = {0, 1} l(f(x i ), Y i ) = I {f(xi ) Y i } distributio P = P x P y x with P x = Uif[0, 1] ad P y x = δ y=1 fuctio class F = N {f = f S S X, S =, f S (x) = I {x S} } ERM ˆf memorizes (perfectly fits) the data, but has o ability to geeralize. Observe that 0 = El(ˆf (x i ), y i ) El(ˆf (x), y) = 1 This pheomeo is called overfittig.

56 51 / 130 Example Not oly is (C) large i this example. Also, uiform deviatios (D) do ot coverge to zero. For ay N ad ay (x 1, y 1),..., (x, y ) P sup {E x,yl(f(x), y) 1 f F i=1 l(f(x i ), y i )} = 1 Where do we go from here? Two approaches: 1. uderstad how to upper boud uiform deviatios (D) 2. fid properties of algorithms that limit i some way the bias of l(ˆf (x i ), y i ). Stability ad compressio are two such approaches.

57 52 / 130 Uiform Deviatios We first focus o uderstadig sup {E x,yl(f(x), y) 1 f F i=1 l(f(x i ), y i )} If F = {f 0} cosists of a sigle fuctio, the clearly sup {El(f(x), y) 1 f F i=1 l(f(x i ), y i )} = {El(f 0(x), y) 1 i=1 This quatity is O P (1/ ) by Hoeffdig s iequality, assumig a l(f 0(x), y) b. l(f 0(x i ), y i )} Moral: for simple classes F the uiform deviatios (D) ca be bouded while for rich classes ot. We will see how far we ca push the size of F.

58 53 / 130 A bit of otatio to simplify thigs... To ease the otatio, Let z i = (x i, y i ) so that the traiig data is {z 1,..., z } g(z) = l(f(x), y) for z = (x, y) Loss class G = {g g(z) = l(f(x), y)} = l F ĝ = l(ˆf ( ), ), g G = l(f F ( ), ) g = arg mi g Eg(z) = l(f ( ), ) is Bayes optimal (loss) fuctio We ca ow work with the set G, but keep i mid that each g G correspods to a f F: g G f F Oce agai, the quatity of iterest is sup g G {Eg(z) 1 i=1 g(z i )} O the ext slide, we visualize deviatios Eg(z) 1 i=1 g(z i ) for all possible fuctios g ad discuss all the cocepts itroduces so far.

59 54 / 130 Empirical Process Viewpoit Eg 0 g all fuctios

60 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 g all fuctios

61 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 g all fuctios

62 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 ĝ g all fuctios

63 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 0 ĝ g

64 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g all fuctios

65 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g g G ĝ all fuctios

66 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g all fuctios

67 55 / 130 Empirical Process Viewpoit A stochastic process is a collectio of radom variables idexed by some set. A empirical process is a stochastic process idexed by a fuctio class G. {Eg(z) 1 Uiform Law of Large Numbers: i probability. sup Eg 1 g G i=1 i=1 g(z i )} g G g(z i ) 0

68 55 / 130 Empirical Process Viewpoit A stochastic process is a collectio of radom variables idexed by some set. A empirical process is a stochastic process idexed by a fuctio class G. {Eg(z) 1 Uiform Law of Large Numbers: i probability. sup Eg 1 g G i=1 i=1 g(z i )} g G g(z i ) 0 Key questio: How big ca G be for the supremum of the empirical process to still be maageable?

69 56 / 130 Uio Boud (Boole s iequality) Boole s iequality: for a fiite or coutable set of evets, Let G = {g 1,..., g N }. The P ( g G Eg 1 i=1 P ( j A j ) P (A j ) j N g(z i ) > ɛ) P (Eg j 1 j=1 Assumig P (a g(z i ) b) = 1 for every g G, P (sup g G {Eg 1 i=1 i=1 g(z i )} > ɛ) N exp ( 2ɛ2 (b a) ) 2 g j (z i ) > ɛ)

70 57 / 130 Fiite Class Alteratively, we set δ = N exp ( 2ɛ2 (b a) 2 ) ad write P sup g G {Eg 1 i=1 g(z i )} > (b a) Aother way to write it: with probability at least 1 δ, sup g G {Eg 1 i=1 g(z i )} (b a) log(n) + log(1/δ) 2 δ log(n) + log(1/δ) 2 Hece, with probability at least 1 δ, the ERM algorithm ˆf for a class F of cardiality N satisfies log(n) + log(1/δ) L(ˆf ) if L(f) 2(b a) f F 2 assumig a l(f(x), y) b for all f F, x X, y Y. The costat 2 is due to the L(f F ) ˆL(f F ) term. This is a loose upper boud.

71 58 / 130 Oce agai... A take-away message is that the followig two statemets are worlds apart: with probability at least 1 δ, for ay g G, Eg 1 i=1 g(z i ) ɛ vs for ay g G, with probability at least 1 δ, Eg 1 i=1 g(z i ) ɛ The secod statemet follows from CLT, while the first statemet is ofte difficult to obtai ad oly holds for some G.

73 60 / 130 Coutable Class: Weighted Uio Boud Let G be coutable ad fix a distributio w o G such that g G w(g) 1. For ay δ > 0, for ay g G P Eg 1 log 1/w(g) + log(1/δ) g(z i ) (b a) 2 δ w(g) i=1 by Hoeffdig s iequality (easy to verify!). By the Uio Boud, P g G Eg 1 i=1 g(z i ) (b a) log 1/w(g) + log(1/δ) 2 δ w(g) δ g G Therefore, with probability at least 1 δ, for all f F L(f) ˆL(f) log 1/w(f) + log(1/δ) (b a) 2 pe (f)

74 61 / 130 Coutable Class: Weighted Uio Boud If ˆf is a regularized ERM, L(ˆf ) L(f F ) {L(ˆf ) ˆL(ˆf ) pe (ˆf )} + {ˆL(ˆf) + pe (ˆf ) ˆL(f F ) pe (f F )} + {ˆL(fF ) L(f F )} + pe (f F ) sup {L(f) ˆL(f) pe (f)} + {ˆL(fF ) L(f F )} + pe (f F ) f F So, (E) implies a boud o (B) whe ˆf is regularized ERM. From the weighted uio boud for a coutable class: L(ˆf ) L(f F ) {ˆL(fF ) L(f F )} + pe (f F ) log 1/w(f F ) + log(1/δ) 2(b a) 2

75 62 / 130 Ucoutable Class: Compressio Bouds Let us make the depedece of the algorithm ˆf o the traiig set S = {(x 1, y 1),..., (x, y )} explicit: ˆf = ˆf [S]. Suppose F has the property that there exists a compressio fuctio C k which selects from ay dataset S of ay size a subset of k labeled examples C k (S) S such that the algorithm ca be writte as The, ˆf [S] = ˆf k [C k (S)] L(ˆf ) ˆL(ˆf ) = El(ˆf k [C k (S)](x), y) 1 i=1 max {El(ˆf k [S I ](x), y) 1 I {1,...,}, I k l(ˆf k [C k (S)](x i ), y i ) i=1 l(ˆf k [S I ](x i ), y i )}

76 63 / 130 Ucoutable Class: Compressio Bouds Sice ˆf k [S I ] oly depeds o k out of poits, the empirical average is mostly out of sample. Addig ad subtractig 1 l(ˆf k [S I ](x ), y ) (x,y ) W for a additioal set of i.i.d. radom variables W = {(x 1, y 1),..., (x k, y k)} results i a upper boud max I {1,...,}, I k El(ˆf k [S I ](x), y) 1 l(ˆf k [S I ](x), y) + (x,y) S S I W I (b a)k We appeal to the uio boud over the ( ) possibilities, with a Hoeffdig s k boud for each. The with probability at least 1 δ, L(ˆf ) if L(f) 2(b a) f F k log(e/k) + log(1/δ) 2 assumig a l(f(x), y) b for all f F, x X, y Y. + (b a)k

77 64 / 130 Example: Classificatio with Thresholds i 1D X = [0, 1], Y = {0, 1} F = {f θ f θ (x) = I {x θ}, θ [0, 1]} l(f θ (x), y) = I {fθ (x) y} ˆf 0 1 For ay set of data (x 1, y 1),..., (x, y ), the ERM solutio ˆf has the property that the first occurrece x l o the left of the threshold has label y l = 0, while first occurrece x r o the right label y r = 1. Eough to take k = 2 ad defie ˆf [S] = ˆf 2[(x l, 0), (x r, 1)].

78 65 / 130 Stability Yet aother way to limit the bias of l(ˆf (x i ), y i ) as a estimate of L(ˆf ) is through a otio of stability. A algorithm ˆf is stable if a chage (or removal) of a sigle data poit does ot chage (i a certai mathematical sese) the fuctio ˆf by much. Of course, a dumb algorithm which outputs ˆf = f 0 without eve lookig at data is very stable ad l(ˆf (x i ), y i ) are idepedet radom variables... But it is ot a good algorithm! We would like to have a algorithm that both approximately miimizes the empirical error ad is stable. Turs out, certai types of regularizatio methods are stable. Example: ˆf = arg mi f F 1 i=1 (f(x i ) y i ) 2 + λ f 2 K where is the orm iduced by the kerel of a reproducig kerel Hilbert space (RKHS) F.

79 66 / 130 Summary so far We proved upper bouds o L(ˆf ) L(f F ) for ERM over a fiite class Regularized ERM over a coutable class (weighted uio boud) ERM over classes F with the compressio property ERM or Regularized ERM that are stable (oly sketched it) What about a more geeral situatio? Is there a way to measure complexity of F that tells us whether ERM will succeed?

81 68 / 130 Uiform Covergece ad Symmetrizatio Let z 1,..., z be aother set of i.i.d. radom variables from P. Let ɛ 1,..., ɛ be i.i.d. Rademacher radom variables: P (ɛ i = 1) = P (ɛ i = +1) = 1/2 Let s get through a few maipulatios: E sup g G {Eg(z) 1 i=1 g(z i )} = E z1 sup g G {E z 1 { 1 By Jese s iequality, this is upper bouded by which is equal to E z1,z 1 sup g G { 1 E ɛ1 E z1,z 1 sup g G { 1 i=1 g(z i) 1 i=1 i=1 i=1 g(z i)} 1 g(z i )} ɛ i (g(z i) g(z i ))} i=1 g(z i )}

82 69 / 130 Uiform Covergece ad Symmetrizatio E ɛ1 E z1,z 1 sup g G { 1 E sup g G { 1 = 2E sup g G { 1 i=1 i=1 i=1 ɛ i (g(z i) g(z i ))} ɛ i g(z i)} + E sup g G { 1 ɛ i g(z i )} i=1 ɛ i g(z i )} The empirical Rademacher averages of G are defied as R (G) = E [sup g G { 1 i=1 ɛ i g(z i )} z 1,..., z ] The Rademacher average (or Rademacher complexity) of G is R (G) = E z1 R(G)

83 70 / 130 Classificatio: Loss Fuctio Disappears Let us focus o biary classificatio with idicator loss ad let F be a class of {0, 1}-valued fuctios. We have ad thus l(f(x), y) = I {f(x) y} = (1 2y)f(x) + y R (G) = E [sup { 1 f F = E [sup { 1 f F i=1 i=1 ɛ i (f(x i )(1 2y i ) + y i )} (x 1, y 1)..., (x, y )] ɛ i f(x i )} x 1,..., x ] = R (F) because, give y 1,..., y, the distributio of ɛ i (1 2y i ) is the same as ɛ i.

84 71 / 130 Vapik-Chervoekis Theory for Classificatio We are ow left examiig E [sup { 1 f F i=1 ɛ i f(x i )} x 1,..., x ] Give x 1,..., x, defie the projectio of F oto sample: F x1 = {(f(x 1),..., f(x )) {0, 1} f F} {0, 1} Clearly, this is a fiite set ad 1 R (F) = E ɛ1 max v F x1 i=1 ɛ i v i 2 log card(f x1 ) This is because a maximum of N (sub)gaussia radom variables log N. The boud is otrivial as log as log card(f x1 ) = o().

85 72 / 130 Vapik-Chervoekis Theory for Classificatio The growth fuctio is defied as Π F () = max {card(f x1,...,x ) x 1,..., x X } The growth fuctio measures expressiveess of F. I particular, if F ca produce all possible sigs (that is, Π F () = 2 ), the boud becomes useless. We say that F shatters some set x 1,..., x if F x = {0, 1}. The Vapik-Chervoekis (VC) dimesio of the class F is defied as vc(f) = max {d Π F (t) = 2 t } Vapik-Chervoekis-Sauer-Shelah Lemma: If d = vc(f) <, the d Π F () ( d ) ( e d d ) i=0

86 Vapik-Chervoekis Theory for Classificatio Coclusio: for ay F with vc(f) <, the ERM algorithm satisfies E {L(ˆf ) if L(f)} 2 2d log(e/d) f F While we proved the result i expectatio, the same type of boud holds with high probability. VC dimesio is a combiatorial dimesio of a biary-valued fuctio class. Its fiiteess is ecessary ad sufficiet for learability if we place o assumptios o the distributio P. Remark: the boud is similar to that obtaied through compressio. I fact, the exact relatioship betwee compressio ad VC dimesio is still a ope questio. 73 / 130

87 74 / 130 Vapik-Chervoekis Theory for Classificatio Examples of VC classes: Half-spaces F = {I { w,x +b 0} w R d, w = 1, b R} has vc(f) = d + 1 For a vector space H of dimesio d, VC dimesio of F = {I {h(x) 0} h H} is at most d The set of Euclidea balls F = {I { d i=1 x i a i 2 b} a Rd, b R} has VC dimesio at most d + 2. Fuctios that ca be computed usig a fiite umber of arithmetic operatios (see (Goldberg ad Jerrum, 1995)) However: F = {f α(x) = I {si(αx) 0} α R} has ifiite VC dimesio, so it is ot correct to thik of VC dimesio as the umber of parameters!

88 74 / 130 Vapik-Chervoekis Theory for Classificatio Examples of VC classes: Half-spaces F = {I { w,x +b 0} w R d, w = 1, b R} has vc(f) = d + 1 For a vector space H of dimesio d, VC dimesio of F = {I {h(x) 0} h H} is at most d The set of Euclidea balls F = {I { d i=1 x i a i 2 b} a Rd, b R} has VC dimesio at most d + 2. Fuctios that ca be computed usig a fiite umber of arithmetic operatios (see (Goldberg ad Jerrum, 1995)) However: F = {f α(x) = I {si(αx) 0} α R} has ifiite VC dimesio, so it is ot correct to thik of VC dimesio as the umber of parameters! Ufortuately, the VC theory is uable to explai the good performace of eural etworks ad Support Vector Machies! This prompted the developmet of a margi-based theory.

90 76 / 130 Classificatio with Real-Valued Fuctios May methods use I(F) = {I {f 0} f F} for classificatio. The VC dimesio ca be very large, yet i practice the methods work well. Example: f(x) = f w(x) = w, ψ(x) where ψ is a mappig to a highdimesioal feature space (see Kerel Methods). The VC dimesio of the set is typically huge (equal to the dimesioality of ψ(x)) or ifiite, yet the methods perform well! Is there a explaatio beyod VC theory?

91 77 / 130 Margis Hard margi: f F i, y i f(x i ) γ f(x) More geerally, we hope to have f F card({i y i f(x i ) < γ}) is small

92 78 / 130 Surrogate Loss Defie 1 if s 0 φ(s) = 1 s/γ if 0 < s < γ 0 if s γ The: I {y sig(f(x))} = I {yf(x) 0} φ(yf(x)) ψ(yf(x)) = I {yf(x) γ} The fuctio φ is a example of a surrogate loss fuctio. (yf(x)) I {yf(x)60} (yf(x)) yf(x) Let The L φ (f) = Eφ(yf(x)) ad ˆLφ (f) = 1 L(f) L φ (f), i=1 ˆLφ (f) ˆL ψ (f) φ(y i f(x i ))

93 79 / 130 Surrogate Loss Now cosider uiform deviatios for the surrogate loss: E sup {L φ (f) ˆL φ (f)} f F We had show that this quatity is at most 2R (φ(f)) for φ(f) = {g(z) = φ(yf(x)) f F} A useful property of Rademacher averages: R (φ(f)) LR (F) if φ is L-Lipschitz. Observe that i our example φ is 1/γ-Lipschitz. Hece, E sup {L φ (f) ˆL φ (f)} 2 f F γ R(F)

94 80 / 130 Margi Boud Same result i high probability: with probability at least 1 δ, sup {L φ (f) ˆL φ (f)} 2 f F γ R(F) + log(1/δ) 2 With probability at least 1 δ, for all f F If ˆf is miimizig margi loss L(f) ˆL ψ (f) + 2 γ R(F) + log(1/δ) 2 ˆf = arg mi f F the with probability at least 1 δ 1 i=1 φ(y i f(x i )) L(ˆf ) if f F L ψ(f) + 4 γ R(F) + 2 log(1/δ) 2 Note: φ assumes kowledge of γ, but this assumptio ca be removed.

96 82 / 130 Useful Properties 1. If F G, the R (F) R (G) 2. R(F) = R (cov(f)) 3. For ay c R, R(cF) = c R (F) 4. If φ R R is L-Lipschitz (that is, φ(a) φ(b) L a b for all a, b R), the R (φ F) L R (F)

97 83 / 130 Rademacher Complexity of Kerel Classes Feature map φ X l 2 ad p.d. kerel K(x 1, x 2) = φ(x 1), φ(x 2) The set F B = {f(x) = w, φ(x) w B} is a ball i H Reproducig property f(x) = f, K(x, ) A easy calculatio shows that empirical Rademacher averages are upper bouded as 1 R (F B ) = E sup f F 1 i=1 = E sup f F B f, 1 1 ɛ i f(x i ) = E sup f F B i=1 i=1 ɛ i K(x i, ) = B E 1 = B E ɛ i ɛ j K(x i, ), K(x j, ) i,j=1 B 1/2 ( K(x i, x i )) i=1 ɛ i f, K(x i, ) 1/2 i=1 ɛ i K(x i, ) A data-idepedet boud of O(Bκ/ ) ca be obtaied if sup x X K(x, x) κ 2. The κ ad B are the effective dimesios.

98 84 / 130 Other Examples Usig properties of Rademacher averages, we may establish guaratees for learig with eural etworks, decisio trees, ad so o. Powerful techique, typically requires oly a few lies of algebra. Occasioally, coverig umbers ad scale-sesitive dimesios ca be easier to deal with.

100 86 / 130 Real-Valued Fuctios: Coverig Numbers Cosider a class F of [ 1, 1]-valued fuctios let Y = [ 1, 1], l(f(x), y) = f(x) y We have E sup L(f) ˆL(f) 2E x1 R(F) f F For real-valued fuctios the cardiality of F x1 is ifiite. However, similar fuctios f ad f with should be treated as the same. (f(x 1),..., f(x )) (f (x 1),..., f (x ))

101 87 / 130 Real-Valued Fuctios: Coverig Numbers Give α > 0, suppose we ca fid V [ 1, 1] of fiite cardiality such that f, v f V, s.t. 1 i=1 f(x i ) v f i α The R (F) = E ɛ1 sup f F = E ɛ1 sup f F 1 1 i=1 i=1 α + E ɛ1 max v V ɛ i f(x i ) ɛ i (f(x i ) v f i) + E ɛ1 sup f F 1 i=1 ɛ i v i Now we are back to the set of fiite cardiality: 2 log card(v) R (F) α + 1 i=1 ɛ i v f i

102 88 / 130 Real-Valued Fuctios: Coverig Numbers Such a set V is called a α-cover (or α-et). More precisely, a set V is a α-cover with respect to l p orm if f, v f V, s.t. 1 i=1 f(x i ) v f i p α p The size of the smallest α-cover is deoted by N p(f x1, α). x 1 x 2 x T Above : Two sets of levels provide a α-cover for the four fuctios. Oly the values of fuctios o x 1,..., x T are relevat.

103 89 / 130 Real-Valued Fuctios: Coverig Numbers We have proved that for ay x 1,..., x, R (F) if {α log card(n1(f x1, α))} α 0 A better boud (called Dudley etropy itegral): 12 R (F) if {4α + α 0 α 1 2 log card(n2(f x1, δ))dδ}

104 90 / 130 Example: Nodecreasig fuctios. Cosider the set F of odecreasig fuctios R [ 1, 1]. While F is a very large set, F x1 is ot that large: N 1(F x1, α) N 2(F x1, α) 2/α. The first boud o the previous slide yields if {α log()} = Õ( 1/3 ) α 0 α while the secod boud (the Dudley etropy itegral) 12 if {4α + α 0 α 1 4/δ log()dδ} = Õ( 1/2 ) where the Õ otatio hides logarithmic factors.

105 91 / 130 Scale-Sesitive Dimesios We say that F R X α-shatters a set (x 1,..., x T ) if there exist (y 1,..., y T ) R T (called a witess to shatterig) with the followig property: (b 1,..., b T ) {0, 1} T, f F s.t. f(x t) > y t + α 2 if bt = 1 ad f(xt) < yt α 2 if bt = 0 The fat-shatterig dimesio of F at scale α, deoted by fat(f, α), is the size of the largest α-shattered set.

106 91 / 130 Scale-Sesitive Dimesios We say that F R X α-shatters a set (x 1,..., x T ) if there exist (y 1,..., y T ) R T (called a witess to shatterig) with the followig property: (b 1,..., b T ) {0, 1} T, f F s.t. f(x t) > y t + α 2 if bt = 1 ad f(xt) < yt α 2 if bt = 0 The fat-shatterig dimesio of F at scale α, deoted by fat(f, α), is the size of the largest α-shattered set. Wait, aother measure of complexity of F? How is it related to coverig umbers?

107 91 / 130 Scale-Sesitive Dimesios We say that F R X α-shatters a set (x 1,..., x T ) if there exist (y 1,..., y T ) R T (called a witess to shatterig) with the followig property: (b 1,..., b T ) {0, 1} T, f F s.t. f(x t) > y t + α 2 if bt = 1 ad f(xt) < yt α 2 if bt = 0 The fat-shatterig dimesio of F at scale α, deoted by fat(f, α), is the size of the largest α-shattered set. Wait, aother measure of complexity of F? How is it related to coverig umbers? Theorem (Medelso & Vershyi): For F [ 1, 1] X ad ay 0 < α < 1, N 2(F x1, α) ( 2 α ) K fat(f,cα) where K, c are positive absolute costats.

108 92 / 130 Quick Summary We are after uiform deviatios i order to uderstad performace of ERM. Rademacher averages is a ice measure with useful properties. They ca be further upper bouded by coverig umbers through the Dudley etropy itegral. I tur, coverig umbers ca be cotrolled via the fat-shatterig combiatorial dimesio. Whew!

110 94 / 130 Faster Rates Are there situatios whe approaches 0 faster tha O(1/ )? EL(ˆf ) if f F L(f) Yes! We ca beat the Cetral Limit Theorem! How is this possible?? Recall that the CLT tells us about covergece of average to the expectatio for radom variables with bouded secod momet. What if this variace is small?

111 95 / 130 Faster Rates: Classificatio Cosider the problem of biary classificatio with the idicator loss ad a class F of {0, 1}-valued fuctios. For ay f F, 1 i=1 l(f(x i ), y i ) is a average of Beroulli radom variables with bias p = El(f(x), y). Exact expressio for the biomial tails: Further upper bouds: P (L(f) ˆL(f) (p ɛ) > ɛ) = i=0 ( i )pi (1 p) i ɛ 2 exp { 2p(1 p) + 2ɛ/3 } Berstei exp { 2ɛ 2 } Hoeffdig

112 96 / 130 Faster Rates: Classificatio Ivertig ɛ 2 exp { 2p(1 p) + 2ɛ/3 } exp { ɛ 2 2p + 2ɛ/3 } = δ yields that for ay f F, with probability at least 1 δ L(f) ˆL(f) 2L(f) log(1/δ) log(1/δ) 3 For o-egative umbers A, B, C A B + C A implies A B + C 2 + BC Therefore for ay f F, with probability at least 1 δ, L(f) ˆL(f) 2ˆL(f) log(1/δ) log(1/δ)

113 97 / 130 Faster Rates: Classificatio By the Uio Boud, for F with fiite N = card(f), with probability at least 1 δ, f F L(f) ˆL(f) 2ˆL(f) log(n/δ) log(n/δ) For a empirical miimizer ˆf, with probability at least 1 δ, a zero empirical loss ˆL(ˆf ) = 0 implies L(ˆf ) 4 log(n/δ) This happes, for istace, i the so-called oiseless case: L(f F ) = 0. Ideed, the ˆL(f F ) = 0 ad thus ˆL(ˆf ) = 0.

114 98 / 130 Summary: Miimax Viewpoit Value of a game where we choose a algorithm, Nature chooses a distributio P P, ad our payoff is the expected loss of our algorithm relative to the best i F: V iid (F, P, ) = if sup {L(ˆf ) if L(f)} ˆf f F P P If we make o assumptio o the distributio P, the P is the set of all distributios. May of the results we obtaied i this lecture are for this distributio-free case. However, oe may view margi-based results ad the above fast rates for the oiseless case as studyig V iid (F, P, ) whe P is icer.

116 100 / 130 Model Selectio For a give class F, we have proved statemets of the type P (sup{l(f) ˆL(f)} φ(δ,, F)) < δ f F Now, take a coutable ested sieve of models F 1 F 2... such that H = i=1f i is a very large set that will surely capture the Bayes fuctio. For a fuctio f H, let k(f) be the smallest idex of F k that cotais f. Let us write φ (δ, i) for φ(δ,, F i ). Let us put a distributio w(i) o the models, with i=1 w(i) = 1. The for every i, P (sup f F i {L(f) ˆL(f)} φ (δw(i), i)) < δ w(i) simply by replacig δ with δw(i).

117 101 / 130 Now, takig a uio boud: P (sup f H Cosider the pealized method {L(f) ˆL(f)} φ (δw(k(f)), k(f))) < δw(i) δ i ˆf = arg mi f H {ˆL(f) + φ(δw(k(f)), k(f))} = arg mi i,f F i {ˆL(f) + φ(δw(i), i)} This balaces fit to data ad the complexity of the model. Of course, this is exactly a regularized ERM form aalyzed earlier. F k F 1 f Let k = k(f ) be the (smallest) model F i that cotais the optimal fuctio.

118 102 / 130 Exactly as o the slide Coutable Class: Weighted Uio Boud, L(ˆf ) L(f ) {L(ˆf ) ˆL(ˆf ) pe (ˆf )} + {ˆL(ˆf) + pe (ˆf ) ˆL(f F ) pe (f F )} + {ˆL(fF ) L(f F )} + pe (f F ) ˆL(f ) L(f ) + pe (f ) = ˆL(f ) L(f ) + φ (δw(k ), k ) The first part of this boud is O P (1/ ) by the CLT, just as before. If the depedece of φ o 1/δ is logarithmic, the takig w(i) = 2 i simply implies a additioal additive i, a pealty for ot kowig the model i advace. Coclusio: give uiform deviatio bouds for a sigle class F, as developed earlier, we ca perform model selectio by pealizig model complexity!

121 105 / 130 Lookig back: Statistical Learig future looks like the past modeled as i.i.d. data evaluated o a radom sample from the same distributio developed various measures of complexity of F

122 106 / 130 Example #1: Bit Predictio Predict a biary sequece y 1, y 2,... {0, 1}, which is revealed oe by oe. At step t, make a predictio z t of the t-th bit, the y t is revealed. Let c t = I {zt =y t }. Goal: make c = 1 t=1 c t large. Suppose we are told that the sequece preseted is Beroulli with a ukow bias p. How should we choose predictios?

123 107 / 130 Example #1: Bit Predictio Of course, we should do majority vote over the past outcomes z t = I {ȳt 1 1/2} where ȳ t 1 = 1 t 1 t 1 s=1 y s. This algorithm guaratees c t max{p, 1 p} ad lim if t ( ct max{ zt, 1 zt}) 0 almost surely ( )

124 107 / 130 Example #1: Bit Predictio Of course, we should do majority vote over the past outcomes z t = I {ȳt 1 1/2} where ȳ t 1 = 1 t 1 t 1 s=1 y s. This algorithm guaratees c t max{p, 1 p} ad lim if t ( ct max{ zt, 1 zt}) 0 almost surely ( ) Claim: there is a algorithm that esures ( ) for a arbitrary sequece. Ay idea how to do it?

125 107 / 130 Example #1: Bit Predictio Of course, we should do majority vote over the past outcomes z t = I {ȳt 1 1/2} where ȳ t 1 = 1 t 1 t 1 s=1 y s. This algorithm guaratees c t max{p, 1 p} ad lim if t ( ct max{ zt, 1 zt}) 0 almost surely ( ) Claim: there is a algorithm that esures ( ) for a arbitrary sequece. Ay idea how to do it? Aother way to formulate ( ): umber of mistakes should be ot much more tha made by the best of the two experts, oe predictig 1 all the time, the other costatly predictig 0.

126 107 / 130 Example #1: Bit Predictio Of course, we should do majority vote over the past outcomes z t = I {ȳt 1 1/2} where ȳ t 1 = 1 t 1 t 1 s=1 y s. This algorithm guaratees c t max{p, 1 p} ad lim if t ( ct max{ zt, 1 zt}) 0 almost surely ( ) Claim: there is a algorithm that esures ( ) for a arbitrary sequece. Ay idea how to do it? Aother way to formulate ( ): umber of mistakes should be ot much more tha made by the best of the two experts, oe predictig 1 all the time, the other costatly predictig 0. Note the differece: estimatig a hypothesized model vs competig agaist a referece set. We had see this distictio i the previous lecture.

127 108 / 130 Example #2: Spam Detectio We are tasked with developig a spam detectio program that eeds to be adaptive to malicious attacks. x 1,..., x are messages, revealed oe-by-oe upo observig the message x t, the learer (spam detector) eeds to decide whether it is spam or ot spam (ŷ t {0, 1}) the actual label y t {0, 1} is revealed (e.g. by the user) Do it seem plausible that (x 1, y 1),..., (x, y ) are i.i.d. from some distributio P? Probably ot... I fact, the sequece might eve be adversarially chose. I fact, spammers adapt ad try to improve their strategies.

129 110 / 130 Olie Learig (Supervised) No assumptio that there is a sigle distributio P Data ot give all at oce, but rather i the olie fashio As before, X is the space of iputs, Y the space of outputs Loss fuctio l(y 1, y 2) Olie protocol (supervised learig): Goal: keep regret small: For t = 1,..., Observe x t, predict ŷ t, observe y t Reg = 1 t=1 1 l(ŷ t, y t) if f F t=1 l(f(x t), y t) A boud o Reg should hold for ay sequece (x 1, y 1),..., (x, y )!

130 111 / 130 Pros/Cos of Olie Learig The good: A upper boud o regret implies good performace relative to the set F o matter how adversarial the sequece is. Olie methods are typically computatioally attractive as they process oe data poit at a time. Used whe data sets are huge. Iterestig research coectios to Game Theory, Iformatio Theory, Statistics, Computer Sciece. The bad: A regret boud implies good performace oly if oe of the elemets of F has good performace (just as i Statistical Learig). However, for o-iid sequeces a sigle f F might ot be good at all! To alleviate this problem, the comparator set F ca be made ito a set of more complex strategies. There might be some (o-i.i.d.) structure of sequeces that we are ot exploitig (this is a iterestig area of research!)

131 112 / 130 Settig Up the Miimax Value First, it turs out that ŷ t has to be a radomized predictio: we eed to decide o a distributio q t (Y) ad the draw ŷ t from q t. The miimax best that both the learer ad the adversary (or, Nature) ca do is V(F, ) = sup if sup E x t X q t y t Y y t q t t=1 { 1 t=1 1 l(ŷ t, y t) if f F t=1 l(f(x t), y t)} This is a awkward ad log expressio, so o eed to be worried. All you eed to kow right ow is: A upper boud o V(F, ) guaratees existece of a strategy (learig algorithm) that will suffer at most that much regret. A lower boud o V(F, ) meas the adversary ca iflict at least that much damage, o matter what the learig algorithm does. It is iterestig to study V(F, )! It turs out, may of the tools we used i Statistical Learig ca be exteded to study Olie Learig!

132 113 / 130 Sequetial Rademacher Complexity A (complete biary) X -valued tree x of depth is a collectio of fuctios x 1,..., x such that x i {±1} i 1 X ad x 1 is a costat fuctio. A sequece ɛ = (ɛ 1,..., ɛ ) defies a path i x: x 1, x 2(ɛ 1), x 3(ɛ 1, ɛ 2),..., x (ɛ 1,..., ɛ 1) Defie sequetial Rademacher complexity as R seq (F, ) = sup x E ɛ1 sup { 1 f F t=1 ɛ tf(x t(ɛ 1 t 1))} where the supremum is over all X -valued trees of depth. Theorem Let Y = {0, 1} ad F is a class of biary-valued fuctios. Let l be the idicator loss. The V(F, ) 2R seq (F, )

133 114 / 130 Fiite Class Suppose F is fiite, N = card(f). The for ay tree x, E ɛ1 sup { 1 2 log N ɛ tf(x t(ɛ 1 t 1))} f F t=1 because, agai, this is a maximum of N (sub)gaussia Radom variables! Hece, V(F, ) 2 2 log N This boud is basically the same as that for Statistical Learig with a fiite umber of fuctios! Therefore, there must exist a algorithm for predictig ŷ t give x t such that regret scales as O ( log N ). What is it?

134 115 / 130 Expoetial Weights, or the Experts Algorithm We thik of each elemet {f 1,..., f N } = F as a expert who gives a predictio f i (x t) give side iformatio x t. We keep distributio w t over experts, accordig to their performace. Let w 1 = (1/N,..., 1/N), η = (8 log N)/T. To predict at roud t, observe x t, pick i t w t ad set ŷ t = f it (x t). Update w t+1(i) w t(i) exp { ηi {fi (x t ) y t }} Claim: for ay sequece (x 1, y 1),..., (x, y ), with probability at least 1 δ 1 t=1 1 I {ŷt y t } if f F t=1 I {f(xt ) y t } log N 2 + log(1/δ) 2

135 116 / 130 Useful Properties of Sequetial Rademacher Complexity Sequetial Rademacher complexity ejoys the same ice properties as its iid cousi, except for the Lipschitz cotractio (4). At the momet we ca oly prove R seq (φ F) LR seq (F) O(log 3/2 ) It is a ope questio whether this logarithmic factor ca be removed...

136 117 / 130 Theory for Olie Learig There is ow a theory with combiatorial parameters, coverig umbers, ad eve a recipe for developig olie algorithms! May of the relevat cocepts (e.g. sequetial Rademacher complexity) are geeralizatios of the i.i.d. aalogues to the case of depedet data. Coupled with the olie-to-batch coversio we itroduce i a few slides, there is ow a iterestig possibility of developig ew computatioally attractive algorithms for statistical learig. Oe such example will be preseted.

137 118 / 130 Theory for Olie Learig Statistical Learig i.i.d. data tuples of data Rademacher averages coverig / packig umbers Dudley etropy itegral VC dimesio Scale-sesitive dimesio Vapik-Chervoekis-Sauer-Shelah Lemma ERM ad regularized ERM Olie Learig arbitrary sequeces biary trees sequetial Rademacher complexity tree cover aalogous result with tree cover Littlestoe s dimesio aalogue for trees aalogous combiatorial result for trees may iterestig algorithms

139 120 / 130 Olie Covex ad Liear Optimizatio For may problems, l(f, (x, y)) is covex i f ad F is a covex set. Let us simply write l(f, z), where the move z eed ot be of the form (x, y). e.g. square loss l(f, (x, y)) = ( f, x y) 2 for liear regressio. e.g. hige loss l(f, (x, y)) = max{0, 1 y f, x }, a surrogate loss for classificatio. We may the use optimizatio algorithms for updatig our hypothesis after seeig each additioal data poit.

140 121 / 130 Olie Covex ad Liear Optimizatio Olie protocol (Olie Covex Optimizatio): Goal: keep regret small: For t = 1,..., Predict f t F, observe z t Reg = 1 t=1 1 l(f t, z t) if f F t=1 l(f, z t) Olie Liear Optimizatio is a particular case whe l(f, z) = f, z.

141 122 / 130 Gradiet Descet At time t = 1,...,, predict f t F, observe z t, update f t+1 = f t η l(f t, z t) ad project f t+1 to the set F, yieldig f t+1. η is a learig rate (step size) gradiet is with respect to the first coordiate This simple algorithm guaratees that for ay f F 1 t=1 l(f t, z t) 1 t=1 l(f, z t) 1 t=1 f t, l(f t, z t) 1 O( 1/2 ) t=1 f, l(f t, z t) as log as l(f t, z t) c for some costat c, for all t, ad F has a bouded diameter.

142 123 / 130 Gradiet Descet for Strogly Covex Fuctios Assume that for ay z, l(, z) is strogly covex i the first argumet. That is, l(f, z) 1 2 f 2 is a covex fuctio. The same gradiet descet algorithm with a differet step size η guaratees that for ay f F a faster rate. 1 t=1 l(f t, z t) 1 t=1 l(f, z t) O ( log() ),

144 125 / 130 How to use regret bouds for i.i.d. data Suppose we have a regret boud 1 t=1 1 l(f t, z t) if f F t=1 l(f, z t) R that holds for all sequeces z 1,..., z, for some R 0. Assume z 1,..., z are i.i.d. with distributio P. Ru the regret miimizatio algorithm o these data ad let f = 1 t=1 f t. The E z,z1,...,z l( f, z) E { 1 t=1 l(f t, z)} = E { 1 t=1 l(f t, z t)} where the last step holds because f t oly depeds o z 1,..., z t 1. Also, 1 E {if f F t=1 l(f, z t)} if f F E { 1 t=1 l(f, z t)} = E zl(f F, z) Combiig, EL( f) if L(f) R f F

145 126 / 130 How to use regret bouds for i.i.d. data This gives a alterative way of provig bouds o EL(ˆf ) if f F L(f) by usig ˆf = f, the average of the trajectory of a olie learig algorithm. Next, we preset a iterestig applicatio of this idea.

146 127 / 130 Pegasos Support Vector Machie is a facy ame for the algorithm i the liear case. m 1 ˆf = arg mi f R d m i=1 max{0, 1 y i f, x i } + λ 2 f 2 The objective ca be kerelized for represetig liear separators i higher-dimesioal feature space. The hige loss is covex i f. Write l(f, z) = max{0, 1 y f, x } + λ 2 f 2 for z = (x, y). The the objective of SVM ca be writte as mi f El(f, z) The expectatio is with respect to the empirical distributio 1 m m i=1 δ (xi,y i ). The a i.i.d. sample z 1,..., z from the empirical distributio is simply a draw with replacemet from the dataset {(x 1, y 1),..., (x m, y m)}.

147 128 / 130 Pegasos A gradiet descet f t+1 = f t η l(f t, z t) with the gives a guaratee l(f t, z t) = y tx ti {yt f t,x t <1} + λf t El( f, z) if El(f, z) R f F Sice l(f, z) is λ-strogly covex, the rate R = O(log()/). Pegasos (Shalev-Shwartz et al, 2010) For t = 1,..., Choose a radom example (x it, y it ) from the dataset. Set η = 1/(λt) If y it f t, x it < 1, update f t+1 = (1 η tλ)f t + η tx it y it else, update f t+1 = (1 η tλ)f t The algorithm ad aalysis are due to (S. Shalev-Shwartz, Siger, Srebro, Cotter, 2010)

148 129 / 130 Pegasos We coclude that f = 1 t=1 f t computed usig the gradiet descet algorithm is a Õ( 1 )-approximate miimizer of the SVM objective after steps. This gives a O(d/(λɛ)) time to coverge to a ɛ-miimizer. Very fast SVM solver, attractive for large datasets!

149 130 / 130 Summary Key poits for both statistical ad olie learig: obtaied performace guaratees with miimal assumptios prior kowledge is captured by the comparator term uderstadig the iheret complexity of the comparator set key techiques: empirical processes for iid ad o-iid data iterestig relatioships betwee statistical ad olie learig computatio ad statistics a basis of machie learig