Concentration of Measure

Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Overview of some probability distributions.

Convexity, Inequalities, and Norms

Chapter 7 Methods of Finding Estimators

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Asymptotic Growth of Functions


10-705/ Intermediate Statistics

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Lecture 5: Span, linear independence, bases, and dimension

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

Introduction to Statistical Learning Theory

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Maximum Likelihood Estimators.

Output Analysis (2, Chapters 10 &11 Law)

Chapter 5: Inner Product Spaces

Infinite Sequences and Series

Sequences and Series

Universal coding for classes of sources

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

I. Chi-squared Distributions

A probabilistic proof of a binomial identity

1. MATHEMATICAL INDUCTION

Statistical Learning Theory

3. Greatest Common Divisor - Least Common Multiple

THE ABRACADABRA PROBLEM

Theorems About Power Series

Class Meeting # 16: The Fourier Transform on R n

Chapter 14 Nonparametric Statistics

4.3. The Integral and Comparison Tests

CS103X: Discrete Structures Homework 4 Solutions

Normal Distribution.

MARTINGALES AND A BASIC APPLICATION

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

1 The Gaussian channel

Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

Plug-in martingales for testing exchangeability on-line

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Chapter 11 Convergence in Distribution

1 Computing the Standard Deviation of Sample Means

Lecture 4: Cheeger s Inequality

NOTES ON PROBABILITY Greg Lawler Last Updated: March 21, 2016

LECTURE 13: Cross-validation

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

A Recursive Formula for Moments of a Binomial Distribution

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Section 11.3: The Integral Test

A Note on Sums of Greatest (Least) Prime Factors

1 Correlation and Regression Analysis

Soving Recurrence Relations

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

1. C. The formula for the confidence interval for a population mean is: x t, which was

Department of Computer Science, University of Otago

Unbiased Estimation. Topic Introduction

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Hypothesis testing. Null and alternative hypotheses

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Incremental calculation of weighted mean and variance

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

THE HEIGHT OF q-binary SEARCH TREES

Ekkehart Schlicht: Economic Surplus and Derived Demand

A PROBABILISTIC VIEW ON THE ECONOMICS OF GAMBLING

Partial Di erential Equations

5 Boolean Decision Trees (February 11)

The Stable Marriage Problem

Measures of Spread and Boxplots Discrete Math, Section 9.4

3 Basic Definitions of Probability Theory

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

Descriptive Statistics

Metric, Normed, and Topological Spaces

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

Inverse Gaussian Distribution

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

INFINITE SERIES KEITH CONRAD

Central Limit Theorem and Its Applications to Baseball

Statistical inference: example 1. Inferential Statistics

PRODUCT RULE WINS A COMPETITIVE GAME

Factors of sums of powers of binomial coefficients

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

Modified Line Search Method for Global Optimization

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Basic Elements of Arithmetic Sequences and Series

Permutations, the Parity Theorem, and Determinants

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK

THE TWO-VARIABLE LINEAR REGRESSION MODEL

Transcription:

Copyright c 2008 2010 Joh Lafferty, Ha Liu, ad Larry Wasserma Do Not Distribute Chapter 7 Cocetratio of Measure Ofte we wat to show that some radom quatity is close to its mea with high probability Results of this kid are kow as cocetratio iequalities I this chapter we cosider some importat cocetratio results such as Hoeffdig s iequality, Berstei s iequality ad McDiarmid s iequality The we cosider uiform bouds that guaratee that a set of radom quatities are simultaeously close to their meas with high probabilty 71 Itroductio Ofte we eed to show that a radom quatity is close to its mea For example, later we will prove Hoeffdig s iequality which implies that, if 1,, are Beroulli radom variables with mea µ the P( µ >) apple 2e 22 where = 1 P i More geerally, we wat a result of the form P f( 1,, ) µ (f) > < (71) where µ (f) = E(f( 1,, )) ad 0 as 1 Such results are kow as cocetratio iequalities ad the pheomeo that may radom quatities are close to their mea with high probability is called cocetratio of measure These results are 97

98 Chapter 7 Cocetratio of Measure fudametal for establishig performace guaratees of may algorithms For statistical learig theory, we will eed uiform bouds of the form P sup f( 1,, ) µ (f) > < (72) over a class of fuctios F 73 Example To motivate the eed for such results, cosider empirical risk miimizatio i classificatio Suppose we have data (X 1,Y 1 ),, (X,Y ) where Y i 2{0, 1} ad X i 2 R d Let h : R d {0, 1} be a classifier The traiig error is br (h) = 1 X I(Y i 6= h(x i )) ad the true classificatio error is R(h) = P(Y 6= h(x)) We would like to kow if b R(h) is close to R(h) with high probability This is precisely of the form (71) with i =(X i,y i ) ad f( 1,, )= 1 P I(Y i 6= h(x i )) Now let H be a set of classifiers Let b h miimize the traiig error R(h) b over H ad let h miimize the true error R(h) over H Ca we guaratee that the risk R( b h) of the selected classifier is close to the risk R(h ) of the best classifier? Let E deote the evet that sup h2h R b (h) R(h) apple Whe the evet E holds, we have that R(h ) apple R( b h) apple b R ( b h)+ apple b R (h )+ apple R(h )+2 where we used the followig facts: h miimizes R, E holds, b h miimizes R b, E holds ad h miimizes R It follows that, whe E holds, R( b h) R(h ) apple2 Cocetratio of measure is used to prove that E holds with high probability 2 Besides classificatio, cocetratio iequalities are used for studyig may other methods such as clusterig, radom projectios ad desity estimatio

72 Basic Iequalities 99 Notatio If P is a probability measure ad f is a fuctio the we write Pf = P (f) = f(z)dp (z) =E(f()) Give 1,,, let P deote the empirical measure that puts mass 1/ at each data poit: P (A) = 1 X I( i 2 A) where I( i 2 A) =1if i 2 A ad I( i 2 A) =0otherwise The we write P f = P (f) = f(z)dp (z) = 1 X f( i ) 72 Basic Iequalities 721 Hoeffdig s Iequality Suppose that has a fiite mea ad that P( E() = 1 0 zdp(z) which yields Markov s iequality: 1 zdp(z) 0) = 1 The, for ay >0, 1 dp (z) = P( >) (74) P( >) apple E() (75) A immediate cosequece of Markov s iequality is Chebyshev s iequality P( µ >)=P( µ 2 > 2 ) apple E( µ)2 2 = 2 2 (76) where µ = E() ad 2 = Var() If 1,, are iid with mea µ ad variace 2 the, sice Var( )= 2 /, Chebyshev s iequality yields P( 2 µ >) apple 2 (77) While this iequality is useful, it does ot decay expoetially fast as icreases To improve the iequality, we use Cheroff s method: for ay t>0, P( >)=P(e >e )=P(e t >e t ) apple e t E(e t ) (78) We the miimize over t ad coclude that:

100 Chapter 7 Cocetratio of Measure P( >) apple if t 0 e t E(e t ) (79) To use the above result we eed to boud the momet geeratig fuctio E(e t ) 710 Lemma Let be a mea µ radom variable such that a apple apple b The, for ay t, E(e t ) apple e tµ+t2 (b a) 2 /8 (711) Proof For simplicity, assume that µ =0 Sice a apple apple b, we ca write as a covex combiatio of a ad b, amely, = b +(1 )a where =( a)/(b a) By the covexity of the fuctio y e ty we have e t apple e tb +(1 )e ta = a b a etb + b b a eta Take expectatios of both sides ad use the fact that E() =0to get Ee t apple a b a etb + b b a eta = e g(u) (712) where u = t(b a), g(u) = u + log(1 + e u ) ad = a/(b a) Note that g(0) = g 0 (0) = 0 Also, g 00 (u) apple 1/4 for all u>0 By Taylor s theorem, there is a 2 (0,u) such that g(u) = g(0) + ug 0 (0) + u2 2 g00 ( ) = u2 2 g00 ( ) apple u2 8 = t2 (b a) 2 8 Hece, Ee t apple e g(u) apple e t2 (b a) 2 /8 713 Theorem (Hoeffdig) If 1, 2,, are idepedet with P(a apple i apple b) =1ad commo mea µ the for ay t>0 where = 1 P i P( µ >) apple 2e 22 /(b a) 2 (714)

72 Basic Iequalities 101 Proof For simplicity assume that E( i )=0 Now we use the Cheroff method For ay t>0, we have, from Markov s iequality, that P 1 X i = P t apple e t E X i t e (t/) P i = P e (t/) P i e t = e t Y i E(e (t/) i ) (715) apple e t e (t2 / 2 ) P (b i a i ) 2 /8 (716) where the last iequality follows from Lemma 710 Now we miimize the right had side over t I particular, we set t =4 2 / P (b i a i ) 2 ad get P apple e 22 /c By a similar argumet, P apple apple e 22 /c ad the result follows 717 Corollary If 1, 2,, are idepedet with P(a i apple i apple b i )=1ad commo mea µ, the, with probability at least 1, s c 2 µ apple 2 log (718) where c = 1 P (b i a i ) 2 719 Corollary If 1, 2,, are idepedet Beroulli radom variables with P( i = 1) = p the, for ay >0, P( q p >) apple 2e 22 Hece, with probability at least 1 1 we have that p apple 2 log 2 720 Example (Classificatio) Returig to the classificatio problem, let h be a classifier ad let f(z) = I(y q 6= h(x) where z =(x, y) The Hoeffdig s iequality implies that R(h) R b 1 (h) apple 2 log 2 with probability at least 1 2 The followig result exteds Hoeffdig s iequality to more geeral fuctios f(z 1,,z )

102 Chapter 7 Cocetratio of Measure 721 Theorem (McDiarmid) Let 1,, be idepedet radom variables Suppose that sup z 1,,z,zi 0 f(z 1,,z i 1,z i,z i+1,,z ) f(z 1,,z i 1,zi,z 0 i+1,,z ) apple c i (722) for i =1,, The P f( 1,, ) E(f( 1,, )) apple 2exp 2 P 2 c2 (723) i Proof Let Y = f( 1,, ) ad µ = E(f( 1,, )) The P Y µ = P Y µ + P Y µ apple We will boud the first quatity The secod follows similarly Let V i = E(Y 1,, i ) E(Y 1,, i 1 ) The ad E(V i 1,, i Now, for ay t>0, P (Y µ ) = P f( 1,, ) E(f( 1,, )) = X 1 )=0 Usig a similar argumet as i Lemma 710, we have E(e tv i 1,, i 1 ) apple e t2 c 2 i /8 (724) X = e t E e t P 1 V i E e tv 1,, 1 V i = P e t P V i e t apple e t E e t P V i apple e t e t2 c 2 /8 E e t P 1 V i applee t e P t2 c2 i The result follows by takig t =4/ P c2 i Remark: If f(z 1,,z )= 1 P z i the we get back Hoeffdig s iequality 725 Example Let X 1,,X P ad let P (A) = 1 P I(X i 2 A) Defie f(x 1,,X )=sup A P (A) P (A) Chagig oe observatio chages f by at most V i

72 Basic Iequalities 103 1/ Hece, P E( ) > apple 2e 22 2 722 Sharper Iequalities Hoeffdig s iequality does ot use ay iformatio about the radom variables except the fact that they are bouded If the variace of X i is small, the we ca get a sharper iequality from Berstei s iequality We begi with a prelimiary result 726 Lemma Suppose that X applec ad E(X) =0 For ay t>0, e E(e tx ) apple exp t 2 2 tc 1 tc (tc) 2 where 2 = Var(X) (727) Proof Let F = P 1 t r 2 E(X r ) r=2 r 2 The, 1X E(e tx t r X r ) = E 1+tx + =1+t 2 2 F apple e t2 2F (728) r r=2 For r 2, E(X r )=E(X r 2 X 2 ) apple c r 2 2 ad so 1X t r 2 c r 2 2 F apple = 1 1X (tc) r r 2 (tc) 2 r r=2 i=2 Hece, E(e tx ) apple exp t 2 2 etc 1 tc o (tc) 2 = etc 1 tc (tc) 2 (729) 730 Theorem (Berstei) If P( X i applec) =1ad E(X i )=µ the, for ay >0, P( X where 2 = 1 P Var(X i) µ >) apple 2exp 2 2 2 +2c/3 (731) Proof For simplicity, assume that µ =0 From Lemma 726, E(e tx i ) apple exp t 2 i 2 e tc 1 tc (tc) 2 (732)

104 Chapter 7 Cocetratio of Measure where i 2 = E(X2 i ) Now, P X > = P X X i > = P e t P X i >e t (733) apple e t E(e t P Y X i )=e t E(e tx i ) (734) apple e t exp t 2 2 etc 1 tc (tc) 2 (735) Take t =(1/c) log(1 + c/ 2 ) to get 2 c P(X >) apple exp c 2 h 2 (736) where h(u) =(1+u) log(1 + u) u The results follows by otig that h(u) u 2 /(2 + 2u/3) for u 0 A useful corollary is the followig 737 Lemma Let X 1,,X be iid ad suppose that X i applec ad E(X i )=µ With probability at least 1, r 2 X µ apple 2 log(1/ ) 2c log(1/ ) + (738) 3 I particular, if apple p 2c 2 log(1/ )/(9), the with probability at least 1, X µ apple C (739) where C =4c log 1 /3 We also get a very specific iequality i the special case that X is Gaussia 740 Theorem Suppose that X 1,,X N(µ, 2 ) 2 P( X µ >) apple exp 2 2 (741) Proof Let X N(0, 1) with desity (x) = R x1 (s)ds For ay >0, (x) =(2 ) 1/2 e x2 /2 ad distributio fuctio P(X >)= 1 (s)ds apple 1 1 s (s)ds = 1 1 0 (s)ds = () apple 1 e 2 /2 (742)

72 Basic Iequalities 105 By symmetry we have that Now suppose that X 1,,X N(µ, Let N(0, 1) The, for all large P( X >) apple 2 e 2 /2 P( X µ >) = P p X µ / > p / 2 ) The X = 1 P X i N(µ, 2 /) apple 2 p e 2 /(2 2) apple e 2 /(2 2 ) = P > p / (743) (744) 723 Bouds o Expected Values Suppose we have a expoetial boud o P(X >) I that case we ca boud E(X ) as follows 745 Theorem Suppose that X 0 ad that for every >0, for some c 2 > 0 ad c 1 > 1/e The, E(X ) apple P(X >) apple c 1 e c 2 2 (746) q C where C = (1 + log(c 1))/c 2 Proof Recall that for ay oegative radom variable Y, E(Y ) = R 1 0 P(Y t)dt Hece, for ay a>0, E(X 2 )= 1 0 P(X 2 t)dt = a 0 P(X 2 t)dt+ 1 a P(X 2 t)dt apple a+ 1 a P(X 2 t)dt Equatio (746) implies that P(X > p t) apple c 1 e c 2t Hece, E(X 2 ) apple a + 1 a P(X 2 t)dt = a + Set a = log(c 1 )/(c 2 ) ad coclude that Fially, we have E(X ) apple p E(X 2 ) apple 1 a P(X p t)dt apple a + c1 1 E(X 2 ) apple log(c 1) c 2 + 1 c 2 = 1 + log(c 1) c 2 q 1+log(c1 ) c 2 a e c 2t dt = a + c 1e c 2a c 2

106 Chapter 7 Cocetratio of Measure Now we cosider boudig the maximum of a set of radom variables 747 Theorem Let X 1,,X be radom variables Suppose there exists >0 such that E(e tx i ) apple e t 2 /2 for all t>0 The E max X i apple p 2 log (748) 1appleiapple Proof By Jese s iequality, exp te max X i apple E exp t max X i 1appleiapple 1appleiapple = E max exp {tx i} apple 1appleiapple X E (exp {tx i }) apple e t2 2 /2 Thus, E (max 1appleiapple X i ) apple log t + t 2 2 The result follows by settig t = p 2 log / 73 Uiform Bouds 731 Biary Fuctios A biary fuctio o a space is a fuctio f : {0, 1} Let F be a class of biary fuctios o For ay z 1,,z defie o F z1,,z = (f(z 1 ),,f(z )) : f 2F (749) F z1,,z is a fiite collectio of biary vectors ad F z1,,z apple2 The set F z1,,z is called the projectio of F oto z 1,,z 750 Example Let F = {f t : t 2 R} where f t (z) =1if z>tad f t (z) =0of z apple t Cosider three real umbers z 1 <z 2 <z 3 The o F z1,z 2,z 3 = (0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1) 2 Defie the growth fuctio or shatterig umber by s(f,)= sup z 1,,z F z1,,z (751)

73 Uiform Bouds 107 A biary fuctio f ca be thought of as a idicator fuctio for a set, amely, A = {z : f(x) =1} Coversely, ay set ca be thought of as a biary fuctio, amely, its idicator fuctio I A (z) We ca therefore re-express the growth fuctio i terms of sets If A is a class of subsets of R d the s(a,) is defied to be s(f,) where F = {I A : A 2 A} is the set of idicator fuctios ad the s(a,) is agai called the shatterig umber It follows that s(a,) = max s(a,f) F where the maximum is over all fiite sets of size ad s(a,f)= {A \ F : A 2A deotes the umber of subsets of F picked out by A We say that a fiite set F of size is shattered by A if s(a,f)=2 752 Theorem Let A ad B be classes of subsets of R d 1 s(a,+ m) apple s(a,)s(a,m) 2 If C = A S B the s(c,) apple s(a,)+s(b,) 3 If C = {A S B : A 2A,B 2B}the s(c,) apple s(a,)s(b,) 4 If C = {A T B : A 2A,B 2B}the s(c,) apple s(a,)s(b,) Proof See exercise 10 VC Dimesio Recall that a fiite set F of size is shattered by A if s(a,f)=2 The VC dimesio (amed after Vapik ad Chervoekis) of A is the size of the largest set that ca be shattered by A The VC dimesio of a class of set A is VC(A) =sup : s(a,)=2 o (753) The VC dimesio of a class of biary fuctios F is VC(F) =sup : s(f,)=2 o (754) If the VC dimesio is fiite, the the growth fuctio caot grow too quickly I fact, there is a phase trasitio: s(f,)=2 for <dad the the growth switches to

108 Chapter 7 Cocetratio of Measure polyomial 755 Theorem (Sauer s Theorem) Suppose that F has fiite VC dimesio d The, dx s(f,) apple i ad for all d, i=0 (756) e d s(f,) apple (757) d Proof Whe = d = 1, (756) clearly holds We proceed by iductio Suppose that (756) holds for 1 ad d 1 ad also that it holds for 1 ad d We will show that it holds for ad d Let h(, d) = P d i=0 i We eed to show that VC(F) apple d implies that s(f,) apple h(, d) Let F 1 = {z 1,,z } ad F 2 = {z 2,,z } Let F 1 = {(f(z 1 ),,f(z ) : f 2 F} ad F 2 = {(f(z 2 ),,f(z ) : f 2 F} For f,g 2F, write f g if g(z 1 )=1 f(z 1 ) ad g(z j )=f(z j ) for j =2,, Let o G = f 2F: there existsg 2Fsuch that g f Defie F 3 = {(f(z 2 ),,f(z )) : f 2 G} The F 1 = F 2 + F 3 Note that VC(F 2 ) apple d ad VC(F 3 ) apple d 1 The latter follows sice, if F 3 shatters a set, the we ca add z 1 to create a set that is shattered by F 1 By assumptio F 2 appleh( 1,d) ad F 3 appleh( 1,d 1) Hece, F 1 appleh( 1,d)+h( 1,d 1) = h(, d) Thus, s(f,) apple h(, d) which proves (756) To prove (757), we use the fact that dx d dx apple i d i i=0 apple i=0 d d d 1+ d apple d ad so: i d X apple d d d e d i=0 i d i The VC dimesios of some commo examples are summarized i Table 71 Now we ca exted the cocetratio iequalities to hold uiformly over sets of biary fuctios We start with fiite collectios 758 Theorem Suppose that F = {f 1,,f N } is a fiite set of biary fuctios The, with probability at least 1, s 2 2N sup P (f) P (f) apple log (759)

73 Uiform Bouds 109 Class A VC dimesio V A A = {A 1,,A N } log 2 N Itervals [a, b] o the real lie 2 Discs i R 2 3 Closed balls i R d d +2 Rectagles i R d 2d Half-spaces i R d d +1 Covex polygos i R 2 1 Table 71 The VC dimesio of some classes A Proof It follows from Hoeffdig s iequality that, for each f 2F, P ( P (f) P (f) >) apple 2e 2 /2 Hece, P max P (f) P (f) > = P ( P (f) P (f) > for some f 2F) The coclusio follows apple NX P ( P (f j ) P (f j ) >) apple 2Ne 2 /2 j=1 Now we cosider results for the case where F is ifiite We begi with a importat result due to Vapik ad Chervoekis 760 Theorem (Vapik ad Chervoekis) Let F be a class of biary fuctios For ay t> p 2/, P sup (P P )f >t ad hece, with probability at least 1, sup P (f) P (f) apple apple 4 s(f, 2)e t2 /8 (761) s 8 4 s(f, 2) log (762) Before provig the theorem, we eed the symmetrizatio lemma Let 0 1,,0 deote a secod idepedet sample from P Let P 0 deote the empirical distributio of this

110 Chapter 7 Cocetratio of Measure secod sample The variables 0 1,,0 are called a ghost sample 763 Lemma (Symmetrizatio) For all t> p 2/, P sup (P P )f >t apple 2P sup (P P 0 )f >t/2 (764) Proof Let f 2Fmaximize (P P )f Note that f is a radom fuctio as it depeds o 1,, We claim that if (P P )f >tad (P P)f 0 applet/2 the (P 0 P )f >t/2 This follows sice t < (P P )f = (P P 0 + P 0 P )f apple (P P 0 )f + (P 0 P )f apple (P P 0 )f + t 2 ad hece (P 0 P )f >t/2 So I( (P P )f >t) I( (P P 0 )f applet/2) = I( (P P )f >t, (P P 0 )f applet/2) apple I( (P 0 Now take the expected value over 0 1,,0 ad coclude that P )f >t/2) I( (P P )f >t) P 0 ( (P P 0 )f applet/2) apple P 0 ( (P 0 P )f >t/2) (765) By Chebyshev s iequality, P 0 ( (P P 0 )f applet/2) 1 4Var 0 (f ) t 2 1 1 1 t 2 2 (Here we used the fact that W 2 [0, 1] implies that Var(W ) apple 1/4) Isertig this ito (765) we have that I( (P P )f >t) apple 2 P 0 ( (P 0 P )f >t/2) Thus, I sup (P P )f >t apple 2 P 0 sup (P 0 Now take the expectatio over 1,, to coclude that P (P P )f >t apple 2 P (P 0 sup sup P )f >t/2 P )f >t/2 The importace of symmetrizatio is that we have replaced (P P )f, which ca take ay real value, with (P P)f, 0 which ca take oly fiitely may values Now we prove the Vapik-Chervoekis theorem

73 Uiform Bouds 111 Proof Let V = F 0 1,, 0, 1,, For ay v 2 V write (P 0 P )v to mea (1/)( P v i P 2 i=+1 v i) Usig the symmetrizatio lemma ad Hoeffdig s iequality, P(sup (P P )f >t) apple 2 P(sup (P 0 =2P(max v2v (P 0 apple 2 X v2v P( (P 0 P )f >t/2) P )v >t/2) P )v >t/2) apple 2 X v2v 2e t2 /8 apple 4 s(f, 2)e t2 /8 Recall that, for a class with fiite VC dimesio d, s(f,) apple (e/d) d hece we have: 766 Corollary If F has fiite VC dimesio d, the, with probability at least 1, sup P (f) P (f) apple s 8 4 log + d log e (767) d 732 Radamacher Complexity A more geeral way to develop uiform bouds is to use a quatity called Rademacher complexity I this sectio we assume that F is a class of fuctios f such that 0 apple f(z) apple 1 Radom variables 1,, are called Rademacher radom variables if they are idepedet, idetically distributed ad P( i = 1) = P( i = 1) = 1/2 Defie the Rademacher complexity of F by 1 X Rad (F) =E if( i ) (768) sup Defie the empirical Rademacher complexity of F by Rad (F, )=E sup 1 X where =( 1,, ) ad the expectatio is over if( i ) oly (769)

112 Chapter 7 Cocetratio of Measure Ituitively, Rad (F) is large if we ca fid fuctios f 2Fthat look like radom oise, that is, they are highly correlated with 1,, Here are some properties of the Rademacher complexity 770 Lemma 1 If F Gthe Rad (F, ) apple Rad (G, ) 2 Let cov(f) deote the covex hull of F The Rad (F, )=Rad (cov(f), ) 3 For ay c 2 R, Rad (cf, )= c Rad (F, ) 4 Let g : R R be such that g(0) = 0 ad, g(y) g(x) applel x y for all x, y The Rad (g F, ) apple 2LRad (F, ) Proof See Exercise 8 771 Theorem With probability at least 1, sup P (f) P (f) apple2 Rad (F)+ s 1 2 2 log (772) ad sup P (f) P (f) apple2 Rad (F, )+ s 4 2 log (773) Proof The proof has two steps First we show that sup P (f) mea The we boud the mea P (f) is close to its Step 1: Let g( 1,, )=sup P (f) P (f) If we chage i to some other value i 0 the g( 1,, ) g( 1,,i 0,, ) apple 1 By McDiarmid s iequality, P ( g( 1,, ) E[g( 1,, )] >) apple 2 e 22 Hece, with probability at least 1, g( 1,, ) apple E[g( 1,, )] + s 1 2 2 log (774)

73 Uiform Bouds 113 Step 2: Now we boud E[g( 1,, )] Oce agai we itroduce a ghost sample 1 0,,0 ad Rademacher variables 1,,, Note that P (f) =E 0 P(f) 0 Also ote that 1 X (f( 0 i) f( i )) = d 1 X i(f( 0 i) f( i )) where = d meas equal i distributio Hece, " # E[g( 1,, )] = E sup P (f) P (f) apple EE "sup 0 P(f) 0 P (f) X = EE 0 "sup apple E 0 "sup 1 =2Rad (F) 1 X = E " sup E 0 (P(f) 0 P (f)) # = EE 0 " i(f( 0 i) f( i )) i f( 0 i) # + E Combiig this boud with (774) proves the first result " sup sup # 1 1 X # # X (f(i) 0 f( i )) i f( i ) To prove the secod result, let a( 1,, )=Rad (F, ) ad ote that a( 1,, ) chages by at most 1/ if we chage q oe observatio McDiarmid s iequality implies that Rad (F, 1 ) Rad (F) apple 2 log 2 with probability at least 1 Combiig this with the first result yields the secod result I the special case where F is a class of biary fuctios, we ca relate Rad (F) to shatterig umbers 775 Theorem Let F be a set of biary fuctios The, for all, r 2 log s(f,) Rad (F) apple (776) Proof Let D = { 1,, } Defie S(f, )= 1 P if( i ) ad S(v, )= 1 P iv i Now, 1 apple i f( i ) apple 1 Note that Rad (F) = E sup S(f, ) = E E sup S(f, ) D # = E E max S(v, ) D v2f 1,, Now, iv i / has mea 0 ad 1/ apple i v i apple 1/ so, by Lemma 710, E(e t iv i ) apple e t2 /(2 2) for ay t>0 From Theorem 747, r r 2 log V 2 log s(f,) E max S(v, ) D apple = v2f 1,,

114 Chapter 7 Cocetratio of Measure ad the result follows I fact, there is a sharper relatioship betwee Rad (F) ad VC dimesio 777 Theorem Suppose that F has fiite VC dimesio d There exists a uiversal costat C>0 such that Rad (F) apple C p d/ For a proof, see, for example, Devroye ad Lugosi (2001) Combiig these results with Theorem 775 ad Theorem 777 we get the followig result 778 Corollary With probability at least 1, sup P (f) P (f) apple r s 8 log s(f,) + 1 2 log If F has fiite VC dimesio d the, with probability at least 1, 2 (779) r s d sup P (f) P (f) apple2c + 1 2 2 log (780) 733 Bouds For Classes of Real Valued Fuctios Suppose ow that F is a class of real-valued fuctios There are various methods to obtai uiform bouds We cosider two such methods: coverig umbers ad bracketig umbers If Q is a measure ad p 1, defie kfk Lp(Q) = f(x) p dq(x) 1/p Whe Q is Lebesgue measure we simply write kfk p We also defie kfk 1 =sup f(x) x A set C = {f 1,,f N } is a -cover of F (or a -et) if, for every f 2Fthere exists a f j 2Csuch that kf f j k Lp(Q) <

73 Uiform Bouds 115 781 Defiitio The size of the smallest -cover is called the coverig umber ad is deoted by N p (, F,Q) The uiform coverig umber is defied by N p (, F) =supn p (, F,Q) Q where the supremum is over all probability measures Q Now we show how coverig umbers ca be used to obtai bouds 782 Theorem Suppose that kfk 1 apple B for all f 2F The, P P (f) P (f) > apple 2N(/3, F,L 1 )e 2 /(18B 2) sup Proof Let N = N(/3, F,L 1 ) ad let C = {f 1,,f N } be a /3 cover For ay f 2F there is a f j 2 C such that kf f j k 1 apple /3 So Hece, P (f) P (f) apple P (f) P (f j ) + P (f j ) P (f j ) + P (f j ) P (f) apple P (f j ) P (f j ) + 2 3 P sup P (f) P (f) > = P max f j 2C P (f j ) P (f j ) > 3 apple 2N(/3, F,L 1 )e 2 /(18B 2 ) apple P max P (f j ) f j 2C apple from the uio boud ad Hoeffdig s iequality NX j=1 P (f j ) + 2 3 > P P (f j ) P (f j ) > 3 The VC dimesio ca be used to boud coverig umbers 783 Theorem Let F be a class of fuctios f : R d [0,B] with VC dimesio d such that 2 apple d<1 Let p 1 ad 0 <<B/4 The 2eB p 3eB p d N p (, F) apple 3 log p e p (For a proof, see Devroye, Gyorfi ad Lugosi (1996)) However, there are cases where the coverig umbers are fiite ad yet the VC dimesio is ifiite

116 Chapter 7 Cocetratio of Measure Bracketig Numbers Aother measure of complexity is the bracketig umber Give a pair of fuctios ` ad u with ` apple u, we defie the bracket o [`, u] = h : `(x) apple h(x) apple u(x) for all x A collectio of pairs of fuctios (`1,u 1 ),,(`N,u N ) is a bracketig of F if, F B[ [`j,u j ] j=1 The collectio is a -L q (P )-bracketig if it is a bracketig ad if u j (x) `j(x) q dp (x) 1 q apple for j =1,,N The bracketig umber N [] (, F,L q (P )) is the size of the smallest bracketig Bracketig umber are a little larger tha coverig umbers but provide stroger cotrol of the class F 784 Theorem 1 N p (, F,P) apple N [] (2, F,L p (P )) 2 Let X 1,,X P If Suppose that N [] (, F,L 1 (P )) < 1 for all >0 The, for every >0, as 1 P sup P (f) P (f) > 0 (785) Proof See exercise 11 R 786 Theorem Let A =sup f f dp ad B =supf kfk 1 The P sup P (f) P (f) > apple 2N [] (/8, F,L 1 (P )) exp +2N [] (/8, F,L 1 (P )) exp Hece, if apple 2A/3, P sup P (f) P (f) > 3 2 4B[6A + ] 3 40B 96 2 apple 4N [] (/8, F,L 1 (P )) exp 76AB (787)

73 Uiform Bouds 117 Proof (This proof follows Yukich (1985)) For otatioal simplicity i the proof, let us write, N() N [] (, F,L 1 (P )) Defie z (f) = R f(dp dp ) Let [`1,u 1 ],,[`N,u N ] be a miimal /8 bracketig We may assume that for each j, ku j kappleb ad k`jk appleb (Otherwise, we simply trucate the brackets) For each j, choose some f j 2 [`j,u j ] Cosider ay f 2Fad let [`j,u j ] deote a bracket cotaiig f The z (f) apple z (f j ) + z (f f j ) Furthermore, z (f f j ) = (f f j )(dp dp ) apple f f j (dp + dp ) apple = u j `j (dp dp )+2 u j `j dp = u j `j (dp dp )+2 = z ( u j `j )+ 8 4 u j `j (dp + dp ) Hece, h z (f) apple z (f j ) + [z ( u j `j )+ i 4 Thus, P (sup Now z (f) >) apple P (max j Var(f j ) apple apple P (max j fj 2 dp = Hece, by Berstei s iequality, P max z (f j ) >/2 j Similarly, Also, ku j Var( u j apple 2 j=1 z (f j ) >/2) + P (max z ( u j `j ) + /4 >/2) j z (f j ) >/2) + P (max z ( u j `j ) >/4) j f j f j dp applekf j k 1 NX 1 (/2) 2 exp 2 AB + B/6 f j dp apple AB 3 apple 2N(/8) exp 4B `j ) apple (u j `j) 2 dp apple u j `j u j `j dp appleku j `jk 1 u j `j dp apple 2B 8 = B 4 `jk 1 apple 2B Hece, by Berstei s iequality, P max z ( u j j `j ) >/4 NX 1 apple 2 exp 2 j=1 apple 2N(/8) exp (/4) 2 2B 4 +2B(/4)/3 3 40B 2 6A +

118 Chapter 7 Cocetratio of Measure The followig result is from Example 197 from va der Vaart (1998) 788 Lemma Let F = {f : 2 } where is a bouded subset of R d Suppose there exists a fuctio m such that, for every 1, 2, f 1 (x) f 2 (x) applem(x) k 1 2 k The, N [] (, F,L q (P )) apple 4p d diam( ) R d m(x) q dp (x) Proof Let = 4 p d R m(x) q dp (x) We ca cover with (at most) N = (diam( )/ ) d cubes C 1,,C N of size Let c 1,,c N deote the ceters of the cubes Note that C j B(x j, p d ) where B(x, r) deotes a ball of radius r cetered at x Hece, S j B(c j, p d ) covers Let j be the projectio of c j oto The S j B( j, 2 p d) covers I summary, for every 2 there is a j 2{ 1,, N } such that k j kapple2 p d apple 2 R m(x) q dp (x) Defie `j = f j m(x)/2 ad u j = f j + m(x)/2 We claim that the brackets [`1,u 1 ],,[`N,u N ] cover F To see this, choose ay f 2 F Let j be the closest elemet { 1,, N } to The f (x) = f j (x)+f (x) f j (x) apple f j (x)+ f (x) f j (x) apple f j (x)+m(x)k j kapplef j (x)+ m(x) 2 R m(x) q dp (x) = u j(x) By a similar argumet, f (x) `j(x) Also, R (u j `j) q dp apple q Fially, ote that the umber of brackets is N =(diam( )/ ) d = 4p d diam( ) R d m(x) q dp (x)

73 Uiform Bouds 119 789 Example (Desity Estimatio) Let X 1,,X P where P has support o a compact set X R d Cosider the kerel desity estimator bp h (x) = 1 h Pi K(kx d X i k/h) where K is a smooth symmetric fuctio ad h>0 is a badwidth We study bp h i detail i the chapter o oparametric desity estimatio Here we boud the sup orm distace betwee bp h (x) ad is mea p h (x) =E(bp h (x)) 790 Theorem Suppose that K(x) apple K(0) for all x ad that for all x, y The K(y) K(x) applelkx yk P sup bp(x) p h (x) > apple 2 x 32L p d apple d diam(x ) h d+1 exp 3 2 h d +exp 4K(0)(6 + ) 3h d 40K(0) Hece, if apple 2/3 the P sup bp(x) p h (x) > apple 4 x 32L p d d diam(x ) h d+1 exp 3 2 h d 28K(0) Proof Let F = {f x : x 2X}where f x (u) =h d K(kx uk/h) We apply Theorem 786 with A =1ad B = K(0)/h d We eed to boud N [] (, F,L 1 (P ) Now f x (u) f y (u) = 1 kx uk h d K h apple apple L kx uk ky uk hd+1 L kx hd+1 yk Apply Lemma 788 with m(x) =L/h d+1 Thus implies that ky uk K h N [] (, F,L 1 (P )) apple 4Lp d diamx h d+1 d Hece, Theorem 786 yields, P sup bp(x) p h (x) > apple 2 x 32L p d apple d diam(x ) h d+1 exp 3 2 h d +exp 4K(0)(6 + ) 3h d 40K(0)

120 Chapter 7 Cocetratio of Measure 791 Corollary Suppose that h = h =(C /) where 1/d ad C = (log ) a for some a 0 The P sup bp(x) p h (x) > apple 4 x 32L p d d diam(x ) (d+1) exp C 3 2 C d 1 28K(0) d Hece, for sufficietly large, P (sup bp(x) x p h (x) >) apple c 1 exp c 2 2 C d 1 d Note that the proofs of the last two results did ot deped o P Hece, if P is the set of distributio with support o X, we have that 2 sup P 2P P sup bp(x) p h (x) > apple c 1 exp x c 2 2 C d 1 d 792 Example Here are some further examples I exercise 12 you are asked to prove these results 1 Let F be the set of cdf s o R The N [] (, F,L 2 (P )) apple 2/ 2 2 (Sobolev Spaces) Let F be the fuctios f o [0, 1] such that kfk 1 apple 1, the (k 1) derivative is absolutely cotiuous ad R (f (k) (x)) 2 dx apple 1 The, there is a costat C>0 such that N [] (, F,L 1 (P )) apple exp " C 1 1 # k 3 Let F be the set of mootoe fuctios f o R such that kfk 1 apple 1 The, there is a costat C>0 such that apple 1 N [] (, F,L 1 (P )) apple exp C 2 74 Additioal results 741 Talagrad s Iequality Oe of the most importat developmets i cocetratio of measure is Talagrad s iequality (Talagrad 1994, 1996) which ca be thought of as a uiform versio of Berstei s

74 Additioal results 121 iequality Let F be a class of fuctios ad defie =sup P (f) 1 P 793 Theorem Let v E sup f 2 (X i ) ad U sup kfk 1 There exists a uiversal costat K such that P sup P (f) E(sup P (f) ) >t apple K exp t KU log 1+ tu v (794) To make use of Talagrad s iequality, we eed to estimate E(sup P (f) ) 795 Theorem (Gié ad Guillou, 2001) Suppose that there exist A ad d such that sup N(, L 2 (P ),a) apple (A/) d P where a = kf k L2 (P ) ad F (x) =sup f(x) The for some C>0 AU E(sup P (f) ) apple C du log + p d s AU log Combiig these results gives Gié ad Guillou s versio of Talagrad s iequality: 1 P 796 Theorem Let v E sup f 2 (X i ) ad U sup kfk 1 There exists a uiversal costat K such that P sup P (f) P (f) >t wheever t C apple K exp ( AU U log + p ) t KU log 1+ tu K( p + U p log(au/ )) 2 s log AU 798 Example Desity Estimatio Gie ad Guillou (2002) apply Talagrad s iequality to get bouds o desity estimators Let X 1,,X P where X i 2 R d ad suppose that P has desity p The kerel desity estimator of p with badwidth h is X kx bp h (x) = 1 K Xi k h Applyig the results above to bp h (x) we see that (uder very weak coditios o K) for all small ad large, (797) P(sup x2r d bp h (x) p h (x) >) apple c 2 e c 2hd 2 (799)

122 Chapter 7 Cocetratio of Measure where p h (x) =E(bp h (x)) ad c 1,c 2 are positive costats This agrees with the earlier result Theorem 790 2 742 A Boud o Expected Values Now we cosider boudig the expected value of the maximum of a ifiite set of radom variables Let {X f : f 2F}be a collectio of mea 0 radom variables idexed by f 2F ad let d be a metric o F Let N(F,r) be the coverig umber of F, that is, the smallest umber of balls of radius r required to cover F Say that {X f : f 2F}is sub-gaussia if, for every t>0 ad every f,g 2F, E(e t(x f X g) ) apple e t2 d 2 (f,g)/2 We say that {X f : f 2F}is sample cotiuous if, for every sequece f 1,f 2,,2Fsuch that d(f i,f) 0 for some f 2F, we have that X fi X f as The followig theorem is from Cesa-Biachi ad Lugosi (2006) ad is a variatio of a theorem due to Dudley (1978) 7100 Theorem Suppose that {X f : f 2F}is sub-gaussia ad sample cotiuous The D/2 p E sup X f apple 12 log N(F,)d (7101) where D =sup f,g2f d(f,g) 0 Proof The proof uses Dudley s chaiig techique We follow the versio i Theorem 83 of Cesa-Biachi ad Lugosi (2006) Let F k be a miimal cover of F of radius D2 k Thus F k = N(F,D2 k ) Let f 0 deote the uique elemet i F 0 Each X f is a radom variable ad hece is a mappig from some sample space S to the reals Fix s 2 S ad let f be such that sup F 2F X f (s) =X f (s) (If a exact maximizer does ot exist, we ca choose a approximate maximizer but we shall assume a exact maximizer) Let f k 2F k miimize the distace to f Hece, d(f k 1,f k ) apple d(f,f k )+d(f,f k 1 ) apple 3D2 k Now lim k1 f k = f ad by sample cotiuity sup X f (s) =X f (s) =X f0 (s)+ f Recall that E(X f0 )=0 Therefore, 1X (X fk (s) X fk 1 (s)) k=1 E sup X f apple f 1X k=1 E max (X f X g ) f,g

75 Summary 123 where the max is over all f 2F k ad g 2F k 1 such that d(f,g) apple 3D2 k There are at most N(F,D2 k ) 2 such pairs By Theorem 747, q E max (X f X g ) apple 3D2 k 2 log N(F,D2 k ) 2 f,g By summig over k we have 1X 1X E sup X f apple 3D2 q2 k log N(F,D2 k ) 2 = 12 D2 qlog (k+1) N(F,D2 k ) f k=1 D/2 apple 12 0 p N(F,)d k=1 7102 Example Let Y 1,,Y be a sample from a cotiuous cdf F o [0, 1] with bouded desity Let X s = p (F (s) F (s)) where F is the empirical distributio fuctio The collectio {X s : s 2 [0, 1]} ca be show to be sub-gaussia a sample cotiuous with respect to the Euclidea metric o [0, 1] The coverig umber is N([0, 1],r)=1/r Hece, p 1/2 p E (F (s) F (s) apple 12 log(1/)d apple C sup 0applesapple1 for some C>0 Hece, 2 E sup (F (s) F (s) 0applesapple1 0 apple C p 75 Summary The most importat results i this chapter are Hoeffdig s iequality: P( X µ >) apple 2e 22 /c, Berstei s iequality P( X >) apple 2exp 2 2 2 +2c/3 the Vapik-Chervoekis boud, P sup (P P )f >t apple 4 s(f, 2)e t2 /8

124 Chapter 7 Cocetratio of Measure ad the Rademacher boud: with probability at least 1, s sup P (f) P (f) apple2 Rad (F)+ 1 2 log 2 These, ad similar results, provide the theoretical basis for may statistical machie learig methods The literature catais may refiemets ad extesios of these results 76 Bibliographic Remarks Cocetratio of measure is a vast ad still growig area Some good refereces are Devroye, Gyorfi ad Lugosi (1996), va der Vaart ad Weller (1996), Chapter 19 of va der Vaart (1998), Dubhashi ad Pacoesi (2009), ad Ledoux (2005) Exercises 71 Suppose that X 0 ad E(X) < 1 Show that E(X) = R 1 0 P (X t)dt 72 Show that h(u) u 2 /(2 + 2u/3) for u 0 where h(u) =(1+u) log(1 + u) u 73 I the proof of McDiarmid s iequality, verify that E(V i X 1,,X i 1 )=0 74 Prove Lemma 737 75 Prove equatio (724) 76 Prove the results i Table 71 77 Derive Hoeffdig s iequality from McDiarmid s iequality 78 Prove lemma 770 79 Cosider Example 7102 Show that {X s : s 2 [0, 1]} is sub-gaussia Show that p log(1/)d apple C for some C>0 R 1/2 0 710 Prove Theorem 752 711 Prove Theorem 784 712 Prove the results i Example 792