COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton wth VC-dmenson, reduced the problem of fndng a good PAClearnng algorthm to the problem of computng the VC-dmenson of a gven hypothess space. Recall that VC-dmeson s defned usng the noton of a shattered set,.e. a subset S of the doman such that Π H (S 2 S. In ths lecture, we compute the VC-dmenson of several hypothess spaces by computng the maxmum sze of a shattered set. 1 Example 1: Axs-algned rectangles Not all sets of four ponts are shattered. For example the followng arrangement s mpossble: - Fgure 1: An mpossble assgnment of /- to the data, as all rectangles that contan the outer three ponts (marked must also contan the one pont. However, ths s not suffcent to conclude that the VC-dmenson s at most three. Note that the followng set does shatter: Fgure 2: A set of four ponts that shatters, as there s an axs-algned rectangle that contans any gven subset of the ponts but contans no others. Therefore, the VC-dmenson s at least four. In fact, t s exactly four. Consder any set of fve dstnct ponts {v 1, v 2, v 3, v 4, v 5 } R 2. Consder a rectangle that contans the ponts wth maxmum x-coordnate, mnmum x-coordnate, maxmum y-coordnate, and mnmum y-coordnate. These ponts may not be dstnct. However, there are at most four such ponts. Call ths set of ponts S {v 1, v 2, v 3, v 4, v 5 }. Any axs-algned rectangle that
contans S must also contan all of the ponts v 1, v 2, v 3, v 4, and v 5. There s at least one v that s not n S, but stll must be n the rectangle. Therefore, the labelng that labels all vertces n S wth and v wth cannot be consstent wth any axs-algned rectangle. Ths means that there s no shattered set of sze 5, snce all possble labelngs of a shattered set must be realzed by some concept. By a smlar argument, we can show that the VC-dmenson of axs-algned rectangles n R n s 2n. By generalzng the approach for provng that the VC-dmenson of the postve half nterval learnng problem s 1, one can show that the VC-dmenson of n 1 dmensonal hyperplanes n R n that pass through the orgn s n. Ths concepts are nequaltes of the form w x > 0 for any fxed w R n and varable x R n. In ths case, concepts label ponts wth f they are one sde of a hyperplane and otherwse. 2 Other remarks on VC-dmenson In the cases mentoned prevously, note that the VC-dmenson s smlar to the number of parameters needed to specfy any partcular concept. In the case of axs-algned rectangles, for example, they are equal snce rectangles requre a left boundary, a rght boundary, a top boundary, and a bottom boundary. Unfortunately, ths smlarty does not always hold, although t often does. There are some hypothess spaces wth nfnte VC-dmenson that can be specfed wth one parameter. Note that f H s fnte, the VC-dmenson s at most log 2 H, as at least 2 r dstnct hypotheses must exst to shatter a set of sze r. For a hypothess space wth nfnte VC-dmenson, there s a set of sze m that s shattered for any m > 0. Therefore, Π H (m 2 m, whch we mentoned last class as an ndcaton of a class that s hard to learn. In the next secton, we wll show that all classes wth bounded VC-dmenson d have Π H (m O(m d, completng the descrpton of PAC-learnablty by VC-dmenson. 3 Sauer s Lemma Recall that ( n k n! (n k!k! f 0 k n and ( n k 0 f k < 0 or k > n. k and n are ntegers and n s nonnegatve for our purposes. Note that ( n k O(n k when k s regarded as a postve constant. We wll show the followng lemma, whch mmedately mples the desred result: Lemma 3.1 (Sauer s Lemma. Let H be a hypothess wth fnte VC-dmenson d. Then, Π H (m d ( m : Φ d (m Proof. We wll prove ths by nducton on m d. There are two base cases: Case 1 (m 0. There s only one possble assgnment of and to the empty set,.e. Π H (m 1 here. Note that Φ d (0 ( 0 0 ( 0 1... ( 0 d 1, as desred. 2
Case 2 (d 0. Not even a sngle pont can be shattered n ths stuaton. Therefore, on any gven pont, all hypotheses have the same value. Therefore, there s only one possble hypothess and Π H (m 1. Ths agrees wth Φ, as Φ 0 (m ( m 0 1. Now, we wll prove the nducton step. For ths, we wll need Pascal s Identty, whch states that ( ( ( n n n 1 k k 1 k 1 for all ntegers n and k wth n 0. Consder a hypothess space H wth VC-dmenson d and a set of m examples S : {x 1, x 2,..., x m }. Let T : {x 1, x 2,..., x m 1 }. Form two hypothess spaces H 1 and H 2 on T as follows (an example s n Fgure 3. Let H 1 be the set of restrctons of hypotheses from H to T. Let h T denote the restrcton of h to T for h H,.e. the functon h T : T {, } such that h T (x h(x for all x T. An element ρ on T s added to H 2 f and only f there are two dstnct hypotheses h 1, h 2 H such that h 1 T h 2 T ρ. Note that Π H (S Π H1 (T Π H2 (T. What are the VC-dmensons of H 1 and H 2? Frst, note that the VC-dmenson of H 1 s at most d, as any shatterng set of sze d 1 n T s also a subset of S that s shattered by the elements of H, contradctng the fact that the VC-dmenson of H s d. Suppose that there s a set of sze d n T that s shattered by H 2. Snce every hypothess n H 2 s the restrcton of two dfferent hypotheses n H, x m can be added to the shattered set of sze d n T to obtan a set shattered by H of sze d 1. Ths s a contradcton, so the VC-dmenson of H 2 s at most d 1. By the nductve hypothess, Π H1 (m 1 Φ d (m 1. Smlarly, Π H2 (m 1 Φ d 1 (m 1. Combnng these two nequaltes shows that Π H (m Φ d (m 1 Φ d 1 (m 1 ( d ( m 1 d 1 ( m 1 j j0 ( m 1 d 1 (( m 1 0 ( m d 1 ( m 0 1 Φ d (m completng the nductve step. ( m 1 1 Often, the polynomal Φ d (m s hard to work wth. Instead, we often use the followng result: Lemma 3.2. Φ d (m (em/d d when m d 1. Proof. m d 1 mples that d m 1. Therefore, snce d n the summand, 3
H x 1 x 2 x 3 x 4 x 5 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 0 0 1 H 1 x 1 x 2 x 3 x 4 0 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 H 2 x 1 x 2 x 3 x 4 0 1 1 0 1 0 0 1 Fgure 3: The constructon of H 1 and H 2 ( d m d d ( m d ( d ( m m ( 1 d m m e d Multplyng on both sdes by (m/d d on both sdes gves the desred result. Pluggng ths result nto the examples bound proven last class shows that ( ( 1 err(h O d ln m m d ln 1 δ We can also wrte ths n terms of the number of examples requred to learn: ( 1 m O (ln 1/δ d ln 1/ɛ ɛ Note that the number of examples requred to learn scales lnearly wth the VC-dmenson. 4 Lower bounds on learnng The bound proven n the prevous secton shows that the VC-dmenson of a hypothess space yelds an upper bound on the number of examples needed to learn. Lower bounds on the requred number of examples also exst. If the VC-dmenson of a hypothess space s d, there s a shattered set of sze d. Intutvely, any hypothess learned from a subset of sze at most d 1 cannot predct the value of the last element wth probablty better than 1/2. Ths suggests that at least Ω(d examples are requred to learn. In future classes, we wll prove the followng Theorem 4.1. For all learnng algorthms A, there s a concept c C and a dstrbuton D such that f A s gven m d/2 examples labeled by c and dstrbuted accordng to D, then Pr[err(h A > 1/8] 1 8 4
One can try to prove ths as follows. Choose a unform dstrbuton D on examples {z 1,..., z d } and run A on m d/2 examples. Call ths set of examples S. Label the elements of S arbtrarly wth and -. Suppose that c C s selected to be consstent wth all of the labels on S and c(x h A (x for all x / S. err D (h A 1 2 snce c agrees wth h A on at most (d/2/2 1/2 of the probablty mass of the doman, whch means that there s no PAC-learnng algorthm on d/2 examples. Ths proof s flawed, as c needs to be chosen before the examples. We wll dscuss a correct proof n future classes. 5