A Generalization of Sauer s Lemma to Classes of Large-Margin Functions

A Generalization of Sauer s Lemma to Classes of Large-Margin Functions Joel Ratsaby University College Lonon Gower Street, Lonon WC1E 6BT, Unite Kingom J.Ratsaby@cs.ucl.ac.uk, WWW home page: http://www.cs.ucl.ac.uk/staff/j.ratsaby/ Abstract. We generalize Sauer s Lemma to classes H of binary-value functions on [n] = {1,..., n} which have a margin of at least N on every element in a sample S [n] of carinality l, where the margin µ h (x) of f F on a point x [n] is efine as the largest non-negative integer a such that h is constant on the interval I a(x) = [x a, x + a] [n]. 1 Introuction Estimation of the complexity of classes of binary-value functions has been behin much of recent evelopments in the of theory learning. In a seminal paper Vapnik an Chervonenkis [1971] applie the law of large numbers uniformly over an infinite class F of binary functions, i.e., inicator functions of sets A in a general omain X, an showe that the complexity of the problem of learning pattern recognition from samples of n ranomly rawn examples can be characterize in terms of a combinatorial complexity of F. This complexity, known as the growth function of F an enote by φ F (n), counts the maximal number of ichotomies, i.e., binary vectors corresponing to the restriction of functions f F on a finite subset S X of carinality n, where the maximum runs over all such S. The Vapnik-Chervonenkis imension of F, enote as V C(F), plays a crucial role in controlling the rate of the growth of φ F (n) with respect to n. Such binary vectors may be viewe as binary-value functions on a finite omain [n] {1,..., n} an hence form a finite class H of the same VC-imension as F. In this paper we consier classes of binary-value functions on [n] which satisfy a constraint of having a large margin on any one set S X of carinality l. We obtain an estimate on the carinality of such a class. Recently there has been interest in learning classes via maximizing the margin [see for instance Vapnik, 1998, Cristianini an Shawe-Taylor, 000]. The usual approach analyzes the growth rate (or more precisely the covering number) of Paper appeare in the Proc. of Thir Colloq. on Math. an Comp. Sci. Alg., Trees, Combin. an Probb. (MAthInfo 004), Vienna Austria, Sept. 004. University College Lonon, Computer Science Department,Technical Report RN/03/13

classes of real-value functions with a large-margin on some set (sample) S. So to the best of the author s knowlege, the approach taken in the current paper of estimating the complexity of classes of large-margin binary-value functions by a generalization of Sauer s lemma is novel. Before iscussing this further we first introuce some neee notation. Some notations, efinitions an existing results Let I(E) enote the inicator function which equals 1 if the expression E is true an 0 otherwise. Let F be a class of functions f : [n] {0, 1}. For a set A = {a 1,..., a k } [n] enote by f A = [f(a 1 ),..., f(a k )]. F is sai to shatter A if {f A : f F } = k. The Vapnik-Chervonenkis imension of F, enote as V C(F ), is efine as the carinality of the largest set shattere by F. Sauer [197] obtaine the following result: Lemma 1. [Sauer, 197] If the VC-imension of F is then F i=0 ( ) n. i We note that the boun is tight as for all, n 1 there exist classes F [n] of VC-imension which achieve the equality. Consier the following efinition of functional margin 1 which naturally suits binary-value functions. Definition 1. The margin µ f (x) of f F on an element x [n] is the largest non-negative integer a such that f has a constant value of either 0 or 1 on the interval set I a (x) = {x a,..., x + a} provie that I a (x) [n]. The sample-margin µ S (f) of f on a subset S [n] is efine as µ S (f) min x S µ f (x). More generally, this efinition applies also to classes on other omains X if there is a linear orering on X. 3 Motivation an aim of the paper In recent years, in search for better learning algorithms, it has been iscovere [see for instance Vapnik, 1998, Cristianini an Shawe-Taylor, 000] that learning 1 For other efinitions of margin see for instance Cristianini an Shawe-Taylor [000].

3 classes F of real-value functions that are restricte to have a large margin on a training sample leas to a faster learning-error convergence. The crucial reason for such improve error bouns is the fast ecrease of the covering number of F, which is analogous to the growth-number in the case of classes of binary-value functions, with respect to an increase in sample-margin value. In this paper, we consier the latter case, restricting to a finite class H(S) of binary value functions h on [n] with the constraint that for a fixe subset S [n] of size l, for all h H(S), µ S (h) N. By generalizing Sauer s result (Lemma 1), we obtain an estimate on the carinality of H(S). Being epenent on the margin parameter N, our estimate may be viewe as being analogous to existing results that boun the covering number of classes of finite-pseuo-imension (or fat γ imension) which consists of real-value functions uner a similar margin constraint [see for instance Anthony an Bartlett, 1999, Ch. 1]. 4 Technical results We start with an auxiliary lemma: Lemma. For N 0, n 0, 0 m n, let w m,n (n) be the number of stanar (one-imensional) orere partitions of a nonnegative integer n into m parts each no larger than N. Then I(n = 0) if m = 0 ( )( ) w m,n (n) = ( 1) i/(n+1) m n i + m 1 if m 1. i/(n + 1) n i i=0,n+1,(n+1),... Remark 1. While our interest is in [n] = {1,..., n}, we allow w m,n (n) to be efine on n = 0 for use by Lemma 3. Proof: The generating function (g.f.) for w m,n (n) is W (x) = ( ) 1 x w m,n (n)x n N+1 m =. 1 x n 0 When m = 0 the only non-zero coefficient is of ( x 0 an it equals 1 so w 0,N (n) = m. I(n = 0). Let T (x) = (1 x N+1 ) m 1 an S(x) = 1 x) Then T (x) = m ( m ( 1) i i i=0 ) x i(n+1) Hence samples on which the target function (the one to be learnt) has a large margin are of consierable worth. Ratsaby [003] estimate the complexity of such samples as a function of the margin parameter an sample size.

4 which generates the sequence t N (n) = ( m n/(n+1)) ( 1) n/(n+1) I(n mo (N + 1) = 0). Similarly, for m 1, it is easy to show S(x) generates s(n) = ( ) n+m 1 n. The prouct W (x) = T (x)s(x) generates their convolution t N (n) s(n), namely, w m,n (n) = i=0,n+1,(n+1),..., ( )( ) ( 1) i/(n+1) m n i + m 1. i/(n + 1) n i Remark. By an alternate proof one obtains a slightly simpler form of over m 1. w m,n (n) = m ( )( ) m n + m 1 k(n + 1) ( 1) k, k m 1 k=0 Before proceeing to the main theorem we have two aitional lemmas. Lemma 3. Let the integer 1 N n an consier the class F consisting of all binary-value functions f on [n] which take the value 1 on no more than r n elements of [n] an whose margin on any element x [n] satisfies µ f (x) N. Then where F = r k=0 c(k, n k; m, N) β r (N) (n) c(k, n k; m, N) 1 = w m i,n (k m + 1 i(n + 1))w m j,n (n k m + 1 j(n + 1)). i,j=0 Proof: Consier the integer pair [k, n k], where n 1 an 0 k n. A two-imensional orere m-partition of [k, n k] is an orere partition into m two-imensional parts, [a j, b j ] where 0 a j, b j n but not both are zero an where m j=1 [a j, b j ] = [k, n k]. For instance, [, 1] = [0, 1]+[, 0] = [1, 1]+[1, 0] = [, 0] + [0, 1] are three partitions of [, 1] into two parts (for more examples see Anrews Anrews [1998] ). Suppose we a the constraint that only a 1 or b m may be zero while all remaining a j, b k 1, j m, 1 k m 1. Denote any partition that satisfies this as vali. For instance, let k =, m = 3 then the m-partitions of [k, n k] are: {[0, 1][1, 1][1, n 4]},{[0, 1][1, ][1, n 5]},...,{[0, 1][1, n 3][1, 0]}, {[0, ][1, 1][1, n 5]}, {[0, ][1, ][1, n 6]},..., {[0, ][1, n 4][1, 0]},..., {[0, n 3][1, 1][1, 0]}. For [k, n k], let P n,k be the collection of all vali partitions of [k, n k]. Let F k enote all binary functions on [n] which take the value 1 over exactly k elements of [n]. Define the mapping Π : F k P n,k where for any f F k (1)

5 the partition Π(f) is efine by the following proceure: Start from the first element of [n], i.e., 1. If f takes the value 1 on it then let a 1 be the length of the constant 1-segment, i.e., the set of all elements starting from 1 on which f takes the constant value 1. Otherwise if f takes the value 0 let a 1 = 0. Then let b 1 be the length of the subsequent 0-segment on which f takes the value 0. Let [a 1, b 1 ] be the first part of Π(f). Next, repeat the following: if there is at least one more element of [n] which has not been inclue in the preceing segment, then let a j be the length of the next 1-segment an b j the length of the subsequent 0-segment. Let [a j, b j ], j = 1,..., m, be the resulting sequence of parts where m is the total number of parts. Only the last part may have a zero value b m since the function may take the value 1 on the last element n of [n] while all other parts, [a j, b j ], j m 1, must have a j, b j 1. The result is a vali partition of [k, n k] into m parts. Clearly, every f F k has a unique partition. Therefore Π is a bijection. Moreover, we may ivie P n,k into mutually exclusive subsets V m consisting of all vali partitions of [k, n k] having exactly m parts, where 1 m n. Thus F k = V m. Consier the following constraint on components of parts: a i, b i N + 1, 1 i m. () Denote by V m,n P n,k the collection of vali partitions of [k, n k] into m parts each of which satisfies this constraint. Let F k,n = F F k consist of all functions satisfying the margin constraint in the statement of the lemma an having exactly k ones. Note that f having a margin no larger than N on any element of [n] implies there oes not exist a segment a i or b i of length larger than N + 1 on which f takes a constant value. Hence the parts of Π(f) satisfy (). Hence, for any f F k,n, its unique vali partition Π(f) must be in V m,n. We therefore have By efinition of F it follows that Let us enote by F k,n = F = V m,n. (3) r F k,n. (4) k=0 c(k, n k; m, N) V m,n (5) the number of vali partitions of [k, n k] into exactly m parts whose components satisfy (). In orer to etermine F it therefore suffices to etermine c(k, n k; m, N).

6 We next construct the generating function G(t 1, t ) = c(α 1, α ; m, N)t α1 1 tα. (6) α 1 0 α 0 For m 1, G(t 1, t ) = (t 0 1 + t 1 1 + + t N+1 1 )(t 1 + t + + t N+1 ) I(m ) ((t 1 1 + + t N+1 1 )(t 1 + + t N+1 ) ) (m ) + (t 1 1 + + t N+1 1 ) I(m ) (t 0 + t 1 + + t N+1 ) where the values of the exponents of all terms in the first an secon factors represent the possible values for a 1 an b 1, respectively. The values of the exponents in the mile m factors are for the values of a j, b j, j m 1 an those in the factor before last an last are for a m an b m, respectively. Equating this to (6) implies the coefficient of t α1 1 tα equals c(α 1, α ; m, N) which we seek. The right sie of (7) equals ( ( Let W (x) = t m 1 1 t m 1 So (8) becomes ( 1 t N+1 1 1 t 1 1 t N+1 1 t ) m + t N+1 1 ) m + t N+1 (7) ( ) m 1 1 t N+1 1 1 t 1 ( ) m 1 1 t N+1 1 t. (8) ) m 1 1 x N+1 1 x generate wm 1,N (n) which is efine in Lemma. α 1,α 0 ( w m,n (α 1 )w m,n (α )t α1+m 1 1 t α+m 1 + w m 1,N (α 1 )w m,n (α )t α1+m+n 1 t α+m 1 + w m,n (α 1 )w m 1,N (α )t α1+m 1 1 t α+m+n + w m 1,N (α 1 )w m 1,N (α )t α1+m+n 1 t α+m+n Equating the coefficients of t α 1 1 tα in (6) an (9) yiels ). (9) c(α 1, α ; m, N) 1 = w m i,n (α 1 m + 1 i(n + 1))w m j,n (α m + 1 j(n + 1)). i,j=0 Substituting k for α 1, n k for α, combining (3), (4) an (5) yiels the result. The next lemma extens the result of Lemma 3 to classes H of finite VCimension.

7 Lemma 4. Let n 1 an 0 n. Let H be a class of binary-value functions h on [n] satisfying µ h (x) N on any x [n] an let V C(H). Then where β (N) is efine in Lemma 3. H β (N) (n) Proof: The proof buils on that of Lemma 7 in Haussler & Long Haussler an Long [1995] which consiere generalizations of the V C-imension an is one by ouble inuction on n an. Start with the case = 0, the boun reuces to H 1 since β (N) 0 (n) 1 when n 1. The boun is correct since if H > 1 then it implies there are two istinct functions h, g. Let k [n] be the element on which they iffer. Then the singleton {k} is shattere by H hence the VC-imension of H is at least 1 which contraicts the assumption that = 0 hence H 1 an the lemma hols. Next, suppose = n. Consier the class F in Lemma 3 with r = n. Such F consists of all binary-value functions f on [n] which satisfy the margin constraint µ f (x) N on every x [n]. By Lemma 3, F β n (N) (n). Clearly by efinition, H F. Hence H β n (N) (n) as claime. Next, suppose 0 < < n. Define π : H {0, 1} by π(h) = [h(1),..., h(n 1)]. Define α : π(h) {0, 1} by α(u 1,..., u ) = min{v : h H, h(i) = u i, h(n) = v, 1 i n 1}. Define A = {h H : h(n) = α(h(1),..., h(n 1))} an enote by A c = H \ A. Consier any h H. If α(h(1),..., h(n 1)) = 1 then A c oes not contain h. Otherwise, A c contains the function g which agrees with h on 1,..., n 1 an g(n) = 1. Hence for all h A c, h(n) = 1. Make the inuctive assumption that the claime boun hols for all classes H on any subset of [n] having carinality an satisfying the margin constraint. Then we claim the following: Claim 1 A β (N) (n 1). This is prove next: the mapping π is one-to-one on A an the set π(a) has VC-imension no larger than since any subset of [n] shattere by π(a) is also shattere by A which is in H an V C(H). Hence by the inuction hypothesis π(a) β (N) (n 1) an since π is one-to-one then A = π(a). Next, uner the same inuction hypothesis, we have: Claim A c β (N) 1 (n 1). We prove this next: First we show that V C(A c ) 1. Let E [n] be shattere by A c an let E = l. Note that n E since as note earlier h(n) = 1 for all

8 h A c. For any b {0, 1} l+1 let h A c be such that h E = [b 1,..., b l ]. If b l+1 = 1 then h(n) = b l+1 since all functions in A c take the value 1 on n. If b l+1 = 0 then there exists a g A which satisfies g(i) = h(i), 1 i n 1 an g(n) = α(h(1),..., h(n 1)), the latter being g(n) = 0. It follows that E {n} is shattere by H. But by assumption V C(H) an n E hence E 1. Since E was chosen arbitrarily then V C(A c ) 1. The same argument as in the proof of Claim 1 applie to A c using 1 to boun its VC-imension, obtains the statement of Claim. From Claims 1 an an recalling the efinition of c(k, n k; m, N) from Lemma 3, it follows that H β (N) = k=0 (n 1) + β (N) 1 (n 1) = I(n/ N) + = I(n/ N) + 1 c(k, n k 1; m, N) + k=1 k=0 c(k, n k 1; m, N) + c(k, n k 1; m, N) k=1 c(k 1, n k; m, N) ( ) c(k, n k 1; m, N) + c(k 1, n k; m, N) k=1 where the inicator I(n/ N) enters here since in case k = 0 the only vali function h is the constant-0 on [n] satisfying µ h (x) n/, for all x [n]. We now have: Claim 3 ( ) c(k, n k; m, N) = c(k, n k 1; m, N) + c(k 1, n k; m, N). (11) Note that this is a recurrence formula for the number (left han sie of (11)) of vali partitions of [k, n k] (excluing the case k = 0) into parts that satisfy (). We prove the claim next: given any such partition π n there is exactly one of four possible ways that it can be constructe by aing a part to a vali two imensional partition π of [n 1] while still satisfying (). The first two amount to starting from a partition π of [k 1, n k] an: (i) aing the part [1, 0] algebraically to any existing part in π, e.g., [x, y] + [1, 0] = [x + 1, y], to obtain a π n with [x + 1, y] as one of the parts (provie that () is still satisfie) which yiels a total number of parts no larger than n 1 or (ii) aing [1, 0] to π as a new last part to obtain a π n (provie it is still vali) with no more than n parts. The remaining two ways amount to starting from a partition π of [k, n k 1] an acting as before except now aing the part [0, 1] instea of (10)

9 [1, 0] either algebraically or as a new first part. There are c(k, n k 1; m, N) vali partitions of [k, n k 1] an there are c(k 1, n k; m, N) vali partitions of [k 1, n k], all satisfying (). Doing the aforementione construction to each one of these partitions yiels all vali partitions of [k, n k] that satisfy (). Continuing, the right han sie of (10) becomes I(n/ N) + k=1 c(k, n k; m, N) = k=0 c(k, n k; m, N) which is precisely β (N) (n). This completes the inuction. Theorem 1. Let n 1, 1 l n an 0 < n l. Let H be a class of binary-value functions h on [n] having V C(H). Let S [n] be a sample of carinality l an consier the subclass H(S) H which consists of all functions h H with a margin µ h (x) > N iff x S. Then H(S) β (N) (n l (N + 1)). Proof: The conition µ h (x) > N implies only two types of functions h are allowe, those which take either a constant-0 value or a constant-1 value over all elements in the interval I N+1 (x). The conition µ h (x) N implies that any function is possible except one that takes a constant-0 or a constant-1 value over I N+1 (x) (see Definition 1). Hence clearly the first conition is significantly more restrictive. Since we seek an upper boun on H(S) then we consier among all sets of [n] of carinality l a set S with the least restrictive constraint, namely, causing as few elements x [n] as possible (except those in S) to have µ h (x) > N. This is achieve by a maximally-packe set S [n] of l elements, for instance S = {N +,..., N + l + 1}. It yiels a minimal-size region R = {1,..., (N + 1) + l} on which every caniate h must take either a constant-0 or constant-1 value, i.e., have a margin larger than N for every x S. This leaves a maximal-size region [n]\r on which the less stringent constraint of having a margin no larger than N must hol. By Lemma 4, there are no more than β (N) (n l (N + 1)) functions in H that satisfy the latter. Hence for any S [n] of carinality S = l, H(S) β (N) (n l (N + 1)).

10 5 Conclusions The main result of the paper is a boun on the carinality of a class of finite VC-imension consisting of binary functions on [n] which have a margin greater than N on a set S of carinality l. This result generalizes the well known Sauer s Lemma an is analogous to existing bouns on the covering number of classes of real-value functions that have a large-margin on a sample S. The result may be use for obtaining the sample complexity of PAC-learning a a class of boolean hypotheses while maximizing the margin on a given training sample.

Bibliography G. E. Anrews. The Theory of Partitions. Cambrige University Press, 1998. M. Anthony an P. L. Bartlett. Neural Network Learning:Theoretical Founations. Cambrige University Press, 1999. N. Cristianini an J. Shawe-Taylor. An Introuction to Support Vector Machines an other Kernel-base learning methos. Cambrige University Press, 000. D. Haussler an P.M. Long. A generalization of sauer s lemma. Journal of Combinatorial Theory (A), 71():19 40, 1995. J. Ratsaby. On the complexity of goo samples for learning. Technical Report RN/03/1, Department of Computer Science, University College Lonon, September 003. N. Sauer. On the ensity of families of sets. J. Combinatorial Theory (A), 13: 145 147, 197. V. Vapnik. Statistical Learning Theory. Wiley, 1998. V. N. Vapnik an A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Apl., 16:64 80, 1971.