A Generalization of Sauer s Lemma to Classes of Large-Margin Functions



Similar documents
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 14 10/27/2008 MOMENT GENERATING FUNCTIONS

2r 1. Definition (Degree Measure). Let G be a r-graph of order n and average degree d. Let S V (G). The degree measure µ(s) of S is defined by,

Pythagorean Triples Over Gaussian Integers

Firewall Design: Consistency, Completeness, and Compactness

10.2 Systems of Linear Equations: Matrices

On Adaboost and Optimal Betting Strategies

MSc. Econ: MATHEMATICAL STATISTICS, 1995 MAXIMUM-LIKELIHOOD ESTIMATION

Math , Fall 2012: HW 1 Solutions

Given three vectors A, B, andc. We list three products with formula (A B) C = B(A C) A(B C); A (B C) =B(A C) C(A B);

Inverse Trig Functions

Modelling and Resolving Software Dependencies

Ch 10. Arithmetic Average Options and Asian Opitons

Differentiability of Exponential Functions

BV has the bounded approximation property

Mannheim curves in the three-dimensional sphere

Continued Fractions and the Euclidean Algorithm

arxiv:math/ v1 [math.co] 21 Feb 2002

An intertemporal model of the real exchange rate, stock market, and international debt dynamics: policy simulations

Parameterized Algorithms for d-hitting Set: the Weighted Case Henning Fernau. Univ. Trier, FB 4 Abteilung Informatik Trier, Germany

Inner Product Spaces

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Web Appendices of Selling to Overcon dent Consumers

Optimal Energy Commitments with Storage and Intermittent Supply

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

Mathematics Review for Economists

Answers to the Practice Problems for Test 2

UNIFIED BIJECTIONS FOR MAPS WITH PRESCRIBED DEGREES AND GIRTH

The one-year non-life insurance risk

Exponential Functions: Differentiation and Integration. The Natural Exponential Function

Optimal Control Policy of a Production and Inventory System for multi-product in Segmented Market

1. Prove that the empty set is a subset of every set.

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

Which Networks Are Least Susceptible to Cascading Failures?

6.2 Permutations continued

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Department of Mathematical Sciences, University of Copenhagen. Kandidat projekt i matematik. Jens Jakob Kjær. Golod Complexes

1 VECTOR SPACES AND SUBSPACES

2. Properties of Functions

minimal polyonomial Example

Data Center Power System Reliability Beyond the 9 s: A Practical Approach

Hull, Chapter 11 + Sections 17.1 and 17.2 Additional reference: John Cox and Mark Rubinstein, Options Markets, Chapter 5

The Quick Calculus Tutorial

MATH10040 Chapter 2: Prime and relatively prime numbers

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CURRENCY OPTION PRICING II

What is Linear Programming?

Notes on tangents to parabolas

Notes on Determinant

Measures of distance between samples: Euclidean

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Lecture L25-3D Rigid Body Kinematics

Risk Adjustment for Poker Players

NOTES ON LINEAR TRANSFORMATIONS

Generating Elementary Combinatorial Objects

A Comparison of Performance Measures for Online Algorithms

11 CHAPTER 11: FOOTINGS

Optimal Control Of Production Inventory Systems With Deteriorating Items And Dynamic Costs

Similarity and Diagonalization. Similar Matrices

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 205A Part 3. Spaces with special properties

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

Optimal shift scheduling with a global service level constraint

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

How To Find Out How To Calculate Volume Of A Sphere

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

A New Evaluation Measure for Information Retrieval Systems

1 Formulating The Low Degree Testing Problem

Kevin James. MTHSC 412 Section 2.4 Prime Factors and Greatest Comm

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

I. Pointwise convergence

A Blame-Based Approach to Generating Proposals for Handling Inconsistency in Software Requirements

BANACH AND HILBERT SPACE REVIEW

1 = (a 0 + b 0 α) (a m 1 + b m 1 α) 2. for certain elements a 0,..., a m 1, b 0,..., b m 1 of F. Multiplying out, we obtain

State of Louisiana Office of Information Technology. Change Management Plan

it is easy to see that α = a

MODELLING OF TWO STRATEGIES IN INVENTORY CONTROL SYSTEM WITH RANDOM LEAD TIME AND DEMAND

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

Search Advertising Based Promotion Strategies for Online Retailers

Catalan Numbers. Thomas A. Dowling, Department of Mathematics, Ohio State Uni- versity.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Systems of Linear Equations

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

The wave equation is an important tool to study the relation between spectral theory and geometry on manifolds. Let U R n be an open set and let

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

NEAR-FIELD TO FAR-FIELD TRANSFORMATION WITH PLANAR SPIRAL SCANNING

Sample Induction Proofs

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

On the representability of the bi-uniform matroid

Calculus Refresher, version c , Paul Garrett, garrett@math.umn.edu garrett/

Safety Stock or Excess Capacity: Trade-offs under Supply Risk

Section 3.3. Differentiation of Polynomials and Rational Functions. Difference Equations to Differential Equations

T ( a i x i ) = a i T (x i ).

Transcription:

A Generalization of Sauer s Lemma to Classes of Large-Margin Functions Joel Ratsaby University College Lonon Gower Street, Lonon WC1E 6BT, Unite Kingom J.Ratsaby@cs.ucl.ac.uk, WWW home page: http://www.cs.ucl.ac.uk/staff/j.ratsaby/ Abstract. We generalize Sauer s Lemma to classes H of binary-value functions on [n] = {1,..., n} which have a margin of at least N on every element in a sample S [n] of carinality l, where the margin µ h (x) of f F on a point x [n] is efine as the largest non-negative integer a such that h is constant on the interval I a(x) = [x a, x + a] [n]. 1 Introuction Estimation of the complexity of classes of binary-value functions has been behin much of recent evelopments in the of theory learning. In a seminal paper Vapnik an Chervonenkis [1971] applie the law of large numbers uniformly over an infinite class F of binary functions, i.e., inicator functions of sets A in a general omain X, an showe that the complexity of the problem of learning pattern recognition from samples of n ranomly rawn examples can be characterize in terms of a combinatorial complexity of F. This complexity, known as the growth function of F an enote by φ F (n), counts the maximal number of ichotomies, i.e., binary vectors corresponing to the restriction of functions f F on a finite subset S X of carinality n, where the maximum runs over all such S. The Vapnik-Chervonenkis imension of F, enote as V C(F), plays a crucial role in controlling the rate of the growth of φ F (n) with respect to n. Such binary vectors may be viewe as binary-value functions on a finite omain [n] {1,..., n} an hence form a finite class H of the same VC-imension as F. In this paper we consier classes of binary-value functions on [n] which satisfy a constraint of having a large margin on any one set S X of carinality l. We obtain an estimate on the carinality of such a class. Recently there has been interest in learning classes via maximizing the margin [see for instance Vapnik, 1998, Cristianini an Shawe-Taylor, 000]. The usual approach analyzes the growth rate (or more precisely the covering number) of Paper appeare in the Proc. of Thir Colloq. on Math. an Comp. Sci. Alg., Trees, Combin. an Probb. (MAthInfo 004), Vienna Austria, Sept. 004. University College Lonon, Computer Science Department,Technical Report RN/03/13

classes of real-value functions with a large-margin on some set (sample) S. So to the best of the author s knowlege, the approach taken in the current paper of estimating the complexity of classes of large-margin binary-value functions by a generalization of Sauer s lemma is novel. Before iscussing this further we first introuce some neee notation. Some notations, efinitions an existing results Let I(E) enote the inicator function which equals 1 if the expression E is true an 0 otherwise. Let F be a class of functions f : [n] {0, 1}. For a set A = {a 1,..., a k } [n] enote by f A = [f(a 1 ),..., f(a k )]. F is sai to shatter A if {f A : f F } = k. The Vapnik-Chervonenkis imension of F, enote as V C(F ), is efine as the carinality of the largest set shattere by F. Sauer [197] obtaine the following result: Lemma 1. [Sauer, 197] If the VC-imension of F is then F i=0 ( ) n. i We note that the boun is tight as for all, n 1 there exist classes F [n] of VC-imension which achieve the equality. Consier the following efinition of functional margin 1 which naturally suits binary-value functions. Definition 1. The margin µ f (x) of f F on an element x [n] is the largest non-negative integer a such that f has a constant value of either 0 or 1 on the interval set I a (x) = {x a,..., x + a} provie that I a (x) [n]. The sample-margin µ S (f) of f on a subset S [n] is efine as µ S (f) min x S µ f (x). More generally, this efinition applies also to classes on other omains X if there is a linear orering on X. 3 Motivation an aim of the paper In recent years, in search for better learning algorithms, it has been iscovere [see for instance Vapnik, 1998, Cristianini an Shawe-Taylor, 000] that learning 1 For other efinitions of margin see for instance Cristianini an Shawe-Taylor [000].

3 classes F of real-value functions that are restricte to have a large margin on a training sample leas to a faster learning-error convergence. The crucial reason for such improve error bouns is the fast ecrease of the covering number of F, which is analogous to the growth-number in the case of classes of binary-value functions, with respect to an increase in sample-margin value. In this paper, we consier the latter case, restricting to a finite class H(S) of binary value functions h on [n] with the constraint that for a fixe subset S [n] of size l, for all h H(S), µ S (h) N. By generalizing Sauer s result (Lemma 1), we obtain an estimate on the carinality of H(S). Being epenent on the margin parameter N, our estimate may be viewe as being analogous to existing results that boun the covering number of classes of finite-pseuo-imension (or fat γ imension) which consists of real-value functions uner a similar margin constraint [see for instance Anthony an Bartlett, 1999, Ch. 1]. 4 Technical results We start with an auxiliary lemma: Lemma. For N 0, n 0, 0 m n, let w m,n (n) be the number of stanar (one-imensional) orere partitions of a nonnegative integer n into m parts each no larger than N. Then I(n = 0) if m = 0 ( )( ) w m,n (n) = ( 1) i/(n+1) m n i + m 1 if m 1. i/(n + 1) n i i=0,n+1,(n+1),... Remark 1. While our interest is in [n] = {1,..., n}, we allow w m,n (n) to be efine on n = 0 for use by Lemma 3. Proof: The generating function (g.f.) for w m,n (n) is W (x) = ( ) 1 x w m,n (n)x n N+1 m =. 1 x n 0 When m = 0 the only non-zero coefficient is of ( x 0 an it equals 1 so w 0,N (n) = m. I(n = 0). Let T (x) = (1 x N+1 ) m 1 an S(x) = 1 x) Then T (x) = m ( m ( 1) i i i=0 ) x i(n+1) Hence samples on which the target function (the one to be learnt) has a large margin are of consierable worth. Ratsaby [003] estimate the complexity of such samples as a function of the margin parameter an sample size.

4 which generates the sequence t N (n) = ( m n/(n+1)) ( 1) n/(n+1) I(n mo (N + 1) = 0). Similarly, for m 1, it is easy to show S(x) generates s(n) = ( ) n+m 1 n. The prouct W (x) = T (x)s(x) generates their convolution t N (n) s(n), namely, w m,n (n) = i=0,n+1,(n+1),..., ( )( ) ( 1) i/(n+1) m n i + m 1. i/(n + 1) n i Remark. By an alternate proof one obtains a slightly simpler form of over m 1. w m,n (n) = m ( )( ) m n + m 1 k(n + 1) ( 1) k, k m 1 k=0 Before proceeing to the main theorem we have two aitional lemmas. Lemma 3. Let the integer 1 N n an consier the class F consisting of all binary-value functions f on [n] which take the value 1 on no more than r n elements of [n] an whose margin on any element x [n] satisfies µ f (x) N. Then where F = r k=0 c(k, n k; m, N) β r (N) (n) c(k, n k; m, N) 1 = w m i,n (k m + 1 i(n + 1))w m j,n (n k m + 1 j(n + 1)). i,j=0 Proof: Consier the integer pair [k, n k], where n 1 an 0 k n. A two-imensional orere m-partition of [k, n k] is an orere partition into m two-imensional parts, [a j, b j ] where 0 a j, b j n but not both are zero an where m j=1 [a j, b j ] = [k, n k]. For instance, [, 1] = [0, 1]+[, 0] = [1, 1]+[1, 0] = [, 0] + [0, 1] are three partitions of [, 1] into two parts (for more examples see Anrews Anrews [1998] ). Suppose we a the constraint that only a 1 or b m may be zero while all remaining a j, b k 1, j m, 1 k m 1. Denote any partition that satisfies this as vali. For instance, let k =, m = 3 then the m-partitions of [k, n k] are: {[0, 1][1, 1][1, n 4]},{[0, 1][1, ][1, n 5]},...,{[0, 1][1, n 3][1, 0]}, {[0, ][1, 1][1, n 5]}, {[0, ][1, ][1, n 6]},..., {[0, ][1, n 4][1, 0]},..., {[0, n 3][1, 1][1, 0]}. For [k, n k], let P n,k be the collection of all vali partitions of [k, n k]. Let F k enote all binary functions on [n] which take the value 1 over exactly k elements of [n]. Define the mapping Π : F k P n,k where for any f F k (1)

5 the partition Π(f) is efine by the following proceure: Start from the first element of [n], i.e., 1. If f takes the value 1 on it then let a 1 be the length of the constant 1-segment, i.e., the set of all elements starting from 1 on which f takes the constant value 1. Otherwise if f takes the value 0 let a 1 = 0. Then let b 1 be the length of the subsequent 0-segment on which f takes the value 0. Let [a 1, b 1 ] be the first part of Π(f). Next, repeat the following: if there is at least one more element of [n] which has not been inclue in the preceing segment, then let a j be the length of the next 1-segment an b j the length of the subsequent 0-segment. Let [a j, b j ], j = 1,..., m, be the resulting sequence of parts where m is the total number of parts. Only the last part may have a zero value b m since the function may take the value 1 on the last element n of [n] while all other parts, [a j, b j ], j m 1, must have a j, b j 1. The result is a vali partition of [k, n k] into m parts. Clearly, every f F k has a unique partition. Therefore Π is a bijection. Moreover, we may ivie P n,k into mutually exclusive subsets V m consisting of all vali partitions of [k, n k] having exactly m parts, where 1 m n. Thus F k = V m. Consier the following constraint on components of parts: a i, b i N + 1, 1 i m. () Denote by V m,n P n,k the collection of vali partitions of [k, n k] into m parts each of which satisfies this constraint. Let F k,n = F F k consist of all functions satisfying the margin constraint in the statement of the lemma an having exactly k ones. Note that f having a margin no larger than N on any element of [n] implies there oes not exist a segment a i or b i of length larger than N + 1 on which f takes a constant value. Hence the parts of Π(f) satisfy (). Hence, for any f F k,n, its unique vali partition Π(f) must be in V m,n. We therefore have By efinition of F it follows that Let us enote by F k,n = F = V m,n. (3) r F k,n. (4) k=0 c(k, n k; m, N) V m,n (5) the number of vali partitions of [k, n k] into exactly m parts whose components satisfy (). In orer to etermine F it therefore suffices to etermine c(k, n k; m, N).

6 We next construct the generating function G(t 1, t ) = c(α 1, α ; m, N)t α1 1 tα. (6) α 1 0 α 0 For m 1, G(t 1, t ) = (t 0 1 + t 1 1 + + t N+1 1 )(t 1 + t + + t N+1 ) I(m ) ((t 1 1 + + t N+1 1 )(t 1 + + t N+1 ) ) (m ) + (t 1 1 + + t N+1 1 ) I(m ) (t 0 + t 1 + + t N+1 ) where the values of the exponents of all terms in the first an secon factors represent the possible values for a 1 an b 1, respectively. The values of the exponents in the mile m factors are for the values of a j, b j, j m 1 an those in the factor before last an last are for a m an b m, respectively. Equating this to (6) implies the coefficient of t α1 1 tα equals c(α 1, α ; m, N) which we seek. The right sie of (7) equals ( ( Let W (x) = t m 1 1 t m 1 So (8) becomes ( 1 t N+1 1 1 t 1 1 t N+1 1 t ) m + t N+1 1 ) m + t N+1 (7) ( ) m 1 1 t N+1 1 1 t 1 ( ) m 1 1 t N+1 1 t. (8) ) m 1 1 x N+1 1 x generate wm 1,N (n) which is efine in Lemma. α 1,α 0 ( w m,n (α 1 )w m,n (α )t α1+m 1 1 t α+m 1 + w m 1,N (α 1 )w m,n (α )t α1+m+n 1 t α+m 1 + w m,n (α 1 )w m 1,N (α )t α1+m 1 1 t α+m+n + w m 1,N (α 1 )w m 1,N (α )t α1+m+n 1 t α+m+n Equating the coefficients of t α 1 1 tα in (6) an (9) yiels ). (9) c(α 1, α ; m, N) 1 = w m i,n (α 1 m + 1 i(n + 1))w m j,n (α m + 1 j(n + 1)). i,j=0 Substituting k for α 1, n k for α, combining (3), (4) an (5) yiels the result. The next lemma extens the result of Lemma 3 to classes H of finite VCimension.

7 Lemma 4. Let n 1 an 0 n. Let H be a class of binary-value functions h on [n] satisfying µ h (x) N on any x [n] an let V C(H). Then where β (N) is efine in Lemma 3. H β (N) (n) Proof: The proof buils on that of Lemma 7 in Haussler & Long Haussler an Long [1995] which consiere generalizations of the V C-imension an is one by ouble inuction on n an. Start with the case = 0, the boun reuces to H 1 since β (N) 0 (n) 1 when n 1. The boun is correct since if H > 1 then it implies there are two istinct functions h, g. Let k [n] be the element on which they iffer. Then the singleton {k} is shattere by H hence the VC-imension of H is at least 1 which contraicts the assumption that = 0 hence H 1 an the lemma hols. Next, suppose = n. Consier the class F in Lemma 3 with r = n. Such F consists of all binary-value functions f on [n] which satisfy the margin constraint µ f (x) N on every x [n]. By Lemma 3, F β n (N) (n). Clearly by efinition, H F. Hence H β n (N) (n) as claime. Next, suppose 0 < < n. Define π : H {0, 1} by π(h) = [h(1),..., h(n 1)]. Define α : π(h) {0, 1} by α(u 1,..., u ) = min{v : h H, h(i) = u i, h(n) = v, 1 i n 1}. Define A = {h H : h(n) = α(h(1),..., h(n 1))} an enote by A c = H \ A. Consier any h H. If α(h(1),..., h(n 1)) = 1 then A c oes not contain h. Otherwise, A c contains the function g which agrees with h on 1,..., n 1 an g(n) = 1. Hence for all h A c, h(n) = 1. Make the inuctive assumption that the claime boun hols for all classes H on any subset of [n] having carinality an satisfying the margin constraint. Then we claim the following: Claim 1 A β (N) (n 1). This is prove next: the mapping π is one-to-one on A an the set π(a) has VC-imension no larger than since any subset of [n] shattere by π(a) is also shattere by A which is in H an V C(H). Hence by the inuction hypothesis π(a) β (N) (n 1) an since π is one-to-one then A = π(a). Next, uner the same inuction hypothesis, we have: Claim A c β (N) 1 (n 1). We prove this next: First we show that V C(A c ) 1. Let E [n] be shattere by A c an let E = l. Note that n E since as note earlier h(n) = 1 for all

8 h A c. For any b {0, 1} l+1 let h A c be such that h E = [b 1,..., b l ]. If b l+1 = 1 then h(n) = b l+1 since all functions in A c take the value 1 on n. If b l+1 = 0 then there exists a g A which satisfies g(i) = h(i), 1 i n 1 an g(n) = α(h(1),..., h(n 1)), the latter being g(n) = 0. It follows that E {n} is shattere by H. But by assumption V C(H) an n E hence E 1. Since E was chosen arbitrarily then V C(A c ) 1. The same argument as in the proof of Claim 1 applie to A c using 1 to boun its VC-imension, obtains the statement of Claim. From Claims 1 an an recalling the efinition of c(k, n k; m, N) from Lemma 3, it follows that H β (N) = k=0 (n 1) + β (N) 1 (n 1) = I(n/ N) + = I(n/ N) + 1 c(k, n k 1; m, N) + k=1 k=0 c(k, n k 1; m, N) + c(k, n k 1; m, N) k=1 c(k 1, n k; m, N) ( ) c(k, n k 1; m, N) + c(k 1, n k; m, N) k=1 where the inicator I(n/ N) enters here since in case k = 0 the only vali function h is the constant-0 on [n] satisfying µ h (x) n/, for all x [n]. We now have: Claim 3 ( ) c(k, n k; m, N) = c(k, n k 1; m, N) + c(k 1, n k; m, N). (11) Note that this is a recurrence formula for the number (left han sie of (11)) of vali partitions of [k, n k] (excluing the case k = 0) into parts that satisfy (). We prove the claim next: given any such partition π n there is exactly one of four possible ways that it can be constructe by aing a part to a vali two imensional partition π of [n 1] while still satisfying (). The first two amount to starting from a partition π of [k 1, n k] an: (i) aing the part [1, 0] algebraically to any existing part in π, e.g., [x, y] + [1, 0] = [x + 1, y], to obtain a π n with [x + 1, y] as one of the parts (provie that () is still satisfie) which yiels a total number of parts no larger than n 1 or (ii) aing [1, 0] to π as a new last part to obtain a π n (provie it is still vali) with no more than n parts. The remaining two ways amount to starting from a partition π of [k, n k 1] an acting as before except now aing the part [0, 1] instea of (10)

9 [1, 0] either algebraically or as a new first part. There are c(k, n k 1; m, N) vali partitions of [k, n k 1] an there are c(k 1, n k; m, N) vali partitions of [k 1, n k], all satisfying (). Doing the aforementione construction to each one of these partitions yiels all vali partitions of [k, n k] that satisfy (). Continuing, the right han sie of (10) becomes I(n/ N) + k=1 c(k, n k; m, N) = k=0 c(k, n k; m, N) which is precisely β (N) (n). This completes the inuction. Theorem 1. Let n 1, 1 l n an 0 < n l. Let H be a class of binary-value functions h on [n] having V C(H). Let S [n] be a sample of carinality l an consier the subclass H(S) H which consists of all functions h H with a margin µ h (x) > N iff x S. Then H(S) β (N) (n l (N + 1)). Proof: The conition µ h (x) > N implies only two types of functions h are allowe, those which take either a constant-0 value or a constant-1 value over all elements in the interval I N+1 (x). The conition µ h (x) N implies that any function is possible except one that takes a constant-0 or a constant-1 value over I N+1 (x) (see Definition 1). Hence clearly the first conition is significantly more restrictive. Since we seek an upper boun on H(S) then we consier among all sets of [n] of carinality l a set S with the least restrictive constraint, namely, causing as few elements x [n] as possible (except those in S) to have µ h (x) > N. This is achieve by a maximally-packe set S [n] of l elements, for instance S = {N +,..., N + l + 1}. It yiels a minimal-size region R = {1,..., (N + 1) + l} on which every caniate h must take either a constant-0 or constant-1 value, i.e., have a margin larger than N for every x S. This leaves a maximal-size region [n]\r on which the less stringent constraint of having a margin no larger than N must hol. By Lemma 4, there are no more than β (N) (n l (N + 1)) functions in H that satisfy the latter. Hence for any S [n] of carinality S = l, H(S) β (N) (n l (N + 1)).

10 5 Conclusions The main result of the paper is a boun on the carinality of a class of finite VC-imension consisting of binary functions on [n] which have a margin greater than N on a set S of carinality l. This result generalizes the well known Sauer s Lemma an is analogous to existing bouns on the covering number of classes of real-value functions that have a large-margin on a sample S. The result may be use for obtaining the sample complexity of PAC-learning a a class of boolean hypotheses while maximizing the margin on a given training sample.

Bibliography G. E. Anrews. The Theory of Partitions. Cambrige University Press, 1998. M. Anthony an P. L. Bartlett. Neural Network Learning:Theoretical Founations. Cambrige University Press, 1999. N. Cristianini an J. Shawe-Taylor. An Introuction to Support Vector Machines an other Kernel-base learning methos. Cambrige University Press, 000. D. Haussler an P.M. Long. A generalization of sauer s lemma. Journal of Combinatorial Theory (A), 71():19 40, 1995. J. Ratsaby. On the complexity of goo samples for learning. Technical Report RN/03/1, Department of Computer Science, University College Lonon, September 003. N. Sauer. On the ensity of families of sets. J. Combinatorial Theory (A), 13: 145 147, 197. V. Vapnik. Statistical Learning Theory. Wiley, 1998. V. N. Vapnik an A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Apl., 16:64 80, 1971.