Support Vector Machine Soft Margin Classifiers: Error Analysis


 Nigel Gibson
 1 years ago
 Views:
Transcription
1 Journal of Machine Learning Research? (2004)??? Subitted 9/03; Published??/04 Support Vector Machine Soft Margin Classifiers: Error Analysis DiRong Chen Departent of Applied Matheatics Beijing University of Aeronautics and Astronautics Beijing , P R CHINA Qiang Wu Yiing Ying DingXuan Zhou Departent of Matheatics City University of Hong Kong Kowloon, Hong Kong, P R CHINA Editor: Peter Bartlett Abstract The purpose of this paper is to provide a PAC error analysis for the qnor soft argin classifier, a support vector achine classification algorith It consists of two parts: regularization error and saple error While any techniques are available for treating the saple error, uch less is known for the regularization error and the corresponding approxiation error for reproducing kernel Hilbert spaces We are ainly concerned about the regularization error It is estiated for general distributions by a Kfunctional in weighted L q spaces For weakly separable distributions (ie, the argin ay be zero) satisfactory convergence rates are provided by eans of separating functions A projection operator is introduced, which leads to better saple error estiates especially for sall coplexity kernels The isclassification error is bounded by the V risk associated with a general class of loss functions V The difficulty of bounding the offset is overcoe Polynoial kernels and Gaussian kernels are used to deonstrate the ain results The choice of the regularization paraeter plays an iportant role in our analysis Keywords: support vector achine classification, isclassification error, qnor soft argin classifier, regularization error, approxiation error 1 Introduction In this paper we study support vector achine (SVM) classification algoriths and investigate the SVM qnor soft argin classifier with 1 < q < Our purpose is to provide an error analysis for this algorith in the PAC fraework Let (X, d) be a copact etric space and Y = 1, 1 A binary classifier f : X 1, 1 is a function fro X to Y which divides the input space X into two classes Let ρ be a probability distribution on Z := X Y and (X, Y) be the corresponding rando variable The isclassification error for a classifier f : X Y is defined to be the c 2004 DiRong Chen, Qiang Wu, Yiing Ying, and DingXuan Zhou
2 Chen, Wu, Ying, and Zhou probability of the event f(x ) Y: R(f) := Prob f(x ) Y = X P (Y = f(x) x)dρ X (x) (1) Here ρ X is the arginal distribution on X and P ( x) is the conditional probability easure given X = x The SVM qnor soft argin classifier (Cortes & Vapnik 1995; Vapnik 1998) is constructed fro saples and depends on a reproducing kernel Hilbert space associated with a Mercer kernel Let K : X X R be continuous, syetric and positive seidefinite, ie, for any finite set of distinct points x 1,, x l X, the atrix (K(x i, x j )) l i,j=1 is positive seidefinite Such a kernel is called a Mercer kernel The Reproducing Kernel Hilbert Space (RKHS) H K associated with the kernel K is defined (Aronszajn, 1950) to be the closure of the linear span of the set of functions K x := K(x, ) : x X with the inner product, HK =, K satisfying K x, K y K = K(x, y) and K x, g K = g(x), x X, g H K Denote C(X) as the space of continuous functions on X with the nor Let κ := K Then the above reproducing property tells us that g κ g K, g H K (2) Define H K := H K + R For a function f = f 1 + b with f 1 H K and b R, we denote f = f 1 and b f = b R The constant ter b is called the offset For a function f : X R, the sign function is defined as sgn(f)(x) = 1 if f(x) 0 and sgn(f)(x) = 1 if f(x) < 0 Now the SVM qnor soft argin classifier (SVM qclassifier) associated with the Mercer kernel K is defined as sgn (f z ), where f z is a iniizer of the following optiization proble involving a set of rando saples z = (x i, y i ) i=1 Z independently drawn according to ρ: 1 f z := arg in f H K 2 f 2 K + C ξ q i, i=1 subject to y i f(x i ) 1 ξ i, and ξ i 0 for i = 1,, Here C is a constant which depends on : C = C(), and often li C() = Throughout the paper, we assue 1 < q <, N, C > 0, and z = (x i, y i ) i=1 are rando saples independently drawn according to ρ Our target is to understand how sgn(f z ) converges (with respect to the isclassification error) to the best classifier, the Bayes rule, as and hence C() tend to infinity Recall the regression function of ρ: f ρ (x) = ydρ(y x) = P (Y = 1 x) P (Y = 1 x), x X (4) Y Then the Bayes rule is given (eg Devroye, L Györfi & G Lugosi 1997) by the sign of the regression function f c := sgn(f ρ ) Estiating the excess isclassification error (3) R(sgn(f z )) R(f c ) (5) 2
3 SVM Soft Margin Classifiers for the classification algorith (3) is our goal In particular, we try to understand how the choice of the regularization paraeter C affects the error To investigate the error bounds for general kernels, we rewrite (3) as a regularization schee Define the loss function V = V q as V (y, f(x)) := (1 yf(x)) q + = y f(x) q χ yf(x) 1, (6) where (t) + = ax0, t The corresponding V risk is E(f) := E(V (y, f(x))) = V (y, f(x))dρ(x, y) (7) If we set the epirical error as E z (f) := 1 Z V (y i, f(x i )) = 1 i=1 (1 y i f(x i )) q +, (8) then the schee (3) can be written as (see Evgeniou, Pontil & Poggio 2000) f z = arg in E z (f) + 1 f H K 2C f 2 K (9) Notice that when H K is replaced by H K, the schee (9) is exactly the Tikhonov regularization schee (Tikhonov & Arsenin, 1977) associated with the loss function V So one ay hope that the ethod for analyzing regularization schees can be applied The definitions of the V risk (7) and the epirical error (8) tell us that for a function f = f + b H K, the rando variable ξ = V (y, f(x)) on Z has the ean E(f) and 1 i=1 ξ(z i) = E z (f) Thus we ay expect by soe standard epirical risk iniization (ERM) arguent (eg Cucker & Sale 2001, Evgeniou, Pontil & Poggio 2000, Shawe Taylor et al 1998, Vapnik 1998, Wahba 1990) to derive bounds for E(f z ) E(f q ), where f q is a iniizer of the V risk (7) It was shown in Lin (2002) that for q > 1 such a iniizer is given by i=1 f q (x) = (1 + f ρ(x)) 1/(q 1) (1 f ρ (x)) 1/(q 1) (1 + f ρ (x)) 1/(q 1), x X (10) + (1 f ρ (x)) 1/(q 1) For q = 1 a iniizer is f c, see Wahba (1999) Note that sgn(f q ) = f c Recall that for the classification algorith (3), we are interested in the excess isclassification error (5), not the excess V risk E(f z ) E(f q ) But we shall see in Section 3 that R(sgn(f)) R(f c ) 2 (E(f) E(f q )) One ight apply this to the function f z and get estiates for (5) However, special features of the loss function (6) enable us to do better: by restricting f z onto [ 1, 1], we can iprove the saple error estiates The idea of the following projection operator was introduced for this purpose in Bartlett (1998) Definition 1 The projection operator π is defined on the space of easurable functions f : X R as 1, if f(x) 1, π(f)(x) = 1, if f(x) 1, (11) f(x), if 1 < f(x) < 1 3
4 Chen, Wu, Ying, and Zhou It is trivial that sgn(π(f)) = sgn(f) Hence R(sgn(f z )) R(f c ) 2 (E(π(f z )) E(f q )) (12) The definition of the loss function (6) also tells us that V (y, π(f)(x)) V (y, f(x)), so E(π(f)) E(f) and E z (π(f)) E z (f) (13) According to (12), we need to estiate E(π(f z )) E(f q ) in order to bound (5) To this end, we introduce a regularizing function f K,C H K It is arbitrarily chosen and depends on C Proposition 2 Let f K,C H K, and f z be defined by (9) Then E(π(f z )) E(f q ) can be bounded by E(f K,C ) E(f q ) + 1 2C f K,C 2 K + E(π(f z )) E z (π(f z )) + E z (f K,C ) E(f K,C ) (14) Proof Decopose the difference E(π(f z )) E(f q ) as ( E(π(f z )) E z (π(f z )) + E z (π(f z )) E z (f K,C ) E(f K,C ) + ) ( E z (f K,C ) + 1 2C f z 2 K E(f K,C ) E(f q ) + 1 2C f K,C 2 K 2C f K,C 2 K 1 2C f z 2 K By the definition of f z and (13), the second ter is 0 Then the stateent follows ) The first ter in (14) is called the regularization error (Sale & Zhou 2004) It can be expressed as a generalized Kfunctional inf E(f) E(fq ) + 1 2C f 2 K when fk,c takes f H K a special choice f K,C (a standard choice in the literature, eg Steinwart 2001) defined as f K,C := arg in E(f) + 1 f H K 2C f 2 K (15) Definition 3 Let V be a general loss function and f V ρ be a iniizer of the V risk (7) The regularization error for the regularizing function f K,C H K is defined as D(C) := E(f K,C ) E(f V ρ ) + 1 2C f K,C 2 K (16) It is called the regularization error of the schee (9) when f K,C = f K,C : D(C) := inf E(f) E(fρ V ) + 1 f H K 2C f 2 K 4
5 SVM Soft Margin Classifiers The ain concern of this paper is the regularization error We shall investigate its asyptotic behavior This investigation is not only iportant for bounding the first ter in (14), but also crucial for bounding the second ter (saple error): it is well known in structural risk iniization (ShaweTaylor et al 1998) that the size of the hypothesis space is essential This is deterined by D(C) in our setting Once C is fixed, the saple error estiate becoes routine Therefore, we need to understand the choice of the paraeter C fro the bound for D(C) Proposition 4 For any C 1/2 and any f K,C H K, there holds where κ := E 0 /(1 + κq4 q 1 ), D(C) D(C) κ2 2C (17) E 0 := inf b R E(b) E(f q) (18) Moreover, κ = 0 if and only if for soe p 0 [0, 1], P (Y = 1 x) = p 0 in probability This proposition will be proved in Section 5 According to Proposition 4, the decay of D(C) cannot be faster than O(1/C) except for soe very special distributions This special case is caused by the offset in (9), for which D(C) 0 Throughout this paper we shall ignore this trivial case and assue κ > 0 When ρ is strictly separable, D(C) = O(1/C) But this is a very special phenoenon In general, one should not expect E(f) = E(f q ) for soe f H K Even for (weakly) separable distributions with zero argin, D(C) decays as O(C p ) for soe 0 < p < 1 To realize such a decay for these separable distributions, the regularizing function will be ultiples of a separating function For details and the concepts of strictly or weakly separable distributions, see Section 2 For general distributions, we shall choose f K,C = f K,C in Sections 6 and 7 and estiate the regularization error of the schee (9) associated with the loss function (6) by eans of the approxiation in the function space L q ρ X In particular, D(C) 1 K(fq, 2C ), where K(f q, t) is a Kfunctional defined as K(f q, t) := inf f f q q L q + t f 2 ρ K f H K X inf q2 q 1 (2 q 1 + 1) f f q L q ρx + t f 2 K f H K, if 1 < q 2,, if q > 2 In the case q = 1, the regularization error (Wu & Zhou, 2004) depends on the approxiation in L 1 ρ X of the function f c which is not continuous in general For q > 1, the regularization error depends on the approxiation in L q ρ X of the function f q When the regression function has good soothness, f q has uch higher regularity than f c Hence the convergence for q > 1 ay be faster than that for q = 1, which iproves the regularization error The second ter in (14) is called the saple error When C is fixed, the saple error E(f z ) E z (f z ) is well understood (except for the offset ter) In (14), the saple error is for the function π(f z ) instead of f z while the isclassification error (5) is kept: (19) 5
6 Chen, Wu, Ying, and Zhou R(sgn(π(f z ))) = R(sgn(f z )) Since the bound for V (y, π(f)(x)) is uch saller than that for V (y, f(x)), the projection iproves the saple error estiate Based on estiates for the regularization error and saple error above, our error analysis will provide ε(δ,, C, β) > 0 for any 0 < δ < 1 such that with confidence 1 δ, E(π(f z )) E(f q ) (1 + β)d(c) + ε(δ,, C, β) (20) Here 0 < β 1 is an arbitrarily fixed nuber Moreover, li ε(δ,, C, β) = 0 If f q lies in the L q ρ X closure of H K, then li K(f q, t) = 0 by (19) Hence E(π(f z )) t 0 E(f q ) 0 with confidence as (and hence C = C()) becoes large This is the case when K is a universal kernel, ie, H K is dense in C(X), or when a sequence of kernels whose RKHS tends to be dense (eg polynoial kernels with increasing degrees) is used In suary, estiating the excess isclassification error R(sgn(f z )) R(f c ) consists of three parts: the coparison (12), the regularization error D(C) and the saple error ε(δ,, C, β) in (20) As functions of the variable C, D(C) decreases while ε(δ,, C, β) usually increases Choosing suitable values for the regularization paraeter C will give the optial convergence rate To this end, we need to consider the tradeoff between these two errors This can be done by iniizing the right side of (20), as shown in the following for Lea 5 Let p, α, τ > 0 Denote c p,α,τ := (p/τ) τ τ+p + (τ/p) p τ+p Then for any C > 0, ( ) αp C p + Cτ 1 α c τ+p p,α,τ (21) The equality holds if and only if C = (p/τ) 1 τ+p α τ+p This yields the optial power αp τ+p The goal of the regularization error estiates is to have p in D(C) = O(C p ), as large as possible But p 1 according to Proposition 4 Good ethods for saple error estiates provide large α and sall τ such that ε(δ,, C, β) = O( Cτ ) Notice that, as always, both α the approxiation properties (represented by the exponent p) and the estiation properties (represented by the exponents τ and α) are iportant to get good estiates for the learning rates (with the optial rate αp τ+p ) 2 Deonstrating with Weakly Separable Distributions With soe special cases let us deonstrate how our error analysis yields soe guidelines for choosing the regularization paraeter C For weakly separable distributions, we also copare our results with bounds in the literature To this end, we need the covering nuber of the unit ball B := f H K : f K 1 of H K (considered as a subset of C(X)) Definition 6 For a subset F of a etric space and η > 0, the covering nuber N (F, η) is defined to be the inial integer l N such that there exist l disks with radius η covering F 6
7 SVM Soft Margin Classifiers Denote the covering nuber of B in C(X) by N (B, η) Recall the constant κ defined in (18) Since the algorith involves an offset ter, we also need its covering nuber and set N (η) := (κ + 1 κ ) 1 η + 1 N (B, η) (22) Definition 7 We say that the Mercer kernel K has logarithic coplexity exponent s 1 if for soe c > 0, there holds log N (η) c (log(1/η)) s, η > 0 (23) The kernel K has polynoial coplexity exponent s > 0 if log N (η) c (1/η) s, η > 0 (24) The covering nuber N (B, η) has been extensively studied (see eg Bartlett 1998, Williason, Sola & Schölkopf 2001, Zhou 2002, Zhou 2003a) In particular, for convolution type kernels K(x, y) = k(x y) with ˆk 0 decaying exponentially fast, (23) holds (Zhou 2002, Theore 3) As an exaple, consider the Gaussian kernel K(x, y) = exp x y 2 /σ 2 with σ > 0 If X [0, 1] n and 0 < η exp90n 2 /σ 2 11n 3, then (23) is valid (Zhou 2002) with s = n + 1 A lower bound (Zhou 2003a) holds with s = n 2, which shows the upper bound is alost sharp It was also shown in (Zhou 2003a) that K has polynoial coplexity exponent 2n/p if K is C p To deonstrate our ain results concerning the regularization paraeter C, we shall only choose kernels having logarithic coplexity exponents or having polynoial coplexity exponents with s sall This eans, H K has sall coplexity The special case we consider here is the deterinistic case: R(f c ) = 0 Understanding how to find D(C) and ε(δ,, C, β) in (20) and then how to choose the paraeter C is our target For M 0, denote 0, if M = 0, θ M = (25) 1, if M > 0 Proposition 8 Suppose R(f c ) = 0 If f K,C H K satisfies V (y, f K,C (x)) M, then for every 0 < δ < 1, with confidence at least 1 δ, we have R(sgn(f z )) 2 ax ε 4M log(2/δ), + 4D(C), (26) where with M := 2CD(C) + 2Cεθ M, ε > 0 is the unique solution to the equation ( ) ε log N q2 q+3 M (a) If (23) is valid, then with c = 2 q c((q + 4) log 8) s, 3ε = log(δ/2) (27) 2q+9 ( (log + log(cd(c)) + ε log(cθm )) s ) + log(2/δ) c 7
8 Chen, Wu, Ying, and Zhou (b) If (24) is valid and C 1, then with a constant c (given explicitly in the proof), (2CD(C)) s/(2s+2) ε c log(2/δ) 1/(s+1) + (2C)s/(s+2) 1/( s 2 +1) (28) Note that the above constant c depends on c, q, and κ, but not on C, or δ The proof of Proposition 8 will be given in Section 5 We can now derive our error bound for weakly separable distributions Definition 9 We say that ρ is (weakly) separable by H K if there is a function fsp H K, called a separating function, satisfying f sp K = 1 and yfsp(x) > 0 alost everywhere It has separation exponent θ (0, + ] if there are positive constants γ, c such that ρ X x X : fsp(x) < γt c t θ, t > 0 (29) Observe that condition (29) with θ = + is equivalent to ρ X x X : fsp(x) < γt = 0, 0 < t < 1 That is, fsp(x) γ alost everywhere Thus, weakly separable distributions with separation exponent θ = + are exactly strictly separable distributions Recall (eg Vapnik 1998, ShaweTaylor et al 1998) that ρ is said to be strictly separable by H K with argin γ > 0 if ρ is (weakly) separable together with the requireent yfsp(x) γ alost everywhere Fro the first part of Definition 9, we know that for a separable distribution ρ, the set x : fsp(x) = 0 has ρ X easure zero, hence li ρ Xx X : fsp(x) < t = 0 t 0 The separation exponent θ easures the asyptotic behavior of ρ near the boundary of two classes It gives the convergence with a polynoial decay Theore 10 If ρ is separable and has separation exponent θ (0, + ] with (29) valid, then for every 0 < δ < 1, with confidence at least 1 δ, we have R(sgn(f z )) 2ε + 8 log(2/δ) where ε is solved by ( ε log N q2 q+3 2(2c ) 2 θ+2 γ 2θ 2 θ+2 C θ+2 + 2Cε + 4(2c ) 2 θ+2 (Cγ 2 ) θ θ+2, (30) ) 3ε = log(δ/2) (31) 2q+9 (a) If (23) is satisfied, with a constant c depending on γ we have ( log(2/δ) + (log + log C) s ) R(sgn(f z )) c + C θ θ+2 (32) 8
9 SVM Soft Margin Classifiers (b) If (24) is satisfied, there holds R(sgn(f z )) c log 2 ( C δ s (s+1)(θ+2) 1 s+1 + C s s+2 1 s/2+1 ) + C θ θ+2 (33) Proof Choose t = ( γθ 2c C )1/(θ+2) > 0 and the function f K,C = 1 t f sp H K Then we have 1 2C f K,C 2 K = 1 2Ct 2 Since yfsp(x) > 0 alost everywhere, we know that for alost every x X, y = sgn(fsp)(x) This eans f ρ = sgn(fsp) and then f q = sgn(fsp) It follows that R(f c ) = 0 and V (y, f K,C (x)) = (1 f sp(x) t ) q + [0, 1] alost everywhere Therefore, we ay take M = 1 in Proposition 8 Moreover, E(f K,C ) = (1 yf ) sp(x) q (1 t dρ = f ) sp(x) q + t dρ X + This can be bounded by (29) as (1 f sp(x) x: fsp(x) <t Z t ) q X dρ X ρ X x X : fsp(x) < t c ( t ) θ + γ It follows fro the choice of t that D(C) c ( t ) θ 1 + γ 2Ct 2 = (2c ) 2 θ+2 (Cγ 2 ) θ θ+2 (34) Then our conclusion follows fro Proposition 8 In case (a), the bound (32) tells us to choose C such that (log C) s / 0 and C as By (32) a reasonable choice of the regularization paraeter C is C = θ+2 θ, (log )s which yields R(sgn(f z )) = O( ) In case (b), when (24) is valid, C = θ+2 θ+s+θs = R(sgn(fz )) = O( θ θ+s+θs ) (35) The following is a typical exaple of separable distribution which is not strictly separable In this exaple, the separation exponent is θ = 1 Exaple 1 Let X = [ 1/2, 1/2] and ρ be the Borel probability easure on Z such that ρ X is the Lebesgue easure on X and 1, for 1/2 x 1/4 and 1/4 x 1/2, f ρ (x) = 1, for 1/4 x < 1/4 (a) If K is the linear polynoial kernel K(x, y) = x y, then R(sgn(f z )) R(f c ) 1/4 (b) If K is the quadratic polynoial kernel K(x, y) = (x y) 2, then with confidence 1 δ, R(sgn(f z )) q(q + 4)2q+11 (log + log C + 2 log(2/δ)) + 16 C 1/3 Hence one should take C such that log C/ 0 and C as 9
10 Chen, Wu, Ying, and Zhou Proof The first stateent is trivial To see (b), we note that dih K = 1 and κ = 1/4 Also, E 0 = inf E(b) = 1 Hence κ = b R 1/(1 + q4 q 2 ) Since H K = ax 2 : a R and ax 2 K = a, N (B, η) 1/(2η) Then (23) holds with s = 1 and c = 4q By Proposition 8, we see that ε q(q+4)2q+11 (log(c)+log(2/δ)) Take the function fsp(x) = x 2 1/16 H K with fsp K = 1 We see that yfsp(x) > 0 alost everywhere Moreover, [ 1 16t t x X : fsp(x) < t, 4 4 ] [ t, t The easure of this set is bounded by 8 2t Hence (29) holds with θ = 1, γ = 1 and c = 8 2 Then Theore 10 gives the stated bound for R(sgn(f z )) 4 ] In this paper, for the saple error estiates we only use the (unifor) covering nubers in C(X) Within the last a few years, the epirical covering nubers (in l or l 2 ), the leaveoneout error or stability analysis (eg Vapnik 1998, Bousquet & Ellisseeff 2002, Zhang 2004), and soe other advanced epirical process techniques such as the local Radeacher averages (van der Vaart & Wellner 1996, Bartlett, Bousquet & Mendelson 2004, and references therein) and the entropy integrals (van der Vaart & Wellner, 1996) have been developed to get better saple error estiates These techniques can be applied to various learning algoriths They are powerful to handle general hypothesis spaces even with large capacity In Zhang (2004) the leaveoneout technique was applied to iprove the saple error estiates given in Bousquet & Ellisseeff (2002): the saple error has a kernelindependent bound O( C C ), iproving the bound O( ) in Bousquet & Ellisseeff (2002); while the regularization error D(C) depends on K and ρ In particular, for q = 2 the bound (Zhang 2004, Corollary 42) takes the for: ( E(E(f z )) 1 + 4κ2 C ) 2 inf E(f) + 1 ( f H K 2C f 2 K = 1 + 4κ2 C ) 2 D(C) + E(fq ) (36) In Bartlett, Jordan & McAuliffe (2003) the epirical process techniques are used to iprove the saple error estiates for ERM In particular, Theore 12 there states that for a convex set F of functions on X, the iniizer f of the epirical error over F satisfies E( f) inf E(f) + K ax f F ( ε cr L 2 log(1/δ) ) 1/(2 β), BL log(1/δ), (37) Here K, c r, β are constants, and ε is solved by an inequality The constant B is a bound for differences, and L is the Lipschitz constant for the loss φ with respect to a pseudoetric on R For ore details, see Bartlett, Jordan & McAuliffe (2003) The definitions of B and L together tell us that φ(y 1 f(x 1 )) φ(y 2 f(x 2 )) LB, (x 1, y 1 ) Z, (x 2, y 2 ) Z, f F (38) Because of our iproveent for the bound of the rando variables V (y, f(x)) given by the projection operator, for the algorith (3) the error bounds we derive here are better 10
11 SVM Soft Margin Classifiers than existing results when the kernel has sall coplexity Let us confir this for separable distributions with separation exponent 0 < θ < and for kernels with polynoial coplexity exponent s > 0 satisfying (24) Recall that D(C) = O(C θ θ+2 ) Take q = 2 Consider the estiate (36) This together with (34) tells us that the derived bound for E(E(f z ) E(f q )) is at least of the order D(C) + C D(C) = O(C θ θ+2 + C θ+2 2 ) By Lea 5, the optial bound derived fro (36) is O( θ θ+2 ) Thus, our bound (35) is better than (36) when 0 < s < 2 θ+1 Turn to the estiate (37) Take F to be the ball of radius 2C D(C) of H K (the expected sallest ball where fz lies), we find that the constant LB in (38) should be at least κ 2C D(C) This tells us that one ter in the bound (37) for the saple error is ( 2C at least O D(C) ) = O( C θ+2 1 ), while the other two ters are ore involved Applying Lea 5 again, we find that the bound derived fro (37) is at least O( θ θ+1 ) So our bound (35) is better than (37) at least for 0 < s < 1 θ+1 Thus, our analysis with the help of the projection operator iproves existing error bounds when the kernel has sall coplexity Note that in (35), the values of s for which the projection operator gives an iproveent are values corresponding to rapidly diinishing regularization (C = β with β > 1 being large) For kernels with large coplexity, refined epirical process techniques (eg van der Vaart & Wellner 1996, Mendelson 2002, Zhang 2004, Bartlett, Jordan & McAuliffe 2003) should give better bounds: the capacity of the hypothesis space ay have ore influence on the saple error than the bound of the rando variable V (y, f(x)), and the entropy integral is powerful to reduce this influence To find the range of this large coplexity, one needs explicit rate analysis for both the saple error and regularization error This is out of our scope It would be interesting to get better error bounds by cobining the ideas of epirical process techniques and the projection operator Proble 11 How uch iproveent can we get for the total error when we apply the projection operator to epirical process techniques? Proble 12 Given θ > 0, s > 0, q 1, and 0 < δ < 1, what is the largest nuber α > 0 such that R(sgn(f z )) = O( α ) with confidence 1 δ whenever the distribution ρ is separable with separation exponent θ and the kernel K satisfies (24)? What is the corresponding optial choice of the regularization paraeter C? In particular, for Exaple 1 we have the following Conjecture 13 In Exaple 1, with confidence 1 δ there holds R(sgn(f z )) = O( log(2/δ) ) by choosing C = 3 Consider the well known setting (Vapnik 1998, ShaweTaylor et al 1998, Steinwart 2001, Cristianini & ShaweTaylor 2000) of strictly separable distributions with argin γ > 0 In this case, ShaweTaylor et al (1998) shows that R(sgn(f)) 2 (log N (F, 2, γ/2) + log(2/δ)) (39) 11
12 Chen, Wu, Ying, and Zhou with confidence 1 δ for > 2/γ whenever f F satisfies y i f(x i ) γ Here F is a function set, N (F, 2, γ/2) = sup N (F, t, γ/2) and N (F, t, γ/2) denotes the covering t X 2 nuber of the set (f(t i )) 2 i=1 : f F in R2 with the l etric For a coparison of this covering nuber with that in C(X), see Pontil (2003) In our analysis, we can take f K,C = 1 γ f sp H K Then M = V (y, f K,C (x)) = 0, and D(C) = 1 We see fro Proposition 8 (a) that when (23) is satisfied, there holds 2Cγ 2 R(f z ) c( (log(1/γ)+log )s +log(1/δ) ) But the optial bound O( 1 ) is also valid for spaces with larger coplexity, which can be seen fro (39) by the relation of the covering nubers: N (F, 2, γ/2) N (F, γ/2) See also Steinwart (2001) We see the shortcoing of our approach for kernels with large coplexity This convinces the interest of Proble 11 raised above 3 Coparison of Errors In this section we consider how to bound the isclassification error by the V risk A systeatic study of this proble was done for convex loss functions by Zhang (2004), and for ore general loss functions by Bartlett, Jordan, and McAuliffe (2003) Using these results, it can be shown that for the loss function V q, there holds ( ) ψ R(sgn(f)) R(f c ) E(f) E(f q ), where ψ : [0, 1] R + is a function defined in Bartlett, Jordan, and McAuliffe (2003) and can be explicitly coputed In fact, we have R(sgn(f)) R(f c ) c E(f) E(f q ) (40) Such a coparison of errors holds true even for a general convex loss function, see Theore 34 in Appendix The derived bound for the constant c in (40) need not be optial In the following we shall give an optial estiate in a sipler for The constant we derive depends on q and is given by C q = 1, if 1 q 2, 2(q 1)/q, if q > 2 (41) We can see that C q 2 Theore 14 Let f : X R be easurable Then R(sgn(f)) R(f c ) C q (E(f) E(f q )) 2 (E(f) E(f q )) Proof By the definition, only points with sgn(f)(x) f c (x) are involved for the isclassification error Hence R(sgn(f)) R(f c ) = f ρ (x) χ sgn(f)(x) sgn(f q)(x)dρ X (42) X 12
13 SVM Soft Margin Classifiers This in connection with the Schwartz inequality iplies that R(sgn(f)) R(f c ) X f ρ (x) 2 χ sgn(f)(x) sgn(f q)(x)dρ X 1/2 Thus it is sufficient to show for those x X with sgn(f)(x) sgn(f q )(x), where for x X, we have denoted E(f x) := By the definition of the loss function V q, we have It follows that f ρ (x) 2 C q (E(f x) E(f q x)), (43) Y V q (y, f(x))dρ(y x) (44) E(f x) = (1 f(x)) q 1 + f ρ (x) + + (1 + f(x)) q 1 f ρ (x) + (45) 2 2 E(f x) E(f q x) = f(x) fq(x) where F (u) is the function (depending on the paraeter f ρ (x)): F (u) = 1 f ρ(x) q(1 + f q (x) + u) q f ρ(x) 2 F (u)du, (46) q(1 f q (x) u) q 1 +, u R Since F (u) is nondecreasing and F (0) = 0, we see fro (46) that when sgn(f)(x) sgn(f q )(x), there holds But E(f x) E(f q x) E(f q x) = fq(x) 0 F (u)du = E(0 x) E(f q x) = 1 E(f q x) (47) 2 q 1 ( 1 f ρ (x) 2) (1 + fρ (x)) 1/(q 1) + (1 f ρ (x)) 1/(q 1) q 1 (48) Therefore, (43) is valid (with t = f ρ (x)) once the following inequality is verified: t 2 2 q 1 ( 1 t 2) 1 C q (1 + t) 1/(q 1) + (1 t) 1/(q 1) q 1, t [ 1, 1] This is the sae as the inequality ) 1/(q 1) (1 t2 (1 + t) 1/(q 1) + (1 t) 1/(q 1), t [ 1, 1] (49) 2 C q To prove (49), we use the Taylor expansion (1 + u) α = k=0 ( ) α u k = k k=0 k 1 l=0 (α l) k! 13 u k, u ( 1, 1)
14 Chen, Wu, Ying, and Zhou With α = 1/(q 1), we have ) 1/(q 1) (1 t2 = C q k=0 ( ) k 1 1 l=0 q 1 + l k! 1 C k q t 2k and (all the odd power ters vanish) (1 + t) 1/(q 1) + (1 t) 1/(q 1) 2 = k=0 ( ) k 1 1 l=0 q 1 + l k! ( k 1 l=0 1 q 1 + l + k 1 + l + k ) t 2k Note that k 1 l=0 This proves (49) and hence our conclusion 1 q 1 + l + k 1 + l + k 1 Cq k When R(f c ) = 0, the estiate in Theore 14 can be iproved (Zhang 2004, Bartlett, Jordan & McAuliffe 2003) to R(sgn(f)) E(f) (50) In fact, R(f c ) = 0 iplies f ρ (x) = 1 alost everywhere This in connection with (48) and (47) gives E(f q x) = 0 and E(f x) 1 when sgn(f)(x) f c (x) Then (43) can be iproved to the for f ρ (x) E(f x) Hence (50) follows fro (42) Theore 14 tells us that the isclassification error can be bounded by the V risk associated with the loss function V So our next step is to study the convergence of the qclassifier with respect to the V risk E 4 Bounding the Offset If the offset b is fixed in the schee (3), the saple error can be bounded by standard arguent using soe easureents of the capacity of the RKHS However, the offset is part of the optiization proble and even its boundedness can not be seen fro the definition This akes our setting here essentially different fro the standard Tikhonov regularization schee The difficulty of bounding the offset b has been realized in the literature (eg Steinwart 2002, Bousquet & Ellisseeff 2002) In this paper we shall overcoe this difficulty by eans of special features of the loss function V The difficulty raised by the offset can also be seen fro the stability analysis (Bousquet & Ellisseeff 2002) As shown in Bousquet & Ellisseeff (2002), the SVM 1nor classifier without the offset is uniforly stable, eaning that sup z Z,z 0 Z V (y, f z (x)) V (y, f z (x)) L β with β 0 as, and z is the sae as z except that one eleent of z is replaced by z 0 The SVM 1nor classifier with the offset is not uniforly stable To see this, we choose x 0 X and saples z = (x 0, y i ) 2n+1 i=1 with y i = 1 for i = 1,, n + 1, and y i = 1 for i = n+2,, 2n+1 Take z 0 = (x 0, 1) As x i are identical, one can see fro the definition 14
Bayes Point Machines
Journal of Machine Learning Research (2) 245 279 Subitted 2/; Published 8/ Bayes Point Machines Ralf Herbrich Microsoft Research, St George House, Guildhall Street, CB2 3NH Cabridge, United Kingdo Thore
More informationABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION
MEASURING COMONOTONICITY IN MDIMENSIONAL VECTORS BY INGE KOCH AND ANN DE SCHEPPER ABSTRACT In this contribution, a new easure of coonotonicity for diensional vectors is introduced, with values between
More informationCapacity of MultipleAntenna Systems With Both Receiver and Transmitter Channel State Information
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO., OCTOBER 23 2697 Capacity of MultipleAntenna Systes With Both Receiver and Transitter Channel State Inforation Sudharan K. Jayaweera, Student Meber,
More informationDynamic rightsizing for powerproportional data centers Extended version
1 Dynaic rightsizing for powerproportional data centers Extended version Minghong Lin, Ada Wieran, Lachlan L. H. Andrew and Eno Thereska Abstract Power consuption iposes a significant cost for data centers
More informationCovering Number Bounds of Certain Regularized Linear Function Classes
Journal of Machine Learning Research 2 (2002) 527550 Submitted 6/200; Published 3/2002 Covering Number Bounds of Certain Regularized Linear Function Classes Tong Zhang T.J. Watson Research Center Route
More informationWhen Is There a Representer Theorem? Vector Versus Matrix Regularizers
Journal of Machine Learning Research 10 (2009) 25072529 Submitted 9/08; Revised 3/09; Published 11/09 When Is There a Representer Theorem? Vector Versus Matrix Regularizers Andreas Argyriou Department
More informationNew Support Vector Algorithms
LETTER Communicated by John Platt New Support Vector Algorithms Bernhard Schölkopf Alex J. Smola GMD FIRST, 12489 Berlin, Germany, and Department of Engineering, Australian National University, Canberra
More informationNonPrice Equilibria in Markets of Discrete Goods
NonPrice Equilibria in Markets of Discrete Goods (working paper) Avinatan Hassidi Hai Kaplan Yishay Mansour Noa Nisan ABSTRACT We study arkets of indivisible ites in which pricebased (Walrasian) equilibria
More informationConsistency and Convergence Rates of OneClass SVMs and Related Algorithms
Journal of Machine Learning Research 7 2006 817 854 Submitted 9/05; Revised 1/06; Published 5/06 Consistency and Convergence Rates of OneClass SVMs and Related Algorithms Régis Vert Laboratoire de Recherche
More informationGaussian Processes for Regression: A Quick Introduction
Gaussian Processes for Regression A Quick Introduction M Ebden, August 28 Coents to arkebden@engoacuk MOTIVATION Figure illustrates a typical eaple of a prediction proble given soe noisy observations of
More informationChoosing Multiple Parameters for Support Vector Machines
Machine Learning, 46, 131 159, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Choosing Multiple Parameters for Support Vector Machines OLIVIER CHAPELLE LIP6, Paris, France olivier.chapelle@lip6.fr
More informationA Tutorial on Support Vector Machines for Pattern Recognition
c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies
More informationA Study on SMOtype Decomposition Methods for Support Vector Machines
1 A Study on SMOtype Decomposition Methods for Support Vector Machines PaiHsuen Chen, RongEn Fan, and ChihJen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw
More informationSelection of the Number of Principal Components: The Variance of the Reconstruction Error Criterion with a Comparison to Other Methods
Ind. Eng. Che. Res. 1999, 38, 43894401 4389 Selection of the Nuber of Principal Coponents: The Variance of the Reconstruction Error Criterion with a Coparison to Other Methods Sergio Valle, Weihua Li,
More informationThe Set Covering Machine
Journal of Machine Learning Research 3 (2002) 723746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationSampling 50 Years After Shannon
Sampling 50 Years After Shannon MICHAEL UNSER, FELLOW, IEEE This paper presents an account of the current state of sampling, 50 years after Shannon s formulation of the sampling theorem. The emphasis is
More informationProbability Estimates for Multiclass Classification by Pairwise Coupling
Journal of Machine Learning Research 5 (2004) 975005 Submitted /03; Revised 05/04; Published 8/04 Probability Estimates for Multiclass Classification by Pairwise Coupling TingFan Wu ChihJen Lin Department
More informationSupportVector Networks
Machine Learning, 20, 273297 (1995) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. SupportVector Networks CORINNA CORTES VLADIMIR VAPNIK AT&T Bell Labs., Holmdel, NJ 07733,
More informationQUASICONFORMAL MAPPINGS WITH SOBOLEV BOUNDARY VALUES
QUASICONFORMAL MAPPINGS WITH SOBOLEV BOUNDARY VALUES KARI ASTALA, MARIO BONK, AND JUHA HEINONEN Abstract. We consider quasiconformal mappings in the upper half space R n+1 + of R n+1, n 2, whose almost
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationConditional mean embeddings as regressors
Steffen Grünewälder 1 STEFFEN@CS.UCL.AC.UK Guy Lever 1 G.LEVER@CS.UCL.AC.UK Luca Baldassarre L.BALDASSARRE@CS.UCL.AC.UK Sam Patterson SAM.X.PATTERSON@GMAIL.COM Arthur Gretton, ARTHUR.GRETTON@GMAIL.COM
More informationThe mean field traveling salesman and related problems
Acta Math., 04 (00), 9 50 DOI: 0.007/s50000467 c 00 by Institut MittagLeffler. All rights reserved The mean field traveling salesman and related problems by Johan Wästlund Chalmers University of Technology
More informationWorking Set Selection Using Second Order Information for Training Support Vector Machines
Journal of Machine Learning Research 6 (25) 889 98 Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines RongEn Fan PaiHsuen
More informationBasic Concepts of Point Set Topology Notes for OU course Math 4853 Spring 2011
Basic Concepts of Point Set Topology Notes for OU course Math 4853 Spring 2011 A. Miller 1. Introduction. The definitions of metric space and topological space were developed in the early 1900 s, largely
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationControllability and Observability of Partial Differential Equations: Some results and open problems
Controllability and Observability of Partial Differential Equations: Some results and open problems Enrique ZUAZUA Departamento de Matemáticas Universidad Autónoma 2849 Madrid. Spain. enrique.zuazua@uam.es
More informationA Unifying View of Sparse Approximate Gaussian Process Regression
Journal of Machine Learning Research 6 (2005) 1939 1959 Submitted 10/05; Published 12/05 A Unifying View of Sparse Approximate Gaussian Process Regression Joaquin QuiñoneroCandela Carl Edward Rasmussen
More informationAn Introduction to the NavierStokes InitialBoundary Value Problem
An Introduction to the NavierStokes InitialBoundary Value Problem Giovanni P. Galdi Department of Mechanical Engineering University of Pittsburgh, USA Rechts auf zwei hohen Felsen befinden sich Schlösser,
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More information