A Study on SMO-type Decomposition Methods for Support Vector Machines
|
|
|
- Bernadette Preston
- 10 years ago
- Views:
Transcription
1 1 A Study on SMO-type Decomposition Methods for Support Vector Machines Pai-Hsuen Chen, Rong-En Fan, and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan [email protected] Abstract Decomposition methods are currently one of the major methods for training support vector machines. They vary mainly according to different working set selections. Existing implementations and analysis usually consider some specific selection rules. This article studies Sequential Minimal Optimization (SMO)-type decomposition methods under a general and flexible way of choosing the two-element working set. Main results include: 1) a simple asymptotic convergence proof, ) a general explanation of the shrinking and caching techniques, and 3) the linear convergence of the methods. Extensions to some SVM variants are also discussed. I. INTRODUCTION The support vector machines (SVMs) [1], [6] are useful classification tools. Given training vectors x i R n,i = 1,...,l, in two classes, and a vector y R l such that y i {1, 1}, SVMs require the solution of the following optimization problem: min α f(α) = 1 αt Qα e T α subject to 0 α i C,i = 1,...,l, (1) y T α = 0, where e is the vector of all ones, C < is the upper bound of all variables, and Q is an l by l symmetric matrix. Training vectors x i are mapped into a higher (maybe infinite) dimensional space by the function φ, Q ij y i y j K(x i,x j ) = y i y j φ(x i ) T φ(x j ), and K(x i,x j ) is the kernel function. Then Q is a positive semi-definite (PSD) matrix. Occasionally some researchers use kernel functions not in the form of φ(x i ) T φ(x j ), so Q may be indefinite. Here we also consider such cases and, hence, require only Q to be symmetric. Due to the density of the matrix Q, the memory problem causes that traditional optimization methods cannot be directly applied to solve (1). Currently the decomposition method is one of the major methods to train SVM (e.g. [3], [10], [3], [5]). This method, an iterative procedure, considers only a small subset of variables per iteration. Denoted as B, this subset is called the working set. Since each iteration involves only B columns of the matrix Q, the memory problem is solved.
2 A special decomposition method is the Sequential Minimal Optimization (SMO) [5], which restricts B to only two elements. Then at each iteration one solves a simple two-variable problem without needing optimization software. It is sketched in the following: Algorithm 1 (SMO-type Decomposition methods) 1) Find α 1 as the initial feasible solution. Set k = 1. ) If α k is an stationary point of (1), stop. Otherwise, find a two-element working set B = {i,j} {1,...,l}. Define N {1,...,l}\B and α k B and αk N as sub-vectors of αk corresponding to B and N, respectively. 3) Solve the following sub-problem with the variable α B : 1 ] min [α α B TB (αkn )T Q BB Q BN α [ ] B e T Q NB Q NN α k B et α B N N α k N = 1 ] [α i α j Q ii Q ij α i + ( e B + Q BN α k N) T α i + constant Q ij Q jj α j subject to 0 α i,α j C, () y i α i + y j α j = ynα T k N, [ ] where QBB Q BN Q NB Q NN is a permutation of the matrix Q. Set α k+1 B to be the optimal point of (). 4) Set α k+1 N αk N. Set k k + 1 and goto Step. Note that the set B changes from one iteration to another. To simplify the notation, we use B instead of B k. Algorithm 1 does not specify how to choose the working set B as there are many possible ways. Regarding the sub-problem (), if Q is positive semi-definite, then it is convex and can be easily solved. For indefinite Q, the situation is more complicated, and this issue is addressed in Section III. If a proper working set is used at each iteration, the function value f(α k ) strictly decreases. However, this property does not imply that the sequence {α k } converges to a stationary point of (1). Hence, proving the convergence of decomposition methods is usually a challenging task. Under certain rules for selecting the working set, the asymptotic convergence has been established (e.g. [16], [], [9], [0]). These selection rules may allow an arbitrary working set size, so they reduce to SMO-type methods when the size is limited to two. This article provides a comprehensive study on SMO-type decomposition methods. The selection rule of the two-element working set B is under a very general setting. Thus, all results here apply to any SMO-type implementation whose selection rule meets the criteria of this paper. In Section II, we discuss existing working set selections for SMO-type methods and propose a general scheme. Section III proves the asymptotic convergence and gives a comparison to former convergence studies. α j
3 3 Shrinking and caching are two effective techniques to speed up the decomposition methods. Earlier research [18] gives the theoretical foundation of these two techniques, but requires some assumptions. In Section IV, we provide a better and a more general explanation without these assumptions. Convergence rates are another important issue and they indicate how fast the method approaches an optimal solution. We establish the linear convergence of the proposed method in Section V. All the above results hold not only for support vector classification, but also for regression and other variants of SVM. Such extensions are presented in Section VI. Finally, Section VII is the conclusion. II. EXISTING AND NEW WORKING SET SELECTIONS FOR SMO-TYPE METHODS In this Section, we discuss existing working set selections and then propose a general scheme for SMO-type methods. A. Existing Selections Currently a popular way to select the working set B is via the maximal violating pair: WSS 1 (Working set selection via the maximal violating pair ) 1) Select where i arg j arg max t I up(α k ) min t I low (α k ) y t f(α k ) t, y t f(α k ) t, I up (α) {t α t < C,y t = 1 or α t > 0,y t = 1}, and I low (α) {t α t < C,y t = 1 or α t > 0,y t = 1}. (3) ) Return B = {i,j}. This working set was first proposed in [14] and has been used in many software packages, for example, LIBSVM [3]. WSS 1 can be derived through the Karush-Kuhn-Tucker (KKT) optimality condition of (1): A vector α is a stationary point of (1) if and only if there is a number b and two nonnegative vectors λ and µ such that f(α) + by = λ µ, λ i α i = 0,µ i (C α) i = 0,λ i 0,µ i 0,i = 1,...,l, where f(α) Qα e is the gradient of f(α). This condition can be rewritten as f(α) i + by i 0 if α i < C, (4) f(α) i + by i 0 if α i > 0. (5)
4 4 Since y i = ±1, by defining I up (α) and I low (α) as in (3), and rewriting (4)-(5) to have the range of b, a feasible α is a stationary point of (1) if and only if where m(α) m(α) M(α), (6) max y i f(α) i, and M(α) min y i f(α) i. i I up(α) i I low (α) Note that m(α) and M(α) are well defined except in a rare situation where all y i = 1 (or 1). In this case the zero vector is the only feasible solution of (1), so the decomposition method stops at the first iteration. Following [14], we define a violating pair of the condition (6) as: Definition 1 (Violating pair) If i I up (α),j I low (α), and y i f(α) i > y j f(α) j, then {i,j} is a violating pair. From (6), the indices {i,j} which most violate the condition are a natural choice of the working set. They are called a maximal violating pair in WSS 1. Violating pairs play an important role in making the decomposition methods work, as demonstrated by: Theorem 1 ([9]) Assume Q is positive semi-definite. The decomposition method has the strict decrease of the objective function value (i.e. f(α k+1 ) < f(α k ), k) if and only if at each iteration B includes at least one violating pair. For SMO-type methods, if Q is PSD, this theorem implies that B must be a violating pair. Unfortunately, having a violating pair in B and then the strict decrease of f(α k ) do not guarantee the convergence to a stationary point. An interesting example from [13] is as follows: Given five data x 1 = [1,0,0] T,x = [0,1,0] T,x 3 = [0,0,1] T,x 4 = [0.1,0.1,0.1] T, and x 5 = [0,0,0] T. If y 1 = = y 4 = 1,y 5 = 1, C = 100, and the linear kernel K(x i,x j ) = x T i x j is used, then the optimal solution of (1) is [0,0,0,00/3,00/3] T and the optimal objective value is 00/3. Starting with α 0 = [0,0,0,0,0] T, if in the first iteration the working set is {1,5}, we have α 1 = [,0,0,0,] T with the objective value. Then if we choose the following working sets at the next three iterations: {1,} {,3} {3,1}, (7) the next three α are: [1,1,0,0,] T [1,0.5,0.5,0,] T [0.75,0.5,0.75,0,] T.
5 5 If we continue as in (7) for choosing the working set, the algorithm requires an infinite number of iterations. In addition, the sequence converges to a non-optimal point [/3,/3,/3,0,] T with the objective value 10/3. All working sets used are violating pairs. A careful look at the above procedure reveals why it fails to converge to the optimal solution. In (7), we have the following for the three consecutive iterations: y t f(α k ) t,t = 1,...,5 y i f(α k ) i + y j f(α k ) j m(α k ) M(α k ) [0,0, 1, 0.8,1] T 1 [0, 0.5, 0.5, 0.8,1] T [ 0.5, 0.5, 0.5, 0.8,1] T Clearly, the selected B = {i,j}, though a violating pair, does not lead to the reduction of the maximal violation m(α k ) M(α k ). This discussion shows the importance of the maximal violating pair in the decomposition method. When such a pair is used as the working set (i.e. WSS 1), the convergence has been fully established in [16] and [17]. However, if B is not the maximal violating pair, it is unclear whether SMO-type methods still converge to a stationary point. Motivated from the above analysis, we conjecture that a sufficiently violated pair is enough and propose a general working set selection in the following subsection. B. A General Working Set Selection for SMO-type Decomposition Methods We propose choosing a constant-factor violating pair as the working set. That is, the difference between the two selected indices is larger than a constant fraction of that between the maximal violating pair. WSS (Working set selection: constant-factor violating pair) 1) Consider a fixed 0 < σ 1 for all iterations. ) Select any i I up (α k ),j I low (α k ) satisfying y i f(α k ) i + y j f(α k ) j σ(m(α k ) M(α k )) > 0. (8) 3) Return B = {i,j}. Clearly (8) ensures the quality of the selected pair by linking it to the maximal violating pair. We can consider an even more general relationship between the two pairs: WSS 3 (Working set selection: a generalization of WSS ) 1) Let h : R 1 R 1 be any function satisfying a) h strictly increases on x 0, b) h(x) x, x 0,h(0) = 0.
6 6 ) Select any i I up (α k ),j I low (α k ) satisfying 3) Return B = {i,j}. y i f(α k ) i + y j f(α k ) j h(m(α k ) M(α k )) > 0. (9) The condition h(x) x ensures that there is at least one pair {i,j} satisfying (9). Clearly, h(x) = σx with 0 < σ 1 fulfills all required conditions and WSS 3 reduces to WSS. The function h can be in a more complicated form. For example, if ( ) h(m(α k ) M(α k )) min m(α k ) M(α k ), m(α k ) M(α k ), (10) then (10) also satisfies all requirements. Subsequently, we mainly analyze the SMO-type method using WSS 3 for the working set selection. The only exception is the linear convergence analysis, which considers WSS. III. ASYMPTOTIC CONVERGENCE The decomposition method generates a sequence {α k }. If it is finite, then a stationary point is obtained. Hence we consider only the case of an infinite sequence. This section establishes the asymptotic convergence of using WSS 3 for the working set selection. First we discuss the subproblem in Step 3) of Algorithm 1 with indefinite Q. Solving it relates to the convergence proof. A. Solving the sub-problem Past convergence analysis usually requires an important property, the function value is sufficiently decreased: There is λ > 0 such that f(α k+1 ) f(α k ) λ α k+1 α k, for all k. (11) If Q is positive semi-definite and the working set {i,j} is a violating pair, [17] has proved (11). However, it is difficult to obtain the same result if Q is indefinite. Some such kernels are, for example, the sigmoid kernel K(x i,x j ) = tanh(γx T i x j + r) [19] and the edit-distance kernel K(x i,x j ) = exp d(xi,xj), where d(x i,x j ) is the edit distance of two strings x i and x j [5]. To obtain (11), [4] modified the sub-problem () to the following form: min α B f ([ αb α k N where τ > 0 is a constant. Clearly, an optimal α k+1 B ]) + τ α B α k B subject to y T Bα B = y T Nα k N, (1) 0 α i C, i B, of (1) leads to f(α k+1 ) + τ α k+1 B αk B f(α k )
7 7 and then (11). However, this change also causes differences to most SVM implementations, which consider positive semi-definite Q and use the sub-problem (). Moreover, (1) may still be non-convex and possess more than one local minimum. Then for implementations, convex and non-convex cases may have to be handled separately. Therefore, we propose another way to solve the sub-problem where 1) The same sub-problem () is used when Q is positive definite (PD), and ) The sub-problem is always convex. An SMO-type decomposition method with special handling on indefinite Q is in the following algorithm. Algorithm (Algorithm 1 with specific handling on indefinite Q) Steps 1), ), and 4) are the same as those in Algorithm 1. Step 3 ) Let τ > 0 be a constant throughout all iterations, and define a Q ii + Q jj y i y j Q ij. (13) a) If a > 0, then solve the sub-problem (). Set α k+1 B to be the optimal point of (). b) If a 0, then solve a modified sub-problem: 1 ] min [α α i,α j i α j Q ii Q ij α i + ( e B + Q BN α k N) T α i + Q ij Q jj α j τ a ((α i αi k ) + (α j α k 4 j) ) subject to 0 α i,α j C, (14) y i α i + y j α j = y T Nα k N. Set α k+1 B to be the optimal point of (14). The additional term in (14) has a coefficient (τ a)/4 related to the matrix Q, but that in (1) does not. We will show later that our modification leads (14) to a convex optimization problem. Next we discuss how to easily solve the two sub-problems () and (14). Consider α i α k i + y id i and α j αj k + y jd j. The sub-problem () is equivalent to 1 ] min [d d i,d j i d j Q ii y i y j Q ij d i + ( e B + (Qα) B ) T y id i y i y j Q ij Q jj d j y j d j subject to d i + d j = 0, 0 α k i + y i d i,α k j + y j d j C. (15) Substituting d i = d j into the objective function leads to a one-variable optimization problem: min d j g(d j ) f(α) f(α k ) subject to L d j U, = 1 (Q ii + Q jj y i y j Q ij )d j + ( y i f(α k ) i + y j f(α k ) j )d j (16) α j
8 8 where L and U are lower and upper bounds of d j after including information on d i : d i = d j and 0 α k i + y id i C. As a defined in (13) is the leading coefficient of the quadratic function (16), a > 0 if and only if the sub-problem () is strictly convex. We then claim that If y i = y j = 1, 0 α k i + y id i,α k j + y jd j C implies L < 0 and U 0. (17) L = max( α k j,α k i C) d j min(c α k j,α k i ) = U. Clearly, U 0. Since i I up (α k ) and j I low (α k ), we have α k j > 0 and αk i For other values of y i and y j, the situations are similar. < C. Thus, L < 0. When the sub-problem () is not strictly convex, we consider (14) via adding an additional term. With d i = d j, Thus, by defining and τ a 4 α B α k B = τ a d j. (18) a if a > 0, a 1 τ otherwise, (19) a y i f(α k ) i + y j f(α k ) j > 0, (0) problems () and (14) are essentially the following strictly convex optimization problem: min d j 1 a 1d j + a d j subject to L d j U. (1) Moreover, a 1 > 0, a > 0, L < 0, and U 0. The quadratic objective function is shown in Figure 1 and the optimum is d j = max( a a 1,L) < 0. () Therefore, once a 1 is defined in (19) according to whether a > 0 or not, the two sub-problems can be easily solved by the same method as in (). To check the decrease of the function value, the definition of the function g in (16), dj < 0 and a 1 dj + a 0 from (), and α k+1 α k = d j imply f(α k+1 ) f(α k ) = g( d j ) = 1 a 1 d j + a dj = (a 1 dj + a ) d j a 1 d j a 1 d j = a 1 4 αk+1 α k.
9 L (a) L a a 1 d j (b) L > a a 1 L d j Fig. 1. Solving the convex sub-problem (1) Therefore, by defining λ 1 4 min(τ,min{q ii + Q jj y i y j Q ij Q ii + Q jj y i y j Q ij > 0}), (3) we have the following lemma: Lemma 1 Assume at each iteration of Algorithm the working set is a violating pair and let {α k } be the sequence generated. Then (11) holds with λ defined in (3). The above discussion is related to [19], which examines how to solve () when it is non-convex. Note that a = 0 may happen as long as Q is not PD (i.e., Q is PSD or indefinite). Then, (16), becoming a linear function, is convex but not strictly convex. To have (11), an additional term makes it quadratic. Given that the function is already convex, we of course hope that there is no need to modify () to (14). The following theorem shows that if Q is PSD and τ < /C, the two sub-problems indeed have the same optimal solution. Therefore, in a sense, for PSD Q, the sub-problem () is always used. Theorem Assume Q is PSD, the working set of Algorithm is a violating pair, and τ C. If a = 0, then the optimal solution of (14) is the same as that of (). Proof: From (16) and a > 0 in (0), if a = 0, () has optimal dj = L. For (14) and the transformed problem (1), a = 0 implies a 1 = τ. Moreover, since Q is PSD, a = 0 = Q ii + Q jj y i y j Q ij = φ(x i ) φ(x j ), and hence φ(x i ) = φ(x j ). Therefore, K it = K jt, t implies y i f(α k ) i + y j f(α k ) j l l = y t K it αt k + y i + y t K jt αt k y j t=1 = y i y j. t=1
10 10 Since {i,j} is a violating pair, y i y j > 0 indicates that a = y i y j =. As τ C, () implies that dj = L is the optimal solution of (14). Thus, we have shown that () and (14) have the same optimal point. The derivation of a = was first developed in [17]. B. Proof of Asymptotic Convergence Using (11), we then have the following lemma: Lemma Assume the working set at each iteration of Algorithm is a violating pair. If a sub-sequence {α k },k K converges to ᾱ, then for any given positive integer s, the sequence {α k+s },k K converges to ᾱ as well. Proof: The lemma was first proved in [16, Lemma IV.3]. For completeness we give details here. For the sub-sequence {α k+1 },k K, from Lemma 1 and the fact that {f(α k )} is a bounded decreasing sequence, we have Thus = 0. lim α k+1 ᾱ k K,k lim k K,k lim k K,k ( α k+1 α k + α k ᾱ ) ( ) 1 λ (f(αk ) f(α k+1 )) + α k ᾱ lim k,k K αk+1 = ᾱ. From {α k+1 } we can prove lim k,k K αk+ = ᾱ too. Therefore, lim k,k K αk+s = ᾱ for any given s. Since the feasible region of (1) is compact, there is at least one convergent sub-sequence of {α k }. The following theorem shows that the limit point of any convergent sub-sequence is a stationary point of (1). Theorem 3 Let {α k } be the infinite sequence generated by the SMO-type method Algorithm using WSS 3. Then any limit point of {α k } is a stationary point of (1). Proof: Assume that ᾱ is the limit point of any convergent sub-sequence {α k },k K. If ᾱ is not a stationary point of (1), it does not satisfy the optimality condition (6). Thus, there is at least one maximal violating pair : ī arg m(ᾱ) j arg M(ᾱ) (4) such that y ī f(ᾱ) ī + y j f(ᾱ) j > 0. (5)
11 11 We further define min (, 1 { } min ) y t f(ᾱ) t + y s f(ᾱ) s y t f(ᾱ) t y s f(ᾱ) s > 0. (6) Lemma, the continuity of f(α), and h( ) > 0 from (6) imply that for any given r, there is k K such that for all k K,k k: For u = 0,...,r, y ī f(α k+u ) ī + y j f(α k+u ) j >. (7) If i I up (ᾱ), then i I up (α k ),...,i I up (α k+r ). (8) If i I low (ᾱ), then i I low (α k ),...,i I low (α k+r ). (9) If y i f(ᾱ) i > y j f(ᾱ) j, then for u = 0,...,r, y i f(α k+u ) i > y j f(α k+u ) j +. (30) If y i f(ᾱ) i = y j f(ᾱ) j, then for u = 0,...,r, y i f(α k+u ) i + y j f(α k+u ) j < h( ). (31) For u = 0,...,r 1, (τ â) α k+u+1 α k+u, (3) where â min{q ii + Q jj y i y j Q ij Q ii + Q jj y i y j Q ij < 0}. If y i f(ᾱ) i > y j f(ᾱ) j and {i,j} is the working set at the (k + u)th iteration, 0 u r 1, then i / I up (α k+u+1 ) or j / I low (α k+u+1 ). (33) The derivation of (7)-(3) is similar, so only the details of (7) are shown. Lemma implies sequences {α k }, {α k+1 },...,{α k+r },k K all converge to ᾱ. The continuity of f(α) and (6) imply that for any given 0 u r and the corresponding sequence {α k+u },k K, there is k u such that (7) holds for all k k u,k K. As r is finite, by selecting k to be the largest of these k u,u = 0,...,r, (7) is valid for all u = 0,...,r. Next we derive (33) [16, Lemma IV.4]. Similar to the optimality condition (6) for problem (1), for the sub-problem at the (k + u)th iteration, if problem (14) is considered and α B is a stationary point, then Now B = {i,j} and α k+u+1 B If max y t f t I up(α B) min y t f t I low (α B) ([ αb ]) α k+u N t ]) ([ αb α k+u N t y t(τ a) (α t αt k ) y t(τ a) (α t αt k ). is a stationary point of the sub-problem satisfying the above inequality. i I up (α k+u+1 ) and j I low (α k+u+1 ),
12 1 then from (3) and (18) y i f(α k+u+1 ) i y j f(α k+u+1 ) j + y i(τ a) (α k+u+1 i α k+u y j f(α k ) j + τ a α k+u+1 α k+u i ) y j(τ a) (α k+u+1 j α k+u j ) y j f(α k ) j +. (34) However, y i f(ᾱ) i > y j f(ᾱ) j implies (30) for α k+u+1, so there is a contradiction. If a > 0 and the sub-problem () is considered, (34) has no term, so immediately we have a contradiction to (30). For the convenience of writing the proof, we reorder indices of ᾱ so that We also define y 1 f(ᾱ) 1 y l f(ᾱ) l. (35) S 1 (k) {i i I up (α k )} and S (k) {l i i I low (α k )}. (36) Clearly, l S 1 (k) + S (k) l(l 1). (37) If {i,j} is selected at the (k + u)th iteration (u = 0,...,r), then we claim that y i f(ᾱ) i > y j f(ᾱ) j. (38) It is impossible that y i (ᾱ) i < y j (ᾱ) j as y i (α k+u ) i < y j (α k+u ) j from (30) then violates (9). If y i (ᾱ) i = y j (ᾱ) j, then y i (α k+u ) i + y j (α k+u ) j < h( ) < h( y ī f(α k+u ) ī + y j f(α k+u ) j) h(m(α k+u ) M(α k+u )). (39) With the property that h is strictly increasing, the first two inequalities come from (31) and (7), while the last is from ī I up (ᾱ), j I low (ᾱ), (8), and (9). Clearly, (39) contradicts (9), so (38) is valid. Next we use a counting procedure to obtain the contradiction of (4) and (5). From the kth to the (k + 1)st iteration, (38) and then (33) show that i / I up (α k+1 ) or j / I low (α k+1 ). For the first case, (8) implies i / I up (ᾱ) and hence i I low (ᾱ). From (9) and the selection rule (9), i I low (α k ) I up (α k ). Thus, i I low (α k ) I up (α k ) and i / I up (α k+1 ).
13 13 With j I low (α k ), S 1 (k + 1) S 1 (k) i + j S 1 (k) 1, S (k + 1) S (k), (40) where i + j 1 comes from (35). Similarly, for the second case, j I low (α k ) I up (α k ) and j / I low (α k+1 ). With i I up (α k ), S 1 (k + 1) S 1 (k), S (k + 1) S (k) (l j) + (l i) S (k) 1. (41) From iteration (k + 1) to (k + ), we can repeat the same argument. Note that (33) can still be used because (38) holds for working sets selected during iterations k to k + r. Using (40) and (41), at r l(l 1) iterations, S 1 (k) + S (k) is reduced to zero, a contradiction to (37). Therefore, the assumptions (4) and (5) are wrong and the proof is complete. Moreover, if (1) has a unique optimal solution, then the whole sequence {α k } globally converges. This happens, for example, when Q is PD. Corollary 1 If Q is positive definite, {α k } globally converges to the unique minimum of (1). Proof: Since Q is positive definite, (1) has a unique solution and we denote it as ᾱ. Assume {α k } does not globally converge to ᾱ. Then there is ǫ > 0 and an infinite subset K such that α k ᾱ ǫ, k K. Since {α k },k K are in a compact space, there is a further sub-sequence converging to α and α ᾱ ǫ. Since α is an optimal solution according to Theorem 3, this contradicts that ᾱ is the unique global minimum. For example, if the RBF kernel K(x i,x j ) = e γ xi xj is used and all x i x j, then Q is positive definite [] and we have the result of Corollary 1. We now discuss the difference between the proof above and earlier convergence work. The work [16] considers a working set selection which allows more than two elements. When the size of the working set is restricted to two, the selection is reduced to WSS 1, the maximal violating pair. This proof in [17] has used a counting procedure by considering two sets related to {ī, j}, the maximal violating pair at ᾱ: I 1 (k) {t t I up (α k ), y t f(ᾱ) t y j f(ᾱ) j}, and I (k) {t t I low (α k ), y t f(ᾱ) t y ī f(ᾱ) ī }. Clearly, if WSS 1 is used, ī I up (α k ) and j I low (α k ) imply that the selected {i,j} must satisfy i I 1 (k) and j I (k). Using (33) to show that I 1 (k) + I (k) decreases to zero, we obtain a contradiction to the fact that I 1 (k) + I (k) from ī I up (α k ) and j I low (α k ). However, now we do not have i I 1 (k) and j I (k) any more since our selection may not be the
14 14 maximal violating pair. Therefore, a new counting scheme is developed. By arranging y i f(ᾱ) i in an ascending (descending) order, in (36), we count the sum of their ranks. IV. STOPPING CONDITION, SHRINKING AND CACHING In this section we discuss other important properties of our method. In previous sections, Q is any symmetric matrix, but here we further require it to be positive semi-definite. To begin, the following theorem depicts some facts about the problem (1). Theorem 4 Assume Q is positive semi-definite. 1) If ᾱ ˆα are any two optimal solutions of (1), then y i f(ᾱ) i = y i f(ˆα) i,i = 1,...,l, (4) and M(ᾱ) = m(ᾱ) = M(ˆα) = m(ˆα). (43) ) If there is an optimal solution ᾱ satisfying m(ᾱ) < M(ᾱ), then ᾱ is the unique optimal solution of (1). 3) The following set is independent of any optimal solution ᾱ: I {i y i f(ᾱ) i > M(ᾱ) or y i f(ᾱ) i < m(ᾱ)}. (44) Moreover, problem (1) has unique and bounded optimal solutions at α i, i I. Proof: Since Q is positive semi-definite, (1) is a convex programming problem and ᾱ and ˆα are both global optima. Then f(ᾱ) = f(ˆα) = f(λᾱ + (1 λ)ˆα), for all 0 λ 1, implies (ᾱ ˆα) T Q(ᾱ ˆα) = 0. As Q is PSD, Q can be factorized to LL T. Thus, L T (ᾱ ˆα) = 0 and hence Qᾱ = Qˆα. Then (4) follows. To prove (43), we will show that m(ˆα) M(ᾱ) and m(ᾱ) M(ˆα). (45) With optimality conditions M(ᾱ) m(ᾱ) and M(ˆα) m(ˆα), (43) holds. Due to the symmetry, it is sufficient to show the first case of (45). If it is wrong, then m(ˆα) < M(ᾱ). (46)
15 15 We then investigate different situations by comparing y i f(ᾱ) i with M(ᾱ) and m(ˆα). If y i f(ᾱ) i M(ᾱ) > m(ˆα), then i / I up (ˆα) and 0 if y i = 1, ˆα i = C if y i = 1. (47) With 0 ᾱ i C, y iˆα i y i ᾱ i 0. (48) If y i f(ᾱ) i m(ˆα) < M(ᾱ), then i / I low (ᾱ) and C if y i = 1, ᾱ i = 0 if y i = 1. (49) Hence, (48) still holds. Other indices are in the following set S {i m(ˆα) < y i f(ˆα) i = y i f(ᾱ) i < M(ᾱ)}. If i S, then i / I up (ˆα) and i / I low (ᾱ). Hence (47) and (49) imply y iˆα i y i ᾱ i = C. (50) Thus, 0 = y T ˆα y T ᾱ = (y iˆα i y i ᾱ i ) + C S. i:i/ S Since (48) implies each term in the above summation is non-negative, S = 0 and ˆα i = ᾱ i, i / S. Therefore, ᾱ = ˆα. However, this contradicts the assumption that ᾱ and ˆα are different optimal solutions. Hence (46) is wrong and we have m(ˆα) M(ᾱ) in (45). The proof of (43) is thus complete. The second result of this theorem and the validity of the set I directly come from (43). Moreover, I is independent of any optimal solution. For any optimal α, if i I and y i f(α) i > M(α) m(α), then, i / I up (α) and α i is the same as (47). Thus, the optimal α i is unique and bounded. The situation for y i f(α) i < m(α) is similar. Lemma 3 in [1] shows a result similar to (43), but the proof is more complicated. It involves both primal and dual SVM formulations. Using Theorem 4, in the rest of this section we derive more properties of the proposed SMO-type methods.
16 16 A. Stopping Condition and Finite Termination As the decomposition method only asymptotically approaches an optimum, in practice, it terminates after satisfying a stopping condition. For example, we can pre-specify a small stopping tolerance ǫ > 0 and check if m(α k ) M(α k ) ǫ (51) is satisfied. Though one can consider other stopping conditions, (51) is commonly used due to its closeness to the optimality condition (6). To justify its validity as a stopping condition, we must make sure that under any given stopping tolerance ǫ > 0, the proposed decomposition method stops in a finite number of iterations. Thus, an infinite loop never happens. To have (51), one can prove a stronger condition: lim k m(αk ) M(α k ) = 0. (5) This condition is not readily available as from the respective definitions of m(α) and M(α), it is unclear whether they are continuous functions of α. Proving (5) will be the main result of this subsection. The first study on the stopping condition of SMO-type methods is [11]. They consider a selection rule which involves the stopping tolerance, so the working set is a so-called ǫ violating pair. Since our selection rule is independent of ǫ, their analysis cannot be applied here. Another work [18] proves (5) for WSS 1 under the assumption that the sequence {α k } globally converges. Here, for more general selection WSS 3, we prove (5) with positive semi-definite Q. Theorem 5 Assume Q is positive semi-definite and the SMO-type decomposition method Algorithm using WSS 3 generates an infinite sequence {α k }. Then lim k m(αk ) M(α k ) = 0. (53) Proof: We prove the theorem by contradiction and assume that the result (53) is wrong. Then, there is an infinite set K and a value > 0 such that m(α k ) M(α k ), k K. (54) Since in the decomposition method m(α k ) > M(α k ), k, (54) can be rewritten as In this K there is a further sub-sequence K so that m(α k ) M(α k ), k K. (55) lim α k = ᾱ. k K,k As Q is positive semi-definite, from Theorem 4, f(α k ) globally converges: lim k f(αk ) i = f(ᾱ) i,i = 1,...,l. (56)
17 17 We then follow a similar counting strategy in Theorem 3. First we rewrite (55) as m(α k ) M(α k ) +, k K, (57) where min (, 1 { } min ) y t f(ᾱ) t + y s f(ᾱ) s y t f(ᾱ) t y s f(ᾱ) s. (58) We still require (7)-(33), but use (56) and (58) to extend (30) and (31) for all k k (i.e., not only k K): If y t f(ᾱ) t > y s f(ᾱ) s, then y t f(α k ) t > y s f(α k ) s. (59) If y t f(ᾱ) t y s f(ᾱ) s, then y t f(α k ) t + y s f(α k ) s >. (60) If y t f(ᾱ) t = y s f(ᾱ) s, then y t f(α k ) t + y s f(α k ) s < h( ). (61) Then the whole proof follows Theorem 3 except (39), in which we need m(α k+u ) M(α k+u ), u = 0,...,r. This condition does not follow from (57), which holds for only a sub-sequence. Thus, in the following we further prove m(α k ) M(α k ), k k. (6) Assume at one k k, 0 < m(α k ) M(α k ) < and {i,j} is the working set of this iteration. As i I up (α k ) and j I low (α k ) from the selection rule, we have M(α k ) y j f(α k ) j < y i f(α k ) i m(α k ). (63) Then according to (60), {i,j} and indices achieving m(α k ) and M(α k ) have the same value of y t f(ᾱ) t. They are all from the following set: {t y t f(ᾱ) t = y i f(ᾱ) i = y j f(ᾱ) j }. (64) For elements not in this set, (59), (60), and (63) imply that If y t f(ᾱ) t > y i f(ᾱ) i, then y t f(α k ) > y i f(α k ) i + > m(α k ) and hence t / I up (α k ). (65) Similarly, If y t f(ᾱ) t < y i f(ᾱ) i, then t / I low (α k ). (66) As we have explained, the working set is from the set (64), other components remain the same from iteration k to k + 1. Therefore, indices satisfying (65) and (66) have t / I up (α k +1 ) and t / I low (α k +1 ), respectively. Furthermore, indices in (65) have larger y t f(α k +1 ) t than others according to (59). Thus, their y t f(α k +1 ) t are greater than m(α k +1 ). Similarly, elements in (66) are smaller than M(α k +1 ). With the fact that m(α k +1 ) > M(α k +1 ), indices which achieve
18 18 m(α k +1 ) and M(α k +1 ) are again from the set (64). This situation holds for all k k. Using (61) and the condition on h, we have m(α k ) M(α k ) < h( ), k k, a contradiction to (57). Thus, the required condition (6) holds. B. Shrinking and Caching Shrinking and caching are two effective techniques to make the decomposition method faster. If an αi k remains at 0 or C for many iterations, eventually it may stay at the same value. Based on this principle, the shrinking technique [10] reduces the size of the optimization problem without considering some bounded variables. The decomposition method then works on a smaller problem which is less time consuming and requires less memory. In the end we put back the removed components and check if an optimal solution of the original problem is obtained. Another technique to reduce the training time is caching. Since Q may be too large to be stored, its elements are calculated as they are needed. We can allocate some space (called cache) to store recently use Q ij [10]. If during final iterations only few columns of Q are still needed and the cache contains them, we can save many kernel evaluations. [18, Theorem II.3] has proved that during the final iterations of using WSS 1, only a small subset of variables are still updated. Such a result supports the use of shrinking and caching techniques. However, this proof considers only any convergent subsequence of {α k } or assumes the global convergence. In this subsection, we provide a more general theory without these two assumptions. Theorem 6 Assume Q is positive semi-definite and the SMO-type method Algorithm uses WSS 3. Let I be the set defined in (44). 1) There is k such that after k k, every αi k,i I has reached the unique and bounded optimal solution. It remains the same during all subsequent iterations and i I is not in the following set: {t M(α k ) y t f(α k ) t m(α k )}. (67) ) If (1) has an optimal solution ᾱ satisfying m(ᾱ) < M(ᾱ), then ᾱ is the unique solution and the decomposition method reaches it at a finite number of iterations. 3) If {α k } is an infinite sequence, then the following two limits exist and are equal: lim k m(αk ) = lim k M(αk ) = m(ᾱ) = M(ᾱ), (68) where ᾱ is any optimal solution. Thus, (68) is independent of any optimal solution. Proof:
19 19 1) If the result is wrong, there is an index ī I and an infinite set ˆK such that α k ī ˆαī, k ˆK, (69) where ˆα ī is the unique optimal component according to Theorem 4. From Theorem 3, there is a further subset K of ˆK such that lim α k = ᾱ (70) k K,k is a stationary point. Moreover, Theorem 4 implies that ᾱ i, i I are unique optimal components. Thus, ᾱ ī = ˆα ī. As ī I, we first consider one of the two possible situations: Thus, ī / I up (ᾱ). (69) then implies y ī f(ᾱ) ī > M(ᾱ). (71) ī I up (α k ), k K. (7) For each j arg M(ᾱ), we have j I low (ᾱ). From (70), there is k such that Thus, (7) and (73) imply j I low (α k ), k K,k k. (73) m(α k ) M(α k ) y ī f(α k ) ī + y j f(α k ) j, k K,k k. (74) With (70), the continuity of f(α), and (53), taking the limit on both sides of (74) obtains 0 y ī f(ᾱ) ī + y j f(ᾱ) j = y ī f(ᾱ) ī M(ᾱ). This inequality violates (71), so there is a contradiction. For the other situation y ī f(ᾱ) ī < m(ᾱ), the proof is the same. The proof that i I is not in the set (67) is similar. If the result is wrong, there is an index ī I such that k K, ī is in the set (67). Then (74) holds and causes a contradiction. ) If the result does not hold, then {α k } is an infinite sequence. From Theorems 3 and 4, ᾱ is the unique optimal solution and {α k } globally converges to it. Define I 1 {i y i f(ᾱ) i = M(ᾱ)}, I {i y i f(ᾱ) i = m(ᾱ)}. Using the first result of this theorem, after k is sufficiently large, arg m(α k ) and arg M(α k ) must be subsets of I 1 I. Moreover, using (53), the continuity of f(α), and the property lim k α k = ᾱ, there is k such that for all k k, arg m(α k ) and arg M(α k ) are both subsets of I 1 (or I ). (75)
20 0 If at the kth iteration, arg m(α k ) and arg M(α k ) are both subsets of I 1, then following the same argument in (63)-(64), we have As the decomposition method maintains feasibility, the working set B I 1. (76) i B y i α k i = i B y i α k+1 i. (77) From (76) and the assumption that m(ᾱ) < M(ᾱ), every ᾱ i, i B satisfies i / I up (α). Thus, ᾱ i, i B is the same as (47). This and (77) then imply = i/ B = i/ B α k+1 ᾱ 1 α k+1 i ᾱ i + α k i ᾱ i + i B,y i=1 i B,y i=1 (C α k+1 i ) + (C α k i ) + i B,y i= 1 i B,y i= 1 (α k+1 (α k i 0) i 0) = α k ᾱ 1. (78) If arg m(α k ) and arg M(α k ) are both subsets of I, (78) still holds. Therefore, 0 α k ᾱ 1 = α k+1 ᾱ 1 =, a contradiction to the fact that {α k } converges to ᾱ. Thus, the decomposition method must stop in a finite number of iterations. 3) Since {α k } is an infinite sequence, using the result of ), problem (1) has no optimal solution ᾱ satisfying M(ᾱ) > m(ᾱ). From Theorem 4, we have M(ᾱ) = m(ᾱ) = y t f(ᾱ) t, t I, (79) and this is independent of different optimal solutions ᾱ. From the result of 1), there is k such that for all k k, i I is not in the set (67). Thus, is a superset of (67) and hence I {1,...,l} \ I (80) min y i f(α k ) i M(α k ) < m(α k ) max y i f(α k ) i. (81) i I i I Though {α k } may not be globally convergent, { y i f(α k ) i }, i = 1,...,l, are according to (4). The limit of both sides of (81) are equal using (79), so (68) follows. Theorem 6 implies that in many iterations, the SMO-type method involves only indices in I. Thus, caching is very effective. This theorem also illustrates two possible shrinking implementations: 1) Elements not in the set (67) are removed.
21 1 ) Any α i which has stayed at the same bound for a certain number of iterations is removed. The software LIBSVM [3] considers the former approach, while SV M light [10] uses the latter. V. CONVERGENCE RATE Though we proved the asymptotic convergence, it is important to investigate how fast the method converges. Under some assumptions, [15] was the first to prove the linear convergence of certain decomposition methods. The work [15] allows the working set to have more than two elements and WSS 1 is a special case. Here we show that when the SMO-type working set selection is extended from WSS 1 to WSS, the same analysis holds. Since [15] was not published, we include here all details of the proof. Note that WSS 3 uses a function h to control the quality of the selected pair. We will see in the proof that it may affect the convergence rate. Proving the linear convergence requires the condition (8), so results established in this section are for WSS but not WSS 3. First we make a few assumptions. Assumption 1 Q is positive definite. Then (1) is a strictly convex programming problem and hence has a unique global optimum ᾱ. By Theorem 6, after large enough iterations working sets are all from the set I defined in (80). From the optimality condition (6), the scalar b satisfies b = m(ᾱ) = M(ᾱ), and the set I corresponds to elements satisfying f(ᾱ) i + by i = 0. (8) From (4)-(5), another form of the optimality condition, if ᾱ i is not at a bound, (8) holds. We further assume that this is the only case that (8) happens. Assumption (Nondegeneracy) For the optimal solution ᾱ, f(ᾱ) i + by i = 0 if and only if 0 < ᾱ i < C. This assumption, commonly used in optimization analysis, implies that indices of all bounded ᾱ i are exactly the set I. Therefore, after enough iterations, Theorem 6 and Assumption imply that all bounded variables are fixed and are not included in the working set. By treating these variables as constants, essentially we solve a problem with the following form: min α f(α) = 1 αt Qα + p T α subject to y T α =, (83) where p is a vector by combining e and other terms related to the bounded components. Moreover, 0 < αi k < C for all i even though we do not explicitly write down inequality constraints in (83). Then
22 the optimal solution ᾱ with the corresponding b can be obtained by the following linear system: Q y ᾱ = p. (84) y T 0 b In each iteration, we consider minimizing f(α k B +d), where d is the direction from αk B the sub-problem () is written as min d 1 dt Q BB d + f(α k ) T Bd to αk+1 B, so subject to y T Bd = 0, (85) where f(α k ) = Qα k + p. If an optimal solution of (85) is d k, then α k+1 B α k+1 N Using (84), = α k B + dk and = αk N. With the corresponding bk, this sub-problem is solved by the following equation: y B Q BB yb T 0 dk b k = f(αk ) B. (86) 0 Q(α k ᾱ) = Qα k + p + by = f(α k ) + by. (87) By defining Y diag(y) to be a diagonal matrix with elements of y on the diagonal, and using y i = ±1, Y Q(α k ᾱ) = Y f(α k ) be. (88) The purpose of checking Q(α k ᾱ) is to see how close the current solution is to the optimal one. Then (88) links it to Y f(α k ), a vector used for the working set selection. Remember that for finding violating pairs, we first sort y i f(α k ) i in an ascending order. The following two theorems are the main results on the linear convergence. Theorem 7 Assume the SMO-type decomposition method uses WSS for the working set selection. If problem (1) satisfies Assumptions 1 and, then there are c < 1 and k such that for all k k (α k+1 ᾱ) T Q(α k+1 ᾱ) c(α k ᾱ) T Q(α k ᾱ). (89) Proof: First, Theorem 6 implies that there is k such that after k k, the problem is reduced to
23 3 (83). We then directly calculate the difference between the (k + 1)st and the kth iterations: (α k+1 ᾱ) T Q(α k+1 ᾱ) (α k ᾱ) T Q(α k ᾱ) (90) = (d k ) T (Q(α k ᾱ)) B + (d k ) T Q BB d k = (d k ) T ((Q(α k ᾱ)) B f(α k ) B b k y B ) (91) = (d k ) T ((Q(α k ᾱ)) B + ( b b k )y B ) (9) = (d k ) T ((Q(α k ᾱ)) B + (b k b)y B ) (93) = [ (Q(α k ᾱ)) B + ( b b k )y B ] T Q 1 BB [ (Q(αk ᾱ)) B + ( b b k )y B ], where (91) is from (86), (9) is from (87), (93) is by using the fact y T B dk = 0 from (86), and the last equality is from (86) and (87). If we define ˆQ Y B Q 1 BB Y B and v Y (Q(α k ᾱ)), (94) where Y B diag(y B ), then v B = Y B (Q(α k ᾱ)) B and (90) becomes From (88), we define Thus, the selection rule (8) of WSS implies [v B + ( b b k )e B ] T ˆQ[vB + ( b b k )e B ]. (95) v 1 max(v t ) = m(α k ) b, t where {i,j} is the working set of the kth iteration. v l min t (v t ) = M(α k ) b. (96) v i v j σ(v 1 v l ), (97) We denote that min(eig( )) and max(eig( )) to be the minimal and maximal eigenvalues of a matrix,
24 4 respectively. A further calculation of (95) shows [v B + ( b b k )e B ] T ˆQ[vB + ( b b k )e B ] min(eig( ˆQ))[v B + ( b b k )e B ] T [v B + ( b b k )e B ] min(eig( ˆQ))max t B (v t + ( b b k )) min(eig( ˆQ))( v i v j ), where {i,j} is working set (98) min(eig( ˆQ))σ ( v1 v l ) (99) min(eig( ˆQ))( y T Q 1 y σ max( v 1, v l ) i,j Q 1 ij ) (100) min(eig( ˆQ)) y T Q 1 y ( l t,s Q 1 ts ) σ (Q(α k ᾱ)) T Q(α k ᾱ) (101) min(eig( ˆQ)) l max(eig(q 1 )) ( y T Q 1 y t,s Q 1 ts ) σ (Q(α k ᾱ)) T Q 1 Q(α k ᾱ) min(eig( ˆQ)) l max(eig(q 1 )) ( y T Q 1 y t,s Q 1 ts ) σ (α k ᾱ) T Q(α k ᾱ), (10) where (98) is from Lemma 3, (99) is from (97), (100) is from Lemma 4, and (101) follows from (96). Note that both lemmas are given in Appendix A. Next we give more details about the derivation of (100): If v 1 v l < 0, then of course v 1 v l max( v 1, v l ). y With y i = ±1, T Q 1 y 1 so (100) follows. In contrast, if v 1 v l 0, we consider v = Pt,s Q 1 ts (Y QY )( Y (α k ᾱ)) from (94). Since e T Y (α k ᾱ) = y T (α k ᾱ) = 0, we can apply Lemma 4: With we have which implies (100). (Y QY ) 1 ij = Q 1 ij y iy j = Q 1 ij and e T (Y QY ) 1 e = y T Q 1 y, v 1 v l max( v 1, v l ) min( v 1, v l ) ( yt Q 1 y t,s Q 1 ts )max( v1, v l ), Finally we can define a constant c as follows: ( min(eig(q 1 BB c 1 min )) B l max(eig(q 1 )) ( y T Q 1 ) y t,s Q 1 ts ) σ where B is any two-element subset of {1,...,l}. Combining (95) and (10), after k k, (89) holds. < 1,
25 5 The condition (8) of Algorithm is used in (97) and then (99). If WSS 3 is considered, in (99) we will have a term h((v 1 v l )/). Thus, the function h affects the convergence rate. Since h(x) x, linear rate is the best result using our derivation. The linear convergence of the objective function is as follows: Theorem 8 Under the same assumptions of Theorem 7, there are c < 1 and k such that for all k k, f(α k+1 ) f(ᾱ) c(f(α k ) f(ᾱ)). Proof: We show that for any k k, f(α k ) f(ᾱ) = 1 (αk ᾱ) T Q(α k ᾱ), and the proof immediately follows from Theorem 7. Using (84), f(α k ) f(ᾱ) = 1 (αk ) T Qα k + p T α k 1 (ᾱ)t Qᾱ p T ᾱ = 1 (αk ) T Qα k + ( Qᾱ by) T α k 1 (ᾱ)t Qᾱ ( Qᾱ by) T ᾱ = 1 (αk ) T Qα k (ᾱ) T Qα k + 1 (ᾱ)t Qᾱ (103) = 1 (αk ᾱ) T Q(α k ᾱ). Since we always keep the feasibility of α k, (103) comes from y T α k = y T ᾱ. VI. EXTENSIONS In this section we show that the same SMO-type methods can be applied to some variants of SVM. A. Support Vector Regression and One-class SVM First we extend (1) to the following general form: min α f(α) = 1 αt Qα + p T α subject to L i α i U i,i = 1,...,l, (104) y T α =, where < L i < U i <,i = 1,...,l, are lower and upper bounds, and Q is an l by l symmetric matrix. Clearly, if L i = 0,U i = C, and p = e, then (104) reduces to (1). The optimality condition is the same as (3) though in the definition of I up (α) and I low (α), 0 and C must be replaced by L i and U i, respectively. Therefore, SMO-type decomposition methods using WSS or 3 can be applied to solve (104). A careful check shows that all results in Sections III-V hold for (104).
26 6 Problem (104) covers some SVM formulations such as support vector regression (SVR) [8] and one-class SVM [6]. Next we discuss SVR in detail. Given a set of data points {(x 1,z 1 ),...,(x l,z l )} such that x i R n is an input vector and z i R 1 is a target output, support vector regression solves the following optimization problem: min α,α f(α,α ) = 1 (α α ) T K(α α ) + ǫ subject to l (α i αi ) = 0, i=1 l (α i + αi ) + i=1 l z i (α i αi ) 0 α i,α i C,i = 1,...,l, (105) where K ij = φ(x i ) T φ(x j ), and ǫ > 0 is the width of the ǫ-insensitive tube. We can rewrite (105) as min α,α f(α,α ) = 1 subject to y T ] [α T,α T K α = 0, α K K α + K α i=1 ] [ǫe T + z T,ǫe T z T α α 0 α i,α i C,i = 1,...,l, (106) where y is a l by 1 vector with y i = 1,i = 1,...,l and y i = 1,i = l + 1,...,l. Thus (106) is in the form of (104) and an SMO-type method with WSS 3 can be applied. Moreover, the procedure asymptotically converges and possesses all properties in Section IV. An interesting issue is about Corollary 1, which requires the Hessian matrix [ ] K K K K to be positive definite. This condition never holds as [ ] K K K K is only positive semi-definite. Note that in Corollary 1, the positive definite Hessian is used to have a unique optimal solution. For SVR, [4, Lemma 4] proves that if ǫ > 0 and K is positive definite, then (106) has a unique solution. Thus, for SVR, Corollary 1 can be modified to require only that K is positive definite. For the linear convergence result in Section V, Assumption 1 does not hold as now [ K K K K ] is not positive definite. However, we will show that similar to Corollary 1, a positive definite K is sufficient. Note that in Section V, the Hessian matrix of (83) is in fact Q I I as α i, i I can be removed after large enough iterations, where I and I are defined as in (44) and (80) except that the set {1,...,l} is replaced by {1,...,l}. Then in the linear convergence proof we need Q I I to be invertible and this condition holds if Q is positive definite (i.e., Assumption 1). For SVR, this means [ ] K K K K should be invertible. To prove it, we first claim that for any 1 i l, i and i + l are not both in the set I. (107) I I
27 7 As ǫ > 0 implies y i f(α,α ) i = (K(α α )) i + ǫ + z i (K(α α)) i ǫ + z i = y i+l f(α,α ) i+l, (108) the definition of I directly leads to (107). We further define Ī an index vector by replacing any t,l t l in I with t l. From (107), I and Ī have the same number of elements and furthermore, K K = Y I K Ī Ī Y I, (109) K I I K where Y I diag(y I ) is a diagonal matrix. Clearly, (109) indicates that if K is positive definite, so is K and [ ] K K Ī Ī K K. Therefore, by replacing Assumption 1 on the Hessian matrix with that on I I the kernel matrix, the linear convergence result holds for SVR. B. Extension to ν-svm ν-svm [7] is another SVM formulation which has a parameter ν instead of C. Its dual form is min α f(α) = 1 αt Qα subject to y T α = 0, e T α = ν, 0 α i 1/l,i = 1,...,l, (110) where e is the vector of all ones. Note that some use e T α ν as a constraint, but [4] and [7] have shown that decision functions are the same. Moreover, [4] proved that (110) is equivalent to problem (1) with certain C. It also discusses the decomposition method for training ν-svm. Via the KKT condition, a vector ᾱ is a stationary point of (110) if and only if there are two scalars ρ and b such that f(α) i ρ + by i 0 if α i < 1/l, f(α) i ρ + by i 0 if α i > 0. By separating the case of y i = 1 and y i = 1, we obtain two conditions on ρ + b and ρ b, respectively. Thus, similar to (6), there is the following optimality condition: m p (α) M p (α) and m n (α) M n (α), (111)
28 8 where m p (α) m n (α) max y i f(α) i, M p (α) min y i f(α) i, and i I up(α),y i=1 i I low (α),y i=1 max y i f(α) i, M n (α) min y i f(α) i. i I up(α),y i= 1 i I low (α),y i= 1 One can also define a violating pair as the following: Definition (Violating pair of (111)) If i I up (α),j I low (α), y i = y j, and y i f(α) i > y j f(α) j, then {i,j} is a violating pair. Clearly, the condition y i = y j is the main difference from Definition 1. In fact the selected pair B = {i,j} must satisfy y i = y j. If y i y j, then the two linear equalities result in the sub-problem having only one feasible point α k B. For the same y i and y j, the two equations in the sub-problem are identical, so we could move α k B to a better point. In addition, the sub-problem is then in the same form of (), so the procedure in Section III-A directly works. All working set selections discussed in Section II can be extended here. We modify WSS 3 as an example: WSS 4 (Extension of WSS 3 for ν-svm) 1) Select any i I up (α k ), j I low (α k ), y i = y j satisfying y i f(α k ) i + y j f(α k ) j h(d(α k )) > 0, (11) where D(α k ) max(m p (α k ) M p (α k ),m n (α k ) M n (α k )). (113) ) Return B = {i,j}. Results in Sections III-IV hold with minor modifications. They are listed in the following without detailed proofs. For easier description, we let α p (α n ) be the sub-vector of α corresponding to positive (negative) samples. Theorem 9 Let {α k } be the infinite sequence generated by the SMO-type decomposition method using WSS 4 for the working set selection. Then any limit point of {α k } is a stationary point of (110). Theorem 10 Assume Q is positive semi-definite. 1) If ᾱ ˆα are any two optimal solutions of (110), then y i f(ᾱ) i = y i f(ˆα) i,i = 1,...,l. (114)
29 9 If ᾱ p ˆα p (ᾱ n ˆα n ), then M p (ᾱ) = m p (ᾱ) = M p (ˆα) = m p (ˆα) (115) (M n (ᾱ) = m n (ᾱ) = M n (ˆα) = m n (ˆα)). (116) ) If there is an optimal solution ᾱ satisfying m p (ᾱ) < M p (ᾱ), then ᾱ p is unique for (110). Similarly, if m n (ᾱ) < M n (ᾱ), then ᾱ n is unique. 3) The following sets are independent of any optimal solution ᾱ: I p {i y i = 1, f(ᾱ) i > M p (ᾱ) or f(ᾱ) i < m p (ᾱ)}, (117) I n {i y i = 1, f(ᾱ) i > M n (ᾱ) or f(ᾱ) i < m n (ᾱ)}. (118) Moreover, problem (110) has unique and bounded optimal solutions at α i, i I p I n. Theorem 11 Assume Q is positive semi-definite. Let {α k } be the infinite sequence generated by the SMO-type decomposition method using WSS 4. If {(α k ) p } ({(α k ) n }) is updated in infinitely many iterations, then lim k m p (α k ) M p (α k ) = 0 (119) (lim k m n (α k ) M n (α k ) = 0). (10) Theorem 1 Assume Q is positive semi-definite and the SMO-type method Algorithm uses WSS 4. Let I p and I n be the sets defined in (117) and (118). Define K p and K n to be iterations in which the working set is from positive and negative samples, respectively. 1) There is k such that after k k, every αi k, i I p I n has reached the unique and bounded optimal solution. For any i I p, there is k such that after k k, k K p, i is not in the following set {t y t = 1,M p (α k ) y t f(α k ) t m p (α k )}. (11) The same result holds for the negative part. ) If (110) has an optimal solution ᾱ satisfying m p (ᾱ) < M p (ᾱ) (m n (ᾱ) < M n (ᾱ)), then ᾱ p (ᾱ n ) is unique for (110) and the decomposition method reaches it in a finite number of iterations. 3) If {(α k ) p } is updated in infinitely many iterations, then the following two limits exist and are equal: lim k m p (α k ) = lim k M p (α k ) = m p (ᾱ) = M p (ᾱ) (1) where ᾱ is any optimal solution. The same result holds for {(α k ) n }.
30 30 VII. DISCUSSION AND CONCLUSIONS This paper provides a comprehensive study on SMO-type decomposition methods. Below we discuss some issues for future investigation. A. Faster Training via a Better Selection Rule Under the general framework discussed in this paper, we can design various selection rules for practical implementations. Among them, the one using the maximal violating pair has been widely applied in SVM software. Developing better rules from the proposed framework is important. Otherwise, this article has only theoretical values. A working set selection which leads to faster training should: 1) reduce the number of iterations of the decomposition method. In other words, the convergence is faster, and ) keep the cost of identifying the working set B similar to that of finding the maximal violating pair. The challenge is that these two goals are often at odds. For example, fewer iterations may not reduce the training time if the cost of working set selections is higher. We proposed some rules with the above properties in [8]. This work shows that the maximal violating pair uses only the firstorder information of the objective function, and derives a better selection rule using the second-order information. The new rule is a special case of WSS. The training time is generally shorter than that by using the maximal violating pair. B. Necessary Conditions for the Convergence Previous studies and the discussion in Section III provide sufficient conditions for the convergence of decomposition methods. That is, under given working set selections, we prove the convergence. Investigating the necessary conditions is also interesting. When the decomposition method converges, which conditions does its working set selection satisfy? We may think that WSS 3 is general enough so that every convergent SMO-type method satisfies the condition (9). However, this may not be right. We suspect that even if some iterations select {i,j} without enough violation (i.e. y i f(α k ) i + y i f(α k ) j < h(m(α k ) M(α k ))), the method may still converge if other iterations have used appropriate working sets. Therefore, finding useful necessary conditions may be a challenging task. It is worth mentioning another working set selection proposed in [1]. This work requires the following condition: There is N > 0 such that for all k, any violating pair of α k is selected at least once in iterations k to k + N. (13)
31 31 Clearly, a cyclic selection of {i,j} in every l(l 1)/ iterations satisfies (13): {1,}, {1,3},...,{1,l}, {,3},...,{l 1,l}, where l is the number of data instances. With (13), the convergence proof in Theorem 3 becomes very simple. For the limit point ᾱ assumed not stationary, its maximal violating pair {ī, j} is also a violating pair at iteration k, where k K,k k. According to (13), this pair {ī, j} of ᾱ must be selected at iteration k, where k k k + N. Then at the (k + 1)st iteration, (8) and (9) imply ī I up (α k +1 ) and j I low (α k +1 ), but (33) indicates that ī I up (α k +1 ) or j I low (α k +1 ). Thus, immediately there is a contradiction. In a sense, (13) is a rule designed for the convergence proof. The two conditions (9) and (13) on the working set selection are quite different, so neither is a necessary condition. From the counter-example in Section II we observed that the selected {i, j} has a much smaller violation than m(α k ) M(α k ), and hence proposed WSS and 3. This example also has a violating pair {4,5} never selected, a situation opposite to (13). Thus both WSS 3 and the condition (13) attempt to remedy problems imposed from this counter-example, but they take very different directions. One focuses on issues related to the maximal violating pair, but the other requires that all current violating pairs are selected later in a finite number of iterations. In general, we think the former leads to faster convergence as it more aggressively reduces the violation. However, this also complicates the convergence proof as a counting procedure in Theorem 3 must be involved in order to obtain the contradiction. ACKNOWLEDGMENTS This work was supported in part by the National Science Council of Taiwan via the grant NSC E REFERENCES [1] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages ACM Press, 199. [] C.-C. Chang, C.-W. Hsu, and C.-J. Lin. The analysis of decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 11(4): , 000. [3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 001. Software available at csie.ntu.edu.tw/ cjlin/libsvm. [4] C.-C. Chang and C.-J. Lin. Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9): , 001.
32 3 [5] C. Cortes, P. Haffner, and M. Mohri. Positive definite rational kernels. In Proceedings of the 16th Annual Conference on Learning Theory, pages 41 56, 003. [6] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 0:73 97, [7] D. J. Crisp and C. J. C. Burges. A geometric interpretation of ν-svm classifiers. In S. Solla, T. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, volume 1, Cambridge, MA, 000. MIT Press. [8] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6: , 005. [9] D. Hush and C. Scovel. Polynomial-time decomposition algorithms for support vector machines. Machine Learning, 51:51 71, 003. [10] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, MIT Press. [11] S. S. Keerthi and E. G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46: , 00. [1] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7): , 003. [13] S. S. Keerthi and C. J. Ong. On the role of the threshold parameter in SVM training algorithms. Technical Report CD-00-09, Department of Mechanical and Production Engineering, National University of Singapore, Singapore, 000. [14] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to Platt s SMO algorithm for SVM classifier design. Neural Computation, 13: , 001. [15] C.-J. Lin. Linear convergence of a decomposition method for support vector machines. Technical report, Department of Computer Science, National Taiwan University, 001. [16] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 1(6): , 001. [17] C.-J. Lin. Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Transactions on Neural Networks, 13(1):48 50, 00. [18] C.-J. Lin. A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 13(5): , 00. [19] H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM and the training of non-psd kernels by SMO-type methods. Technical report, Department of Computer Science, National Taiwan University, 003. [0] N. List and H. U. Simon. A general convergence theorem for the decomposition method. In Proceedings of the 17th Annual Conference on Learning Theory, pages , 004. [1] S. Lucidi, L. Palagi, M. Sciandrone, and A. Risi. A convergent decomposition algorithm for support vector machines. Computational Optimization and Applications, 006. To appear. [] C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constructive Approximation, :11, [3] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR 97, pages , New York, NY, IEEE. [4] L. Palagi and M. Sciandrone. On the convergence of a modified version of SVM light algorithm. Optimization Methods and Software, 0(-3):315 33, 005. [5] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, MIT Press. [6] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7): , 001. [7] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 1: , 000. [8] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995.
33 33 APPENDIX A. Proof of Two Lemmas Used in Section V Lemma 3 If v 1 v l, then max i ( v i ) v 1 v l. Proof: We notice that max i ( v i ) must happen at v 1 or v l. It is easy to see v 1 v l v 1 + v l max( v 1, v l ). Lemma 4 If Q is invertible, then for any x such that 1) e T x = 0, ) v Qx, max i ((Qx) i ) = v 1 > v l = min i ((Qx) i ), and v 1 v l 0, we have Proof: min( v 1, v l ) (1 et Q 1 e i,j Q 1 ij )max( v1, v l ). Since v 1 > v l and v 1 v l 0, we have either v 1 > v l 0 or 0 v 1 > v l. For the first case, if the result is wrong, so for j = 1,...,l, v l > (1 et Q 1 e, i,j Q 1 ij )v1 With x = Q 1 v and (14), e T x = v 1 v j v 1 v l e T Q 1 v < ( et Q 1 e i,j Q 1 ij )v1. (14) = i,j = i,j Q 1 ij v j Q 1 ij (v1 (v 1 v j )) v 1 e T Q 1 e (v 1 v l ) Q 1 ij i,j ( > v 1 e T Q 1 e ( et Q 1 ) e i,j Q 1 ij ) Q 1 ij i,j = 0 causes a contradiction. The case of 0 v 1 > v l is similar.
Working Set Selection Using Second Order Information for Training Support Vector Machines
Journal of Machine Learning Research 6 (25) 889 98 Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines Rong-En Fan Pai-Hsuen
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Support Vector Machine Solvers
Support Vector Machine Solvers Léon Bottou NEC Labs America, Princeton, NJ 08540, USA Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei, Taiwan [email protected] [email protected]
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Several Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
Support Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
Support Vector Machines
CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
Factorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
An example of a computable
An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Distributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
Nonlinear Programming Methods.S2 Quadratic Programming
Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective
IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction
IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL R. DRNOVŠEK, T. KOŠIR Dedicated to Prof. Heydar Radjavi on the occasion of his seventieth birthday. Abstract. Let S be an irreducible
Applied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Massive Data Classification via Unconstrained Support Vector Machines
Massive Data Classification via Unconstrained Support Vector Machines Olvi L. Mangasarian and Michael E. Thompson Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI
3. Linear Programming and Polyhedral Combinatorics
Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the
Notes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725
Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
Support Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. [email protected] 1 Support Vector Machines: history SVMs introduced
Large Linear Classification When Data Cannot Fit In Memory
Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan [email protected] Kai-Wei Chang Dept. of Computer
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Classification of Cartan matrices
Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
Cost Minimization and the Cost Function
Cost Minimization and the Cost Function Juan Manuel Puerta October 5, 2009 So far we focused on profit maximization, we could look at a different problem, that is the cost minimization problem. This is
Which Is the Best Multiclass SVM Method? An Empirical Study
Which Is the Best Multiclass SVM Method? An Empirical Study Kai-Bo Duan 1 and S. Sathiya Keerthi 2 1 BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 [email protected]
A Tutorial on Support Vector Machines for Pattern Recognition
c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies
A Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
Ideal Class Group and Units
Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals
Support Vector Machines for Classification and Regression
UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer
1 Sets and Set Notation.
LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Nonlinear Optimization: Algorithms 3: Interior-point methods
Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris [email protected] Nonlinear optimization c 2006 Jean-Philippe Vert,
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Date: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
One-sided Support Vector Regression for Multiclass Cost-sensitive Classification
One-sided Support Vector Regression for Multiclass Cost-sensitive Classification Han-Hsing Tu [email protected] Hsuan-Tien Lin [email protected] Department of Computer Science and Information
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.
PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include
Lecture 5 Principal Minors and the Hessian
Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and
ISOMETRIES OF R n KEITH CONRAD
ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
On the Interaction and Competition among Internet Service Providers
On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Universal Algorithm for Trading in Stock Market Based on the Method of Calibration
Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
11 Ideals. 11.1 Revisiting Z
11 Ideals The presentation here is somewhat different than the text. In particular, the sections do not match up. We have seen issues with the failure of unique factorization already, e.g., Z[ 5] = O Q(
I. GROUPS: BASIC DEFINITIONS AND EXAMPLES
I GROUPS: BASIC DEFINITIONS AND EXAMPLES Definition 1: An operation on a set G is a function : G G G Definition 2: A group is a set G which is equipped with an operation and a special element e G, called
Proximal mapping via network optimization
L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
The Ideal Class Group
Chapter 5 The Ideal Class Group We will use Minkowski theory, which belongs to the general area of geometry of numbers, to gain insight into the ideal class group of a number field. We have already mentioned
By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal
How To Prove The Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand
Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such
17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH
31 Kragujevac J. Math. 25 (2003) 31 49. SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH Kinkar Ch. Das Department of Mathematics, Indian Institute of Technology, Kharagpur 721302, W.B.,
An interval linear programming contractor
An interval linear programming contractor Introduction Milan Hladík Abstract. We consider linear programming with interval data. One of the most challenging problems in this topic is to determine or tight
BANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
Learning to Find Pre-Images
Learning to Find Pre-Images Gökhan H. Bakır, Jason Weston and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstraße 38, 72076 Tübingen, Germany {gb,weston,bs}@tuebingen.mpg.de
Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected].
Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected] This paper contains a collection of 31 theorems, lemmas,
LINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
it is easy to see that α = a
21. Polynomial rings Let us now turn out attention to determining the prime elements of a polynomial ring, where the coefficient ring is a field. We already know that such a polynomial ring is a UF. Therefore
Lecture 13 Linear quadratic Lyapunov theory
EE363 Winter 28-9 Lecture 13 Linear quadratic Lyapunov theory the Lyapunov equation Lyapunov stability conditions the Lyapunov operator and integral evaluating quadratic integrals analysis of ARE discrete-time
Large Margin DAGs for Multiclass Classification
S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 [email protected]
THE SCHEDULING OF MAINTENANCE SERVICE
THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single
THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS
THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear
What is Linear Programming?
Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to
Duality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
Linear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem
On Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: [email protected] Fabrizio Smeraldi School of
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
SECTION 10-2 Mathematical Induction
73 0 Sequences and Series 6. Approximate e 0. using the first five terms of the series. Compare this approximation with your calculator evaluation of e 0.. 6. Approximate e 0.5 using the first five terms
The Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
α = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
Early defect identification of semiconductor processes using machine learning
STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
Stationary random graphs on Z with prescribed iid degrees and finite mean connections
Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative
Learning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
On the representability of the bi-uniform matroid
On the representability of the bi-uniform matroid Simeon Ball, Carles Padró, Zsuzsa Weiner and Chaoping Xing August 3, 2012 Abstract Every bi-uniform matroid is representable over all sufficiently large
Learning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
