# A Study on SMO-type Decomposition Methods for Support Vector Machines

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 1 A Study on SMO-type Decomposition Methods for Support Vector Machines Pai-Hsuen Chen, Rong-En Fan, and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan Abstract Decomposition methods are currently one of the major methods for training support vector machines. They vary mainly according to different working set selections. Existing implementations and analysis usually consider some specific selection rules. This article studies Sequential Minimal Optimization (SMO)-type decomposition methods under a general and flexible way of choosing the two-element working set. Main results include: 1) a simple asymptotic convergence proof, ) a general explanation of the shrinking and caching techniques, and 3) the linear convergence of the methods. Extensions to some SVM variants are also discussed. I. INTRODUCTION The support vector machines (SVMs) [1], [6] are useful classification tools. Given training vectors x i R n,i = 1,...,l, in two classes, and a vector y R l such that y i {1, 1}, SVMs require the solution of the following optimization problem: min α f(α) = 1 αt Qα e T α subject to 0 α i C,i = 1,...,l, (1) y T α = 0, where e is the vector of all ones, C < is the upper bound of all variables, and Q is an l by l symmetric matrix. Training vectors x i are mapped into a higher (maybe infinite) dimensional space by the function φ, Q ij y i y j K(x i,x j ) = y i y j φ(x i ) T φ(x j ), and K(x i,x j ) is the kernel function. Then Q is a positive semi-definite (PSD) matrix. Occasionally some researchers use kernel functions not in the form of φ(x i ) T φ(x j ), so Q may be indefinite. Here we also consider such cases and, hence, require only Q to be symmetric. Due to the density of the matrix Q, the memory problem causes that traditional optimization methods cannot be directly applied to solve (1). Currently the decomposition method is one of the major methods to train SVM (e.g. [3], [10], [3], [5]). This method, an iterative procedure, considers only a small subset of variables per iteration. Denoted as B, this subset is called the working set. Since each iteration involves only B columns of the matrix Q, the memory problem is solved.

2 A special decomposition method is the Sequential Minimal Optimization (SMO) [5], which restricts B to only two elements. Then at each iteration one solves a simple two-variable problem without needing optimization software. It is sketched in the following: Algorithm 1 (SMO-type Decomposition methods) 1) Find α 1 as the initial feasible solution. Set k = 1. ) If α k is an stationary point of (1), stop. Otherwise, find a two-element working set B = {i,j} {1,...,l}. Define N {1,...,l}\B and α k B and αk N as sub-vectors of αk corresponding to B and N, respectively. 3) Solve the following sub-problem with the variable α B : 1 ] min [α α B TB (αkn )T Q BB Q BN α [ ] B e T Q NB Q NN α k B et α B N N α k N = 1 ] [α i α j Q ii Q ij α i + ( e B + Q BN α k N) T α i + constant Q ij Q jj α j subject to 0 α i,α j C, () y i α i + y j α j = ynα T k N, [ ] where QBB Q BN Q NB Q NN is a permutation of the matrix Q. Set α k+1 B to be the optimal point of (). 4) Set α k+1 N αk N. Set k k + 1 and goto Step. Note that the set B changes from one iteration to another. To simplify the notation, we use B instead of B k. Algorithm 1 does not specify how to choose the working set B as there are many possible ways. Regarding the sub-problem (), if Q is positive semi-definite, then it is convex and can be easily solved. For indefinite Q, the situation is more complicated, and this issue is addressed in Section III. If a proper working set is used at each iteration, the function value f(α k ) strictly decreases. However, this property does not imply that the sequence {α k } converges to a stationary point of (1). Hence, proving the convergence of decomposition methods is usually a challenging task. Under certain rules for selecting the working set, the asymptotic convergence has been established (e.g. [16], [], [9], [0]). These selection rules may allow an arbitrary working set size, so they reduce to SMO-type methods when the size is limited to two. This article provides a comprehensive study on SMO-type decomposition methods. The selection rule of the two-element working set B is under a very general setting. Thus, all results here apply to any SMO-type implementation whose selection rule meets the criteria of this paper. In Section II, we discuss existing working set selections for SMO-type methods and propose a general scheme. Section III proves the asymptotic convergence and gives a comparison to former convergence studies. α j

3 3 Shrinking and caching are two effective techniques to speed up the decomposition methods. Earlier research [18] gives the theoretical foundation of these two techniques, but requires some assumptions. In Section IV, we provide a better and a more general explanation without these assumptions. Convergence rates are another important issue and they indicate how fast the method approaches an optimal solution. We establish the linear convergence of the proposed method in Section V. All the above results hold not only for support vector classification, but also for regression and other variants of SVM. Such extensions are presented in Section VI. Finally, Section VII is the conclusion. II. EXISTING AND NEW WORKING SET SELECTIONS FOR SMO-TYPE METHODS In this Section, we discuss existing working set selections and then propose a general scheme for SMO-type methods. A. Existing Selections Currently a popular way to select the working set B is via the maximal violating pair: WSS 1 (Working set selection via the maximal violating pair ) 1) Select where i arg j arg max t I up(α k ) min t I low (α k ) y t f(α k ) t, y t f(α k ) t, I up (α) {t α t < C,y t = 1 or α t > 0,y t = 1}, and I low (α) {t α t < C,y t = 1 or α t > 0,y t = 1}. (3) ) Return B = {i,j}. This working set was first proposed in [14] and has been used in many software packages, for example, LIBSVM [3]. WSS 1 can be derived through the Karush-Kuhn-Tucker (KKT) optimality condition of (1): A vector α is a stationary point of (1) if and only if there is a number b and two nonnegative vectors λ and µ such that f(α) + by = λ µ, λ i α i = 0,µ i (C α) i = 0,λ i 0,µ i 0,i = 1,...,l, where f(α) Qα e is the gradient of f(α). This condition can be rewritten as f(α) i + by i 0 if α i < C, (4) f(α) i + by i 0 if α i > 0. (5)

4 4 Since y i = ±1, by defining I up (α) and I low (α) as in (3), and rewriting (4)-(5) to have the range of b, a feasible α is a stationary point of (1) if and only if where m(α) m(α) M(α), (6) max y i f(α) i, and M(α) min y i f(α) i. i I up(α) i I low (α) Note that m(α) and M(α) are well defined except in a rare situation where all y i = 1 (or 1). In this case the zero vector is the only feasible solution of (1), so the decomposition method stops at the first iteration. Following [14], we define a violating pair of the condition (6) as: Definition 1 (Violating pair) If i I up (α),j I low (α), and y i f(α) i > y j f(α) j, then {i,j} is a violating pair. From (6), the indices {i,j} which most violate the condition are a natural choice of the working set. They are called a maximal violating pair in WSS 1. Violating pairs play an important role in making the decomposition methods work, as demonstrated by: Theorem 1 ([9]) Assume Q is positive semi-definite. The decomposition method has the strict decrease of the objective function value (i.e. f(α k+1 ) < f(α k ), k) if and only if at each iteration B includes at least one violating pair. For SMO-type methods, if Q is PSD, this theorem implies that B must be a violating pair. Unfortunately, having a violating pair in B and then the strict decrease of f(α k ) do not guarantee the convergence to a stationary point. An interesting example from [13] is as follows: Given five data x 1 = [1,0,0] T,x = [0,1,0] T,x 3 = [0,0,1] T,x 4 = [0.1,0.1,0.1] T, and x 5 = [0,0,0] T. If y 1 = = y 4 = 1,y 5 = 1, C = 100, and the linear kernel K(x i,x j ) = x T i x j is used, then the optimal solution of (1) is [0,0,0,00/3,00/3] T and the optimal objective value is 00/3. Starting with α 0 = [0,0,0,0,0] T, if in the first iteration the working set is {1,5}, we have α 1 = [,0,0,0,] T with the objective value. Then if we choose the following working sets at the next three iterations: {1,} {,3} {3,1}, (7) the next three α are: [1,1,0,0,] T [1,0.5,0.5,0,] T [0.75,0.5,0.75,0,] T.

5 5 If we continue as in (7) for choosing the working set, the algorithm requires an infinite number of iterations. In addition, the sequence converges to a non-optimal point [/3,/3,/3,0,] T with the objective value 10/3. All working sets used are violating pairs. A careful look at the above procedure reveals why it fails to converge to the optimal solution. In (7), we have the following for the three consecutive iterations: y t f(α k ) t,t = 1,...,5 y i f(α k ) i + y j f(α k ) j m(α k ) M(α k ) [0,0, 1, 0.8,1] T 1 [0, 0.5, 0.5, 0.8,1] T [ 0.5, 0.5, 0.5, 0.8,1] T Clearly, the selected B = {i,j}, though a violating pair, does not lead to the reduction of the maximal violation m(α k ) M(α k ). This discussion shows the importance of the maximal violating pair in the decomposition method. When such a pair is used as the working set (i.e. WSS 1), the convergence has been fully established in [16] and [17]. However, if B is not the maximal violating pair, it is unclear whether SMO-type methods still converge to a stationary point. Motivated from the above analysis, we conjecture that a sufficiently violated pair is enough and propose a general working set selection in the following subsection. B. A General Working Set Selection for SMO-type Decomposition Methods We propose choosing a constant-factor violating pair as the working set. That is, the difference between the two selected indices is larger than a constant fraction of that between the maximal violating pair. WSS (Working set selection: constant-factor violating pair) 1) Consider a fixed 0 < σ 1 for all iterations. ) Select any i I up (α k ),j I low (α k ) satisfying y i f(α k ) i + y j f(α k ) j σ(m(α k ) M(α k )) > 0. (8) 3) Return B = {i,j}. Clearly (8) ensures the quality of the selected pair by linking it to the maximal violating pair. We can consider an even more general relationship between the two pairs: WSS 3 (Working set selection: a generalization of WSS ) 1) Let h : R 1 R 1 be any function satisfying a) h strictly increases on x 0, b) h(x) x, x 0,h(0) = 0.

6 6 ) Select any i I up (α k ),j I low (α k ) satisfying 3) Return B = {i,j}. y i f(α k ) i + y j f(α k ) j h(m(α k ) M(α k )) > 0. (9) The condition h(x) x ensures that there is at least one pair {i,j} satisfying (9). Clearly, h(x) = σx with 0 < σ 1 fulfills all required conditions and WSS 3 reduces to WSS. The function h can be in a more complicated form. For example, if ( ) h(m(α k ) M(α k )) min m(α k ) M(α k ), m(α k ) M(α k ), (10) then (10) also satisfies all requirements. Subsequently, we mainly analyze the SMO-type method using WSS 3 for the working set selection. The only exception is the linear convergence analysis, which considers WSS. III. ASYMPTOTIC CONVERGENCE The decomposition method generates a sequence {α k }. If it is finite, then a stationary point is obtained. Hence we consider only the case of an infinite sequence. This section establishes the asymptotic convergence of using WSS 3 for the working set selection. First we discuss the subproblem in Step 3) of Algorithm 1 with indefinite Q. Solving it relates to the convergence proof. A. Solving the sub-problem Past convergence analysis usually requires an important property, the function value is sufficiently decreased: There is λ > 0 such that f(α k+1 ) f(α k ) λ α k+1 α k, for all k. (11) If Q is positive semi-definite and the working set {i,j} is a violating pair, [17] has proved (11). However, it is difficult to obtain the same result if Q is indefinite. Some such kernels are, for example, the sigmoid kernel K(x i,x j ) = tanh(γx T i x j + r) [19] and the edit-distance kernel K(x i,x j ) = exp d(xi,xj), where d(x i,x j ) is the edit distance of two strings x i and x j [5]. To obtain (11), [4] modified the sub-problem () to the following form: min α B f ([ αb α k N where τ > 0 is a constant. Clearly, an optimal α k+1 B ]) + τ α B α k B subject to y T Bα B = y T Nα k N, (1) 0 α i C, i B, of (1) leads to f(α k+1 ) + τ α k+1 B αk B f(α k )

7 7 and then (11). However, this change also causes differences to most SVM implementations, which consider positive semi-definite Q and use the sub-problem (). Moreover, (1) may still be non-convex and possess more than one local minimum. Then for implementations, convex and non-convex cases may have to be handled separately. Therefore, we propose another way to solve the sub-problem where 1) The same sub-problem () is used when Q is positive definite (PD), and ) The sub-problem is always convex. An SMO-type decomposition method with special handling on indefinite Q is in the following algorithm. Algorithm (Algorithm 1 with specific handling on indefinite Q) Steps 1), ), and 4) are the same as those in Algorithm 1. Step 3 ) Let τ > 0 be a constant throughout all iterations, and define a Q ii + Q jj y i y j Q ij. (13) a) If a > 0, then solve the sub-problem (). Set α k+1 B to be the optimal point of (). b) If a 0, then solve a modified sub-problem: 1 ] min [α α i,α j i α j Q ii Q ij α i + ( e B + Q BN α k N) T α i + Q ij Q jj α j τ a ((α i αi k ) + (α j α k 4 j) ) subject to 0 α i,α j C, (14) y i α i + y j α j = y T Nα k N. Set α k+1 B to be the optimal point of (14). The additional term in (14) has a coefficient (τ a)/4 related to the matrix Q, but that in (1) does not. We will show later that our modification leads (14) to a convex optimization problem. Next we discuss how to easily solve the two sub-problems () and (14). Consider α i α k i + y id i and α j αj k + y jd j. The sub-problem () is equivalent to 1 ] min [d d i,d j i d j Q ii y i y j Q ij d i + ( e B + (Qα) B ) T y id i y i y j Q ij Q jj d j y j d j subject to d i + d j = 0, 0 α k i + y i d i,α k j + y j d j C. (15) Substituting d i = d j into the objective function leads to a one-variable optimization problem: min d j g(d j ) f(α) f(α k ) subject to L d j U, = 1 (Q ii + Q jj y i y j Q ij )d j + ( y i f(α k ) i + y j f(α k ) j )d j (16) α j

8 8 where L and U are lower and upper bounds of d j after including information on d i : d i = d j and 0 α k i + y id i C. As a defined in (13) is the leading coefficient of the quadratic function (16), a > 0 if and only if the sub-problem () is strictly convex. We then claim that If y i = y j = 1, 0 α k i + y id i,α k j + y jd j C implies L < 0 and U 0. (17) L = max( α k j,α k i C) d j min(c α k j,α k i ) = U. Clearly, U 0. Since i I up (α k ) and j I low (α k ), we have α k j > 0 and αk i For other values of y i and y j, the situations are similar. < C. Thus, L < 0. When the sub-problem () is not strictly convex, we consider (14) via adding an additional term. With d i = d j, Thus, by defining and τ a 4 α B α k B = τ a d j. (18) a if a > 0, a 1 τ otherwise, (19) a y i f(α k ) i + y j f(α k ) j > 0, (0) problems () and (14) are essentially the following strictly convex optimization problem: min d j 1 a 1d j + a d j subject to L d j U. (1) Moreover, a 1 > 0, a > 0, L < 0, and U 0. The quadratic objective function is shown in Figure 1 and the optimum is d j = max( a a 1,L) < 0. () Therefore, once a 1 is defined in (19) according to whether a > 0 or not, the two sub-problems can be easily solved by the same method as in (). To check the decrease of the function value, the definition of the function g in (16), dj < 0 and a 1 dj + a 0 from (), and α k+1 α k = d j imply f(α k+1 ) f(α k ) = g( d j ) = 1 a 1 d j + a dj = (a 1 dj + a ) d j a 1 d j a 1 d j = a 1 4 αk+1 α k.

9 L (a) L a a 1 d j (b) L > a a 1 L d j Fig. 1. Solving the convex sub-problem (1) Therefore, by defining λ 1 4 min(τ,min{q ii + Q jj y i y j Q ij Q ii + Q jj y i y j Q ij > 0}), (3) we have the following lemma: Lemma 1 Assume at each iteration of Algorithm the working set is a violating pair and let {α k } be the sequence generated. Then (11) holds with λ defined in (3). The above discussion is related to [19], which examines how to solve () when it is non-convex. Note that a = 0 may happen as long as Q is not PD (i.e., Q is PSD or indefinite). Then, (16), becoming a linear function, is convex but not strictly convex. To have (11), an additional term makes it quadratic. Given that the function is already convex, we of course hope that there is no need to modify () to (14). The following theorem shows that if Q is PSD and τ < /C, the two sub-problems indeed have the same optimal solution. Therefore, in a sense, for PSD Q, the sub-problem () is always used. Theorem Assume Q is PSD, the working set of Algorithm is a violating pair, and τ C. If a = 0, then the optimal solution of (14) is the same as that of (). Proof: From (16) and a > 0 in (0), if a = 0, () has optimal dj = L. For (14) and the transformed problem (1), a = 0 implies a 1 = τ. Moreover, since Q is PSD, a = 0 = Q ii + Q jj y i y j Q ij = φ(x i ) φ(x j ), and hence φ(x i ) = φ(x j ). Therefore, K it = K jt, t implies y i f(α k ) i + y j f(α k ) j l l = y t K it αt k + y i + y t K jt αt k y j t=1 = y i y j. t=1

10 10 Since {i,j} is a violating pair, y i y j > 0 indicates that a = y i y j =. As τ C, () implies that dj = L is the optimal solution of (14). Thus, we have shown that () and (14) have the same optimal point. The derivation of a = was first developed in [17]. B. Proof of Asymptotic Convergence Using (11), we then have the following lemma: Lemma Assume the working set at each iteration of Algorithm is a violating pair. If a sub-sequence {α k },k K converges to ᾱ, then for any given positive integer s, the sequence {α k+s },k K converges to ᾱ as well. Proof: The lemma was first proved in [16, Lemma IV.3]. For completeness we give details here. For the sub-sequence {α k+1 },k K, from Lemma 1 and the fact that {f(α k )} is a bounded decreasing sequence, we have Thus = 0. lim α k+1 ᾱ k K,k lim k K,k lim k K,k ( α k+1 α k + α k ᾱ ) ( ) 1 λ (f(αk ) f(α k+1 )) + α k ᾱ lim k,k K αk+1 = ᾱ. From {α k+1 } we can prove lim k,k K αk+ = ᾱ too. Therefore, lim k,k K αk+s = ᾱ for any given s. Since the feasible region of (1) is compact, there is at least one convergent sub-sequence of {α k }. The following theorem shows that the limit point of any convergent sub-sequence is a stationary point of (1). Theorem 3 Let {α k } be the infinite sequence generated by the SMO-type method Algorithm using WSS 3. Then any limit point of {α k } is a stationary point of (1). Proof: Assume that ᾱ is the limit point of any convergent sub-sequence {α k },k K. If ᾱ is not a stationary point of (1), it does not satisfy the optimality condition (6). Thus, there is at least one maximal violating pair : ī arg m(ᾱ) j arg M(ᾱ) (4) such that y ī f(ᾱ) ī + y j f(ᾱ) j > 0. (5)

11 11 We further define min (, 1 { } min ) y t f(ᾱ) t + y s f(ᾱ) s y t f(ᾱ) t y s f(ᾱ) s > 0. (6) Lemma, the continuity of f(α), and h( ) > 0 from (6) imply that for any given r, there is k K such that for all k K,k k: For u = 0,...,r, y ī f(α k+u ) ī + y j f(α k+u ) j >. (7) If i I up (ᾱ), then i I up (α k ),...,i I up (α k+r ). (8) If i I low (ᾱ), then i I low (α k ),...,i I low (α k+r ). (9) If y i f(ᾱ) i > y j f(ᾱ) j, then for u = 0,...,r, y i f(α k+u ) i > y j f(α k+u ) j +. (30) If y i f(ᾱ) i = y j f(ᾱ) j, then for u = 0,...,r, y i f(α k+u ) i + y j f(α k+u ) j < h( ). (31) For u = 0,...,r 1, (τ â) α k+u+1 α k+u, (3) where â min{q ii + Q jj y i y j Q ij Q ii + Q jj y i y j Q ij < 0}. If y i f(ᾱ) i > y j f(ᾱ) j and {i,j} is the working set at the (k + u)th iteration, 0 u r 1, then i / I up (α k+u+1 ) or j / I low (α k+u+1 ). (33) The derivation of (7)-(3) is similar, so only the details of (7) are shown. Lemma implies sequences {α k }, {α k+1 },...,{α k+r },k K all converge to ᾱ. The continuity of f(α) and (6) imply that for any given 0 u r and the corresponding sequence {α k+u },k K, there is k u such that (7) holds for all k k u,k K. As r is finite, by selecting k to be the largest of these k u,u = 0,...,r, (7) is valid for all u = 0,...,r. Next we derive (33) [16, Lemma IV.4]. Similar to the optimality condition (6) for problem (1), for the sub-problem at the (k + u)th iteration, if problem (14) is considered and α B is a stationary point, then Now B = {i,j} and α k+u+1 B If max y t f t I up(α B) min y t f t I low (α B) ([ αb ]) α k+u N t ]) ([ αb α k+u N t y t(τ a) (α t αt k ) y t(τ a) (α t αt k ). is a stationary point of the sub-problem satisfying the above inequality. i I up (α k+u+1 ) and j I low (α k+u+1 ),

12 1 then from (3) and (18) y i f(α k+u+1 ) i y j f(α k+u+1 ) j + y i(τ a) (α k+u+1 i α k+u y j f(α k ) j + τ a α k+u+1 α k+u i ) y j(τ a) (α k+u+1 j α k+u j ) y j f(α k ) j +. (34) However, y i f(ᾱ) i > y j f(ᾱ) j implies (30) for α k+u+1, so there is a contradiction. If a > 0 and the sub-problem () is considered, (34) has no term, so immediately we have a contradiction to (30). For the convenience of writing the proof, we reorder indices of ᾱ so that We also define y 1 f(ᾱ) 1 y l f(ᾱ) l. (35) S 1 (k) {i i I up (α k )} and S (k) {l i i I low (α k )}. (36) Clearly, l S 1 (k) + S (k) l(l 1). (37) If {i,j} is selected at the (k + u)th iteration (u = 0,...,r), then we claim that y i f(ᾱ) i > y j f(ᾱ) j. (38) It is impossible that y i (ᾱ) i < y j (ᾱ) j as y i (α k+u ) i < y j (α k+u ) j from (30) then violates (9). If y i (ᾱ) i = y j (ᾱ) j, then y i (α k+u ) i + y j (α k+u ) j < h( ) < h( y ī f(α k+u ) ī + y j f(α k+u ) j) h(m(α k+u ) M(α k+u )). (39) With the property that h is strictly increasing, the first two inequalities come from (31) and (7), while the last is from ī I up (ᾱ), j I low (ᾱ), (8), and (9). Clearly, (39) contradicts (9), so (38) is valid. Next we use a counting procedure to obtain the contradiction of (4) and (5). From the kth to the (k + 1)st iteration, (38) and then (33) show that i / I up (α k+1 ) or j / I low (α k+1 ). For the first case, (8) implies i / I up (ᾱ) and hence i I low (ᾱ). From (9) and the selection rule (9), i I low (α k ) I up (α k ). Thus, i I low (α k ) I up (α k ) and i / I up (α k+1 ).

13 13 With j I low (α k ), S 1 (k + 1) S 1 (k) i + j S 1 (k) 1, S (k + 1) S (k), (40) where i + j 1 comes from (35). Similarly, for the second case, j I low (α k ) I up (α k ) and j / I low (α k+1 ). With i I up (α k ), S 1 (k + 1) S 1 (k), S (k + 1) S (k) (l j) + (l i) S (k) 1. (41) From iteration (k + 1) to (k + ), we can repeat the same argument. Note that (33) can still be used because (38) holds for working sets selected during iterations k to k + r. Using (40) and (41), at r l(l 1) iterations, S 1 (k) + S (k) is reduced to zero, a contradiction to (37). Therefore, the assumptions (4) and (5) are wrong and the proof is complete. Moreover, if (1) has a unique optimal solution, then the whole sequence {α k } globally converges. This happens, for example, when Q is PD. Corollary 1 If Q is positive definite, {α k } globally converges to the unique minimum of (1). Proof: Since Q is positive definite, (1) has a unique solution and we denote it as ᾱ. Assume {α k } does not globally converge to ᾱ. Then there is ǫ > 0 and an infinite subset K such that α k ᾱ ǫ, k K. Since {α k },k K are in a compact space, there is a further sub-sequence converging to α and α ᾱ ǫ. Since α is an optimal solution according to Theorem 3, this contradicts that ᾱ is the unique global minimum. For example, if the RBF kernel K(x i,x j ) = e γ xi xj is used and all x i x j, then Q is positive definite [] and we have the result of Corollary 1. We now discuss the difference between the proof above and earlier convergence work. The work [16] considers a working set selection which allows more than two elements. When the size of the working set is restricted to two, the selection is reduced to WSS 1, the maximal violating pair. This proof in [17] has used a counting procedure by considering two sets related to {ī, j}, the maximal violating pair at ᾱ: I 1 (k) {t t I up (α k ), y t f(ᾱ) t y j f(ᾱ) j}, and I (k) {t t I low (α k ), y t f(ᾱ) t y ī f(ᾱ) ī }. Clearly, if WSS 1 is used, ī I up (α k ) and j I low (α k ) imply that the selected {i,j} must satisfy i I 1 (k) and j I (k). Using (33) to show that I 1 (k) + I (k) decreases to zero, we obtain a contradiction to the fact that I 1 (k) + I (k) from ī I up (α k ) and j I low (α k ). However, now we do not have i I 1 (k) and j I (k) any more since our selection may not be the

14 14 maximal violating pair. Therefore, a new counting scheme is developed. By arranging y i f(ᾱ) i in an ascending (descending) order, in (36), we count the sum of their ranks. IV. STOPPING CONDITION, SHRINKING AND CACHING In this section we discuss other important properties of our method. In previous sections, Q is any symmetric matrix, but here we further require it to be positive semi-definite. To begin, the following theorem depicts some facts about the problem (1). Theorem 4 Assume Q is positive semi-definite. 1) If ᾱ ˆα are any two optimal solutions of (1), then y i f(ᾱ) i = y i f(ˆα) i,i = 1,...,l, (4) and M(ᾱ) = m(ᾱ) = M(ˆα) = m(ˆα). (43) ) If there is an optimal solution ᾱ satisfying m(ᾱ) < M(ᾱ), then ᾱ is the unique optimal solution of (1). 3) The following set is independent of any optimal solution ᾱ: I {i y i f(ᾱ) i > M(ᾱ) or y i f(ᾱ) i < m(ᾱ)}. (44) Moreover, problem (1) has unique and bounded optimal solutions at α i, i I. Proof: Since Q is positive semi-definite, (1) is a convex programming problem and ᾱ and ˆα are both global optima. Then f(ᾱ) = f(ˆα) = f(λᾱ + (1 λ)ˆα), for all 0 λ 1, implies (ᾱ ˆα) T Q(ᾱ ˆα) = 0. As Q is PSD, Q can be factorized to LL T. Thus, L T (ᾱ ˆα) = 0 and hence Qᾱ = Qˆα. Then (4) follows. To prove (43), we will show that m(ˆα) M(ᾱ) and m(ᾱ) M(ˆα). (45) With optimality conditions M(ᾱ) m(ᾱ) and M(ˆα) m(ˆα), (43) holds. Due to the symmetry, it is sufficient to show the first case of (45). If it is wrong, then m(ˆα) < M(ᾱ). (46)

15 15 We then investigate different situations by comparing y i f(ᾱ) i with M(ᾱ) and m(ˆα). If y i f(ᾱ) i M(ᾱ) > m(ˆα), then i / I up (ˆα) and 0 if y i = 1, ˆα i = C if y i = 1. (47) With 0 ᾱ i C, y iˆα i y i ᾱ i 0. (48) If y i f(ᾱ) i m(ˆα) < M(ᾱ), then i / I low (ᾱ) and C if y i = 1, ᾱ i = 0 if y i = 1. (49) Hence, (48) still holds. Other indices are in the following set S {i m(ˆα) < y i f(ˆα) i = y i f(ᾱ) i < M(ᾱ)}. If i S, then i / I up (ˆα) and i / I low (ᾱ). Hence (47) and (49) imply y iˆα i y i ᾱ i = C. (50) Thus, 0 = y T ˆα y T ᾱ = (y iˆα i y i ᾱ i ) + C S. i:i/ S Since (48) implies each term in the above summation is non-negative, S = 0 and ˆα i = ᾱ i, i / S. Therefore, ᾱ = ˆα. However, this contradicts the assumption that ᾱ and ˆα are different optimal solutions. Hence (46) is wrong and we have m(ˆα) M(ᾱ) in (45). The proof of (43) is thus complete. The second result of this theorem and the validity of the set I directly come from (43). Moreover, I is independent of any optimal solution. For any optimal α, if i I and y i f(α) i > M(α) m(α), then, i / I up (α) and α i is the same as (47). Thus, the optimal α i is unique and bounded. The situation for y i f(α) i < m(α) is similar. Lemma 3 in [1] shows a result similar to (43), but the proof is more complicated. It involves both primal and dual SVM formulations. Using Theorem 4, in the rest of this section we derive more properties of the proposed SMO-type methods.

16 16 A. Stopping Condition and Finite Termination As the decomposition method only asymptotically approaches an optimum, in practice, it terminates after satisfying a stopping condition. For example, we can pre-specify a small stopping tolerance ǫ > 0 and check if m(α k ) M(α k ) ǫ (51) is satisfied. Though one can consider other stopping conditions, (51) is commonly used due to its closeness to the optimality condition (6). To justify its validity as a stopping condition, we must make sure that under any given stopping tolerance ǫ > 0, the proposed decomposition method stops in a finite number of iterations. Thus, an infinite loop never happens. To have (51), one can prove a stronger condition: lim k m(αk ) M(α k ) = 0. (5) This condition is not readily available as from the respective definitions of m(α) and M(α), it is unclear whether they are continuous functions of α. Proving (5) will be the main result of this subsection. The first study on the stopping condition of SMO-type methods is [11]. They consider a selection rule which involves the stopping tolerance, so the working set is a so-called ǫ violating pair. Since our selection rule is independent of ǫ, their analysis cannot be applied here. Another work [18] proves (5) for WSS 1 under the assumption that the sequence {α k } globally converges. Here, for more general selection WSS 3, we prove (5) with positive semi-definite Q. Theorem 5 Assume Q is positive semi-definite and the SMO-type decomposition method Algorithm using WSS 3 generates an infinite sequence {α k }. Then lim k m(αk ) M(α k ) = 0. (53) Proof: We prove the theorem by contradiction and assume that the result (53) is wrong. Then, there is an infinite set K and a value > 0 such that m(α k ) M(α k ), k K. (54) Since in the decomposition method m(α k ) > M(α k ), k, (54) can be rewritten as In this K there is a further sub-sequence K so that m(α k ) M(α k ), k K. (55) lim α k = ᾱ. k K,k As Q is positive semi-definite, from Theorem 4, f(α k ) globally converges: lim k f(αk ) i = f(ᾱ) i,i = 1,...,l. (56)

17 17 We then follow a similar counting strategy in Theorem 3. First we rewrite (55) as m(α k ) M(α k ) +, k K, (57) where min (, 1 { } min ) y t f(ᾱ) t + y s f(ᾱ) s y t f(ᾱ) t y s f(ᾱ) s. (58) We still require (7)-(33), but use (56) and (58) to extend (30) and (31) for all k k (i.e., not only k K): If y t f(ᾱ) t > y s f(ᾱ) s, then y t f(α k ) t > y s f(α k ) s. (59) If y t f(ᾱ) t y s f(ᾱ) s, then y t f(α k ) t + y s f(α k ) s >. (60) If y t f(ᾱ) t = y s f(ᾱ) s, then y t f(α k ) t + y s f(α k ) s < h( ). (61) Then the whole proof follows Theorem 3 except (39), in which we need m(α k+u ) M(α k+u ), u = 0,...,r. This condition does not follow from (57), which holds for only a sub-sequence. Thus, in the following we further prove m(α k ) M(α k ), k k. (6) Assume at one k k, 0 < m(α k ) M(α k ) < and {i,j} is the working set of this iteration. As i I up (α k ) and j I low (α k ) from the selection rule, we have M(α k ) y j f(α k ) j < y i f(α k ) i m(α k ). (63) Then according to (60), {i,j} and indices achieving m(α k ) and M(α k ) have the same value of y t f(ᾱ) t. They are all from the following set: {t y t f(ᾱ) t = y i f(ᾱ) i = y j f(ᾱ) j }. (64) For elements not in this set, (59), (60), and (63) imply that If y t f(ᾱ) t > y i f(ᾱ) i, then y t f(α k ) > y i f(α k ) i + > m(α k ) and hence t / I up (α k ). (65) Similarly, If y t f(ᾱ) t < y i f(ᾱ) i, then t / I low (α k ). (66) As we have explained, the working set is from the set (64), other components remain the same from iteration k to k + 1. Therefore, indices satisfying (65) and (66) have t / I up (α k +1 ) and t / I low (α k +1 ), respectively. Furthermore, indices in (65) have larger y t f(α k +1 ) t than others according to (59). Thus, their y t f(α k +1 ) t are greater than m(α k +1 ). Similarly, elements in (66) are smaller than M(α k +1 ). With the fact that m(α k +1 ) > M(α k +1 ), indices which achieve

### Working Set Selection Using Second Order Information for Training Support Vector Machines

Journal of Machine Learning Research 6 (25) 889 98 Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines Rong-En Fan Pai-Hsuen

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Support Vector Machine Solvers

Support Vector Machine Solvers Léon Bottou NEC Labs America, Princeton, NJ 08540, USA Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei, Taiwan leon@bottou.org cjlin@csie.ntu.edu.tw

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Convex Optimization SVM s and Kernel Machines

Convex Optimization SVM s and Kernel Machines S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola and Stéphane Canu S.V.N.

### A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods

A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods Abstract Hsuan-Tien Lin and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Density Level Detection is Classification

Density Level Detection is Classification Ingo Steinwart, Don Hush and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov Abstract We

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Support Vector Machines

CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)

### Solution based on matrix technique Rewrite. ) = 8x 2 1 4x 1x 2 + 5x x1 2x 2 2x 1 + 5x 2

8.2 Quadratic Forms Example 1 Consider the function q(x 1, x 2 ) = 8x 2 1 4x 1x 2 + 5x 2 2 Determine whether q(0, 0) is the global minimum. Solution based on matrix technique Rewrite q( x1 x 2 = x1 ) =

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### 1 Polyhedra and Linear Programming

CS 598CSC: Combinatorial Optimization Lecture date: January 21, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im 1 Polyhedra and Linear Programming In this lecture, we will cover some basic material

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems

Journal of Machine Learning Research 7 (2006) 1467 1492 Submitted 10/05; Revised 2/06; Published 7/06 Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems Luca Zanni

### An Efficient Implementation of an Active Set Method for SVMs

Journal of Machine Learning Research 7 (2006) 2237-2257 Submitted 10/05; Revised 2/06; Published 10/06 An Efficient Implementation of an Active Set Method for SVMs Katya Scheinberg Mathematical Science

### ME128 Computer-Aided Mechanical Design Course Notes Introduction to Design Optimization

ME128 Computer-ided Mechanical Design Course Notes Introduction to Design Optimization 2. OPTIMIZTION Design optimization is rooted as a basic problem for design engineers. It is, of course, a rare situation

### INTRODUCTION TO PROOFS: HOMEWORK SOLUTIONS

INTRODUCTION TO PROOFS: HOMEWORK SOLUTIONS STEVEN HEILMAN Contents 1. Homework 1 1 2. Homework 2 6 3. Homework 3 10 4. Homework 4 16 5. Homework 5 19 6. Homework 6 21 7. Homework 7 25 8. Homework 8 28

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### Iterative Techniques in Matrix Algebra. Jacobi & Gauss-Seidel Iterative Techniques II

Iterative Techniques in Matrix Algebra Jacobi & Gauss-Seidel Iterative Techniques II Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin

### Linear Systems. Singular and Nonsingular Matrices. Find x 1, x 2, x 3 such that the following three equations hold:

Linear Systems Example: Find x, x, x such that the following three equations hold: x + x + x = 4x + x + x = x + x + x = 6 We can write this using matrix-vector notation as 4 {{ A x x x {{ x = 6 {{ b General

### Massive Data Classification via Unconstrained Support Vector Machines

Massive Data Classification via Unconstrained Support Vector Machines Olvi L. Mangasarian and Michael E. Thompson Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Cost Minimization and the Cost Function

Cost Minimization and the Cost Function Juan Manuel Puerta October 5, 2009 So far we focused on profit maximization, we could look at a different problem, that is the cost minimization problem. This is

### IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL R. DRNOVŠEK, T. KOŠIR Dedicated to Prof. Heydar Radjavi on the occasion of his seventieth birthday. Abstract. Let S be an irreducible

### Classification of Cartan matrices

Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### Linear Least Squares

Linear Least Squares Suppose we are given a set of data points {(x i,f i )}, i = 1,...,n. These could be measurements from an experiment or obtained simply by evaluating a function at some points. One

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### A Network Flow Approach in Cloud Computing

1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges

### 3. Linear Programming and Polyhedral Combinatorics

Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the

### 61. REARRANGEMENTS 119

61. REARRANGEMENTS 119 61. Rearrangements Here the difference between conditionally and absolutely convergent series is further refined through the concept of rearrangement. Definition 15. (Rearrangement)

### The multi-integer set cover and the facility terminal cover problem

The multi-integer set cover and the facility teral cover problem Dorit S. Hochbaum Asaf Levin December 5, 2007 Abstract The facility teral cover problem is a generalization of the vertex cover problem.

### Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### Geometry of Linear Programming

Chapter 2 Geometry of Linear Programming The intent of this chapter is to provide a geometric interpretation of linear programming problems. To conceive fundamental concepts and validity of different algorithms

### Notes on Support Vector Machines

Notes on Support Vector Machines Fernando Mira da Silva Fernando.Silva@inesc.pt Neural Network Group I N E S C November 1998 Abstract This report describes an empirical study of Support Vector Machines

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Adapting Codes and Embeddings for Polychotomies

Adapting Codes and Embeddings for Polychotomies Gunnar Rätsch, Alexander J. Smola RSISE, CSL, Machine Learning Group The Australian National University Canberra, 2 ACT, Australia Gunnar.Raetsch, Alex.Smola

### Which Is the Best Multiclass SVM Method? An Empirical Study

Which Is the Best Multiclass SVM Method? An Empirical Study Kai-Bo Duan 1 and S. Sathiya Keerthi 2 1 BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 askbduan@ntu.edu.sg

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### Ideal Class Group and Units

Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals

### Large Linear Classification When Data Cannot Fit In Memory

Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw Kai-Wei Chang Dept. of Computer

### 7 - Linear Transformations

7 - Linear Transformations Mathematics has as its objects of study sets with various structures. These sets include sets of numbers (such as the integers, rationals, reals, and complexes) whose structure

### ORDERS OF ELEMENTS IN A GROUP

ORDERS OF ELEMENTS IN A GROUP KEITH CONRAD 1. Introduction Let G be a group and g G. We say g has finite order if g n = e for some positive integer n. For example, 1 and i have finite order in C, since

### A Tutorial on Support Vector Machines for Pattern Recognition

c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies

### Math 315: Linear Algebra Solutions to Midterm Exam I

Math 35: Linear Algebra s to Midterm Exam I # Consider the following two systems of linear equations (I) ax + by = k cx + dy = l (II) ax + by = 0 cx + dy = 0 (a) Prove: If x = x, y = y and x = x 2, y =

### Support Vector Machines for Classification and Regression

UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer

### PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.

PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Week 5 Integral Polyhedra

Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Fast Support Vector Machine Training and Classification on Graphics Processors

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Christopher Catanzaro Narayanan Sundaram Kurt Keutzer Electrical Engineering and Computer Sciences University of California

### Lecture 5: Conic Optimization: Overview

EE 227A: Conve Optimization and Applications January 31, 2012 Lecture 5: Conic Optimization: Overview Lecturer: Laurent El Ghaoui Reading assignment: Chapter 4 of BV. Sections 3.1-3.6 of WTB. 5.1 Linear

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### Early defect identification of semiconductor processes using machine learning

STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew

### Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Date: April 12, 2001. Contents

2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

### 1 Sets and Set Notation.

LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most

### Lecture 5 Principal Minors and the Hessian

Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

I GROUPS: BASIC DEFINITIONS AND EXAMPLES Definition 1: An operation on a set G is a function : G G G Definition 2: A group is a set G which is equipped with an operation and a special element e G, called

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### equations Karl Lundengård December 3, 2012 MAA704: Matrix functions and matrix equations Matrix functions Matrix equations Matrix equations, cont d

and and Karl Lundengård December 3, 2012 Solving General, Contents of todays lecture and (Kroenecker product) Solving General, Some useful from calculus and : f (x) = x n, x C, n Z + : f (x) = n x, x R,

### THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear

### The Dirichlet Unit Theorem

Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if

### Introduction and message of the book

1 Introduction and message of the book 1.1 Why polynomial optimization? Consider the global optimization problem: P : for some feasible set f := inf x { f(x) : x K } (1.1) K := { x R n : g j (x) 0, j =

### Lecture 13 Linear quadratic Lyapunov theory

EE363 Winter 28-9 Lecture 13 Linear quadratic Lyapunov theory the Lyapunov equation Lyapunov stability conditions the Lyapunov operator and integral evaluating quadratic integrals analysis of ARE discrete-time

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed