# Working Set Selection Using Second Order Information for Training Support Vector Machines

Save this PDF as:

Size: px
Start display at page:

Download "Working Set Selection Using Second Order Information for Training Support Vector Machines"

## Transcription

1 Journal of Machine Learning Research 6 (25) Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines Rong-En Fan Pai-Hsuen Chen Chih-Jen Lin Department of Computer Science, National Taiwan University Taipei 6, Taiwan Editor: Thorsten Joachims Abstract Working set selection is an important step in decomposition methods for training support vector machines (SVMs). This paper develops a new technique for working set selection in SMO-type decomposition methods. It uses second order information to achieve fast convergence. Theoretical properties such as linear convergence are established. Experiments demonstrate that the proposed method is faster than existing selection methods using first order information. Keywords: support vector machines, decomposition methods, sequential minimal optimization, working set selection. Introduction Support vector machines (SVMs) (Boser et al., 992; Cortes and Vapnik, 995) are a useful classification method. Given instances x i, i =,...,l with labels y i {, }, the main task in training SVMs is to solve the following quadratic optimization problem: min f(α) = α 2 αt Qα e T α subject to α i C, i =,...,l, () y T α =, where e is the vector of all ones, C is the upper bound of all variables, Q is an l by l symmetric matrix with Q ij = y i y j K(x i,x j ), and K(x i,x j ) is the kernel function. The matrix Q is usually fully dense and may be too large to be stored. Decomposition methods are designed to handle such difficulties (e.g., Osuna et al., 997; Joachims, 998; Platt, 998; Chang and Lin, 2). Unlike most optimization methods which update the whole vector α in each step of an iterative process, the decomposition method modifies only a subset of α per iteration. This subset, denoted as the working set B, leads to a small sub-problem to be minimized in each iteration. An extreme case is the Sequential Minimal Optimization (SMO) (Platt, 998), which restricts B to have only two elements. Then in each iteration one does not require any optimization software in order to solve a simple two-variable problem. This method is sketched in the following: c 25 Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin.

2 Fan, Chen, and Lin Algorithm (SMO-type decomposition method). Find α as the initial feasible solution. Set k =. 2. If α k is an optimal solution of (), stop. Otherwise, find a two-element working set B = {i, j} {,...,l}. Define N {,...,l}\b and α k B and αk N to be sub-vectors of α k corresponding to B and N, respectively. 3. Solve the following sub-problem with the variable α B : min α B 2 [ α T B (α k N )T][ Q BB Q BN Q NB Q NN ] [ αb α k N ] [ e T B et N ] [ ] α B α k N = 2 αt BQ BB α B + ( e B + Q BN α k N) T α B + constant = [ ] [ ] [ ] [ ] Q αi α ii Q ij αi j + ( e 2 Q ij Q jj α B + Q BN α k N) T αi + constant j α j subject to α i, α j C, (2) where y i α i + y j α j = y T Nα k N, [ ] QBB Q BN Q NB Q NN is a permutation of the matrix Q. 4. Set α k+ B to be the optimal solution of (2) and α k+ N Step 2. αk N. Set k k + and goto Note that the set B changes from one iteration to another, but to simplify the notation, we just use B instead of B k. Since only few components are updated per iteration, for difficult problems, the decomposition method suffers from slow convergences. Better methods of working set selection could reduce the number of iterations and hence are an important research issue. Existing methods mainly rely on the violation of the optimality condition, which also corresponds to first order (i.e., gradient) information of the objective function. Past optimization research indicates that using second order information generally leads to faster convergence. Now () is a quadratic programming problem, so second order information directly relates to the decrease of the objective function. There are several attempts (e.g., Lai et al., 23a,b) to find working sets based on the reduction of the objective value, but these selection methods are only heuristics without convergence proofs. Moreover, as such techniques cost more than existing ones, fewer iterations may not lead to shorter training time. This paper develops a simple working set selection using second order information. It can be extended for indefinite kernel matrices. Experiments demonstrate that the training time is shorter than existing implementations. This paper is organized as follows. In Section 2, we discuss existing methods of working set selection and propose a new strategy. Theoretical properties of using the new selection technique are in Section 3. In Section 4 we extend the proposed selection method to other SVM formulas such as ν-svm. A detailed experiment is in Section 5. We then in Section 6 discuss and compare some variants of the proposed selection method. Finally, Section 7 concludes this research work. A pseudo code of the proposed method is in the Appendix. 89

3 Working Set Selection for Training SVMs 2. Existing and New Working Set Selections In this section, we discuss existing methods of working set selection and then propose a new approach. 2. Existing Selections Currently a popular way to select the working set B is via the maximal violating pair: WSS (Working set selection via the maximal violating pair ). Select where i arg max{ y t f(α k ) t t I up (α k )}, t j arg min{ y t f(α k ) t t I low (α k )}, t I up (α) {t α t < C, y t = or α t >, y t = }, and I low (α) {t α t < C, y t = or α t >, y t = }. (3) 2. Return B = {i, j}. This working set was first proposed in Keerthi et al. (2) and is used in, for example, the software LIBSVM (Chang and Lin, 2). WSS can be derived through the Karush- Kuhn-Tucker (KKT) optimality condition of (): A vector α is a stationary point of () if and only if there is a number b and two nonnegative vectors λ and µ such that f(α) + by = λ µ, λ i α i =, µ i (C α i ) =, λ i, µ i, i =,...,l, where f(α) Qα e is the gradient of f(α). This condition can be rewritten as f(α) i + by i if α i < C, (4) f(α) i + by i if α i >. (5) Since y i = ±, by defining I up (α) and I low (α) as in (3), and rewriting (4)-(5) to y i f(α) i b, i I up (α), and y i f(α) i b, i I low (α), a feasible α is a stationary point of () if and only if where m(α) m(α) M(α), (6) max y i f(α) i, and M(α) min y i f(α) i. i I up(α) i I low (α) Note that m(α) and M(α) are well defined except a rare situation where all y i = (or ). In this case the zero vector is the only feasible solution of (), so the decomposition method stops at the first iteration. Following Keerthi et al. (2), we define a violating pair of the condition (6). 89

4 Fan, Chen, and Lin Definition (Violating pair) If i I up (α), j I low (α), and y i f(α) i > y j f(α) j, then {i, j} is a violating pair. From (6), indices {i, j} which most violate the optimality condition are a natural choice of the working set. They are called a maximal violating pair in WSS. It is known that violating pairs are important in the working set selection: Theorem 2 (Hush and Scovel, 23) Assume Q is positive semi-definite. SMO-type methods have the strict decrease of the function value (i.e., f(α k+ ) < f(α k ), k) if and only if B is a violating pair. Interestingly, the maximal violating pair is related to first order approximation of f(α). As explained below, {i, j} selected via WSS satisfies where {i, j} = arg min Sub(B), (7) B: B =2 Sub(B) min d B f(α k ) T Bd B (8a) subject to y T Bd B =, d t, if α k t =, t B, d t, if α k t = C, t B, d t, t B. Problem (7) was first considered in Joachims (998). By defining d T [d T B,T N ], the objective function (8a) comes from minimizing first order approximation of f(α k + d): f(α k + d) f(α k ) + f(α k ) T d = f(α k ) + f(α k ) T Bd B. The constraint yb T d B = is from y T (α k +d) = and y T α k =. The condition α t C leads to inequalities (8b) and (8c). As (8a) is a linear function, the inequalities d t, t B avoid that the objective value goes to. A first look at (7) indicates that we may have to check all ( l 2) B s in order to find an optimal set. Instead, WSS efficiently solves (7) in O(l) steps. This result is discussed in Lin (2a, Section II), where more general settings ( B is any even integer) are considered. The proof for B = 2 is easy, so we give it in Appendix A for completeness. The convergence of the decomposition method using WSS is proved in Lin (2a, 22). 2.2 A New Working Set Selection Instead of using first order approximation, we may consider more accurate second order information. As f is a quadratic, f(α k + d) f(α k ) = f(α k ) T d + 2 dt 2 f(α k )d (8b) (8c) (8d) = f(α k ) T Bd B + 2 dt B 2 f(α k ) BB d B (9) 892

5 Working Set Selection for Training SVMs is exactly the reduction of the objective value. Thus, by replacing the objective function of (8) with (9), a selection method using second order information is min Sub(B), () B: B =2 where Sub(B) min d B 2 dt B 2 f(α k ) BB d B + f(α k ) T Bd B (a) subject to y T Bd B =, (b) d t, if α k t =, t B, d t, if α k t = C, t B. (c) (d) Note that inequality constraints d t, t B in (8) are removed, as later we will see that the optimal value of () does not go to. Though one expects () is better than (8), min B: B =2 Sub(B) in () becomes a challenging task. Unlike (7)-(8), which can be efficiently solved by WSS, for () and () there is no available way to avoid checking all ( l 2) B s. Note that except the working set selection, the main task per decomposition iteration is on calculating the two kernel columns Q ti and Q tj, t =,...,l. This requires O(l) operations and is needed only if Q is not stored. Therefore, each iteration can become l times more expensive if an O(l 2 ) working set selection is used. Moreover, from simple experiments we know that the number of iterations is, however, not decreased l times. Therefore, an O(l 2 ) working set selection is impractical. A viable implementation of using second order information is thus to heuristically check several B s only. We propose the following new selection: WSS 2 (Working set selection using second order information). Select i arg max{ y t f(α k ) t t I up (α k )}. t 2. Consider Sub(B) defined in () and select 3. Return B = {i, j}. j arg min t {Sub({i, t}) t I low (α k ), y t f(α k ) t < y i f(α k ) i }. (2) By using the same i as in WSS, we check only O(l) possible B s to decide j. Alternatively, one may choose j arg M(α k ) and search for i by a way similar to (2). In fact, such a selection is the same as swapping labels y first and then applying WSS 2, so the performance should not differ much. It is certainly possible to consider other heuristics, and the main concern is how good they are if compared to the one by fully checking all ( l 2) sets. In Section 7 we will address this issue. Experiments indicate that a full check does not reduce iterations of using WSS 2 much. Thus WSS 2 is already a very good way of using second order information.. To simplify the notations, we denote arg M(α) as arg min t Ilow (α) y t f(α) t and arg m(α) as arg max t Iup(α) y t f(α) t, respectively. 893

6 Fan, Chen, and Lin Despite the above issue of how to effectively use second order information, the real challenge is whether the new WSS 2 can cause shorter training time than WSS. Now the two selection methods differ only in selecting j, so we can also consider WSS 2 as a direct attempt to improve WSS. The following theorem shows that one could efficiently solve (), so the working set selection WSS 2 does not cost a lot more than WSS. Theorem 3 If B = {i, j} is a violating pair and K ii + K jj 2K ij >, then () has the optimal objective value ( y i f(α k ) i + y j f(α k ) j ) 2. 2(K ii + K jj 2K ij ) Proof Define ˆd i y i d i and ˆd j y j d j. From yb T d B =, we have ˆd i = ˆd j and ] [ di 2 [ ] [ Q di d ii Q ij j Q ij Q jj d j ] + [ f(α k ) i f(α k ] [ ] d ) i j d j = 2 (K ii + K jj 2K ij ) ˆd 2 j + ( y i f(α k ) i + y j f(α k ) j ) ˆd j. (3) Since K ii + K jj 2K ij > and B is a violating pair, we can define a ij K ii + K jj 2K ij > and b ij y i f(α k ) i + y j f(α k ) j >. (4) Then (3) has the minimum at and ˆd j = ˆd i = b ij a ij <, (5) the objective function (a) = b2 ij 2a ij. Moreover, we can show that ˆd i and ˆd j (d i and d j ) indeed satisfy (c)-(d). If j I low (α k ), αj k = implies y j = and hence d j = y j ˆdj >, a condition required by (c). Other cases are similar. Thus ˆd i and ˆd i defined in (5) are optimal for (). Note that if K is positive definite, then for any i j, K ii + K jj 2K ij >. Using Theorem 3, (2) in WSS 2 is reduced to a very simple form: { } j arg min b2 it t I low (α k ), y t f(α k ) t < y i f(α k ) i, t a it where a it and b it are defined in (4). If K is not positive definite, the leading coefficient a ij of (3) may be non-positive. This situation will be addressed in the next sub-section. Note that (8) and () are used only for selecting the working set, so they do not have to maintain the feasibility α k i + d i C, i B. On the contrary, feasibility must hold for the sub-problem (2) used to obtain α k+ after B is determined. There are some earlier attempts to use second order information for selecting working sets (e.g., Lai et al., 23a,b), but they always check the feasibility. Then solving sub-problems during the 894

7 Working Set Selection for Training SVMs selection procedure is more complicated than solving the sub-problem (). Besides, these earlier approaches do not provide any convergence analysis. In Section 6 we will investigate the issue of maintaining feasibility in the selection procedure, and explain why using () is better. 2.3 Non-Positive Definite Kernel Matrices Theorem 3 does not hold if K ii + K jj 2K ij. For the linear kernel, sometimes K is only positive semi-definite, so it is possible that K ii + K jj 2K ij =. Moreover, some existing kernel functions (e.g., sigmoid kernel) are not the inner product of two vectors, so K is even not positive semi-definite. Then K ii + K jj 2K ij < may occur and (3) is a concave objective function. Once B is decided, the same difficulty occurs for the sub-problem (2) to obtain α k+. Note that (2) differs from () only in constraints; (2) strictly requires the feasibility α i + d i C, t B. Therefore, (2) also has a concave objective function if K ii +K jj 2K ij <. In this situation, (2) may possess multiple local minima. Moreover, there are difficulties in proving the convergence of the decomposition methods (Palagi and Sciandrone, 25; Chen et al., 26). Thus, Chen et al. (26) proposed adding an additional term to (2) s objective function if a ij K ii + K jj 2K ij : [ ] [ Q min αi α ii Q ij j α i,α j 2 Q ij Q jj τ a ij 4 ] [ αi α j ((α i α k i ) 2 + (α j α k j) 2 ) ] [ ] + ( e B + Q BN α k N) T αi + α j subject to α i, α j C, (6) y i α i + y j α j = y T Nα k N, where τ is a small positive number. By defining ˆd i y i (α i α k i ) and ˆd j y j (α j α k j ), (6) s objective function, in a form similar to (3), is 2 τ ˆd 2 j + b ij ˆdj, (7) where b ij is defined as in (4). The new objective function is thus strictly convex. If {i, j} is a violating pair, then a careful check shows that there is ˆd j < which leads to a negative value in (7) and maintains the feasibility of (6). Therefore, we can find α k+ α k satisfying f(α k+ ) < f(α k ). More details are in Chen et al. (26). For selecting the working set, we consider a similar modification: If B = {i, j} and a ij is defined as in (4), then () is modified to: Sub(B) min d B 2 dt B 2 f(α k ) BB d B + f(α k ) T Bd B + τ a ij 4 subject to constraints of (). (d 2 i + d 2 j) (8) Note that (8) differs from (6) only in constraints. In (8) we do not maintain the feasibility of α k t + d t, t B. We are allowed to do so because (8) is used only for identifying the working set B. 895

8 Fan, Chen, and Lin By reformulating (8) to (7) and following the same argument in Theorem 3, the optimal objective value of (8) is b2 ij 2τ. Therefore, a generalized working set selection is as the following: WSS 3 (Working set selection using second order information: any symmetric K). Define a ts and b ts as in (4), and { ats if a ā ts ts >, τ otherwise. (9) Select i j arg max{ y t f(α k ) t t I up (α k )}, t arg min t { b2 it ā it t I low (α k ), y t f(α k ) t < y i f(α k ) i }. (2) 2. Return B = {i, j}. In summary, an SMO-type decomposition method using WSS 3 for the working set selection is: Algorithm 2 (An SMO-type decomposition method using WSS 3). Find α as the initial feasible solution. Set k =. 2. If α k is a stationary point of (), stop. Otherwise, find a working set B = {i, j} by WSS Let a ij be defined as in (4). If a ij >, solve the sub-problem (2). Otherwise, solve (6). Set α k+ B to be the optimal point of the sub-problem. 4. Set α k+ N αk N. Set k k + and goto Step 2. In the next section we study theoretical properties of using WSS Theoretical Properties To obtain theoretical properties of using WSS 3, we consider the work (Chen et al., 26), which gives a general study of SMO-type decomposition methods. It considers Algorithm 2 but replaces WSS 3 with a general working set selection 2 : WSS 4 (A general working set selection discussed in Chen et al., 26). Consider a fixed < σ for all iterations. 2. In fact, Chen et al. (26) consider an even more general framework for selecting working sets, but for easy description, we discuss WSS 4 here. 896

9 Working Set Selection for Training SVMs 2. Select any i I up (α k ), j I low (α k ) satisfying 3. Return B = {i, j}. y i f(α k ) i + y j f(α k ) j σ(m(α k ) M(α k )) >. (2) Clearly (2) ensures the quality of the selected pair by linking it to the maximal violating pair. It is easy to see that WSS 3 is a special case of WSS 4: Assume B = {i, j} is the set returned from WSS 3 and j arg M(α k ). Since WSS 3 selects i arg m(α k ), with ā ij > and ā i j >, (2) in WSS 3 implies Thus, ( y i f(α k ) i + y j f(α k ) j ) 2 (m(αk ) M(α k )) 2. ā ij y i f(α k ) i + y j f(α k ) j ā i j min t,s ā t,s max t,s ā t,s (m(α k ) M(α k )), an inequality satisfying (2) for σ = min t,s ā t,s / max t,s ā t,s. Therefore, all theoretical properties proved in Chen et al. (26) hold here. They are listed below. It is known that the decomposition method may not converge to an optimal solution if using improper methods of working set selection. We thus must prove that the proposed selection leads to the convergence. Theorem 4 (Asymptotic convergence (Chen et al., 26, Theorem 3 and Corollary )) Let {α k } be the infinite sequence generated by the SMO-type method Algorithm 2. Then any limit point of {α k } is a stationary point of (). Moreover, if Q is positive definite, {α k } globally converges to the unique minimum of (). As the decomposition method only asymptotically approaches an optimum, in practice, it is terminated after satisfying a stopping condition. For example, we can pre-specify a small tolerance ǫ > and check if the maximal violation is small enough: m(α k ) M(α k ) ǫ. (22) Alternatively, one may check if the selected working set {i, j} satisfies y i f(α k ) i + y j f(α k ) j ǫ, (23) because (2) implies m(α k ) M(α k ) ǫ/σ. These are reasonable stopping criteria due to their closeness to the optimality condition (6). To avoid an infinite loop, we must have that under any ǫ >, Algorithm 2 stops in a finite number of iterations. The finite termination of using (22) or (23) as the stopping condition is implied by (26) of Theorem 5 stated below. Shrinking and caching (Joachims, 998) are two effective techniques to make the decomposition method faster. The former removes some bounded components during iterations, so smaller reduced problems are considered. The latter allocates some memory space (called cache) to store recently used Q ij, and may significantly reduce the number of kernel evaluations. The following theorem explains why these two techniques are useful in practice: 897

10 Fan, Chen, and Lin Theorem 5 (Finite termination and explanation of caching and shrinking techniques (Chen et al., 26, Theorems 4 and 6)) Assume Q is positive semi-definite.. The following set is independent of any optimal solution ᾱ: I {i y i f(ᾱ) i > M(ᾱ) or y i f(ᾱ) i < m(ᾱ)}. (24) Problem () has unique and bounded optimal solutions at α i, i I. 2. Assume Algorithm 2 generates an infinite sequence {α k }. There is k such that after k k, every αi k, i I has reached the unique and bounded optimal solution. It remains the same in all subsequent iterations and k k: i {t M(α k ) y t f(α k ) t m(α k )}. (25) 3. If () has an optimal solution ᾱ satisfying m(ᾱ) < M(ᾱ), then ᾱ is the unique solution and Algorithm 2 reaches it in a finite number of iterations. 4. If {α k } is an infinite sequence, then the following two limits exist and are equal: lim k m(αk ) = lim k M(αk ) = m(ᾱ) = M(ᾱ), (26) where ᾱ is any optimal solution. Finally, the following theorem shows that Algorithm 2 is linearly convergent under some assumptions: Theorem 6 (Linear convergence (Chen et al., 26, Theorem 8)) Assume problem () satisfies. Q is positive definite. Therefore, () has a unique optimal solution ᾱ. 2. The nondegenency condition. That is, the optimal solution ᾱ satisfies that f(ᾱ) i + by i = if and only if < ᾱ i < C, (27) where b = m(ᾱ) = M(ᾱ) according to Theorem 5. For the sequence {α k } generated by Algorithm 2, there are c < and k such that for all k k, f(α k+ ) f(ᾱ) c(f(α k ) f(ᾱ)). This theorem indicates how fast the SMO-type method Algorithm 2 converges. For any fixed problem () and a given tolerance ǫ, there is k such that within k + O(log(/ǫ)) iterations, f(α k ) f(ᾱ) ǫ. Note that O(log(/ǫ)) iterations are necessary for decomposition methods according to the analysis in Lin (2b) 3. Hence the result of linear convergence here is already the best worst case analysis. 3. Lin (2b) gave a three-variable example and explained that the SMO-type method using WSS is linearly convergent. A careful check shows that the same result holds for any method of working set selection. 898

11 Working Set Selection for Training SVMs 4. Extensions The proposed WSS 3 can be directly used for training support vector regression (SVR) and one-class SVM because they solve problems similar to (). More detailed discussion about applying WSS 4 (and hence WSS 3) to SVR and one-class SVM is in Chen et al. (26, Section IV). Another formula which needs special attention is ν-svm (Schölkopf et al., 2), which solves a problem with one more linear constraint: min f(α) = α 2 αt Qα subject to y T α =, (28) e T α = ν, α i /l, i =,...,l, where e is the vector of all ones and ν. Similar to (6), α is a stationary point of (28) if and only if it satisfies where m p (α) m n (α) m p (α) M p (α) and m n (α) M n (α), (29) max y i f(α) i, M p (α) min y i f(α) i, and i I up(α),y i = i I low (α),y i = max y i f(α) i, M n (α) min y i f(α) i. i I up(α),y i = i I low (α),y i = A detailed derivation is in, for example, Chen et al. (26, Section VI). In an SMO-type method for ν-svm the selected working set B = {i, j} must satisfy y i = y j. Otherwise, if y i y j, then the two linear equalities make the sub-problem have only one feasible point α k B. Therefore, to select the working set, one considers positive (i.e., y i = ) and negative (i.e., y i = ) instances separately. Existing implementations such as LIBSVM (Chang and Lin, 2) check violating pairs in each part and select the one with the largest violation. This strategy is an extension of WSS. By a derivation similar to that in Section 2, the selection can also be from first or second order approximation of the objective function. Using Sub({i, j}) defined in (), WSS 2 in Section 2 is modified to WSS 5 (Extending WSS 2 for ν-svm). Find 2. Find i p arg m p (α k ), j p arg min t {Sub({i p, t}) y t =, α t I low (α k ), y t f(α k ) t < y ip f(α k ) ip }. i n arg m n (α k ), j n arg min t {Sub({i n, t}) y t =, α t I low (α k ), y t f(α k ) t < y in f(α k ) in }. 899

12 Fan, Chen, and Lin Problem #data #feat. Problem #data #feat. Problem #data #feat.,3 8 69, 8, , , 2 aa,65 9, 24, wa 2,477 3 space ga, 6 Table : Data statistics for small problems (left two columns: classification, right column: regression). : subset of the original problem. 3. Check Sub({i p, j p }) and Sub({i n, j n }). Return the set with a smaller value. By Theorem 3 in Section 2, it is easy to solve Sub({i p, t}) and Sub({i n, t}) in the above procedure. 5. Experiments In this section we aim at comparing the proposed WSS 3 with WSS, which selects the maximal violating pair. As indicated in Section 2, they differ only in finding the second element j: WSS checks first order approximation of the objective function, but WSS 3 uses second order information. 5. Data and Experimental Settings First, some small data sets (around, samples) including ten binary classification and five regression problems are investigated under various settings. Secondly, observations are further confirmed by using four large (more than 3, instances) classification problems. Data statistics are in Tables and 3. Problems and are from the Statlog collection (Michie et al., 994). We select space ga and from StatLib ( The data sets,, covtype,, and are from the UCI machine learning repository (Newman et al., 998). Problems aa and a9a are compiled in Platt (998) from the UCI adult data set. Problems wa and w8a are also from Platt (998). The data set was originally used in Bailey et al. (993). The problem is a Mackey-Glass time series. The data sets and are from the Delve archive ( Problem is from Ho and Kleinberg (996) and we further transform it to a two-class set. The problem IJCNN is from the first problem of IJCNN 2 challenge (Prokhorov, 2). For most data sets each attribute is linearly scaled to [, ]. We do not scale aa, a9a, wa, and w8a as they take two values and. Another exception is covtype, in which 44 of 54 features have / values. We scale only the other ten features to [, ]. All data are available at We use LIBSVM (version 2.7) (Chang and Lin, 2), an implementation of WSS, for experiments. An easy modification to WSS 3 ensures that two codes differ only in the working set implementation. We set τ = 2 in WSS 3. 9

13 Working Set Selection for Training SVMs Different SVM parameters such as C in () and kernel parameters affect the training time. It is difficult to evaluate the two methods under every parameter setting. To have a fair comparison, we simulate how one uses SVM in practice and consider the following procedure:. Parameter selection step: Conduct five-fold cross validation to find the best one within a given set of parameters. 2. Final training step: Train the whole set with the best parameter to obtain the final model. For each step we check time and iterations using the two methods of working set selection. For some extreme parameters (e.g., very large or small values) in the parameter selection step, the decomposition method converges very slowly, so the comparison shows if the proposed WSS 3 saves time under difficult situations. On the other hand, the best parameter usually locates in a more normal region, so the final training step tests if WSS 3 is competitive with WSS for easier cases. The behavior of using different kernels is a concern, so we thoroughly test four commonly used kernels:. RBF kernel: 2. Linear kernel: 3. Polynomial kernel: K(x i,x j ) = e γ x i x j 2. K(x i,x j ) = x T i x j. K(x i,x j ) = (γ(x T i x j + )) d. 4. Sigmoid kernel: K(x i,x j ) = tanh(γx T i x j + d). Note that this function cannot be represented as φ(x i ) T φ(x j ) under some parameters. Then the matrix Q is not positive semi-definite. Experimenting with this kernel tests if our extension to indefinite kernels in Section 2.3 works well or not. Parameters used for each kernel are listed in Table 2. Note that as SVR has an additional parameter ǫ, to save the running time, for other parameters we may not consider as many values as in classification. It is important to check how WSS 3 performs after incorporating shrinking and caching strategies. Such techniques may effectively save kernel evaluations at each iteration, so the higher cost of WSS 3 is a concern. We consider various settings:. With or without shrinking. 2. Different cache size: First a 4MB cache allows the whole kernel matrix to be stored in the computer memory. Second, we allocate only K, so cache misses may happen and more kernel evaluations are needed. The second setting simulates the training of large-scale sets whose kernel matrices cannot be stored. 9

14 Fan, Chen, and Lin Kernel Problem type log 2 C log 2 γ log 2 ǫ d RBF Classification 5, 5, 2 3, 5, 2 Regression, 5, 2 3, 5, 2 8,, Linear Classification 3, 5, 2 Regression 3, 5, 2 8,, Polynomial Classification 3, 5, 2 5,, 2, 4, Regression 3, 5, 2 5,, 8,, 2, 4, Sigmoid Classification 3, 2, 3 2, 3, 3 2.4, 2.4, Regression 3, 9, 3 γ = #features 8,, 3 2.4, 2.4, Table 2: Parameters used for various kernels: values of each parameter are from a uniform discretization of an interval. We list the left, right end points and the space for discretization. For example, 5, 5, 2 for log 2 C means log 2 C = 5, 3,..., Results For each kernel, we give two figures showing results of parameter selection and final training steps, respectively. We further separate each figure to two scenarios: without/with shrinking, and present three ratios between using WSS 3 and using WSS : ratio ratio 2 ratio 3 # iter. by Alg. 2 with WSS 3 # iter. by Alg. 2 with WSS, time by Alg. 2 (WSS 3, K cache) time by Alg. 2 (WSS, K cache), time by Alg. 2 (WSS 3, 4M cache) time by Alg. 2 (WSS, 4M cache). Note that the number of iterations is independent of the cache size. For the parameter selection step, time (or iterations) of all parameters is summed up before calculating the ratio. In general the final training step is very fast, so the timing result may not be accurate. Hence we repeat this step several times to obtain more reliable timing values. Figures -8 present obtained ratios. They are in general smaller than one, so using WSS 3 is really better than using WSS. Before describing other results, we explain an interesting observation: In these figures, if shrinking is not used, in general ratio ratio 2 ratio 3. (3) Under the two very different cache sizes, one is too small to store the kernel matrix, but the other is large enough. Thus, roughly we have time per Alg. 2 iteration (K cache) Calculating two Q columns + Selection, time per Alg. 2 iteration (4M cache) Selection. If shrinking is not used, the optimization problem is not reduced and hence (3) time by Alg. 2 cost per iteration constant. (32) # iter. of Alg. 2 92

### A Study on SMO-type Decomposition Methods for Support Vector Machines

1 A Study on SMO-type Decomposition Methods for Support Vector Machines Pai-Hsuen Chen, Rong-En Fan, and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Support Vector Machine Solvers

Support Vector Machine Solvers Léon Bottou NEC Labs America, Princeton, NJ 08540, USA Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei, Taiwan leon@bottou.org cjlin@csie.ntu.edu.tw

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### A fast multi-class SVM learning method for huge databases

www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

### Density Level Detection is Classification

Density Level Detection is Classification Ingo Steinwart, Don Hush and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov Abstract We

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Convex Optimization SVM s and Kernel Machines

Convex Optimization SVM s and Kernel Machines S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola and Stéphane Canu S.V.N.

### Discrete Optimization

Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods

A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods Abstract Hsuan-Tien Lin and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### Breaking SVM Complexity with Cross-Training

Breaking SVM Complexity with Cross-Training Gökhan H. Bakır Max Planck Institute for Biological Cybernetics, Tübingen, Germany gb@tuebingen.mpg.de Léon Bottou NEC Labs America Princeton NJ, USA leon@bottou.org

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### The Characteristic Polynomial

Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Fast Support Vector Machine Training and Classification on Graphics Processors

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Christopher Catanzaro Narayanan Sundaram Kurt Keutzer Electrical Engineering and Computer Sciences University of California

### An Efficient Implementation of an Active Set Method for SVMs

Journal of Machine Learning Research 7 (2006) 2237-2257 Submitted 10/05; Revised 2/06; Published 10/06 An Efficient Implementation of an Active Set Method for SVMs Katya Scheinberg Mathematical Science

### Linear Systems. Singular and Nonsingular Matrices. Find x 1, x 2, x 3 such that the following three equations hold:

Linear Systems Example: Find x, x, x such that the following three equations hold: x + x + x = 4x + x + x = x + x + x = 6 We can write this using matrix-vector notation as 4 {{ A x x x {{ x = 6 {{ b General

### Large Linear Classification When Data Cannot Fit In Memory

Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw Kai-Wei Chang Dept. of Computer

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Some Notes on Taylor Polynomials and Taylor Series

Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems

Journal of Machine Learning Research 7 (2006) 1467 1492 Submitted 10/05; Revised 2/06; Published 7/06 Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems Luca Zanni

### Fairness in Routing and Load Balancing

Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria

### Large Margin DAGs for Multiclass Classification

S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 jplatt@microsoft.com

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### 5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### 4.6 Linear Programming duality

4.6 Linear Programming duality To any minimization (maximization) LP we can associate a closely related maximization (minimization) LP. Different spaces and objective functions but in general same optimal

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

1 SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA, S. LEE, V. NARAYANAN, H. THADAKAMALLA The Dept. of Industrial Engineering,

### Support Vector Machines

CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)

### 1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

### THE SCHEDULING OF MAINTENANCE SERVICE

THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single

### Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Journal of Machine Learning Research 11 (2010) 1471-1490 Submitted 8/09; Revised 1/10; Published 4/10 Training and Testing Low-degree Polynomial Data Mappings via Linear SVM Yin-Wen Chang Cho-Jui Hsieh

### 1 Polyhedra and Linear Programming

CS 598CSC: Combinatorial Optimization Lecture date: January 21, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im 1 Polyhedra and Linear Programming In this lecture, we will cover some basic material

### Optimization of Design. Lecturer:Dung-An Wang Lecture 12

Optimization of Design Lecturer:Dung-An Wang Lecture 12 Lecture outline Reading: Ch12 of text Today s lecture 2 Constrained nonlinear programming problem Find x=(x1,..., xn), a design variable vector of

### Early defect identification of semiconductor processes using machine learning

STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

58 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 you saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c nn c nn...

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### CS 4310 HOMEWORK SET 1

CS 4310 HOMEWORK SET 1 PAUL MILLER Section 2.1 Exercises Exercise 2.1-1. Using Figure 2.2 as a model, illustrate the operation of INSERTION-SORT on the array A = 31, 41, 59, 26, 41, 58. Solution: Assume

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### ME128 Computer-Aided Mechanical Design Course Notes Introduction to Design Optimization

ME128 Computer-ided Mechanical Design Course Notes Introduction to Design Optimization 2. OPTIMIZTION Design optimization is rooted as a basic problem for design engineers. It is, of course, a rare situation

### MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

### Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

### Direct Methods for Solving Linear Systems. Linear Systems of Equations

Direct Methods for Solving Linear Systems Linear Systems of Equations Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Lecture 5 Principal Minors and the Hessian

Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

(R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter MWetter@lbl.gov February 20, 2009 Notice: This work was supported by the U.S.

### The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

### Mathematical finance and linear programming (optimization)

Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may

### Integer Factorization using the Quadratic Sieve

Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give

### What is Linear Programming?

Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Which Is the Best Multiclass SVM Method? An Empirical Study

Which Is the Best Multiclass SVM Method? An Empirical Study Kai-Bo Duan 1 and S. Sathiya Keerthi 2 1 BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 askbduan@ntu.edu.sg

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

### Analysis of an Artificial Hormone System (Extended abstract)

c 2013. This is the author s version of the work. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purpose or for creating

### Chapter 3 LINEAR PROGRAMMING GRAPHICAL SOLUTION 3.1 SOLUTION METHODS 3.2 TERMINOLOGY

Chapter 3 LINEAR PROGRAMMING GRAPHICAL SOLUTION 3.1 SOLUTION METHODS Once the problem is formulated by setting appropriate objective function and constraints, the next step is to solve it. Solving LPP

### Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

### Sequences with MS

Sequences 2008-2014 with MS 1. [4 marks] Find the value of k if. 2a. [4 marks] The sum of the first 16 terms of an arithmetic sequence is 212 and the fifth term is 8. Find the first term and the common

### Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Walrasian Demand Econ 2100 Fall 2015 Lecture 5, September 16 Outline 1 Walrasian Demand 2 Properties of Walrasian Demand 3 An Optimization Recipe 4 First and Second Order Conditions Definition Walrasian

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### Discuss the size of the instance for the minimum spanning tree problem.

3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Math 315: Linear Algebra Solutions to Midterm Exam I

Math 35: Linear Algebra s to Midterm Exam I # Consider the following two systems of linear equations (I) ax + by = k cx + dy = l (II) ax + by = 0 cx + dy = 0 (a) Prove: If x = x, y = y and x = x 2, y =

### 3. Linear Programming and Polyhedral Combinatorics

Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the

### Solution to Homework 2

Solution to Homework 2 Olena Bormashenko September 23, 2011 Section 1.4: 1(a)(b)(i)(k), 4, 5, 14; Section 1.5: 1(a)(b)(c)(d)(e)(n), 2(a)(c), 13, 16, 17, 18, 27 Section 1.4 1. Compute the following, if

### Nonnegative Matrix Factorization: Algorithms, Complexity and Applications

Nonnegative Matrix Factorization: Algorithms, Complexity and Applications Ankur Moitra Massachusetts Institute of Technology July 6th, 2015 Ankur Moitra (MIT) NMF July 6th, 2015 m M n W m M = A inner dimension

### ORDERS OF ELEMENTS IN A GROUP

ORDERS OF ELEMENTS IN A GROUP KEITH CONRAD 1. Introduction Let G be a group and g G. We say g has finite order if g n = e for some positive integer n. For example, 1 and i have finite order in C, since

### Massive Data Classification via Unconstrained Support Vector Machines

Massive Data Classification via Unconstrained Support Vector Machines Olvi L. Mangasarian and Michael E. Thompson Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI

### Support Vector Machines for Classification and Regression

UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer

### Cost Minimization and the Cost Function

Cost Minimization and the Cost Function Juan Manuel Puerta October 5, 2009 So far we focused on profit maximization, we could look at a different problem, that is the cost minimization problem. This is

### 9.2 Summation Notation

9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a