Dual unification of biclass Support Vector Machine formulations


 Norman Hancock
 1 years ago
 Views:
Transcription
1 Dual unification of biclass Support Vector Machine formulations L. González a,c.angulo b,f.velasco a and A. Català b a COSDE Group. Depto. de Economía Aplicada I, Universidad de Sevilla, E408, Sevilla, Spain b GREC, Universitat Politècnica de Catalunya, E08800, Vilanova i Geltrú, Spain Abstract Support Vector Machine (SVM) theory was originally developed on the basis of a linearly separable binary classification problem, and other approaches have been later introduced for this problem. In this paper it is demonstrated that all these approaches admit the same dual problem formulation in the linearly separable case and that all the solutions are equivalent. For the nonlinearly separable case, all the approaches can also be formulated as a unique dual optimization problem, however their solutions are not equivalent. Discussions and remarks in the article point to an indepth comparison between SVM formulations and associated parameters. Key words: SVM; large margin principle; biclassification; optimization; convex hull. Introduction Support Vector Machines are learning machines which implement the structural risk imization inductive principle to obtain good generalizations on a limited number of learning patterns. This theory was originally developed by V. Vapnik on the basis of a linearly separable binary classification problem with signed outputs ± []. SVM presents sound theoretical properties and behavior in problems of binary classification []. Many papers generalizing the original biclass approach to multiclassification problems [3,4] through different algorithms exist, such as vr SVM or v SVM. This paper unifies known dual formulations for biclass SVM approaches and improves their generalization ability when the proposed approach is used for multiclassification problems. Preprint submitted to Elsevier Science st August 005
2 This paper is organized as follows: in Section, the standard SVM classification learning paradigm is briefly presented in order to introduce some notation. Several SVM approaches are shown and it is demonstrated that all the approaches can be formulated as a unique dual optimization problem in the linearly separable case, where all the solutions are equivalent. Section 3 is devoted to the nonlinearly separable case, such that a theorem is derived which indicates that all the approaches can be formulated as a unique dual optimization problem in the nonlinearly separable case, however their solutions are not equivalent. Finally, some concluding remarks are made. BiClass SVM Learning The SVM is an implementation of a more general regularization principle known as the large margin principle []. Let Z =((x,y ),, (x n,y n )) = (z,,z n ) (X Y) n be a training set, with X as the input space and Y = {θ,θ } = {, +} the output space. Let φ : X F R d,withφ =(φ,,φ d ), be a feature mapping for the usual kernel trick. F is named feature space. Letx def = φ(x) F be the representation of x X. A (binary) linear classifier, f w (x) = φ(x), w b = x, w b, is sought in the space F, withf w : X F R, and outputs are obtained by thresholding f w, h w (x) =sign(f w (x)). Term b is called bias. The optimal separating hyperplane π b identified by the linear classifier is given by {x X : φ(x), w = b}. It is in canonical form [5] w.r.t. Z when z i Z x i, w b =. () The exact definition of margin between classes varies according to the authors. Usually margin is defined [6] as by assug that the restriction in () is w achieved at some point in both classes. However, the exact equality for both classes is not absolutely necessary; it could be attained in only one class when nonlinear problems are considered. Hence, in [5] the margin is defined as the distance from the point of both classes which is closest to the hyperplane π b, and is given by, whereas in [7] the margin is defined as twice this distance. w In this work, classes will be initially considered to be linearly separable in the feature space. Let w be the director vector of a separating hyperplane, which exists since classes are linearly separable. Let β and α be the imum and maximum values for each class in Y = {θ,θ } = {+, }, effectively
3 attained for some patterns z Z and z Z since both sets are finite, and β = z i Z x i, w = z i Z y i x i, w α =max z Z x, w = z Z y x, w () where Z and Z are respectively the patterns belonging to the classes labeled as {θ,θ } = {+, }. We can consider α β, otherwise we can consider the vector w. A natural choice for the bias, ensuring positive and negative outputs for the patterns in the respective classes, would be b = α + β. (3) In this papare, margin is defined as the distance between parallel hyperplanes π α : w, x α =0andπ β : w, x β =0, d(π α,π β )= β α w. (4) Furthermore, it is always possible to scale up w in π b such that w =;in this case, difference β α is called the geometrical margin.. Finding the large margin classifier From (4), it follows that the classifier w with the largest margin on a given training sample Z is def β α w LM = argmax w F;α,β IR w. (5) From (), it is derived β α w = { } w y i x i, w +y x, w z i Z z Z. (6) Many possibilities exist to translate this problem into an optimization problem. Several formulation approaches are presented below, and some remarks and improvements are made. 3
4 .. Standard primal SVM norm formulation The classifier w associated to (6) can be interpreted as [8] def w SV M =argmax w F;b IR w y i x i, w. z i Z A computationally straightforward method of casting problem (5) is to imize the norm w while the margin is fixed to β α =.Theoptimal separating hyperplane obtained must be in canonical form if the bias term is introduced by defining β = b + and α = b. Hence, the problem is translated into the optimization problem w F;b IR w y i ( x i, w b) z i Z, (7) and the usual formulation for SVMs [] is obtained. The bias term is calculated a posteriori by using the KarushKuhnTucker (KKT) conditions or other more computationally robust expressions [9]... classes ordinal regression formulation Alternatively, (5) can be solved by maximizing the difference β α while w is fixed to the unity, i.e. maximizing the geometrical margin. This is a nonlinear nonconvex restriction and hence leads to an associated optimization problem which is harder to solve than (7). Recently, it has been proved in [0] that this nonconvex restriction can be replaced by the convex constraint, w w, since the optimal solution has unit magnitude in order to optimize the obective function. Hence, the resulting optimization problem can be expressed, α β w F;α,β IR x i, w β z i Z x, w α z Z w, w which is as straightforward to solve as that in (7). (8) The original formulation in [0] is designed for multiclass ordinal regression, therefore an additional constraint, α β, appears to prevent overlapping between classes. We establish that this restriction is not longer necessary when only two linearly separable classes are considered, since ordination only depends on the sign of vector w. 4
5 ..3 CMargin formulation A geometrical approach to solving (5) is presented in []. The optimization problem can be written as w F;α,β IR w + α β x i, w β z i Z x, w α z Z. (9) Moreover, it is also shown that the standard primal SVM norm formulation (7) can be derived from (9) by setting β α = and using (3) for the bias choice. In a similar direct way, it could now be shown that the problem leading toformulationin(8)canbeobtainedfrom(9)byfixing w, w =.Amore general approach would be to consider the obective function in (9) as follows, w + C (α β) (0) where a tunable tradeoff exists between w and α β by means of C > Exact margin maximization The alternative geometrical framework introduced in [] and developed in a more general form in [] has been employed in [7] for Banach spaces. Hence, the problem becomes: w F;α,β IR w β α x i, w β z i Z x, w α z Z () as a result for Hilbert spaces, which is the same as in (7) when a different bias, b = α+β is defined. β α. Dualization of all the approaches The most important contribution in this article is the following demonstration of all the proposed Quadratic Programg problem formulations to be the same when dualized. Firstly, a proposition will be demonstrated. Henceforth, notation is simplified by obviating ranges of subscripts and xy = x, y. 5
6 Proposition Let the problem be xi C, x C x i x provided that C and C are convex sets with C C =. Let us assume that x C and x C are solution points for each set; therefore by defining w 0 = x x, it verifies that: x w 0 = x C xw 0 x w 0 =max x C xw 0. () PROOF. If we define β 0 = x w 0 and α 0 = x w 0,thenβ 0 α 0 = w 0 > 0 since C C =. Letπ a : xw 0 a =0beahyperplanewitha R. Let S = { x R d : x x = w 0 } be the surface of the sphere centered at x. Then it verifies two properties: Property : Let π β be an hyperplane with β>β 0 therefore since x π α0 and d(π α0,π β )= β α 0 > β 0 α 0 = w w 0 w 0 0 then π β S =. Property : S C = {x }. Proof: Obviously x S since x x = w 0. Let us suppose that x S C, x x, exists; hence for 0 λ, x λ = λx +( λ)x C since C is a convex set, with x λ x w 0 (triangular inequality). Strict inequality is not possible: if λ exists such that x λ x < w 0, then x is not a solution point. On the other hand, if λ does not exist such that x λ x < w 0,thenx λ S C for 0 λ. However x λ r is a line, and the intersection of this line with the surface of a sphere contains, at most, two points which is a contradiction. r P Q r 50 % P Figure. Geometrical representation of Proposition on R. Notation is P x, P x,r π β0,r π α0 and Q=x 0. Let us see that x w 0 = x C xw 0 = β 0. Let us suppose that x C xw 0 = β<β 0,thenx 0 C exists such that x 0w 0 = β<β 0 (Figure ). We consider Note that if x x is considered instead of x x,thenβ 0 <α 0. 6
7 the line r : x + λ(x 0 x ) C for 0 λ, x r and r π β0, therefore 0 <λ 0 < exists such that x λ0 S. Hence by Property it follows that x λ0 = x, which is a contradiction. The proof that x w 0 =max x C xw 0 = α 0, is similar to the former proof but considers the surface of the sphere whose center is in x. Let x C and x C be those points which provide the distance between C and C. The norm of the vector w 0 is therefore the distance between C and C, i.e. d(c,c )=d(x, x )= x x = w 0. On the other hand, by using β 0 α 0 = w 0, it follows that, w 0 = β 0 α 0 w 0 = d(π α 0,π β0 ). (3) The main result is demonstrated below. Theorem Dual expressions of the optimization problems (7), (8), (9) and () for linearly separable classes can be formulated as u IR n,v IR n n n u i x i v x i= = n n (4) u i = v = u i,v 0 z i Z,z Z i= = and solutions for all the approaches are equivalent. PROOF. Let us suppose that {u i0} n i= and { } v0 n = unified dual problem (4). Hence, the vector n are the solution of the w 0 = u i0 x i v0 x (5) i= and parameters α 0 =max z Z w 0 x and β 0 = zi Z w 0 x i are considered. Bounds are certainly attained for at least one pattern in each class since sets Z and Z are finite. The standardized SVM primal QP problem (7) is first considered. Constraint y i (x i w b) can be written as x i w (b+) 0, z i Z and (b ) x w 0, z Z. Hence, the associated Lagrangian is n = L(w,b)= ww ( i u i x i v x )w + b ( i u i v )+( i u i + v ) 7
8 and its partial derivatives are w =0 w ( i u i x i v x )=0 w = i u i x i v x b =0 u i v =0 u i = v. i i By substitution, L(u i,v )= i u i x i v x +( i u i + v )and since max ( f(x)) is the same problem as f(x), then the dual problem can be written u i,v u i x i v x ( i u i + v ) i u i = v u i,v 0. i (6) Dual problem (6) is now transformed by defining λ = i u i and dual variables u i = u i, λ v = v. It can be supposed that λ>0, since λ = 0 is the trivial λ solution. The new dual problem becomes λ,u i,v λ u i x i v x λ i u i = (7) v = u i,v,λ 0 i where λ does not depend on u i,v. Let us define a = i u i x i v x, and therefore function f(λ, a) = λ a λ is increasing in a, since it verifies f = a λ > 0. Furthermore, f = aλ =0and f = a>0implyλ λ λ 0 = is a imum. Therefore, a λ,u i,v f(λ, a) =f(λ 0,a)= u i,v u i,v a = u i,v u i x i i v x and the dual problem is in terms of u i,v which is identical to problem (4). On the other hand, if w 0 is the vector defined in (5) on the solution of dual problem (4), then the solution of dual problem (7) can be expressed as w = i u i0 x i v 0 x =( i u i0 ) ( i u i0 x i v 0 x ) = λ0 w 0.Moreover, using constraints of primal problem (7) and the definition of the bounds in Note that u =(u,,u n ) IR n and v =(v,,v n ) IR n verify u = v =. 8
9 (5) lead to λ 0 β 0 = b + and λ 0 α 0 = b, which is an equation system whose solution is λ 0 = β 0 α 0 and b = β 0+α 0 β 0 α 0. The classes ordinal regression formulation problem 3 will now be solved. The Lagrangian of the primal problem is: L(w,α,β)=(α β) i u i (x i w β) v (α x w) γ( ww) with partial derivatives, =0 α v =0 v = =0 β i u i =0 i u i = i u i = v = and by considering that γ = 0 is not a valid solution 4, therefore w =0 γw ( i u i x i v x )=0 w = γ i u i x i v x. Hence, the dual obective function is L(γ,u i,v )= 4γ i u i x i v x γ and as max ( f(x)) is the same problem as f(x) then the dual problem becomes, γ,u i,v u i x i v x + γ 4γ i u i = (8) v = u i,v,γ 0 i where γ does not depend on u i,v. Let us define a = i u i x i, v x and hence function f(γ,a) = 4γ a + γ verifies f = a f and = a +. Therefore, for a > 0, the function a γ γ 4γ is increasing with respect to a and γ,a f(γ,a) = γ f(γ, a a). The equation f = a + = 0 has two solutions: γ γ 4γ 0 = a, which is not possible since γ>0; and γ 0 = a, which verifies that this value is a imum, f a γ 3 0 > 0. Hence, f(γ 0,a)= 4γ 0 a + γ 0 = a implies, γ,a f(γ,a) = γ f(γ, a a)= a a = u i,v u i x i i γ (γ 0)= v x and as arg f(x) =argf (x), the final dual problem is problem (4). 3 A more general demonstration is displayed in [0]. 4 Demonstration can be found in [0]. 9
10 On the other hand, if w 0 is the solution of dual problem (4), then the solution of dual problem (8) is w = w γ 0 = w w 0 0, i.e. the norm of the solution vector is one, and by using the constraint of primal problem (8), α = α w 0 0, β = β w 0 0. Thirdly, we solve the CMargin formulation by using obective function (0) for C > 0, w F;α,β IR w + C (α β) x i, w β z i Z (9) x, w α z Z. The Lagrangian is L(w,α,β)= ww + C (α β) i u i (x i w β) v (α x w) and its partial derivatives are α =0 C v =0 v = C =0 C β + i u i =0 i u i = C i u i = v = C w =0 w ( u i x i v x )=0 w = u i x i v x i i which leads to the dual obective function, L(u i,v )= i u i x i v x. By considering dual variables u i = u i C, v = v C, the dual function is therefore max u i,v C u i x i v x i u i = (0) v = u i,v 0 i and as max ( Af(x)) is the same problem as f(x) fora>0, then the final dual problem is problem (4). By following the same line of reasoning as above, the solution for problem 5 (9) is: w = C w 0, α = C α 0 and β = C β 0. Finally, the exact margin maximization problem will be considered. The associated Lagrangian to the primal problem is L(w,α,β)= w β α i 5 Problem (9) is obtained for C =. u i (x i w β) v (α x w) 0
11 and its partial derivatives are, =0 α v = =0 β i u i = w (β α) w (β α) i u i = v = w (β α) w =0 w w =(β α)( i which leads to the dual obective function, max u i,v u i x i v x ) i u i i u i x i v x u i = v, u i,v 0. i () By considering dual variables u i = u i i, u v i = v ( i u i u i i > 0, otherwise if i u i = 0, then the solution is trivial), the dual function is therefore max u i,v i u i x i v x u i = v = u i,v 0 i and by applying arg max f(x) =arg =arg f(x) f (x) then the final dual problem becomes problem (4). when f(x) > 0, On the other hand, value i u i is calculated using i u i = w, w = (β α) ( i u i ) w 0, α =( i u i ) α 0 and β =( i u i ) β 0, hence w0 i u i = β 0 α 0 is obtained and by applying (3) then i u i =. Hence, the solution to problem () w 0 3 is given in the form, w = w w 0 3 0, α = α w and β = β w Discussion Given Z and Z,theconvex hull of Z k is defined in [] as { nk } n k C k = u i x i, 0 u i, u i =, x i Z k, k =, i= i= i.e. the smallest convex sets which contain the set of points for each class: It is demonstrated that if Z and Z are linearly separable then maximizing the margin separation of two sets is equivalent to imizing the distance between the two closest points of the convex hulls. By using this approach, it is shown that optimization problem (9) leads to problems (7) and (8).
12 It has been demonstrated in Theorem that this fact is more general: all the existing approaches have the same form in the dualization. Moreover, the expression for each solution has been obtained: the result in (3) enables the comparision of which scale factor is used for each SVM formulation studied. Given λ>0, the solutions are similar, verifying w = λw 0, β = λβ 0 and α = λα 0, therefore: w 0 = β α = d(π w α,π β ). Hence, it is deduced that the scale factor for all the problems can be obtained from w, β and α in the form λ = w β α. () For standard primal SVM problem (7), the scale is provided by β α =;by using equality (): λ = w = λw 0 λ =. Analogously, for problem w 0 (8) the scale is provided by w =, and therefore λ = β α = λ(β 0 α 0 ) λ =. This result also provides a mean for parameter C w 0 in problem (9) when obective function (0) is used: λ = C. For exact margin maximization, the scale is w The nonlinearly separable case Let us now suppose that sets Z and Z are nonlinearly separable. By using slack variables, constraints are relaxed in the form 6, x i, w β ξ i z i Z x, w α + ξ z Z ξ i,ξ 0 (3) and a penalty term is added to the functional for the primal optimization problem, n n C ξ i + (4) i= ξ = where C>0, which quantifies the loss produced when ξ i,ξ > 0. The constraints in the dual optimization problems for the primal expressions considered are the same as in (6), (8), (0) and (), except that the dual variables are upperbounded, 0 u i,v C, i,. Theorem 3 Dual expression of the optimization problems (7), (8), (9) and () with slack variables (3) in the constraints and penalty term (4) in the 6 For the Standard SVM, α and β must be replaced with b andb+, respectively.
13 functional, for nonlinearly separable classes, can be formulated as u IR n,v IR n n u i x n i v x i= = n u n i = v = 0 u i,v A z i Z z Z. i= = (5) where A depends on n i= u i x i n = v x. However, the solutions for these approaches are not equivalent. PROOF. The dual obective functions to be imized for each formulation can easily be derived, in a similar form to the Theorem counterpart, by using the parameter λ = i u i when necessary, Standard primal SVM norm, f(λ, u i,v )= λ u i x i i v x λ classes ordinal regression and CMargin, f(λ, u i,v )= λ u i x i i v x Exact margin maximization, f(λ, u i,v )= u i x i i v x are now dependent variables, in contrast with the linearly sep where λ, u i,v arable case. The set of constraints for all the problems is n u i = n v =, 0 u i,v C i= = λ, 0 <λ NC, z i Z z Z being N =(n,n ). Hence, by nag A = C, the set of constraints in λ (5) is obtained for all the cases. However, as A = C,theC parameter is implicitly used in the dualized form, λ or, identically, a condition exists in the original problem on the λ parameter, therefore it is no longer possible to affirm that all the approaches lead to a similar solution. Let us see this situation. 3
14 Let us suppose that {u i0 }n i= and { v0 are the solution of (5) and x = = n u i0 x i, x = n v0 x, w 0 = x x are considered together with the two i= sets 7 C k (A) = = { nk } n } n k u i x i, 0 u i A, u i =,z i Z k, k =,. i= i= When C (A) andc (A) are disoint sets, Proposition verifies that α 0 = max w 0x = x w 0, β 0 = w 0x i = x w 0, β 0 α 0 = w 0. x C (A) x C (A) In a similar form to the linear case, the solution for each problem verifies w = λw 0, β = λβ 0 and α = λα 0. Therefore, β α = w 0 λ = β α and λ w 0 so A = C depends on n λ i= u i x i n = v. x On the other hand, constraint β α = implies that λ = is verified for w 0 the standard primal SVM norm; in classes ordinal regression, the vector accomplishes w = λ = w 0 w 0 3/ ; and for the CMargin, λ = C. ; in the exact margin, λ = w (β α) λ = 3. Discussion Two restrictions must be considered when selecting A. IfN =(n,n ) then i u i = v = n A, n A NA and therefore A N, and since i u i =and0 u i, it implies that A. Hence A. (6) (n,n ) On the other hand, it is known that parameter C in the original primal optimization problem is a tradeoff between the smoothness of the solution and the number of errors allowed in the classification problem. Nevertheless, by using (6), it also follows that it is necessary to impose C in the classes ordinal regression formulation and C C in the CMargin expression. 7 These sets are called softconvex hulls in []. 4
15 4 Conclusions and Future Work Several approaches have been introduced which deal with the large margin principle, specially related with the SVM theory. It has been demonstrated in this paper that all these approaches admit the same dual problem formulation in the linearly separable case, and all the solutions are equivalent. For the nonlinearly separable case, all the approaches can also be formulated as a unique dual optimization problem, however the solutions are not equivalent. Relations between all the approaches and the proposed dual unifying method indicate the convex hull formulation to be the most interesting approach in order to deal with the most flexible framework. From the proposed unified dual approach it will be possible to formulate a new SVM with greater power to interpret the geometry of the solution, the margin and associated parameters. 5 Acknowledgements This study was partially supported by Junta de Andalucía grant ACPAI 003/4, and Spanish MCyT grant TIC C00. References [] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, Inc, 998. [] N. Cristianini, J. ShaweTaylor, An introduction to Support Vector Machines and other kernelbased learning methods, Cambridge University Press, 000. [3] U. Kressel, Pairwise classification and support vector machine, In B. Schölkopf, C. Burgues and A. Smola, editors Advances in Kernel Methods: support Vector Learning. MIT Press. Cambridge, MA (999) [4] E. Mayoraz, E. Alpaydin, Support vector machines for multiclass classification, in: IWANN (), 999, pp URL citeseer.ist.psu.edu/mayoraz98support.html [5] B. Schölkopf, A. J. Smola, Learning with Kernels, The MIT Press, Cambridge, MA, 00. [6] B. Schölkopf, C. J. Burgues, A. J. Smola, Intorduction to support vector learning, In B. Schölkopf, C. Burgues and A.J. Smola, editors Advances in Kernel Methods: support Vector Learning. MIT Press. Cambridge, MA (999) 5. 5
16 [7] M. Hein, O. Bousquet, Maximal margin classification for metric spaces, in: B. Schölkopf, M. Warmuth (Eds.), Learning Theory and Kernel Machines, Springer Verlag, Heidelberg, Germany, 003, pp [8] R. Hebrich, Learning Kernel Classifiers. Theory and Algorithms, The MIT Press, 00. [9] T. Joachims, Advances in kernel methods: support vector learning, MIT Press, 999, Ch. Making largescale support vector machine learning practical, pp [0] A. Shashua, A. Levin, Taxonomy of large margin principle algorithms for ordinal regression problems (00). URL citeseer.ist.psu.edu/shashua0taxonomy.html [] K. P. Bennett, E. J. Bredensteiner, Duality and geometry in SVM classifiers, in: Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 000, pp [] D. Zhou, B. Xiao, H. Zhou, R. Dai, Global geometry of SVM classifiers, Technical report, AI Lab, Institute of Automation, Chinese Academy of Sciences (00). 6
17 6 Algunos comentarios He desarrollado los problemas primales de las cuatro aproximaciones en el caso no separable (si quieres te lo mando por fax) y he llegado, de nuevo, a lo formulado en el Teorema 3, pero pensando en la psvm dual, he encontrado algunas cosas que te comento para ver si coincides conmigo. En primer lugar se debe tener en cuenta que el valor 0 <λ= i u i NC Caso Ordinal Regression and CMargin: El problema dual queda u IR n,v IR n n n u i x i v x i= = n n (7) u i = v = 0 u i,v C z i Z,z Z i= = en el caso C Margin con C =yenelcasoordinal regression es: n n u IR n,v IR n u i x i v x i= = n n (8) u i = v = 0 u i,v C z i Z,z Z i= = que evidentemente proporcionan la misma solución. Ya sabemos que si el problema de clasificación original es no separable, entonces debemos imponer que C<. De esta forma, a la hora de resolver el problema original tendremos que dar un valor a C y al final hemos de comprobar si hemos conseguido una solución no trivial (w 0), ya que puede que no se logre separar los convexhull con dicho valor de C. Por otro lado, sabemos que /N C y puede ocurrir que incluso en el caso extremo. C =/N la solución sea la trivial (basta con que la media aritmética de todos los valores de entrenamiento en cada clase coincidan). Esto ya lo sabíamos pero también ocurren cosas feas en la aproximación... Exact margin: En este caso el problema dual para el caso no separable queda: n u i x i n v x i= = u IR n,v IR n n u i i= n n u i = v 0 u i,v C z i Z,z Z i= = (9) 7
18 (VEAMOSSIMEEXPLICOBIEN)Como0<λ= i u i NC y N>, voy a tomar como valor de λ = C y resuelvo el problema de optimización anterior, con lo cual como estoy restringiendo el problema, la solución u, v aporta un valor en la función obetivo mayor o igual que el aportado por la solución cuando λ es variable. Ahora bien, en este nuevo problema si se considera u i = u i/λ y v = v /λ se tiene el problema (dado en el teorema ): u IR n,v IR n n n u i x i v x i= = n n (30) u i = v = u i,v 0 z i Z,z Z i= = y como el conunto de entrenamiento es no separable, entonces la solución a este problema proporciona un vector w trivial (ESTO LO HE COMPRO BADO EN MATHEMATICA CON UN EJEMPLO) y por tanto su norma es cero y ya que las funciones obetivos son positivas es el menor valor que puede alcanzar la función obetivo en el problema (9). Respecto a la aproximación clásica: Standard SVM: En esta aproximación si el problema es no separable se puede conseguir como en el caso anterior que w = 0 pero no tiene porque ser la solución óptima ya que en la función obetivo se tiene un segundo sumando i u i, de tal forma que (como ocurre en la práctica) se tenga otro vector w 0con i u i grande con lo cual se tiene un valor más pequeño (negativo) en la función obetivo que en el caso anterior. Por todo ello, la aproximación acertada es la standard ya que evita el problema de que la solución sea trivial, y esto lo consigue al imponer que β α =, ya que de esta forma se esta asegurando que los convex hull sean siempre disuntos. Por supuesto que no es necesario que β α = basta con imponer que β α = γ con γ fiado a priori y positivo. Comentarios: Yocreoqueelquizdelasformulaciones(nolastandard)está en que se utilizan núcleos gaussiano y por tanto eligiendo adecuadamente los parámetros del problema de optimización se consigue que el problema sea separable (recordando que el núcleo de Gauss genera un espacio de dimensión infinita) y por tanto todas las aproximaciones son equivalentes. Por otro lado, como se trabaa numéricamente los errores de redondeo hacen que no se alcance exactamente el vector trivial. Pregunta: Están mal mis razonamientos? 8
Choosing Multiple Parameters for Support Vector Machines
Machine Learning, 46, 131 159, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Choosing Multiple Parameters for Support Vector Machines OLIVIER CHAPELLE LIP6, Paris, France olivier.chapelle@lip6.fr
More informationA Study on SMOtype Decomposition Methods for Support Vector Machines
1 A Study on SMOtype Decomposition Methods for Support Vector Machines PaiHsuen Chen, RongEn Fan, and ChihJen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw
More informationA Tutorial on Support Vector Machines for Pattern Recognition
c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies
More informationSupport Vector Clustering
Journal of Machine Learning Research 2 (2) 2537 Submitted 3/4; Published 2/ Support Vector Clustering Asa BenHur BIOwulf Technologies 23 Addison st. suite 2, Berkeley, CA 9474, USA David Horn School
More informationNew Support Vector Algorithms
LETTER Communicated by John Platt New Support Vector Algorithms Bernhard Schölkopf Alex J. Smola GMD FIRST, 12489 Berlin, Germany, and Department of Engineering, Australian National University, Canberra
More informationF β Support Vector Machines
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31  August 4, 2005 F β Support Vector Machines Jérôme Callut and Pierre Dupont Department of Computing Science
More informationThe Set Covering Machine
Journal of Machine Learning Research 3 (2002) 723746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,
More informationSupport Vector Machines
CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)
More informationWhen Is There a Representer Theorem? Vector Versus Matrix Regularizers
Journal of Machine Learning Research 10 (2009) 25072529 Submitted 9/08; Revised 3/09; Published 11/09 When Is There a Representer Theorem? Vector Versus Matrix Regularizers Andreas Argyriou Department
More informationSupportVector Networks
Machine Learning, 20, 273297 (1995) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. SupportVector Networks CORINNA CORTES VLADIMIR VAPNIK AT&T Bell Labs., Holmdel, NJ 07733,
More informationCovering Number Bounds of Certain Regularized Linear Function Classes
Journal of Machine Learning Research 2 (2002) 527550 Submitted 6/200; Published 3/2002 Covering Number Bounds of Certain Regularized Linear Function Classes Tong Zhang T.J. Watson Research Center Route
More informationLarge Margin DAGs for Multiclass Classification
S.A. Solla, T.K. Leen and K.R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 jplatt@microsoft.com
More informationSparse Prediction with the ksupport Norm
Sparse Prediction with the Support Norm Andreas Argyriou École Centrale Paris argyrioua@ecp.fr Rina Foygel Department of Statistics, Stanford University rinafb@stanford.edu Nathan Srebro Toyota Technological
More informationWorking Set Selection Using Second Order Information for Training Support Vector Machines
Journal of Machine Learning Research 6 (25) 889 98 Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines RongEn Fan PaiHsuen
More informationA Support Vector Method for Multivariate Performance Measures
Thorsten Joachims Cornell University, Dept. of Computer Science, 4153 Upson Hall, Ithaca, NY 14853 USA tj@cs.cornell.edu Abstract This paper presents a Support Vector Method for optimizing multivariate
More information4.1 Learning algorithms for neural networks
4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch Pitts units and perceptrons, but the question of how to
More informationSupport Vector Machine Solvers
Support Vector Machine Solvers Léon Bottou NEC Labs America, Princeton, NJ 08540, USA ChihJen Lin Department of Computer Science National Taiwan University, Taipei, Taiwan leon@bottou.org cjlin@csie.ntu.edu.tw
More informationLearning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
More informationLearning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
More informationSupport Vector Data Description
Machine Learning, 54, 45 66, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Support Vector Data Description DAVID M.J. TAX davidt@first.fhg.de ROBERT P.W. DUIN r.p.w.duin@tnw.tudelft.nl
More informationNew Results on Hermitian Matrix RankOne Decomposition
New Results on Hermitian Matrix RankOne Decomposition Wenbao Ai Yongwei Huang Shuzhong Zhang June 22, 2009 Abstract In this paper, we present several new rankone decomposition theorems for Hermitian
More informationOptimization with SparsityInducing Penalties. Contents
Foundations and Trends R in Machine Learning Vol. 4, No. 1 (2011) 1 106 c 2012 F. Bach, R. Jenatton, J. Mairal and G. Obozinski DOI: 10.1561/2200000015 Optimization with SparsityInducing Penalties By
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationPossibilistic InstanceBased Learning
Possibilistic InstanceBased Learning Eyke Hüllermeier Department of Mathematics and Computer Science University of Marburg Germany Abstract A method of instancebased learning is introduced which makes
More informationLearning over Sets using Kernel Principal Angles
Journal of Machine Learning Research 4 (2003) 913931 Submitted 3/03; Published 10/03 Learning over Sets using Kernel Principal Angles Lior Wolf Amnon Shashua School of Engineering and Computer Science
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationNew insights on the meanvariance portfolio selection from de Finetti s suggestions. Flavio Pressacco and Paolo Serafini, Università di Udine
New insights on the meanvariance portfolio selection from de Finetti s suggestions Flavio Pressacco and Paolo Serafini, Università di Udine Abstract: In this paper we offer an alternative approach to
More information