Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering Department University of Louisville Louisville, KY 40292
1 Regression Using Support Vector Machines: Basic Foundations Support Vector Machines (SVM) were developed by Vapnik [1] to solve the classification problem, but recently, SVM have been successfully extended to regression and density estimation problems [2]. SVM are gaining popularity due to many attractive features and promising empirical performance. For instance, the formulation of SVM density estimation employs the Structural Risk Minimization (SRM) principle, which has been shown to be superior to the traditional Empirical Risk Minimization (ERM) principle employed in conventional learning algorithms (e.g. neural networks) [3]. SRM minimizes an upper bound on the generalization error as opposed to ERM, which minimizes the error on the training data. This difference makes SVM more attractive in statistical learning applications. The traditional formulation of the SVM density estimation problem raises a quadratic optimization problem of the same size as the training data set. This computationally demanding optimization problem prevents the SVM from being the default choice of the pattern recognition community [4]. Several approaches have been introduced for circumventing the above shortcomings of the SVM learning. These include simpler optimization criterion for SVM design (e.g. the kernel ADA- TRON [5]), specialized QP algorithms like the conjugate gradient method, decomposition techniques (which break down the large QP problem into a series of smaller QP sub-problems), the sequential minimal optimization (SMO) algorithm and its various extensions [6], Nystrom approximations [7], and greedy Bayesian methods [8] and the Chunking algorithm [9]. Recently, active learning has become a popular paradigm for reducing the sample complexity of large-scale learning tasks (e.g. [10 12]). In active learning, instead of learning from random samples, the learner has the ability to select its own training data. This is done iteratively and the output of one step is used to select the examples for the next step. This tutorial presents the mathematical foundations of the SVM regression algorithm. Then, it presents a new learning algorithm which uses the Mean Field (MF) theory. The MF methods provide efficient approximations which are able to cope with the complexity of probabilistic data models [13]. MF methods replace the intractable task of computing high dimensional sums and integrals by the much easier problem of solving a system of linear equations. The regression problem is formu-
1 Problem Statement and Some Basic Principles 2 lated so that the MF method can be used to approximate the learning procedure in a way that avoids the quadratic programming optimization. This proposed approach is suitable for high dimensional regression problems and several experimental examples are presented. 1 Problem Statement and Some Basic Principles The regression problem can be stated as: given a training data set D = {(y i, t i ) i = 1, 2,..., n}, of input vectors y i and associated targets t i, the goal is to fit a function g(y) which approximates the relation inherited between the data set points and it can be used later on to infer the output t for a new input data point y. Any practical regression algorithm has a loss function L (t, g(y)), which describes how the estimated function deviated from the true one. Many forms for the loss function can be found in the literature: e.g. linear, quadratic loss function, exponential, etc. In this tutorial, Vapnik s loss function is used, which is known as ε insensitive loss function and defined as: 0 if t g(y) ε L (t, g(y)) = (1) t g(y) ε otherwise Figure 1: The soft margin loss function. where ε> 0 is a predefined constant which controls the noise tolerance. With the ε insensitive loss function, the goal is to find g(y) that has at most ε deviation from the actually obtained targets t i for all training data, and at the same time as flat as possible. In other words, the regression algorithm does not care about errors as long as they are less than ε, but will not accept any deviation larger than this.
2 Classical Formulation of the Regression Problem 3 For pedagogical reasons, the following discussion begins by describing the case of linear functions g, taking the form: f(y) = w.y + b (2) where w Y, Y is the input space, b R, and w.y is the dot product of the vectors w and y. 2 Classical Formulation of the Regression Problem As stated before, the goal of a regression algorithm is to fit a flat function to the data points. Flatness in the case of Eq. (2) means that one seeks a small w. One way to ensure this flatness is to minimize the norm, i.e. w 2. Thus, the regression problem can be written as a convex optimization problem: minimize subject to 1 2 w 2 (3) t i (w.y + b) ε (4) (w.y + b) t i ε The implied assumption in Eq.(4) is that such a function g actually exists that approximates all pairs (y i, t i ) with ε precision, or in other words, that the convex optimization problem is feasible. Sometimes, however, this may not be the case, or we also may want to allow for some errors. Analogously to the soft margin loss function [14] which was adapted to SVM machines Vapnik [15], slack variables ζ i, ζi can be introduced to cope with otherwise infeasible constraints of the optimization problem in Eq.(4). Hence the formulation stated in [15] is attained: minimize subject to 1 2 w 2 + C (ζ i + ζi ) (5) t i (w.y + b) ε + ζ i (w.y + b) t i ε + ζ (6) i ζ i, ζi 0 The constant C > 0 determines the trade-off between the flatness of g and the amount up to which deviations larger than ε are tolerated. This corresponds to dealing with the so called ε-insensitive loss function which described before.
2.1 Dual problem and quadratic programming 4 As shown in Fig.1, only the points outside the shaded region contribute to the cost insofar, as the deviations are penalized in a linear fashion. It turns out that in most cases the optimization problem Eq. (6) can be solved more easily in its dual formulation. Moreover, the dual formulation provides the key for extending SVM machine to nonlinear functions. Hence, a standard dualization method utilizing Lagrange multipliers will be described next. 2.1 Dual problem and quadratic programming The minimization problem in Eq. (6) is called the primal objective function. The key idea of the dual problem is to construct a Lagrange function from the primal objective function and the corresponding constraints, by introducing a dual set of variables. It can be shown that the Lagrange function has a saddle point with respect to the primal and dual variables at the solution (for details see e.g. [16], [17]). The primal objective function with its constraints are transformed to the Lagrange function as follows: L = 1 2 w 2 + C (ζ i + ζi ) (λ i ζ i + λ i ζi ) α i (ε + ζ i t i + (w.y + b)) αi (ε + ζi + t i (w.y + b)) (7) Here L is the Lagrangian and α i, α i, λ i, and λ i are Lagrange multipliers. Hence the dual variables in Eq. (7) have to satisfy positivity constraints: α i, α i, λ i, λ i 0. (8) It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables (w, b, ζ i, ζ i ) have to vanish for optimality: (Note α ( ) i, refers to α i, and α i. b L = w L = (αi α i ) = 0 (9) (αi α i )y i = 0 (10) ( ) ζ L =C α ( ) i λ ( ) i = 0 (11) i
2.2 Support Vectors 5 Substituting from Eqs. (9),(10), and (11) into Eq. (7) yields the dual optimization problem: maximize 1 (α i αi )(α j α 2 j)(y i.y j ) ε (α i + αi ) + y i (α i αi ) i,j=1 subject to (α i αi ) = 0 and α i, αi [0, C] (12) In deriving Eq. (12), the dual variables λ i, λ i are eliminated through the condition in Eq. (11) which can be reformulated as λ ( ) i = C α ( ) i. Eq. (9) can be rewritten as follows: w = g(y) = (α i αi )y i, thus: (α i αi )(y i.y) + b (13) This is the so-called Support Vector Machines regression expansion, i.e. w can be completely described as a linear combination of the training patterns y i. In a sense, the complexity of a function s representation by SVs is independent of the dimensionality of the input space Y, and depends only on the number of SVs. Moreover, the complete algorithm can be described in terms of dot products between the data. Even when evaluating g(y), the value of w does not need to be computed explicitly. These observations will come in handy for the formulation of a nonlinear extension. 2.2 Support Vectors The Karush-Kuhn-Tucker (KKT) conditions [18, 19] are the basics for the Lagrangian solution. These conditions state that at the solution point, the product between dual variables and constraints has to vanish i.e.: α i (ε + ζ i t i + w.y i + b) = 0 αi (ε + ζ i + t i w.y i b) = 0 (14) (C α i )ζ i = 0 (C αi )ζi = 0 (15)
2.3 Computing b 6 Several useful conclusions can be drawn from these conditions. Firstly only samples (y i, t i ) with corresponding α ( ) i a set of dual variables α i, α i that: = C lie outside the ε-insensitive tube. Secondly α i α i = 0, i.e. there can never be = 0 which are both simultaneously nonzero. This allows to conclude ε t i + w.y i + b 0 and ζ i = 0 if α i C (16) ε t i + w.y i + b 0 if α i > 0 (17) (18) A final note has to be made regarding the sparsity of the SVM expansion. From Eq. (14) it follows that only for g(y) ε the Lagrange multipliers may be nonzero, or in other words, for all samples inside the ε-tube (i.e. the shaded region in Fig. (1)) the α i, αi vanish: for g(y) < ε the second factor in Eq. (14) is nonzero, hence α i, αi has to be zero such that the KKT conditions are satisfied. Therefore there is a sparse expansion of w in terms of y i (i.e. not all y i needed to describe w). The training samples that come with nonvanishing coefficients are called Support Vectors. 2.3 Computing b There are many ways to compute the value of b in Eq. (13). One of such ways can be found in [20]: b = 1 2 (w.(y r + y s )) (19) where y r and y s are the support vectors (i.e. any input vector which has nonzero value of either α i or α i respectively). 3 Nonlinear Regression: The Kernel Trick The next step is to make the SVM algorithm nonlinear. This, for instance, could be achieved by simply preprocessing the training patterns y i by a map Ψ : Y I into some feature space I, as described in [1], and then applying the standard SVM regression algorithm. Here is a brief look at an example given in [1]. Example 1 (Quadratic features in R2)