Regression Using Support Vector Machines: Basic Foundations

Similar documents

Support Vector Machine (SVM)

Support Vector Machines Explained

A Simple Introduction to Support Vector Machines

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines

A New Quantitative Behavioral Model for Financial Prediction

Nonlinear Programming Methods.S2 Quadratic Programming

Several Views of Support Vector Machines

Big Data - Lecture 1 Optimization reminders

Using artificial intelligence for data reduction in mechanical engineering

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization: Algorithms 3: Interior-point methods

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

A fast multi-class SVM learning method for huge databases

10. Proximal point method

Support Vector Machines for Classification and Regression

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Gaussian Processes in Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Linear Programming. March 14, 2014

Distributed Machine Learning and Big Data

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Duality in Linear Programming

Linear Programming Notes V Problem Transformations

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Chapter 6. The stacking ensemble approach

Interior-Point Algorithms for Quadratic Programming

Linear Programming in Matrix Form

Date: April 12, Contents

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

Factorization Machines

Mathematical finance and linear programming (optimization)

Support Vector Machines

24. The Branch and Bound Method

An Introduction to Machine Learning

Machine Learning Final Project Spam Filtering

CHAPTER 2 Estimating Probabilities

Linear Threshold Units

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

STA 4273H: Statistical Machine Learning

Practical Guide to the Simplex Method of Linear Programming

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

Machine Learning and Pattern Recognition Logistic Regression

The equivalence of logistic regression and maximum entropy models

Support-Vector Networks

Linear Programming. April 12, 2005

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Adaptive Online Gradient Descent

Online Support Vector Regression

Applications of Support Vector-Based Learning

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Integrating algebraic fractions

1 Solving LPs: The Simplex Algorithm of George Dantzig

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Minimizing costs for transport buyers using integer programming and column generation. Eser Esirgen

Creating a NL Texas Hold em Bot

A Robust Formulation of the Uncertain Set Covering Problem

Comparison of machine learning methods for intelligent tutoring systems

Monotonicity Hints. Abstract

Machine Learning in FX Carry Basket Prediction

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Predict Influencers in the Social Network

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Nonlinear Iterative Partial Least Squares Method

THE SVM APPROACH FOR BOX JENKINS MODELS

A Tutorial on Support Vector Machines for Pattern Recognition

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Statistics Graduate Courses

Scaling Kernel-Based Systems to Large Data Sets

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:

Large-Scale Sparsified Manifold Regularization

4.6 Linear Programming duality

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

Partial Fractions. p(x) q(x)

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Optimal linear-quadratic control

Statistical machine learning, high dimension and big data

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

The Method of Least Squares

Online Classification on a Budget

SOLVING LINEAR SYSTEMS

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Integrating Benders decomposition within Constraint Programming

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Classification algorithm in Data mining: An Overview

Optimal Resource Allocation for the Quality Control Process

Proximal mapping via network optimization

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Tutorial: Operations Research in Constraint Programming

Supervised Learning (Big Data Analytics)

Transcription:

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering Department University of Louisville Louisville, KY 40292

1 Regression Using Support Vector Machines: Basic Foundations Support Vector Machines (SVM) were developed by Vapnik [1] to solve the classification problem, but recently, SVM have been successfully extended to regression and density estimation problems [2]. SVM are gaining popularity due to many attractive features and promising empirical performance. For instance, the formulation of SVM density estimation employs the Structural Risk Minimization (SRM) principle, which has been shown to be superior to the traditional Empirical Risk Minimization (ERM) principle employed in conventional learning algorithms (e.g. neural networks) [3]. SRM minimizes an upper bound on the generalization error as opposed to ERM, which minimizes the error on the training data. This difference makes SVM more attractive in statistical learning applications. The traditional formulation of the SVM density estimation problem raises a quadratic optimization problem of the same size as the training data set. This computationally demanding optimization problem prevents the SVM from being the default choice of the pattern recognition community [4]. Several approaches have been introduced for circumventing the above shortcomings of the SVM learning. These include simpler optimization criterion for SVM design (e.g. the kernel ADA- TRON [5]), specialized QP algorithms like the conjugate gradient method, decomposition techniques (which break down the large QP problem into a series of smaller QP sub-problems), the sequential minimal optimization (SMO) algorithm and its various extensions [6], Nystrom approximations [7], and greedy Bayesian methods [8] and the Chunking algorithm [9]. Recently, active learning has become a popular paradigm for reducing the sample complexity of large-scale learning tasks (e.g. [10 12]). In active learning, instead of learning from random samples, the learner has the ability to select its own training data. This is done iteratively and the output of one step is used to select the examples for the next step. This tutorial presents the mathematical foundations of the SVM regression algorithm. Then, it presents a new learning algorithm which uses the Mean Field (MF) theory. The MF methods provide efficient approximations which are able to cope with the complexity of probabilistic data models [13]. MF methods replace the intractable task of computing high dimensional sums and integrals by the much easier problem of solving a system of linear equations. The regression problem is formu-

1 Problem Statement and Some Basic Principles 2 lated so that the MF method can be used to approximate the learning procedure in a way that avoids the quadratic programming optimization. This proposed approach is suitable for high dimensional regression problems and several experimental examples are presented. 1 Problem Statement and Some Basic Principles The regression problem can be stated as: given a training data set D = {(y i, t i ) i = 1, 2,..., n}, of input vectors y i and associated targets t i, the goal is to fit a function g(y) which approximates the relation inherited between the data set points and it can be used later on to infer the output t for a new input data point y. Any practical regression algorithm has a loss function L (t, g(y)), which describes how the estimated function deviated from the true one. Many forms for the loss function can be found in the literature: e.g. linear, quadratic loss function, exponential, etc. In this tutorial, Vapnik s loss function is used, which is known as ε insensitive loss function and defined as: 0 if t g(y) ε L (t, g(y)) = (1) t g(y) ε otherwise Figure 1: The soft margin loss function. where ε> 0 is a predefined constant which controls the noise tolerance. With the ε insensitive loss function, the goal is to find g(y) that has at most ε deviation from the actually obtained targets t i for all training data, and at the same time as flat as possible. In other words, the regression algorithm does not care about errors as long as they are less than ε, but will not accept any deviation larger than this.

2 Classical Formulation of the Regression Problem 3 For pedagogical reasons, the following discussion begins by describing the case of linear functions g, taking the form: f(y) = w.y + b (2) where w Y, Y is the input space, b R, and w.y is the dot product of the vectors w and y. 2 Classical Formulation of the Regression Problem As stated before, the goal of a regression algorithm is to fit a flat function to the data points. Flatness in the case of Eq. (2) means that one seeks a small w. One way to ensure this flatness is to minimize the norm, i.e. w 2. Thus, the regression problem can be written as a convex optimization problem: minimize subject to 1 2 w 2 (3) t i (w.y + b) ε (4) (w.y + b) t i ε The implied assumption in Eq.(4) is that such a function g actually exists that approximates all pairs (y i, t i ) with ε precision, or in other words, that the convex optimization problem is feasible. Sometimes, however, this may not be the case, or we also may want to allow for some errors. Analogously to the soft margin loss function [14] which was adapted to SVM machines Vapnik [15], slack variables ζ i, ζi can be introduced to cope with otherwise infeasible constraints of the optimization problem in Eq.(4). Hence the formulation stated in [15] is attained: minimize subject to 1 2 w 2 + C (ζ i + ζi ) (5) t i (w.y + b) ε + ζ i (w.y + b) t i ε + ζ (6) i ζ i, ζi 0 The constant C > 0 determines the trade-off between the flatness of g and the amount up to which deviations larger than ε are tolerated. This corresponds to dealing with the so called ε-insensitive loss function which described before.

2.1 Dual problem and quadratic programming 4 As shown in Fig.1, only the points outside the shaded region contribute to the cost insofar, as the deviations are penalized in a linear fashion. It turns out that in most cases the optimization problem Eq. (6) can be solved more easily in its dual formulation. Moreover, the dual formulation provides the key for extending SVM machine to nonlinear functions. Hence, a standard dualization method utilizing Lagrange multipliers will be described next. 2.1 Dual problem and quadratic programming The minimization problem in Eq. (6) is called the primal objective function. The key idea of the dual problem is to construct a Lagrange function from the primal objective function and the corresponding constraints, by introducing a dual set of variables. It can be shown that the Lagrange function has a saddle point with respect to the primal and dual variables at the solution (for details see e.g. [16], [17]). The primal objective function with its constraints are transformed to the Lagrange function as follows: L = 1 2 w 2 + C (ζ i + ζi ) (λ i ζ i + λ i ζi ) α i (ε + ζ i t i + (w.y + b)) αi (ε + ζi + t i (w.y + b)) (7) Here L is the Lagrangian and α i, α i, λ i, and λ i are Lagrange multipliers. Hence the dual variables in Eq. (7) have to satisfy positivity constraints: α i, α i, λ i, λ i 0. (8) It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables (w, b, ζ i, ζ i ) have to vanish for optimality: (Note α ( ) i, refers to α i, and α i. b L = w L = (αi α i ) = 0 (9) (αi α i )y i = 0 (10) ( ) ζ L =C α ( ) i λ ( ) i = 0 (11) i

2.2 Support Vectors 5 Substituting from Eqs. (9),(10), and (11) into Eq. (7) yields the dual optimization problem: maximize 1 (α i αi )(α j α 2 j)(y i.y j ) ε (α i + αi ) + y i (α i αi ) i,j=1 subject to (α i αi ) = 0 and α i, αi [0, C] (12) In deriving Eq. (12), the dual variables λ i, λ i are eliminated through the condition in Eq. (11) which can be reformulated as λ ( ) i = C α ( ) i. Eq. (9) can be rewritten as follows: w = g(y) = (α i αi )y i, thus: (α i αi )(y i.y) + b (13) This is the so-called Support Vector Machines regression expansion, i.e. w can be completely described as a linear combination of the training patterns y i. In a sense, the complexity of a function s representation by SVs is independent of the dimensionality of the input space Y, and depends only on the number of SVs. Moreover, the complete algorithm can be described in terms of dot products between the data. Even when evaluating g(y), the value of w does not need to be computed explicitly. These observations will come in handy for the formulation of a nonlinear extension. 2.2 Support Vectors The Karush-Kuhn-Tucker (KKT) conditions [18, 19] are the basics for the Lagrangian solution. These conditions state that at the solution point, the product between dual variables and constraints has to vanish i.e.: α i (ε + ζ i t i + w.y i + b) = 0 αi (ε + ζ i + t i w.y i b) = 0 (14) (C α i )ζ i = 0 (C αi )ζi = 0 (15)

2.3 Computing b 6 Several useful conclusions can be drawn from these conditions. Firstly only samples (y i, t i ) with corresponding α ( ) i a set of dual variables α i, α i that: = C lie outside the ε-insensitive tube. Secondly α i α i = 0, i.e. there can never be = 0 which are both simultaneously nonzero. This allows to conclude ε t i + w.y i + b 0 and ζ i = 0 if α i C (16) ε t i + w.y i + b 0 if α i > 0 (17) (18) A final note has to be made regarding the sparsity of the SVM expansion. From Eq. (14) it follows that only for g(y) ε the Lagrange multipliers may be nonzero, or in other words, for all samples inside the ε-tube (i.e. the shaded region in Fig. (1)) the α i, αi vanish: for g(y) < ε the second factor in Eq. (14) is nonzero, hence α i, αi has to be zero such that the KKT conditions are satisfied. Therefore there is a sparse expansion of w in terms of y i (i.e. not all y i needed to describe w). The training samples that come with nonvanishing coefficients are called Support Vectors. 2.3 Computing b There are many ways to compute the value of b in Eq. (13). One of such ways can be found in [20]: b = 1 2 (w.(y r + y s )) (19) where y r and y s are the support vectors (i.e. any input vector which has nonzero value of either α i or α i respectively). 3 Nonlinear Regression: The Kernel Trick The next step is to make the SVM algorithm nonlinear. This, for instance, could be achieved by simply preprocessing the training patterns y i by a map Ψ : Y I into some feature space I, as described in [1], and then applying the standard SVM regression algorithm. Here is a brief look at an example given in [1]. Example 1 (Quadratic features in R2)