SVM Tutorial: Classification, Regression, and Ranking



Similar documents
Support Vector Machines

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Forecasting the Direction and Strength of Stock Market Movement

What is Candidate Sampling

L10: Linear discriminants analysis

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

BERNSTEIN POLYNOMIALS

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Lecture 2: Single Layer Perceptrons Kevin Swingler

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

The Greedy Method. Introduction. 0/1 Knapsack Problem

1 Example 1: Axis-aligned rectangles

New Approaches to Support Vector Ordinal Regression

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Fisher Markets and Convex Programs

Single and multiple stage classifiers implementing logistic discrimination

J. Parallel Distrib. Comput.

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

How To Calculate The Accountng Perod Of Nequalty

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Performance Analysis and Coding Strategy of ECOC SVMs

Support Vector Machine Model for Currency Crisis Discrimination. Arindam Chaudhuri 1. Abstract

Financial market forecasting using a two-step kernel learning method for the support vector regression

We are now ready to answer the question: What are the possible cardinalities for finite fields?

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Logistic Regression. Steve Kroon

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

A machine vision approach for detecting and inspecting circular parts

The Mathematical Derivation of Least Squares

The OC Curve of Attribute Acceptance Plans

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Recurrence. 1 Definitions and main statements

Gender Classification for Real-Time Audience Analysis System

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

An Alternative Way to Measure Private Equity Performance

8 Algorithm for Binary Searching in Trees

Support vector domain description

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

CHAPTER 14 MORE ABOUT REGRESSION

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Learning to Classify Ordinal Data: The Data Replication Method

Project Networks With Mixed-Time Constraints

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

This circuit than can be reduced to a planar circuit

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Ring structure of splines on triangulations

A Simple Approach to Clustering in Excel

where the coordinates are related to those in the old frame as follows.

PERRON FROBENIUS THEOREM

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

AUTHENTICATION OF OTTOMAN ART CALLIGRAPHERS

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

Period and Deadline Selection for Schedulability in Real-Time Systems

An Interest-Oriented Network Evolution Mechanism for Online Communities

Discussion Papers. Support Vector Machines (SVM) as a Technique for Solvency Analysis. Laura Auria Rouslan A. Moro. Berlin, August 2008

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Loop Parallelization

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Availability-Based Path Selection and Network Vulnerability Assessment

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Generalizing the degree sequence problem

ONE of the most crucial problems that every image

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Improved SVM in Cloud Computing Information Mining

An interactive system for structure-based ASCII art creation

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Lecture 5,6 Linear Methods for Classification. Summary

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

A DATA MINING APPLICATION IN A STUDENT DATABASE

Active Learning for Interactive Visualization

Formulating & Solving Integer Problems Chapter

Efficient Project Portfolio as a tool for Enterprise Risk Management

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

An Efficient and Simplified Model for Forecasting using SRM

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Mining Multiple Large Data Sources

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

1. Measuring association using correlation and regression

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Solving Factored MDPs with Continuous and Discrete Variables

Extending Probabilistic Dynamic Epistemic Logic

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Least Squares Fitting of Data

A robust kernel-distance multivariate control chart using support vector principles

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

A Secure Password-Authenticated Key Agreement Using Smart Cards

An MILP model for planning of batch plants operating in a campaign-mode

Transcription:

SVM Tutoral: Classfcaton, Regresson, and Rankng Hwanjo Yu and Sungchul Km 1 Introducton Support Vector Machnes(SVMs) have been extensvely researched n the data mnng and machne learnng communtes for the last decade and actvely appled to applcatons n varous domans. SVMs are typcally used for learnng classfcaton, regresson, or rankng functons, for whch they are called classfyng SVM, support vector regresson (SVR), or rankng SVM (or RankSVM) respectvely. Two specal propertes of SVMs are that SVMs acheve (1) hgh generalzaton by maxmzng the margn and (2) support an effcent learnng of nonlnear functons by kernel trck. Ths chapter ntroduces these general concepts and technques of SVMs for learnng classfcaton, regresson, and rankng functons. In partcular, we frst present the SVMs for bnary classfcaton n Secton 2, SVR n Secton 3, rankng SVM n Secton 4, and another recently developed method for learnng rankng SVM called Rankng Vector Machne (RVM) n Secton 5. 2 SVM Classfcaton SVMs were ntally developed for classfcaton [5] and have been extended for regresson [23] and preference (or rank) learnng [14, 27]. The ntal form of SVMs s a bnary classfer where the output of learned functon s ether postve or negatve. A multclass classfcaton can be mplemented by combnng multple bnary classfers usng parwse couplng method [13, 15]. Ths secton explans the mot- Hwanjo Yu POSTECH, Pohang, South Korea, e-mal: hwanjoyu@postech.ac.kr Sungchul Km POSTECH, Pohang, South Korea, e-mal: subrght@postech.ac.kr 1

2 Hwanjo Yu and Sungchul Km vaton and formalzaton of SVM as a bnary classfer, and the two key propertes margn maxmzaton and kernel trck. Fg. 1 Lnear classfers (hyperplane) n two-dmensonal spaces Bnary SVMs are classfers whch dscrmnate data ponts of two categores. Each data object (or data pont) s represented by a n-dmensonal vector. Each of these data ponts belongs to only one of two classes. A lnear classfer separates them wth an hyperplane. For example, Fg. 1 shows two groups of data and separatng hyperplanes that are lnes n a two-dmensonal space. There are many lnear classfers that correctly classfy (or dvde) the two groups of data such as L1, L2 and L3 n Fg. 1. In order to acheve maxmum separaton between the two classes, SVM pcks the hyperplane whch has the largest margn. The margn s the summaton of the shortest dstance from the separatng hyperplane to the nearest data pont of both categores. Such a hyperplane s lkely to generalze better, meanng that the hyperplane correctly classfy unseen or testng data ponts. SVMs does the mappng from nput space to feature space to support nonlnear classfcaton problems. The kernel trck s helpful for dong ths by allowng the absence of the exact formulaton of mappng functon whch could cause the ssue of curse of dmensonalty. Ths makes a lnear classfcaton n the new space (or the feature space) equvalent to nonlnear classfcaton n the orgnal space (or the nput space). SVMs do these by mappng nput vectors to a hgher dmensonal space (or feature space) where a maxmal separatng hyperplane s constructed.

SVM Tutoral: Classfcaton, Regresson, and Rankng 3 2.1 Hard-margn SVM Classfcaton To understand how SVMs compute the hyperplane of maxmal margn and support nonlnear classfcaton, we frst explan the hard-margn SVM where the tranng data s free of nose and can be correctly classfed by a lnear functon. The data ponts D n Fg. 1 (or tranng set) can be expressed mathematcally as follows. D = {(x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )} (1) where x s a n-dmensonal real vector, y s ether 1 or -1 denotng the class to whch the pont x belongs. The SVM classfcaton functon F(x) takes the form F(x) = w x b. (2) w s the weght vector and b s the bas, whch wll be computed by SVM n the tranng process. Frst, to correctly classfy the tranng set, F( ) (or w and b) must return postve numbers for postve data ponts and negatve numbers otherwse, that s, for every pont x n D, These condtons can be revsed nto: w x b > 0 f y = 1,and w x b < 0 f y = 1 y (w x b) > 0, (x,y ) D (3) If there exsts such a lnear functon F that correctly classfes every pont n D or satsfes Eq.(3), D s called lnearly separable. Second, F (or the hyperplane) needs to maxmze the margn. Margn s the dstance from the hyperplane to the closest data ponts. An example of such hyperplane s llustrated n Fg. 2. To acheve ths, Eq.(3) s revsed nto the followng Eq.(4). y (w x b) 1, (x,y ) D (4) Note that Eq.(4) ncludes equalty sgn, and the rght sde becomes 1 nstead of 0. If D s lnearly separable, or every pont n D satsfes Eq.(3), then there exsts such a F that satsfes Eq.(4). It s because, f there exst such w and b that satsfy Eq.(3), they can be always rescaled to satsfy Eq.(4) The dstance from the hyperplane to a vector x s formulated as F(x ) w. Thus, the margn becomes margn = 1 w (5)

4 Hwanjo Yu and Sungchul Km Fg. 2 SVM classfcaton functon: the hyperplane maxmzng the margn n a two-dmensonal space because when x are the closest vectors, F(x) wll return 1 accordng to Eq.(4). The closest vectors, that satsfy Eq.(4) wth equalty sgn, are called support vectors. Maxmzng the margn becomes mnmzng w. Thus, the tranng problem n SVM becomes a constraned optmzaton problem as follows. mnmze: Q(w) = 1 2 w 2 (6) subject to: y (w x b) 1, (x,y ) D (7) The factor of 1 2 s used for mathematcal convenence. 2.1.1 Solvng the Constraned Optmzaton Problem The constraned optmzaton problem (6) and (7) s called prmal problem. It s characterzed as follows: The objectve functon (6) s a convex functon of w. The constrants are lnear n w. Accordngly, we may solve the constraned optmzaton problem usng the method of Largrange multplers [3]. Frst, we construct the Largrange functon: J(w,b,α) = 1 2 w w m =1α {y (w x b) 1} (8)

SVM Tutoral: Classfcaton, Regresson, and Rankng 5 where the auxlary nonnegatve varables α are called Largrange multplers. The soluton to the constraned optmzaton problem s determned by the saddle pont of the Lagrange functon J(w,b,α), whch has to be mnmzed wth respect to w and b; t also has to be maxmzed wth respect to α. Thus, dfferentatng J(w,b,α) wth respect to w and b and settng the results equal to zero, we get the followng two condtons of optmalty: Condton1 : J(w,b,α) w = 0 (9) Condton2 : J(w,b,α) b = 0 (10) After rearrangement of terms, the Condton 1 yelds and the Condton 2 yelds w = m =1 m =1 α y,x (11) α y = 0 (12) The soluton vector w s defned n terms of an expanson that nvolves the m tranng examples. As noted earler, the prmal problem deals wth a convex cost functon and lnear constrants. Gven such a constraned optmzaton problem, t s possble to construct another problem called dual problem. The dual problem has the same optmal value as the prmal problem, but wth the Largrange multplers provdng the optmal soluton. To postulate the dual problem for our prmal problem, we frst expand Eq.(8), term by term, as follows: J(w,b,α) = 1 2 w w m =1α y w x b m =1 α y + m =1 α (13) The thrd term on the rght-hand sde of Eq.(13) s zero by vrtue of the optmalty condton of Eq.(12). Furthermore, from Eq.(11) we have w w = m =1 α y w x = m m =1 j=1 α α j y y j x x j (14) Accordngly, settng the objectve functon J(w, b, α) = Q(α), we can reformulate Eq.(13) as m Q(α) = =1α 1 2 where the α are nonnegatve. We now state the dual problem: m =1 m j=1 α α j y y j x x j (15)

6 Hwanjo Yu and Sungchul Km maxmze: Q(α) = α 1 2 subject to: α α j y y j x x j (16) j α y = 0 (17) α 0 (18) Note that the dual problem s cast entrely n terms of the tranng data. Moreover, the functon Q(α) to be maxmzed depends only on the nput patterns n the form of a set of dot product {x x j } m (, j)=1. Havng determned the optmum Lagrange multplers, denoted by α, we may compute the optmum weght vector w usng Eq.(11) and so wrte w = α y x (19) Note that accordng to the property of Kuhn-Tucker condtons of optmzaton theory, The soluton of the dual problem α must satsfy the followng condton. α {y (w x b) 1} = 0 for = 1,2,...,m (20) and ether α or ts correspondng constrant {y (w x b) 1} must be nonzero. Ths condton mples that only when x s a support vector or y (w x b) = 1, ts correspondng coeffcent α wll be nonzero (or nonnegatve from Eq.(18)). In other words, the x whose correspondng coeffcents α are zero wll not affect the optmum weght vector w due to Eq.(19). Thus, the optmum weght vector w wll only depend on the support vectors whose coeffcents are nonnegatve. Once we compute the nonnegatve α and ther correspondng suppor vectors, we can compute the bas b usng a postve support vector x from the followng equaton. The classfcaton of Eq.(2) now becomes as follows. b = 1 w x (21) F(x) = α y x x b (22) 2.2 Soft-margn SVM Classfcaton The dscusson so far has focused on lnearly separable cases. However, the optmzaton problem (6) and (7) wll not have a soluton f D s not lnearly separable. To deal wth such cases, soft margn SVM allows mslabeled data ponts whle stll maxmzng the margn. The method ntroduces slack varables, ξ, whch measure

SVM Tutoral: Classfcaton, Regresson, and Rankng 7 the degree of msclassfcaton. The followng s the optmzaton problem for soft margn SVM. mnmze: Q 1 (w,b,ξ ) = 1 2 w 2 +C ξ (23) subject to: y (w x b) 1 ξ, (x,y ) D (24) ξ 0 (25) Due to the ξ n Eq.(24), data ponts are allowed to be msclassfed, and the amount of msclassfcaton wll be mnmzed whle maxmzng the margn accordng to the objectve functon (23). C s a parameter that determnes the tradeoff between the margn sze and the amount of error n tranng. Smlarly to the case of hard-margn SVM, ths prmal form can be transformed to the followng dual form usng the Lagrange multplers. maxmze: Q 2 (α) = subject to: α α α j y y j x x j (26) j α y = 0 (27) C α 0 (28) Note that nether the slack varables ξ nor ther Lagrange multplers appear n the dual problem. The dual problem for the case of nonseparable patterns s thus smlar to that for the smple case of lnearly separable patterns except for a mnor but mportant dfference. The objectve functon Q(α) to be maxmzed s the same n both cases. The nonseparable case dffers from the separable case n that the constrant α 0 s replaced wth the more strngent constrant C α 0. Except for ths modfcaton, the constraned optmzaton for the nonseparable case and computatons of the optmum values of the weght vector w and bas b proceed n the same way as n the lnearly separable case. Just as the hard-margn SVM, α consttute a dual representaton for the weght vector such that w = m s =1 α y x (29) where m s s the number of support vectors whose correspondng coeffcent α > 0. The determnaton of the optmum values of the bas also follows a procedure smlar to that descrbed before. Once α and b are computed, the functon Eq.(22) s used to classfy new object. We can further dsclose relatonshps among α, ξ, and C by the Kuhn-Tucker condtons whch are defned by and α {y (w x b) 1+ξ } = 0, = 1,2,...,m (30)

8 Hwanjo Yu and Sungchul Km µ ξ = 0, = 1,2,...,m (31) Eq.(30) s a rewrte of Eq.(20) except for the replacement of the unty term (1 ξ ). As for Eq.(31), the µ are Lagrange multplers that have been ntroduced to enforce the nonnegatvty of the slack varables ξ for all. At the saddle pont the dervatve of the Lagrange functon for the prmal problem wth respect to the slack varable ξ s zero, the evaluaton of whch yelds By combnng Eqs.(31) and (32), we see that α + µ = C (32) ξ = 0 f α < C, and (33) ξ 0 f α = C (34) We can graphcally dsplay the relatonshps among α, ξ, and C n Fg. 3. Fg. 3 Graphcal relatonshps among α, ξ, and C Data ponts outsde the margn wll have α = 0 and ξ = 0 and those on the margn lne wll have C > α > 0 and stll ξ = 0. Data ponts wthn the margn wll have α = C. Among them, those correctly classfed wll have 1 > ξ > 0 and msclassfed ponts wll have ξ > 1. 2.3 Kernel Trck for Nonlnear Classfcaton If the tranng data s not lnearly separable, there s no straght hyperplane that can separate the classes. In order to learn a nonlnear functon n that case, lnear SVMs must be extended to nonlnear SVMs for the classfcaton of nonlnearly separable

SVM Tutoral: Classfcaton, Regresson, and Rankng 9 data. The process of fndng classfcaton functons usng nonlnear SVMs conssts of two steps. Frst, the nput vectors are transformed nto hgh-dmensonal feature vectors where the tranng data can be lnearly separated. Then, SVMs are used to fnd the hyperplane of maxmal margn n the new feature space. The separatng hyperplane becomes a lnear functon n the transformed feature space but a nonlnear functon n the orgnal nput space. Let x be a vector n the n-dmensonal nput space and ϕ( ) be a nonlnear mappng functon from the nput space to the hgh-dmensonal feature space. The hyperplane representng the decson boundary n the feature space s defned as follows. w ϕ(x) b = 0 (35) where w denotes a weght vector that can map the tranng data n the hgh dmensonal feature space to the output space, and b s the bas. Usng the ϕ( ) functon, the weght becomes w = α y ϕ(x ) (36) The decson functon of Eq.(22) becomes F(x) = m α y ϕ(x ) ϕ(x) b (37) Furthermore, the dual problem of soft-margn SVM (Eq.(26)) can be rewrtten usng the mappng functon on the data vectors as follows. Q(α) = α 1 2 α α j y y j ϕ(x ) ϕ(x j ) (38) j holdng the same constrants. Note that the feature mappng functons n the optmzaton problem and also n the classfyng functon always appear as dot products, e.g., ϕ(x ) ϕ(x j ). ϕ(x ) ϕ(x j ) s the nner product between pars of vectors n the transformed feature space. Computng the nner product n the transformed feature space seems to be qute complex and suffer from the curse of dmensonalty problem. To avod ths problem, the kernel trck s used. The kernel trck replaces the nner product n the feature space wth a kernel functon K n the orgnal nput space as follows. K(u,v) = ϕ(u) ϕ(v) (39) The Mercer s theorem proves that a kernel functon K s vald, f and only f, the followng condtons are satsfed, for any functon ψ(x). (Refer to [9] for the proof n detal.) K(u, v)ψ(u)ψ(v)dxdy 0 (40) where ψ(x) 2 dx 0

10 Hwanjo Yu and Sungchul Km The Mercer s theorem ensures that the kernel functon can be always expressed as the nner product between pars of nput vectors n some hgh-dmensonal space, thus the nner product can be calculated usng the kernel functon only wth nput vectors n the orgnal space wthout transformng the nput vectors nto the hghdmensonal feature vectors. The dual problem s now defned usng the kernel functon as follows. maxmze: Q 2 (α) = subject to: The classfcaton functon becomes: α α α j y y j K(x,x j ) (41) j α y = 0 (42) C α 0 (43) F(x) = α y K(x,x) b (44) Snce K( ) s computed n the nput space, no feature transformaton wll be actually done or no ϕ( ) wll be computed, and thus the weght vector w = α y ϕ(x) wll not be computed ether n nonlnear SVMs. The followngs are popularly used kernel functons. Polynomal: K(a,b) = (a b+1) d Radal Bass Functon (RBF): K(a,b) = exp( γ a b 2 ) Sgmod: K(a,b) = tanh(κa b+c) Note that, the kernel functon s a knd of smlarty functon between two vectors where the functon output s maxmzed when the two vectors become equvalent. Because of ths, SVM can learn a functon from any shapes of data beyond vectors (such as trees or graphs) as long as we can compute a smlarty functon between any pars of data objects. Further dscussons on the propertes of these kernel functons are out of the scope. We wll nstead gve an example of usng polynomal kernel for learnng an XOR functon n the followng secton. 2.3.1 Example: XOR problem To llustrate the procedure of tranng a nonlnear SVM functon, assume we are gven a tranng set of Table 1. Fgure 4 plots the tranng ponts n the 2-D nput space. There s no lnear functon that can separate the tranng ponts. To proceed, let K(x,x ) = (1+x x ) 2 (45) If we denote x = (x 1,x 2 ) and x = (x 1,x 2 ), the kernel functon s expressed n terms of monomals of varous orders as follows.

SVM Tutoral: Classfcaton, Regresson, and Rankng 11 Input vector x Desred output y (-1, -1) -1 (-1, +1) +1 (+1, -1) +1 (+1, +1) -1 Table 1 XOR Problem Fg. 4 XOR Problem K(x,x ) = 1+x 2 1x 2 1 + 2x 1 x 2 x 1 x 2 + x 2 2x 2 2 + 2x 1 x 1 + 2x 2 x 2 (46) The mage of the nput vector x nduced n the feature space s therefore deduced to be ϕ(x) = (1,x 2 1, 2x 1 x 2,x 2 2, 2x 1, 2x 2 ) (47) Based on ths mappng functon, the objectve functon for the dual form can be derved from Eq. (41) as follows. Q(α) = α 1 + α 2 + α 3 + α 4 1 2 (9α2 1 2α 1 α 2 2α 1 α 3 + 2α1α 4 +9α 2 2 + 2α 2 α 3 2α 2 α 4 + 9α 3 2α3α 4 + α 2 4) (48) Optmzng Q(α) wth respect to the Lagrange multplers yelds the followng set of smultaneous equatons:

12 Hwanjo Yu and Sungchul Km 9α 1 α 2 α 3 + α 4 = 1 α 1 + 9α 2 + α 3 α 4 = 1 α 1 + α 2 + 9α 3 α 4 = 1 α 1 α 2 α 3 + 9α 4 = 1 Hence, the optmal values of the Lagrange multplers are α 1 = α 2 = α 3 = α 4 = 1 8 Ths result denotes that all four nput vectors are support vectors and the optmum value of Q(α) s and Q(α) = 1 4 1 2 w 2 = 1 4, or w = 1 2 From Eq.(36), we fnd that the optmum weght vector s w = 1 8 [ ϕ(x 1)+ϕ(x 2 )+ϕ(x 3 ) ϕ(x 4 )] 1 1 1 1 0 = 1 1 1 8 2 1 + 1 2 1 2 + 1 0 2 1 2 1 = 0 2 2 2 2 2 0 2 2 0 1 2 (49) The bas b s 0 because the frst element of w s 0. The optmal hyperplane becomes whch reduces to w ϕ(x) = [0 0 1 x 2 1 1 0 0 0] 2x1 x 2 2 x2 2 = 0 (50) 2x1 22 x 1 x 2 = 0 (51)

SVM Tutoral: Classfcaton, Regresson, and Rankng 13 x 1 x 2 = 0 s the optmal hyperplane, the soluton of the XOR problem. It makes the output y = 1 for both nput ponts x 1 = x 2 = 1 and x 1 = x 2 = 1, and y = 1 for both nput ponts x 1 = 1,x 2 = 1 or x 1 = 1,x 2 = 1. Fgure. 5 represents the four ponts n the transformed feature space. Fg. 5 The 4 data ponts of XOR problem n the transformed feature space 3 SVM Regresson SVM Regresson (SVR) s a method to estmate a functon that maps from an nput object to a real number based on tranng data. Smlarly to the classfyng SVM, SVR has the same propertes of the margn maxmzaton and kernel trck for nonlnear mappng. A tranng set for regresson s represented as follows. D = {(x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )} (52) where x s a n-dmensonal vector, y s the real number for each x. The SVR functon F(x ) makes a mappng from an nput vector x to the target y and takes the form. F(x) = w x b (53) where w s the weght vector and b s the bas. The goal s to estmate the parameters (w and b) of the functon that gve the best ft of the data. An SVR functon F(x)

14 Hwanjo Yu and Sungchul Km approxmates all pars (x, y ) whle mantanng the dfferences between estmated values and real values under ε precson. That s, for every nput vector x n D, The margn s y w x b ε (54) w x + b y ε (55) margn = 1 w By mnmzng w 2 to maxmze the margn, the tranng n SVR becomes a constraned optmzaton problem as follows. (56) mnmze: L(w) = 1 2 w 2 (57) subject to: y w x b ε (58) w x + b y ε (59) The soluton of ths problem does not allow any errors. To allow some errors to deal wth nose n the tranng data, The soft margn SVR uses slack varables ξ and ˆξ. Then, the optmzaton problem can be revsed as follows. mnmze: L(w,ξ) = 2 1 w 2 +C (ξ 2, ˆξ 2 ), C > 0 (60) subject to: y w x b ε + ξ, (x,y ) D (61) w x + b y ε + ˆξ, (x,y ) D (62) ξ, ˆξ 0 (63) The constant C > 0 s the trade-off parameter between the margn sze and the amount of errors. The slack varables ξ and ˆξ deal wth nfeasble constrants of the optmzaton problem by mposng the penalty to the excess devatons whch are larger than ε. To solve the optmzaton problem Eq.(60), we can construct a Lagrange functon from the objectve functon wth Lagrange multplers as follows:

SVM Tutoral: Classfcaton, Regresson, and Rankng 15 mnmze: L = 2 1 w 2 +C (ξ + ˆξ ) (η ξ + ˆη ˆξ ) (64) α (ε + η y + w x + b) ˆα (ε + ˆη + y w x b) subject to: η, ˆη 0 (65) α, ˆα 0 (66) where η, ˆη,α, ˆα are the Lagrange multplers whch satsfy postve constrants. The followng s the process to fnd the saddle pont by usng the partal dervatves of L wth respect to each lagrangan multplers for mnmzng the functon L. L b = (α ˆα ) = 0 (67) L w = w Σ(α ˆα )x = 0,w = (α ˆα )x (68) L ˆξ = C ˆα ˆη = 0, ˆη = C ˆα (69) The optmzaton problem wth nequalty constrants can be changed to followng dual optmzaton problem by substtutng Eq. (67), (68) and (69) nto (64). maxmze: L(α) = subject to: 1 2 y (α ˆα ) ε (α + ˆα ) (70) (α ˆα )(α ˆα )x x j (71) j (α ˆα ) = 0 (72) 0 α, ˆα C (73) The dual varables η, ˆη are elmnated n revsng Eq. (64) nto Eq. (70). Eq. (68) and (68) can be rewrtten as follows. w = (α ˆα )x (74) η = C α (75) ˆη = C ˆα (76) where w s represented by a lnear combnaton of the tranng vectors x. Accordngly, the SVR functon F(x) becomes the followng functon. F(x) = (α ˆα )x x+b (77)

16 Hwanjo Yu and Sungchul Km Eq.(77) can map the tranng vectors to target real values wth allowng some errors but t cannot handle the nonlnear SVR case. The same kernel trck can be appled by replacng the nner product of two vectors x,x j wth a kernel functon K(x,x j ). The transformed feature space s usually hgh dmensonal, and the SVR functon n ths space becomes nonlnear n the orgnal nput space. Usng the kernel functon K, The nner product n the transformed feature space can be computed as fast as the nner product x x j n the orgnal nput space. The same kernel functons ntroduced n Secton 2.3 can be appled here. Once replacng the orgnal nner product wth a kernel functon K, the remanng process for solvng the optmzaton problem s very smlar to that for the lnear SVR. The lnear optmzaton functon can be changed by usng kernel functon as follows. maxmze: L(α) = subject to: 1 2 y (α ˆα ) ε (α + ˆα ) (α ˆα )(α ˆα )K(x,x j ) (78) j (α ˆα ) = 0 (79) ˆα 0,α 0 (80) 0 α, ˆα C (81) Fnally, the SVR functon F(x) becomes the followng usng the kernel functon. F(x) = ( ˆα α )K(x,x)+b (82) 4 SVM Rankng Rankng SVM, learnng a rankng (or preference) functon, has produced varous applcatons n nformaton retreval [14, 16, 28]. The task of learnng rankng functons s dstngushed from that of learnng classfcaton functons as follows: 1. Whle a tranng set n classfcaton s a set of data objects and ther class labels, n rankng, a tranng set s an orderng of data. Let A s preferred to B be specfed as A B. A tranng set for rankng SVM s denoted as R = {(x 1,y ),...,(x m,y m )} where y s the rankng of x, that s, y < y j f x x j. 2. Unlke a classfcaton functon, whch outputs a dstnct class for a data object, a rankng functon outputs a score for each data object, from whch a global orderng of data s constructed. That s, the target functon F(x ) outputs a score such that F(x ) > F(x j ) for any x x j. If not stated, R s assumed to be strct orderng, whch means that for all pars x and x j n a set D, ether x R x j or x R x j. However, t can be straghtforwardly

SVM Tutoral: Classfcaton, Regresson, and Rankng 17 generalzed to weak orderngs. Let R be the optmal rankng of data n whch the data s ordered perfectly accordng to user s preference. A rankng functon F s typcally evaluated by how closely ts orderng R F approxmates R. Usng the technques of SVM, a global rankng functon F can be learned from an orderng R. For now, assume F s a lnear rankng functon such that: {(x,x j ) : y < y j R} : F(x ) > F(x j ) w x > w x j (83) A weght vector w s adjusted by a learnng algorthm. We say an orderngs R s lnearly rankable f there exsts a functon F (represented by a weght vector w) that satsfes Eq.(83) for all {(x,x j ) : y < y j R}. The goal s to learn F whch s concordant wth the orderng R and also generalze well beyond R. That s to fnd the weght vector w such that w x > w x j for most data pars {(x,x j ) : y < y j R}. Though ths problem s known to be NP-hard [10], The soluton can be approxmated usng SVM technques by ntroducng (non-negatve) slack varables ξ j and mnmzng the upper bound ξ j as follows [14]: mnmze: L 1 (w,ξ j ) = 1 2 w w+c ξ j (84) subject to: {(x,x j ) : y < y j R} : w x w x j + 1 ξ j (85) (, j) : ξ j 0 (86) By the constrant (85) and by mnmzng the upper bound ξ j n (84), the above optmzaton problem satsfes orderngs on the tranng set R wth mnmal error. By mnmzng w w or by maxmzng the margn (= 1 w ), t tres to maxmze the generalzaton of the rankng functon. We wll explan how maxmzng the margn corresponds to ncreasng the generalzaton of rankng n Secton 4.1. C s the soft margn parameter that controls the trade-off between the margn sze and tranng error. By rearrangng the constrant (85) as w(x x j ) 1 ξ j (87) The optmzaton problem becomes equvalent to that of classfyng SVM on parwse dfference vectors (x x j ). Thus, we can extend an exstng SVM mplementaton to solve the problem. Note that the support vectors are the data pars (x s,xs j ) such that constrant (87) s satsfed wth the equalty sgn,.e., w(x s xs j ) = 1 ξ j. Unbounded support vectors are the ones on the margn (.e., ther slack varables ξ j = 0), and bounded support vectors are the ones wthn the margn (.e., 1 > ξ j > 0) or msranked (.e., ξ j > 1). As done n the classfyng SVM, a functon F n rankng SVM s also expressed only by the support vectors. Smlarly to the classfyng SVM, the prmal problem of rankng SVM can be transformed to the followng dual problem usng the Lagrange multplers.

18 Hwanjo Yu and Sungchul Km maxmze: L 2 (α) = α j α j α uv K(x x j,x u x v ) (88) j j uv subject to: C α 0 (89) Once transformed to the dual, the kernel trck can be appled to support nonlnear rankng functon. K( ) s a kernel functon. α j s a coeffcent for a parwse dfference vectors (x x j ). Note that the kernel functon s computed for P 2 ( m 4 ) tmes where P s the number of data pars and m s the number of data ponts n the tranng set, thus solvng the rankng SVM takes O(m 4 ) at least. Fast tranng algorthms for rankng SVM have been proposed [17] but they are lmted to lnear kernels. Once α s computed, w can be wrtten n terms of the parwse dfference vectors and ther coeffcents such that: w = α j (x x j ) (90) j The rankng functon F on a new vector z can be computed usng the kernel functon replacng the dot product as follows: F(z) = w z = j α j (x x j ) z = α j K(x x j,z). (91) j 4.1 Margn-Maxmzaton n Rankng SVM Fg. 6 Lnear projecton of four data ponts We now explan the margn-maxmzaton of the rankng SVM, to reason about how the rankng SVM generates a rankng functon of hgh generalzaton. We frst establsh some essental propertes of rankng SVM. For convenence of explana-

SVM Tutoral: Classfcaton, Regresson, and Rankng 19 ton, we assume a tranng set R s lnearly rankable and thus we use hard-margn SVM,.e., ξ j = 0 for all (, j) n the objectve (84) and the constrants (85). In our rankng formulaton, from Eq.(83), the lnear rankng functon F w projects data vectors onto a weght vector w. For nstance, Fg. 6 llustrates lnear projectons of four vectors {x 1,x 2,x 3,x 4 } onto two dfferent weght vectors w 1 and w 2 respectvely n a two-dmensonal space. Both F x1 and F x2 make the same orderng R for the four vectors, that s, x 1 > R x 2 > R x 3 > R x 4. The rankng dfference of two vectors (x,x j ) accordng to a rankng functon F w s denoted by the geometrc dstance of the two vectors projected onto w, that s, formulated as w(x x j ) w. Corollary 1. Suppose F w s a rankng functon computed by the hard-margn rankng SVM on an orderng R. Then, the support vectors of F w represent the data pars that are closest to each other when projected to w thus closest n rankng. Proof. The support vectors are the data pars (x s,xs j ) such that w(xs xs j ) = 1 n constrant (87), whch s the smallest possble value for all data pars (x,x j ) R. Thus, ts rankng dfference accordng to F w (= w(xs xs j ) w ) s also the smallest among them [24]. Corollary 2. The rankng functon F, generated by the hard-margn rankng SVM, maxmzes the mnmal dfference of any data pars n rankng. Proof. By mnmzng w w, the rankng SVM maxmzes the margn δ = 1 w = w(x s xs j ) w where (x s,xs j ) are the support vectors, whch denotes, from the proof of Corollary 1, the mnmal dfference of any data pars n rankng. The soft margn SVM allows bounded support vectors whose ξ j > 0 as well as unbounded support vectors whose ξ j = 0, n order to deal wth nose and allow small error for the R that s not completely lnearly rankable. However, the objectve functon n (84) also mnmzes the amount of the slacks and thus the amount of error, and the support vectors are the close data pars n rankng. Thus, maxmzng the margn generates the effect of maxmzng the dfferences of close data pars n rankng. From Corollary 1 and 2, we observe that rankng SVM mproves the generalzaton performance by maxmzng the mnmal rankng dfference. For example, consder the two lnear rankng functons F w1 and F w2 n Fg. 6. Although the two weght vectors w 1 and w 2 make the same orderng, ntutvely w 1 generalzes better than w 2 because the dstance between the closest vectors on w 1 (.e., δ 1 ) s larger than that on w 2 (.e., δ 2 ). SVM computes the weght vector w that maxmzes the dfferences of close data pars n rankng. Rankng SVMs fnd a rankng functon of hgh generalzaton n ths way.

20 Hwanjo Yu and Sungchul Km 5 Rankng Vector Machne: An Effcent Method for Learnng the 1-norm Rankng SVM Ths secton presents another rank learnng method, Rankng Vector Machne (RVM), a revsed 1-norm rankng SVM that s better for feature selecton and more scalable to large data sets than the standard rankng SVM. We frst develop a 1-norm rankng SVM, a rankng SVM that s based on 1-norm objectve functon. (The standard rankng SVM s based on 2-norm objectve functon.) The 1-norm rankng SVM learns a functon wth much less support vectors than the standard SVM. Thereby, ts testng tme s much faster than 2-norm SVMs and provdes better feature selecton propertes. (The functon of 1-norm SVM s lkely to utlze a less number of features by usng a less number of support vectors [11].) Feature selecton s also mportant n rankng. Rankng functons are relevance or preference functons n document or data retreval. Identfyng key features ncreases the nterpretablty of the functon. Feature selecton for nonlnear kernel s especally challengng, and the fewer the number of support vectors are, the more effcently feature selecton can be done [12, 20, 6, 30, 8]. We next present RVM whch revses the 1-norm rankng SVM for fast tranng. The RVM trans much faster than standard SVMs whle not compromsng the accuracy when the tranng set s relatvely large. The key dea of RVM s to express the rankng functon wth rankng vectors nstead of support vectors. Support vectors n rankng SVMs are parwse dfference vectors of the closest pars as dscussed n Secton 4. Thus, the tranng requres nvestgatng every data par as potental canddates of support vectors, and the number of data pars are quadratc to the sze of tranng set. On the other hand, the rankng functon of the RVM utlzes each tranng data object nstead of data pars. Thus, the number of varables for optmzaton s substantally reduced n the RVM. 5.1 1-norm Rankng SVM The goal of 1-norm rankng SVM s the same as that of the standard rankng SVM, that s, to learn F that satsfes Eq.(83) for most {(x,x j ) : y < y j R} and generalze well beyond the tranng set. In the 1-norm rankng SVM, we express Eq.(83) usng the F of Eq.(91) as follows. F(x u ) > F(x v ) = = P j P j α j (x x j ) x u > P j α j (x x j ) x v (92) α j (x x j ) (x u x v ) > 0 (93) Then, replacng the nner product wth a kernel functon, the 1-norm rankng SVM s formulated as:

SVM Tutoral: Classfcaton, Regresson, and Rankng 21 mnmze : L(α,ξ) = P α j +C P ξ j (94) j j s.t. : P α j K(x x j,x u x v ) 1 ξ uv, {(u,v) : y u < y v R} (95) j α 0, ξ 0 (96) Whle the standard rankng SVM suppresses the weght w to mprove the generalzaton performance, the 1-norm rankng suppresses α n the objectve functon. Snce the weght s expressed by the sum of the coeffcent tmes parwse rankng dfference vectors, suppressng the coeffcent α corresponds to suppressng the weght w n the standard SVM. (Mangasaran proves t n [18].) C s a user parameter controllng the tradeoff between the margn sze and the amount of error, ξ, and K s the kernel functon. P s the number of parwse dfference vectors ( m 2 ). The tranng of the 1-norm rankng SVM becomes a lnear programmng (LP) problem thus solvable by LP algorthms such as the Smplex and Interor Pont method [18, 11, 19]. Just as the standard rankng SVM, K needs to be computed P 2 ( m 4 ) tmes, and there are P number of constrants (95) and α to compute. Once α s computed, F s computed usng the same rankng functon as the standard rankng SVM,.e., Eq.(91). The accuraces of 1-norm rankng SVM and standard rankng SVM are comparable, and both methods need to compute the kernel functon O(m 4 ) tmes. In practce, the tranng of the standard SVM s more effcent because fast decomposton algorthms have been developed such as sequental mnmal optmzaton (SMO) [21] whle the 1-norm rankng SVM uses common LP solvers. It s shown that 1-norm SVMs use much less support vectors that standard 2- norm SVMs, that s, the number of postve coeffcents (.e., α > 0) after tranng s much less n the 1-norm SVMs than n the standard 2-norm SVMs [19, 11]. It s because, unlke the standard 2-norm SVM, the support vectors n the 1-norm SVM are not bounded to those close to the boundary n classfcaton or the mnmal rankng dfference vectors n rankng. Thus, the testng nvolves much less kernel evaluatons, and t s more robust when the tranng set contans nosy features [31]. 5.2 Rankng Vector Machne Although the 1-norm rankng SVM has merts over the standard rankng SVM n terms of the testng effcency and feature selecton, ts tranng complexty s very hgh w.r.t. the number of data ponts. In ths secton, we present Rankng Vector Machne (RVM), whch revses the 1-norm rankng SVM to reduce the tranng tme substantally. The RVM sgnfcantly reduces the number of varables n the optmzaton problem whle not compromzng the accuracy. The key dea of RVM s to express the rankng functon wth rankng vectors nstead of support vectors.

22 Hwanjo Yu and Sungchul Km The support vectors n rankng SVMs are chosen from parwse dfference vectors, and the number of parwse dfference vectors are quadratc to the sze of tranng set. On the other hand, the rankng vectors are chosen from the tranng vectors, thus the number of varables to optmze s substantally reduced. To theoretcally justfy ths approach, we frst present the Representer Theorem. Theorem 1 (Representer Theorem [22]). Denote by Ω: [0, ) R a strctly monotonc ncreasng functon, by X a set, and by c : (X R 2 ) m R { } an arbtrary loss functon. Then each mnmzer F H of the regularzed rsk c((x 1,y 1,F(x 1 )),...,(x m,y m,f(x m )))+Ω( F H ) (97) admts a representaton of the form F(x) = m =1 α K(x,x) (98) The proof of the theorem s presented n [22]. Note that, n the theorem, the loss functon c s arbtrary allowng couplng between data ponts (x,y ), and the regularzer Ω has to be monotonc. Gven such a loss functon and regularzer, the representer theorem states that although we mght be tryng to solve the optmzaton problem n an nfntedmensonal space H, contanng lnear combnatons of kernels centered on arbtrary ponts of X, the soluton les n the span of m partcular kernels those centered on the tranng ponts [22]. Based on the theorem, we defne our rankng functon F as Eq.(98), whch s based on the tranng ponts rather than arbtrary ponts (or parwse dfference vectors). Functon (98) s smlar to functon (91) except that, unlke the latter usng parwse dfference vectors (x x j ) and ther coeffcents (α j ), the former utlzes the tranng vectors (x ) and ther coeffcents (α ). Wth ths functon, Eq.(92) becomes the followng. F(x u ) > F(x v ) = = m m α K(x,x u ) > Thus, we set our loss functon c as follows. c = {(u,v):y u <y v R} (1 m α K(x,x v ) (99) α (K(x,x u ) K(x,x v )) > 0. (100) m α (K(x,x u ) K(x,x v ))) (101) The loss functon utlzes couples of data ponts penalzng msranked pars, that s, t returns hgher values as the number of msranked pars ncreases. Thus, the loss functon s order senstve, and t s an nstance of the functon class c n Eq.(97).

SVM Tutoral: Classfcaton, Regresson, and Rankng 23 We set the regularzer Ω( f H ) = m α (α 0), whch s strctly monotoncally ncreasng. Let P s the number of pars (u,v) R such that y u < y v, and let ξ uv = 1 m α (K(x,x u ) K(x,x v )). Then, our RVM s formulated as follows. mnmze: L(α,ξ) = m α +C P ξ j (102) j s.t.: m α (K(x,x u ) K(x,x v )) 1 ξ uv, {(u,v) : y u < y v R} (103) α,ξ 0 (104) The soluton of the optmzaton problem les n the span of kernels centered on the tranng ponts (.e., Eq.(98)) as suggested n the representer theorem. Just as the 1-norm rankng SVM, the RVM suppresses α to mprove the generalzaton, and forces Eq.(100) by constrant (103). Note that there are only m number of α n the RVM. Thus, the kernel functon s evaluated O(m 3 ) tmes whle the standard rankng SVM computes t O(m 4 ) tmes. Another ratonale of RVM or ratonale of usng tranng vectors nstead of parwse dfference vectors n the rankng functon s that the support vectors n the 1-norm rankng SVM are not the closest parwse dfference vectors, thus expressng the rankng functon wth parwse dfference vectors becomes not as benefcal n the 1-norm rankng SVM. To explan ths further, consder classfyng SVMs. Unlke the 2-norm (classfyng) SVM, the support vectors n the 1-norm (classfyng) SVM are not lmted to those close to the decson boundary. Ths makes t possble that the 1-norm (classfyng) SVM can express the smlar boundary functon wth less number of support vectors. Drectly extended from the 2-norm (classfyng) SVM, the 2-norm rankng SVM mproves the generalzaton by maxmzng the closest parwse rankng dfference that corresponds to the margn n the 2-norm (classfyng) SVM as dscussed n Secton 4. Thus, the 2-norm rankng SVM expresses the functon wth the closest parwse dfference vectors (.e., the support vectors). However, the 1-norm rankng SVM mproves the generalzaton by suppressng the coeffcents α just as the 1-norm (classfyng) SVM. Thus, the support vectors n the 1-norm rankng SVM are not the closest parwse dfference vectors any more, and thus expressng the rankng functon wth parwse dfference vectors becomes not as benefcal n the 1-norm rankng SVM. 5.3 Experment Ths secton evaluates the RVM on synthetc datasets (Secton 5.3.1) and a realworld dataset (Secton 5.3.2). The RVM s compared wth the state-of-the-art rankng SVM provded n SVM-lght. Experment results show that the RVM trans substantally faster than the SVM-lght for nonlnear kernels whle ther accura-

24 Hwanjo Yu and Sungchul Km ces are comparable. More mportantly, the number of rankng vectors n the RVM s multple orders of magntudes smaller than the number of support vectors n the SVM-lght. Experments are performed on a Wndows XP Professonal machne wth a Pentum IV 2.8GHz and 1GB of RAM. We mplemented the RVM usng C and used CPLEX 1 for the LP solver. The source codes are freely avalable at http://s.postech.ac.kr/rvm [29]. Evaluaton metrc: MAP (mean average precson) s used to measure the rankng qualty when there are only two classes of rankng [26], and NDCG s used to evaluate rankng performance for IR applcatons when there are multple levels of rankng [2, 4, 7, 25]. Kendall s τ s used when there s a global orderng of data and the tranng data s a subset of t. Rankng SVMs as well as the RVM mnmze the amount of error or ms-rankng, whch s correspondng to optmzng the Kendall s τ [16, 27]. Thus, we use the Kendall s τ to compare ther accuracy. Kendall s τ computes the overall accuracy by comparng the smlarty of two orderngs R and R F. (R F s the orderng of D accordng to the learned functon F.) The Kendall s τ s defned based on the number of concordant pars and dscordant pars. If R and R F agree n how they order a par, x and x j, the par s concordant, otherwse, t s dscordant. The accuracy of functon F s defned as the number of concordant pars between R and R F per the total number of pars n D as follows. F(R,R F # of concordant pars ) = ( ) R 2 For example, suppose R and R F order fve ponts x 1,...,x 5 as follow: (x 1,x 2,x 3,x 4,x 5 ) R (x 3,x 2,x 1,x 4,x 5 ) R F Then, the accuracy of F s 0.7, as the number of dscordant pars s 3,.e.,{x 1,x 2 },{x 1,x 3 },{x 2,x 3 } whle all remanng 7 pars are concordant. 5.3.1 Experments on Synthetc Dataset Below s the descrpton of our experments on synthetc datasets. 1. We randomly generated a tranng and a testng dataset D tran and D test respectvely, where D tran contans m tran (= 40, 80, 120, 160, 200) data ponts of n (e.g., 5) dmensons (.e., m tran -by-n matrx), and D test contans m test (= 50) data ponts of n dmensons (.e., m test -by-n matrx). Each element n the matrces s a random number between zero and one. (We only dd experments on the data set 1 http://www.log.com/products/cplex/

SVM Tutoral: Classfcaton, Regresson, and Rankng 25 1 0.99 0.98 1 0.99 0.98 Kendall s tau 0.97 0.96 0.95 0.94 0.93 SVM (Lnear) RVM (Lnear) Kendall s tau 0.97 0.96 0.95 0.94 0.93 SVM (RBF) RVM (RBF) 0.92 0.91 0.9 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear 0.92 0.91 0.9 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Fg. 7 Accuracy 12 30 Tranng tme n seconds 10 8 6 4 2 SVM (Lnear) RVM (Lnear) Tranng tme n seconds 25 20 15 10 5 SVM (RBF) RVM (RBF) 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear Kernel 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Kernel Fg. 8 Tranng tme 200 200 Number of support vectors 150 100 50 SVM (Lnear) RVM (Lnear) Number of support vectors 150 100 50 SVM (RBF) RVM (RBF) 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear Kernel 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Kernel Fg. 9 Number of support (or rankng) vectors of up to 200 objects due to performance reason. Rankng SVMs run ntolerably slow on data sets larger than 200.) 2. We randomly generate a global rankng functon F, by randomly generatng the weght vector w n F (x) = w x for lnear, and n F (x) = exp( w x ) 2 for RBF functon.

26 Hwanjo Yu and Sungchul Km 0.2 0.2 Decrement n accuracy 0.15 0.1 0.05 SVM (Lnear) RVM (Lnear) Decrement n accuracy 0.15 0.1 0.05 SVM (RBF) RVM (RBF) 0 0 1 2 3 4 5 6 k (a) Lnear 0 0 1 2 3 4 5 6 k (b) RBF Fg. 10 Senstvty to nose (m tran = 100). 3. We rank D tran and D test accordng to F, whch forms the global orderng R tran and R test on the tranng and testng data. 4. We tran a functon F from R tran, and test the accuracy of F on R test. We tuned the soft margn parameter C by tryng C = 10 5, 10 5,...,10 5, and used the hghest accuracy for comparson. For the lnear and RBF functons, we used lnear and RBF kernels accordngly. We repeat ths entre process 30 tmes to get the mean accuracy. Accuracy: Fgure 7 compares the accuraces of the RVM and the rankng SVM from the SVM-lght. The rankng SVM outperforms RVM when the sze of data set s small, but ther dfference becomes trval as the sze of data set ncreases. Ths phenomenon can be explaned by the fact that when the tranng sze s too small, the number of potental rankng vectors becomes too small to draw an accurate rankng functon whereas the number of potental support vectors s stll large. However, as the sze of tranng set ncreases, RVM becomes as accurate as the rankng SVM because the number of potental rankng vectors becomes large as well. Tranng Tme: Fgure 8 compares the tranng tme of the RVM and the SVMlght. Whle the SVM lght trans much faster than RVM for lnear kernel (SVM lght s specally optmzed for lnear kernel.), the RVM trans sgnfcantly faster than the SVM lght for RBF kernel. Number of Support (or Rankng) Vectors: Fgure 9 compares the number of support (or rankng) vectors used n the functon of RVM and the SVM-lght. RVM s model uses a sgnfcantly smaller number of support vectors than the SVM-lght. Senstvty to nose: In ths experment, we compare the senstvty of each method to nose. We nsert nose by swtchng the orders of some data pars n R tran. We set the sze of tranng set m tran = 100 and the dmenson n = 5. After we make R tran from a random functon F, we randomly pcked k vectors from the R tran and swtched t wth ts adjacent vector n the orderng to mplant nose n the tranng set. Fgure 10 shows the decrements of the accuraces as the number of msorderngs

SVM Tutoral: Classfcaton, Regresson, and Rankng 27 ncreases n the tranng set. Ther accuraces are moderately decreasng as the nose ncreases n the tranng set, and ther senstvtes to nose are comparable. 5.3.2 Experment on Real Dataset In ths secton, we experment usng the OHSUMED dataset obtaned from the LETOR, the ste contanng benchmark datasets for rankng [1]. OHSUMED s a collecton of documents and queres on medcne, consstng of 348,566 references and 106 queres. There are n total 16,140 query-document pars upon whch relevance judgements are made. In ths dataset the relevance judgements have three levels: defntely relevant, partally relevant, and rrelevant. The OHSUMED dataset n the LETOR extracts 25 features. We report our experments on the frst three queres and ther documents. We compare the performance of RVM and SVMlght on them. We tuned the parameters 3-fold cross valdaton wth tryng C and γ = 10 6,10 5,...,10 6 for the lnear and RBF kernels and compared the hghest performance. The tranng tme s measured for tranng the model wth the tuned parameters. We repeated the whole process three tmes and reported the mean values. query 1 query 2 query 3 D = 134 D = 128 D = 182 Acc Tme #SV or #RV Acc Tme #SV or #RV Acc Tme #SV or #RV lnear.5484.23 1.4.6730.41 3.83.6611 1.94 1.99 RVM RBF.5055.85 4.3.6637.41 2.83.6723 4.71 1 lnear.5634 1.83 92.6723 1.03 101.66.6588 4.24 156.55 SVM RBF.5490 3.05 92.6762 3.50 102.6710 55.08 156.66 Table 2 Experment results: Accuracy (Acc), Tranng Tme (Tme), and Number of Support or Rankng Vectors (#SV or #RV) Table 5.3.2 show the results. The accuraces of the SVM and RVM are comparable overall; SVM shows a lttle hgh accuracy than RVM for query 1, but for the other queres, ther accuracy dfferences are not statstcally sgnfcant. More mportantly, the number of rankng vectors n RVM s sgnfcantly smaller than that of support vectors n SVM. For example, for query 3, the RVM havng just one rankng vector outperformed the SVM wth over 150 support vectors. The tranng tme of RVM s sgnfcantly shorter than that of SVM-lght. References 1. Letor: Learnng to rank for nformaton retreval. Http://research.mcrosoft.com/users/LETOR/ 2. Baeza-Yates, R., Rbero-Neto, B. (eds.): Modern Informaton Retreval. ACM Press (1999) 3. Bertsekas, D.P.: Nonlnear Programmng. Athena Scentfc (1995)

28 Hwanjo Yu and Sungchul Km 4. Burges, C., Shaked, T., Renshaw, E., Lazer, A., Deeds, M., Hamlton, N., Hullender, G.: Learnng to rank usng gradent descent. In: Proc. Int. Conf. Machne Learnng (ICML 04) (2004) 5. Burges, C.J.C.: A tutoral on support vector machnes for pattern recognton. Data Mnng and Knowledge Dscovery 2, 121 167 (1998) 6. Cao, B., Shen, D., Sun, J.T., Yang, Q., Chen, Z.: Feature selecton n a kernel space. In: Proc. Int. Conf. Machne Learnng (ICML 07) (2007) 7. Cao, Y., Xu, J., Lu, T.Y., L, H., Huang, Y., Hon, H.W.: Adaptng rankng svm to document retreval. In: Proc. ACM SIGIR Int. Conf. Informaton Retreval (SIGIR 06) (2006) 8. Cho, B., Yu, H., Lee, J., Chee, Y., Km, I.: Nonlnear support vector machne vsualzaton for rsk factor analyss usng nomograms and localzed radal bass functon kernels. IEEE Transactons on Informaton Technology n Bomedcne ((Accepted)) 9. Chrstann, N., Shawe-Taylor, J.: An Introducton to support vector machnes and other kernel-based learnng methods. Cambrdge Unversty Press (2000) 10. Cohen, W.W., Schapre, R.E., Snger, Y.: Learnng to order thngs. In: Proc. Advances n Neural Informaton Processng Systems (NIPS 98) (1998) 11. Fung, G., Mangasaran, O.L.: A feature selecton newton method for support vector machne classfcaton. Computatonal Optmzaton and Applcatons (2004) 12. Guyon, I., Elsseeff, A.: An ntroducton to varable and feature selecton. Journal of Machne Learnng Research (2003) 13. Haste, T., Tbshran, R.: Classfcaton by parwse couplng. In: Advances n Neural Informaton Processng Systems (1998) 14. Herbrch, R., Graepel, T., Obermayer, K. (eds.): Large margn rank boundares for ordnal regresson. MIT-Press (2000) 15. J.H.Fredman: Another approach to polychotomous classfcaton. Tech. rep., Standford Unversty, Department of Statstcs, 10:1895-1924 (1998) 16. Joachms, T.: Optmzng search engnes usng clckthrough data. In: Proc. ACM SIGKDD Int. Conf. Knowledge Dscovery and Data Mnng (KDD 02) (2002) 17. Joachms, T.: Tranng lnear svms n lnear tme. In: Proc. ACM SIGKDD Int. Conf. Knowledge Dscovery and Data Mnng (KDD 06) (2006) 18. Mangasaran, O.L.: Generalzed Support Vector Machnes. MIT Press (2000) 19. Mangasaran, O.L.: Exact 1-norm support vector machnes va unconstraned convex dfferentable mnmzaton. Journal of Machne Learnng Research (2006) 20. Mangasaran, O.L., Wld, E.W.: Feature selecton for nonlnear kernel support vector machnes. Tech. rep., Unversty of Wsconsn, Madson (1998) 21. Platt, J.: Fast tranng of support vector machnes usng sequental mnmal optmzaton. In: A.S. B. Scholkopf C. Burges (ed.) Advances n Kernel Methods: Support Vector Machnes. MIT Press, Cambrdge, MA (1998) 22. Scholkopf, B., Herbrch, R., Smola, A.J., Wllamson, R.C.: A generalzed representer theorem. In: Proc. COLT (2001) 23. Smola, A.J., Scholkopf, B.: A tutoral on support vector regresson. Tech. rep., NeuroCOLT2 Techncal Report NC2-TR-1998-030 (1998) 24. Vapnk, V.: Statstcal Learnng Theory. John Wley and Sons (1998) 25. Xu, J., L, H.: Adarank: A boostng algorthm for nformaton retreval. In: Proc. ACM SIGIR Int. Conf. Informaton Retreval (SIGIR 07) (2007) 26. Yan, L., Doder, R., Mozer, M.C., Wolnewcz, R.: Optmzng classfer performance va the wlcoxon-mann-whtney statstcs. In: Proc. Int. Conf. Machne Learnng (ICML 03) (2003) 27. Yu, H.: SVM selectve samplng for rankng wth applcaton to data retreval. In: Proc. Int. Conf. Knowledge Dscovery and Data Mnng (KDD 05) (2005) 28. Yu, H., Hwang, S.W., Chang, K.C.C.: Enablng soft queres for data retreval. Informaton Systems (2007) 29. Yu, H., Km, Y., Hwang, S.W.: RVM: An effcent method for learnng rankng SVM. Tech. rep., Department of Computer Scence and Engneerng, Pohang Unversty of Scence and Technology (POSTECH), Pohang, Korea, http://s.hwanjoyu.org/rvm (2008)