New Approaches to Support Vector Ordinal Regression



Similar documents
Support Vector Machines

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

What is Candidate Sampling

Forecasting the Direction and Strength of Stock Market Movement

Financial market forecasting using a two-step kernel learning method for the support vector regression

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

SVM Tutorial: Classification, Regression, and Ranking

BERNSTEIN POLYNOMIALS

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Lecture 2: Single Layer Perceptrons Kevin Swingler

Recurrence. 1 Definitions and main statements

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

1 Example 1: Axis-aligned rectangles

Project Networks With Mixed-Time Constraints

L10: Linear discriminants analysis

An MILP model for planning of batch plants operating in a campaign-mode

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Extending Probabilistic Dynamic Epistemic Logic

Performance Analysis and Coding Strategy of ECOC SVMs

Single and multiple stage classifiers implementing logistic discrimination

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Calculation of Sampling Weights

Can Auto Liability Insurance Purchases Signal Risk Attitude?

An Interest-Oriented Network Evolution Mechanism for Online Communities

Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

8 Algorithm for Binary Searching in Trees

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Fisher Markets and Convex Programs

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Realistic Image Synthesis

Loop Parallelization

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

DEFINING %COMPLETE IN MICROSOFT PROJECT

Brigid Mullany, Ph.D University of North Carolina, Charlotte

CHAPTER 14 MORE ABOUT REGRESSION

A Probabilistic Theory of Coherence

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

The Greedy Method. Introduction. 0/1 Knapsack Problem

Logistic Regression. Steve Kroon

How To Calculate The Accountng Perod Of Nequalty

The Application of Fractional Brownian Motion in Option Pricing

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Support vector domain description

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Credit Limit Optimization (CLO) for Credit Cards

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Sketching Sampled Data Streams

The Geometry of Online Packing Linear Programs

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

J. Parallel Distrib. Comput.

IMPACT ANALYSIS OF A CELLULAR PHONE

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

A Lyapunov Optimization Approach to Repeated Stochastic Games

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

1. Math 210 Finite Mathematics

This circuit than can be reduced to a planar circuit

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Research Article Integrated Model of Multiple Kernel Learning and Differential Evolution for EUR/USD Trading

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Joe Pimbley, unpublished, Yield Curve Calculations

Alternate Approximation of Concave Cost Functions for

Lecture 5,6 Linear Methods for Classification. Summary

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Heuristic Static Load-Balancing Algorithm Applied to CESM

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Implementation of Deutsch's Algorithm Using Mathcad

The OC Curve of Attribute Acceptance Plans

Learning from Multiple Outlooks

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

An efficient constraint handling methodology for multi-objective evolutionary algorithms

An Alternative Way to Measure Private Equity Performance

On the Interaction between Load Balancing and Speed Scaling

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Transcription:

New Approaches to Support Vector Ordnal Regresson We Chu chuwe@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt, Unversty College London, London, WCN 3AR, UK S. Sathya Keerth selvarak@yahoo-nc.com Yahoo! Research Labs, 0 S. DeLacey Avenue, Pasadena, CA-905, USA Abstract In ths paper, we propose two new support vector approaches for ordnal regresson, whch optmze multple thresholds to defne parallel dscrmnant hyperplanes for the ordnal scales. Both approaches guarantee that the thresholds are properly ordered at the optmal soluton. The sze of these optmzaton problems s lnear n the number of tranng samples. The SMO algorthm s adapted for the resultng optmzaton problems; t s extremely easy to mplement and scales effcently as a quadratc functon of the number of examples. The results of numercal experments on benchmark datasets verfy the usefulness of these approaches.. Introducton We consder the supervsed learnng problem of predctng varables of ordnal scale, a settng that brdges metrc regresson and classfcaton, and referred to as rankng learnng or ordnal regresson. Ordnal regresson arses frequently n socal scence and nformaton retreval where human preferences play a maor role. The tranng samples are labelled by a set of ranks, whch exhbts an orderng among the dfferent categores. In contrast to metrc regresson problems, these ranks are of fnte types and the metrc dstances between the ranks are not defned. These ranks are also dfferent from the labels of multple classes n classfcaton problems due to the exstence of the orderng nformaton. There are several approaches to tackle ordnal regresson problems n the doman of machne learnng. The nave dea s to transform the ordnal scales nto numerc values, and then solve the problem as a stan- Appearng n Proceedngs of the st Internatonal Conference on Machne Learnng, Bonn, Germany, 005. Copyrght 005 by the author(s/owner(s. dard regresson problem. Kramer et al. (00 nvestgated the use of a regresson tree learner n ths way. A problem wth ths approach s that there mght be no prncpled way of devsng an approprate mappng functon snce the true metrc dstances between the ordnal scales are unknown n most of the tasks. Another dea s to decompose the orgnal ordnal regresson problem nto a set of bnary classfcaton tasks. Frank and Hall (00 converted an ordnal regresson problem nto nested bnary classfcaton problems that encode the orderng of the orgnal ranks and then organzed the results of these bnary classfers n some ad hoc way for predcton. It s also possble to formulate the orgnal problem as a large augmented bnary classfcaton problem. Har-Peled et al. (00 proposed a constrant classfcaton approach that provdes a unfed framework for solvng rankng and mult-classfcaton problems. Herbrch et al. (000 appled the prncple of Structural Rsk Mnmzaton (Vapnk, 995 to ordnal regresson leadng to a new dstrbuton-ndependent learnng algorthm based on a loss functon between pars of ranks. The man dffculty wth these two algorthms (Har-Peled et al., 00; Herbrch et al., 000 s that the problem sze of these formulatons s a quadratc functon of the tranng data sze. As for sequental learnng, Crammer and Snger (00 proposed a proceptronbased onlne algorthm for rank predcton, known as the PRank algorthm. Shashua and Levn (003 generalzed the support vector formulaton for ordnal regresson by fndng r thresholds that dvde the real lne nto r consecutve ntervals for the r ordered categores. However there s a problem wth ther approach: the ordnal nequaltes on the thresholds, b b... b r, are not ncluded n ther formulaton. Ths omsson may result n dsordered thresholds at the soluton on some unfortunate cases (see secton 4. for an example. In ths paper, we propose two new approaches for support vector ordnal regresson. The frst one takes only the adacent ranks nto account n determnng

New Approaches to Support Vector Ordnal Regresson the thresholds, exactly as Shashua and Levn (003 proposed, but we ntroduce explct constrants n the problem formulaton that enforce the nequaltes on the thresholds. The second approach s entrely new; t consders the tranng samples from all the ranks to determne each threshold. Interestngly, we show that, n ths second approach, the ordnal nequalty constrants on the thresholds are automatcally satsfed at the optmal soluton though there are no explct constrants on these thresholds. For both approaches the sze of the optmzaton problems s lnear n the number of tranng samples. We show that the popular SMO algorthm (Platt, 999; Keerth et al., 00 for SVMs can be easly adapted for the two approaches. The resultng algorthms scale effcently; emprcal analyss shows that the cost s roughly a quadratc functon of the problem sze. Usng several benchmark datasets we demonstrate that the generalzaton capabltes of the two approaches are much better than that of the nave approach of dong standard regresson on the ordnal labels. The paper s organzed as follows. In secton we present the frst approach wth explct nequalty constrants on the thresholds, derve the optmalty condtons for the dual problem, and adapt the SMO algorthm for the soluton. In secton 3 we present the second approach wth mplct constrants. In secton 4 we do an emprcally study to show the scalng propertes of the two algorthms and ther generalzaton performance. We conclude n secton 5. Notatons Throughout ths paper we wll use x to denote the nput vector of the ordnal regresson problem and φ(x to denote the feature vector n a hgh dmensonal reproducng kernel Hlbert space (RKHS related to x by transformaton. All computatons wll be done usng the reproducng kernel functon only, whch s defned as K(x, x = φ(x φ(x ( where denotes nner product n the RKHS. Wthout loss of generalty, we consder an ordnal regresson problem wth r ordered categores and denote these categores as consecutve ntegers Y = {,,...,r} to keep the known orderng nformaton. In the -th category, where Y, the number of tranng samples s denoted as n,andthe -th tranng sample s denoted as x where x Rd. The total number of tranng samples r = n s denoted as n. b, =,...,r denotethe(r thresholds.. Explct Constrants on Thresholds As a powerful computatonal tool for supervsed learnng, support vector machnes (SVMs map the nput vectors nto feature vectors n a hgh dmensonal y= ξ + y= y=3 ξ b - b b + ξ + ξ b b - b + f(x = w. φ(x Fgure. An llustraton of the defnton of slack varables ξ and ξ for the thresholds. The samples from dfferent ranks, represented as crcles flled wth dfferent patterns, are mapped by w φ(x onto the axs of functon value. Note that a sample from rank + could be counted twce for errors f t s sandwched by b + andb +where b + <b +, and the samples from rank +, etc. never gve contrbutons to the threshold b. RKHS (Vapnk, 995; Schölkopf & Smola, 00, where a lnear machne s constructed by mnmzng a regularzed functonal. For bnary classfcaton (a specal case of ordnal regresson wth r =, SVMs fnd an optmal drecton that maps the feature vectors nto functon values on the real lne, and a sngle optmzed threshold s used to dvde the real lne nto two regons for the two classes respectvely. In the settng of ordnal regresson, the support vector formulaton could attempt to fnd an optmal mappng drecton w, andr thresholds, whch defne r parallel dscrmnant hyperplanes for the r ranks accordngly. For each threshold b, Shashua and Levn (003 suggested consderng the samples from the two adacent categores, and +, for emprcal errors (see Fgure for an llustraton. More exactly, each sample n the -th category should have a functon value that s less than the lower margn b, otherwse w φ(x (b s the error (denoted as ξ ; smlarly, each sample from the ( +-th category should have a functon value that s greater than the upper margn b +, otherwse (b + w φ(x + s the error (denoted as ξ +. Shashua and Levn (003 generalzed the prmal problem of SVMs to ordnal regresson as follows: r mn w,b,ξ,ξ w w + C = ( n + n ξ + ξ + = = subect to w φ(x b +ξ, ξ 0, for =,...,n ; w φ(x + b + ξ +, ξ + 0, for =,...,n + ; where runs over,...,r andc>0. ( (3 The superscrpt n ξ + denotes that the error s assocated wth a sample n the adacent upper category of the -th threshold.

New Approaches to Support Vector Ordnal Regresson A problem wth the above formulaton s that the natural ordnal nequaltes on the thresholds,.e., b b... b r cannot be guaranteed to hold at the soluton. To tackle ths problem, we explctly nclude the followng constrants n (3: b b, for =,...,r. (4.. Prmal and Dual Problems By ntroducng two auxlary varables b 0 = and b r =+, the modfed prmal problem n ( (4 can be equvalently wrtten as follows: mn w,b,ξ,ξ w w + C r = subect to n = ( ξ + ξ w φ(x b +ξ, ξ 0,, ; w φ(x b + ξ, ξ 0,, ; b b,. (5 (6 The dual problem can be derved by standard Lagrangan technques. Let α 0, γ 0, α 0, γ 0andµ 0 be the Lagrangan multplers for the nequaltes n (6. The Lagrangan for the prmal problem s: L e = w w + C r ( n = = ξ + ξ r n = = α ( +ξ w φ(x + b r n = ( +ξ + w φ(x b = α r = γ ξ r = γ ξ r = µ (b b. (7 The KKT condtons for the prmal problem requre the followng to hold: L e b L e w = w r = L e ξ L e ξ = n = n = ( α α φ(x = 0; (8 = C α γ =0,, ; (9 = C α γ (α + µ n + =0,, ; (0 ( α + + µ + =0,. = Note that the dummy varables assocated wth b 0 and b r,.e. µ, µ r, α s and α r s, are always zero. The condtons (9 and (0 gve rse to the constrants 0 α C and 0 α C respectvely. Let us now apply Wolfe dualty theory to the prmal problem. By ntroducng the KKT condtons (8 (0 nto the Lagrangan (7 and applyng the kernel trck (, the dual problem becomes a maxmzaton problem nvolvng the Lagrangan multplers α, α and µ: max (α +α (α α (α α K(x,x,,, ( subect to 0 α C,,, 0 α + C,,, n = α + µ = n ( + = α + + µ +,, µ 0,, where runs over,...,r. Leavng the dummy varables out of account, the sze of the optmzaton problem s n n n r (α and α plusr (forµ. The dual problem ( ( s a convex quadratc programmng problem. Once the α, α and µ are obtaned by solvng ths problem, w s obtaned from (8. The determnaton of the b s wll be addressed n the next secton. The dscrmnant functon value for a new nput vector x s f(x = w x =, (α α K(x,x. (3 The predctve ordnal decson functon s gven by arg mn { : f(x <b }... Optmalty Condtons for the Dual To derve proper stoppng condtons for algorthms that solve the dual problem and also determne the thresholds b s, t s mportant to wrte down the optmalty condtons for the dual. Though the resultng condtons that are derved below look a bt clumsy because of the notatons, the deas behnd them are very much smlar to those for the bnary SVM classfer case. The Lagrangan for the dual can be wrtten down as follows:,, (α α (α α K(x,x + r = β ( n = α n + = α + + µ µ + L d =, (η α + η π (C α α, (π (C α + r = λ µ, (α + α where the Lagrangan multplers η, η, π, π and λ are non-negatve, whle β can take any value. The KKT condtons assocated wth β can be gven as follows: L d α = f(x η + π + β =0,π 0, η 0,π (C α =0,η α =0, for =,...,n ; = f(x + η + + π + β =0, L d α + π + η + 0,η + 0,π + (C α + =0, for =,...,n + ; α + =0, (4 where f(x s defned as n (3, whle the KKT condtons assocated wth the µ are β β λ =0,λ µ =0,λ 0, (5

New Approaches to Support Vector Ordnal Regresson where =,...,r. The condtons n (4 can be re-grouped nto the followng sx cases: case : α =0 f(x + β case : 0 <α <C f(x +=β case 3 : α = C f(x + β case 4 : α + =0 f(x + β case 5 : 0 <α + <C f(x + =β case 6 : α + = C f(x + β We can classfy any varable nto one of the followng sx sets: I 0a = { {,...,n } :0<α <C} I 0b = { {,...,n+ } :0<α + <C} I = { {,...,n+ } : α + =0} I = { {,...,n } : α =0} I 3 = { {,...,n } : α = C} I 4 = { {,...,n+ } : α + = C} Let us denote I 0 = I 0a I 0b, I up = I 0 I I 3 and I low = I 0 I I 4. We further defne F up(β onthe set Iup as { Fup(β f(x = + f I 0a I 3 f(x + f I 0b I and Flow (β on the set I low as F low(β = { f(x + f I 0a I f(x + f I 0b I 4 Then the condtons can be smplfed as β Fup(β Iup and β Flow (β I low, whch can be compactly wrtten as: b low β b up (6 where b up = mn{fup(β : Iup} and b low = max{flow : I low }. The KKT condtons n (5 ndcate that the condton, β β always holds, and that β = β f µ > 0. To merge the condtons (5 and (6, let us defne B low = max{bk low : k =,...,} and B up = mn{b k up : k =,...,r }, where =,...,r. The overall optmalty condtons can be smply wrtten as B low β Bup where { B+ B low = low f µ + > 0 B low otherwse and { B Bup up f µ = > 0 B up otherwse. Table. The basc framework of the SMO algorthm for support vector ordnal regresson usng explct threshold constrants. SMO start at a vald pont, α, α and µ, that satsfy (, fnd the current B up and B low Loop do. determne the actve threshold J. optmze the par of actve varables and the set µ a 3. compute B up and B low at the new pont whle the optmalty condton (7 has not been satsfed Ext return α, α and b We ntroduce a tolerance parameter τ > 0, usually 0.00, to defne approxmate optmalty condtons. The overall stoppng condton becomes max{b low B up : =,...,r } τ. (7 From the condtons n (4 and (3, t s easy to see the close relatonshp between the b s n the prmal problem and the multplers β s. In partcular, at the optmal soluton, β and b are dentcal. Thus b can be taken to be any value from the nterval, [B low,b up]. We can resolve any non-unqueness by smply takng b = (B low + B up. Note that the KKT condtons n (5, comng from the addtve constrants n (4 we ntroduced n Shashua and Levn s formulaton, enforce B low B low and B up Bup at the soluton, whch guarantee that the thresholds specfed n these feasble regons wll satsfy the nequalty constrants b b ; wthout the constrants n (4, the thresholds mght be dsordered at the soluton!.3. SMO Algorthm In ths secton we adapt the SMO algorthm (Platt, 999; Keerth et al., 00 for the soluton of ( (. The key dea of SMO conssts of startng wth a vald ntal pont and optmzng only one par of varables at a tme whle fxng all the other varables. The suboptmzaton problem of the two actve varables can be solved analytcally. Table presents an outlne of the SMO mplementaton for our optmzaton problem. In order to determne the par of actve varables to optmze, we select the actve threshold frst. The ndex of the actve threshold s defned as J = arg max { : B low B up >τ}. Let us assume that Blow J and BJ up are actually defned by b o low and bu up respectvely, and that the two multplers assocated wth b o low and bu up are α o and α u. The par of multplers (α o,α u s optmzed from the current pont (αo old new pont, (αo new,αu new.,α old u toreachthe It s possble that o u. In ths case, named as cross update, more than one equalty constrant n ( s nvolved n the optmzaton that may update the

New Approaches to Support Vector Ordnal Regresson varable set µ a = {µ mn{o,u}+,...,µ max{o,u} }, a subset of µ. In the case of o = u, named as standard update, only one equalty constrant s nvolved and the varables of µ are keep ntact,.e. µ a =. These suboptmzaton problems can be solved analytcally, and the detaled formulas for updatng can be found n our longer techncal report (Chu & Keerth, 005. 3. Implct Constrants on Thresholds In ths secton we present a new approach to support vector ordnal regresson. Instead of consderng only the emprcal errors from the samples of adacent categores to determne a threshold, we allow the samples n all the categores to contrbute errors for each threshold. A very nce property of ths approach s that the ordnal nequaltes on the thresholds are satsfed automatcally at the optmal soluton n spte of the fact that such constrants on the thresholds are not explctly ncluded n the new formulaton. Fgure explans the new defnton of slack varables ξ and ξ. For a threshold b, the functon values of all the samples from all the lower categores, should be less than the lower margn b ; f that does not hold, then ξ k = w φ(xk (b s taken as the error assocated wth the sample x k for b, where k. Smlarly, the functon values of all the samples from the upper categores should be greater than the upper margn b + ; otherwse ξ k =(b + w φ(x k s the error assocated wth the sample x k for b, where k>. Here, the subscrpt k denotes that the slack varable s assocated wth the -th nput sample n the k-th category; the superscrpt denotes that the slack varable s assocated wth the lower categores of b ; and the superscrpt denotes that the slack varable s assocated wth the upper categores of b. 3.. Prmal Problem By takng all the errors assocated wth all r thresholds nto account, the prmal problem can be defned as follows: r mn w,b,ξ,ξ w w + C = ( n k ξ k + k= = r n k k=+ = ξ k (8 subect to w φ(x k b +ξ k, ξ k 0, for k =,..., and =,...,n k ; w φ(x k b + ξ k, ξ k 0, (9 for k = +,...,r and =,...,n k ; where runs over,...,r. Note that there are r nequalty constrants for each sample x k (one for each threshold. y= ξ y= y=3 ξ ξ 3 b - b b + ξ 3 ξ b - b b + f(x = w. φ(x Fgure. An llustraton on the new defnton of slack varables ξ and ξ that mposes mplct constrants on the thresholds. All the samples are mapped by w φ(x onto the axs of functon values. Note the term ξ3 n ths graph. To prove the nequaltes on the thresholds at the optmal soluton, let us consder the stuaton where w s fxed and only the b s are optmzed. Note that the ξ k and ξ k are automatcally determned once the b are gven. To elmnate these varables, let us defne, for k r, (b ={ {,...,n k } : w φ(x k b }, I up k (b ={ {,...,nk } : w φ(x k b }. It s easy to see that b s optmal ff t mnmzes the functon e (b = k= Ik low(b( w φ(xk b + + r k=+ I up (b( w φ(xk + b + (0 k Let B denote the set of all mnmzers of e (b. By convexty, B s a closed nterval. Gven two ntervals B =[c,d ]andb =[c,d ], we say B B f c c and d d. I low k Lemma. B B B r Proof. The rght sde dervatve of e wth respect to b s g (b = (b + r (b ( k= Ilow k k=+ Iup k Take any one and consder B =[c,d ]andb+ = [c +,d + ]. Suppose c >c +. Defne b = c and b + = c +. Snce b + s strctly to the left of the nterval B that mnmzes e, we have g (b + < 0. Snce b + s a mnmzer of e + we also have g + (b + 0. Thus we have g +(b + g (b + > 0; also, by ( we get 0 <g + (b + g (b + = I+(b low + I up + (b + whch s mpossble. In a smlar way, d >d + s also not possble. Ths proves the lemma. If the optmal b are all unque, then Lemma mples that the b satsfy the natural ordnal orderng. Even when one or more b s are non-unque, Lemma says that there exst choces for the b that obey If, n the prmal problem, we regularze the b also (.e., nclude the extra cost term b / then the b are guaranteed to be unque. Lemma stll holds n ths case.

New Approaches to Support Vector Ordnal Regresson the natural orderng. The fact that the order preservaton comes about automatcally s nterestng and non-trval, whch dffers from the PRank algorthm (Crammer & Snger, 00 where the order preservaton on the thresholds s easly brought n va ther update rule. It s also worth notng that Lemma holds even for an extended problem formulaton that allows the use of dfferent costs (dfferent C values for dfferent msclassfcatons (class k msclassfed as class can have a C k. In applcatons such as collaboratve flterng such a problem formulaton can be very approprate; for example, an A rated move that s msrated as C may need to be penalzed much more than f a B rated move s msrated as C. Shashua and Levn s formulaton and ts extenson gven n secton of ths paper do not precsely support such a dfferental cost structure. Ths s another good reason n support of the mplct problem formulaton of the current secton. 3.. Dual Problem Let α k 0, γ k 0, α k 0andγ k 0 be the Lagrangan multplers for the nequaltes n (9. Usng deas parallel to those n secton. we can show that the dual of (8 (9 s the followng maxmzaton problem that nvolves only the multplers α and α : max α,α r =k α k k, k, subect to n k α k = k= = ( k α k r ( k α k α k = =k = ( k α k + r α k = =k K(x k,x k + k, r n k k=+ = α k 0 α k C and k 0 α k C and k>. ( (3 The dual problem ( (3 s a convex quadratc programmng problem. The sze of the optmzaton problem s (r n where n = r k= nk s the total number of tranng samples. The dscrmnant functon value for a new nput vector x s f(x = w x = ( k α k r α k K(x k,x. k, = =k The predctve ordnal decson functon s gven by arg mn{ : f(x <b }. The deas for adaptng SMO to ( (3 are smlar to those n secton.3. The resultng suboptmzaton problem s analogous to the case of standard update n secton.3 where only one of the equalty constrants from (3 s nvolved. Full detals of the dervaton of the dual problem as well as the SMO algorthm have been skpped for lack of space. These detals are gven n our longer techncal report (Chu & Keerth, 005. 4. Numercal Experments We have mplemented the two SMO algorthms for the ordnal regresson formulatons wth explct constrants (EXC and mplct constrants (IMC, 3 along wth the algorthm of Shashua and Levn (003 for comparson purpose. The functon cachng technque and the double-loop scheme proposed by Keerth et al. (00 have been ncorporated n the mplementaton for effcency. We begn ths secton wth a smple dataset to llustrate the typcal behavor of the three algorthms, and then emprcally study the scalng propertes of our algorthms. Then we compare the generalzaton performance of our algorthms aganst standard support vector regresson on eght benchmark datasets for ordnal regresson. The followng Gaussan kernel was used n these experments: ( K(x, x = exp κ d ς= (x ς x ς (4 where x ς denotes the ς-th element of the nput vector x. The tolerance parameter τ was set to 0.00 for all the algorthms. We have utlzed two evaluaton metrcs whch quantfy the accuracy of predcted ordnal scales {ŷ,...,ŷ t } wth respect to true targets {y,...,y t }: amean absolute error s the average devaton of the predcton from the true target, t.e. t = ŷ y, n whch we treat the ordnal scales as consecutve ntegers; bmean zero-one error s smply the fracton of ncorrect predctons. 4.. Gradng Dataset The gradng dataset was used n chapter 4 of Johnson and Albert (999 as an example of the ordnal regresson problem. 4 There are 30 samples of students score. The sat-math score and grade n prerequste probablty course of these students are used as nput features, and ther fnal grades are taken as the targets. In our experments, the sx students wth fnal grade A or E were not used, and the feature assocated wth the grade n prerequste probablty course was treated as a contnuous varable though t had an ordnal scale. In Fgure 3 we present the soluton obtaned by the 3 The source code (wrtten n ANSI C of our mplementaton of the two algorthms can be found at http://www.gatsby.ucl.ac.uk/ chuwe/svor.htm. 4 The gradng dataset s avalable at http://www.mathworks.com/support/books/book593.sp.

New Approaches to Support Vector Ordnal Regresson Grade n probablty course Shashua and Levn s formulaton wth explct constrants wth mplct constrants 5 4 3 b =0.0 b = 0.5 (a (b (c 5 5 Grade n probablty course 4 3 b =b = 0.07 Grade n probablty course b 4 = 0.5 b =0.74 3 CPU tme n seconds 5 ordnal scales n the target 0 4 mplct constrants explct constrants support vector regresson 0 0 0 slope.3 slope.8 0 slope.43 CPU tme n seconds 0 4 0 0 0 0 0 ordnal scales n the target mplct constrants explct constrants support vector regresson slope.3 slope.33 slope.39 450 500 550 600 Sat math score 450 500 550 600 Sat math score 450 500 550 600 Sat math score Fgure 3. The tranng results of the three algorthms usng a Gaussan kernel on the gradng dataset. The dscrmnant functon values are presented as contour graphs ndexed by the two thresholds. The crcles denote the students wth grade D, the dots denote grade C, and the squares denote grade B. three algorthms usng the Gaussan kernel (4 wth κ =0.5 and the regularzaton factor value of C =. In ths partcular settng, the soluton to Shashua and Levn (003 s formulaton has dsordered thresholds b <b as shown n Fgure 3 (left plot; the formulaton wth explct constrants corrects ths dsorder and yelds equal values for the two thresholds as shown n Fgure 3 (mddle plot. 4.. Scalng In ths experment, we emprcally studed how the two SMO algorthms scale wth respect to tranng data sze and the number of ordnal scales n the target. The Calforna Housng dataset was used n the scalng experments. 5 Twenty-eght tranng datasets wth szes rangng from 00 to 5,000 were generated by random selecton from the orgnal dataset. The contnuous target varable of the Calforna Housng data was dscretzed to ordnal scale by usng 5 or 0 equalfrequency bns. The standard support vector regresson (SVR was used as a baselne, n whch the ordnal targets were treated as contnuous values and ɛ = 0.. These datasets were traned by the two algorthms usng a Gaussan kernel wth κ = and a regularzaton factor value of C = 00. Fgure 4 gves plots of the computatonal costs of the three algorthms as functons of the problem sze, for the two cases of 5 and 0 target bns. Our algorthms scale well wth scalng exponents between.3 and.33, whle the scalng exponent of SVR s about.40 n ths case. Ths nearquadratc property n scalng comes from the sparseness property of SVMs,.e., non-support vectors affect the computatonal cost only mldly. The EXC and IMC algorthms cost more than the SVR approach due to the larger problem sze. For large szes, the cost of EXC s only about x tmes that of SVR. As expected, we also notced that the computatonal cost of IMC s dependent on r, the number of ordnal scales n the 5 The Calforna Housng dataset can be found at http://lb.stat.cmu.edu/datasets/. 00 500 000 000 5000 Tranng data sze 00 500 000 000 5000 Tranng data sze Fgure 4. Plots of CPU tme versus tranng data sze on log log scale, ndexed by the estmated slopes respectvely. We used the Gaussan kernel wth κ =andthe regularzaton factor value of C = 00 n the experment. target. The cost for 0 ranks s observed to be roughly 5 tmes that for 5 ranks, whereas the cost of EXC s nearly the same for the two cases. These observatons are consstent wth the sze of the optmzaton problems. The problem sze of IMC s (r n (whch s heavly nfluenced by r whle the problem sze of EXC s about n + r (whch largely depends on n only snce we usually have n r. Ths factor of effcency can be a key advantage for the EXC formulaton. 4.3. Benchmark datasets Next, we compared the generalzaton performance of the two approaches aganst the nave approach of usng standard support vector regresson (SVR and the method (SLA of Shashua and Levn (003. We collected eght benchmark datasets that were used for metrc regresson problems. 6 For each dataset, the target values were dscretzed nto ten ordnal quanttes usng equal-frequency bnnng. We randomly parttoned each dataset nto tranng/test splts as specfed n Table. The parttonng was repeated 0 tmes ndependently. The nput vectors were normalzed to zero mean and unt varance, coordnate-wse. The Gaussan kernel (4 was used for all the algorthms. 5-fold cross valdaton was used to determne the optmal values of model parameters (the Gaussan kernel parameter κ and the regularzaton factor C nvolved n the problem formulatons, and the test error was obtaned usng the optmal model parameters for each formulaton. The ntal search was done on a 7 7 coarse grd lnearly spaced n the regon {(log 0 C, log 0 κ 3 log 0 C 3, 3 log 0 κ 3}, followed by a fne search on a 9 9 unform grd lnearly spaced by 0. n the (log 0 C, log 0 κ space. The ordnal targets were treated as contnuous values n standard SVR, and the predctons for test cases were rounded to the nearest ordnal scale. The nsenstve zone parameter, ɛ of SVR was fxed at 0.. The test results of the four algorthms are recorded n Table. It s very clear that the generalzaton capabltes 6 These regresson datasets are avalable at http://www.lacc.up.pt/ ltorgo/regresson/datasets.html.

New Approaches to Support Vector Ordnal Regresson Table. Test results of the four algorthms usng a Gaussan kernel. The targets of these benchmark datasets were dscretzed by 0 equal-frequency bns. The results are the averages over 0 trals, along wth the standard devaton. d denotes the nput dmenson and tranng/test denotes the partton sze. We use bold face to ndcate the lowest average value among the results of the four algorthms. The symbols are used to ndcate the cases sgnfcantly worse than the wnnng entry; A p-value threshold of 0.0 n Wlcoxon rank sum test was used to decde ths. Partton Mean zero-one error Mean absolute error Dataset d tranng/test SVR SLA EXC IMC SVR SLA EXC IMC Pyrmdnes 7 50/4 0.777±0.068 0.756±0.073 0.75±0.063 0.79±0.066.404±0.84.400±0.55.33±0.93.94±0.04 Machnecpu 6 50/59 0.693±0.056 0.643±0.057 0.66±0.056 0.655±0.045.048±0.4.00±0. 0.986±0.7 0.990±0.5 Boston 3 300/06 0.589±0.05 0.56±0.03 0.569±0.05 0.56±0.06 0.785±0.05 0.765±0.057 0.773±0.049 0.747±0.049 Abalone 8 000/377 0.758±0.07 0.739±0.008 0.736±0.0 0.73±0.007.407±0.0.389±0.07.39±0.0.36±0.03 Bank 3 3000/59 0.786±0.004 0.759±0.005 0.744±0.005 0.75±0.005.47±0.00.44±0.0.5±0.07.393±0.0 Computer 4000/49 0.494±0.006 0.46±0.006 0.46±0.005 0.473±0.005 0.63±0.0 0.597±0.00 0.60±0.009 0.596±0.008 Calforna 8 5000/5640 0.677±0.003 0.640±0.003 0.640±0.003 0.639±0.003.070±0.008.068±0.006.068±0.005.008±0.005 Census 6 6000/6784 0.735±0.004 0.699±0.00 0.699±0.00 0.705±0.00.83±0.009.7±0.007.70±0.007.05±0.007 of the three ordnal regresson algorthms are better than that of the approach of SVR. The performance of Shashua and Levn s method s smlar to our EXC approach, as expected, snce the two formulatons are pretty much the same. Our ordnal algorthms are comparable on the mean zero-one error, but the results also show the IMC algorthm yelds much more stable results on mean absolute error than the EXC algorthm. 7 From the vew of the formulatons, EXC only consders the extremely worst samples between successve ranks, whereas IMC takes all the samples nto account. Thus the outlers may affect the results of EXC sgnfcantly, whle the results of IMC are relatvely more stable n both valdaton and test. 5. Concluson In ths paper we proposed two new approaches to support vector ordnal regresson that determne r parallel dscrmnant hyperplanes for the r ranks by usng r thresholds. The ordnal nequalty constrants on the thresholds are mposed explctly n the frst approach and mplctly n the second one. The problem sze of the two approaches s lnear n the number of tranng samples. We also desgned SMO algorthms that scale only about quadratcally wth the problem sze. The results of numercal experments verfed that the generalzaton capabltes of these approaches are much better than the nave approach of applyng standard regresson. Acknowledgments A part of the work was carred out at IPAM of UCLA. WC was supported by the Natonal Insttutes of Health and ts Natonal Insttute of General Medcal Scences dvson 7 As ponted out by a revewer, ξ + ξ + n ( of EXC s an upper bound on the zero-one error of the -th example, whle, n (8 of IMC, k= ξ k + r k=+ ξ k s an upper bound on the absolute error. Note that, n all the examples we use consecutve ntegers to represent the ordnal scales. under Grant Number P0 GM6308. References Chu, W., & Keerth, S. S. (005. New approaches to support vector ordnal regresson (Techncal Report. Yahoo! Research Labs. Crammer, K., & Snger, Y. (00. Prankng wth rankng. Advances n Neural Informaton Processng Systems 4 (pp. 64 647. Cambrdge, MA: MIT Press. Frank, E., & Hall, M. (00. A smple approach to ordnal classfcaton. Proceedngs of the European Conference on Machne Learnng (pp. 45 65. Har-Peled, S., Roth, D., & Zmak, D. (00. Constrant classfcaton: A new approach to multclass classfcaton and rankng. Advances n Neural Informaton Processng Systems 5. Herbrch, R., Graepel, T., & Obermayer, K. (000. Large margn rank boundares for ordnal regresson. Advances n Large Margn Classfers (pp. 5 3. MIT Press. Johnson, V. E., & Albert, J. H. (999. Ordnal data modelng (statstcs for socal scence and publc polcy. Sprnger-Verlag. Keerth, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (00. Improvements to Platt s SMO algorthm for SVM classfer desgn. Neural Computaton, 3, 637 649. Kramer, S., Wdmer, G., Pfahrnger, B., & DeGroeve, M. (00. Predcton of ordnal classes usng regresson trees. Fundamenta Informatcae, 47, 3. Platt, J. C. (999. Fast tranng of support vector machnes usng sequental mnmal optmzaton. Advances n Kernel Methods - Support Vector Learnng (pp. 85 08. MIT Press. Schölkopf,B.,&Smola,A.J.(00. Learnng wth kernels. The MIT Press. Shashua, A., & Levn, A. (003. Rankng wth large margn prncple: two approaches. Advances n Neural Informaton Processng Systems 5 (pp. 937 944. Vapnk, V. N. (995. The nature of statstcal learnng theory. New York: Sprnger-Verlag.