Multi-class kernel logistic regression: a fixed-size implementation



Similar documents
Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Series Solutions of ODEs 2 the Frobenius method. The basic idea of the Frobenius method is to look for solutions of the form 3

Figure 1. Inventory Level vs. Time - EOQ Problem

Logistic Regression. Steve Kroon

What is Candidate Sampling

Modern Problem Solving Techniques in Engineering with POLYMATH, Excel and MATLAB. Introduction

DECOMPOSITION ALGORITHM FOR OPTIMAL SECURITY-CONSTRAINED POWER SCHEDULING

Loop Parallelization

Recurrence. 1 Definitions and main statements

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Support Vector Machines

The Application of Qubit Neural Networks for Time Series Forecasting with Automatic Phase Adjustment Mechanism

Use of Multi-attribute Utility Functions in Evaluating Security Systems

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements


L10: Linear discriminants analysis

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

CLASSIFYING FEATURE DESCRIPTION FOR SOFTWARE DEFECT PREDICTION

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

8 Algorithm for Binary Searching in Trees

Forecasting the Direction and Strength of Stock Market Movement

New Approaches to Support Vector Ordinal Regression

BERNSTEIN POLYNOMIALS

Optimal Adaptive Voice Smoother with Lagrangian Multiplier Method for VoIP Service

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

1 Example 1: Axis-aligned rectangles

Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

CONSIDER a connected network of n nodes that all wish

PERRON FROBENIUS THEOREM

Face Recognition in the Scrambled Domain via Salience-Aware Ensembles of Many Kernels

Financial market forecasting using a two-step kernel learning method for the support vector regression

Imperial College London

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

An Alternative Way to Measure Private Equity Performance

Peer-to-peer systems have attracted considerable attention

Realistic Image Synthesis

Lecture 5,6 Linear Methods for Classification. Summary

Project Networks With Mixed-Time Constraints

Vasicek s Model of Distribution of Losses in a Large, Homogeneous Portfolio

Quantization Effects in Digital Filters

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Calculation of Sampling Weights

Single and multiple stage classifiers implementing logistic discrimination

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

J. Parallel Distrib. Comput.

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

24. Impact of Piracy on Innovation at Software Firms and Implications for Piracy Policy

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Extending Probabilistic Dynamic Epistemic Logic

Machine Learning and Data Mining Lecture Notes

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

ONE of the most crucial problems that every image

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

1 De nitions and Censoring

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Cyber-Security Via Computing With Words

Regression Models for a Binary Response Using EXCEL and JMP

SVM Tutorial: Classification, Regression, and Ranking

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Statistical Methods to Develop Rating Models

Least Squares Fitting of Data

Support vector domain description

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

When can bundling help adoption of network technologies or services?

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Lecture 3: Force of Interest, Real Interest Rate, Annuity

This circuit than can be reduced to a planar circuit

Fast Fuzzy Clustering of Web Page Collections

Solving Factored MDPs with Continuous and Discrete Variables

where the coordinates are related to those in the old frame as follows.

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Prediction of Stock Market Index Movement by Ten Data Mining Techniques

Bayesian Cluster Ensembles

Learning from Multiple Outlooks

SIMPLE LINEAR CORRELATION

When Network Effect Meets Congestion Effect: Leveraging Social Services for Wireless Services

Ring structure of splines on triangulations

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Transcription:

Mult-lass kernel logst regresson: a fxed-sze mplementaton Peter Karsmakers,2, Krstaan Pelkmans 2, Johan AK Suykens 2 Abstrat Ths researh studes a pratal teratve algorthm for mult-lass kernel logst regresson (KLR Startng from the negatve penalzed log lkelhood rterum we show that the optmzaton problem n eah teraton an be solved by a weghted verson of Least Squares Support Vetor Mahnes (LS-SVMs In ths dervaton t turns out that the global regularzaton term s refleted as a usual regularzaton n eah separate step In the LS-SVM framework, fxed-sze LS- SVM s known to perform well on large data sets We therefore mplement ths model to solve large sale mult-lass KLR problems wth estmaton n the prmal spae To redue the sze of the Hessan, an alternatng desent verson of Newton s method s used whh has the extra advantage that t an be easly used n a dstrbuted omputng envronment It s nvestgated how a mult-lass kernel logst regresson model ompares to a one-versus-all odng sheme I INTRODUCTION Logst regresson (LR and kernel logst regresson (KLR have already proven ther value n the statstal and mahne learnng ommunty Opposed to an emprally rsk mzaton approah suh as employed by Support Vetor Mahnes (SVMs, LR and KLR yeld probablst outomes based on a maxmum lkelhood argument It seen that ths framework provdes a natural extenson to multlass lassfaton tasks, whh must be ontrasted to the ommonly used odng approah (see eg [3] or [] In ths paper we use the LS-SVM framework to solve the KLR problem In our dervaton we see that the mzaton of the negatve penalzed log lkelhood rterum s equvalent to solvng n eah teraton a weghted verson of least squares support vetor mahnes (wls-svms [] [2] In ths dervaton t turns out that the global regularzaton term s refleted as usual n eah step In [2] a smlar teratve weghtng of wls-svms, wth dfferent weghtng fators s reported to onverge to an SVM soluton Unlke SVMs, KLR by ts nature s not sparse and needs all tranng samples n ts fnal model Dfferent adaptatons to the orgnal algorthm were proposed to obtan sparseness suh as n [3],[4], [5] and [6] The seond one uses a sequental mzaton optmzaton (SMO approah and n the last ase, the bnary KLR problem s reformulated nto a geometr programg system whh an be effently solved by an nteror-pont algorthm In the LS-SVM framework, fxed-sze LS-SVM has shown ts value on large data sets It approxmates the feature map usng a spetral deomposton, whh leads to a sparse representaton of the model when estmatng n the prmal spae We therefor use ths tehnque as a pratal mplementaton of KLR wth estmaton n the prmal spae To redue the sze of the Hessan, an alternatng desent verson of Newton s method s used whh has the extra advantage that t an be easly used n a dstrbuted omputng envronment The proposed algorthm s ompared to exstng algorthms usng small sze to large sale benhmark data sets The paper s organzed as follows In seton II we gve an ntroduton to logst regresson Seton III desrbes the extenson to kernel logst regresson A fxed-sze mplementaton s gven n seton IV Seton V reports numeral results on several experments, and fnally we onlude n seton VI II LOGISTIC REGRESSION A Mult-lass logst regresson After ntrodung some notatons, we reall the prnples of mult-lass logst regresson Suppose we have a multlass problem wth C lasses (C 2 wth a tranng set {(x, y } N = Rd {, 2,, C} wth N samples, where nput samples x are d from an unknown probablty dstrbuton over the random vetors (X, Y We defne the frst element of x to be, so that we an norporate the nterept term n the parameter vetor The goal s to fnd a lassfaton rule from the tranng data, suh that when gven a new nput x, we an assgn a lass label to t In mult-lass penalzed logst regresson the ondtonal lass probabltes are estmated va logt stohast models Pr(Y = X = x; w = Pr(Y = 2 X = x; w = Pr(Y = C X = x; w = exp(β T x + C = exp(βt x exp(β T 2 x + C = exp(βt x + C = exp(βt x, where w = [β T ; β T 2 ; ; β T C ], w R(C d s a olleton of the dfferent parameter vetors of m s equal to C lnear models The lass membershp of a new pont x an be gven by the lassfaton rule whh s arg max {,2,,C} ( Pr(Y = X = x ; w (2 The ommon method to nfer the parameters of the dfferent models s va the use of a penalzed negatve log lkelhood (PNLL rteron The author s wth: KH Kempen (Assoate KULeuven, IIBT, Klenhoefstraat 4,B-2440 Geel, Belgum, 2 KULeuven, ESAT- SCD/SISTA, Kasteelpark Arenberg 0, B-300, Heverlee,Belgum, (emal: namesurname@esatkuleuvenbe ( β,β 2,,β m l(β, β 2,, β m = N ln = P r(y = y X = x ; w + m 2 = βt β (3

We derve the objetve funton for penalzed logst regresson by ombnng ( wth (3 whh gves l LR (β, β 2,, β m = ( β T x + ln( + e βt x + e βt 2 x + + e βt m x + D ( β2 T x + ln( + e βt x + e βt 2 x + + e βt m x + D 2 + D C m β T β, 2 = ( + ln( + e βt x + e βt 2 x + + e βt m x + (4 where D = {(x, y } N =, D = D D 2 D C, D D j =, j and y =, x D In the sequel we use the shorthand notaton p, = Pr(Y = X = x ; Θ (5 where Θ denotes a parameter vetor whh wll be lear from the ontext Ths PNLL rteron for penalzed logst regresson s known to possess a number of useful propertes suh as the fat that t s onvex n the parameters w, smooth and has asymptot optmalty propertes B Logst regresson algorthm: teratvely re-weghted least squares Untl now we have defned a model and an objetve funton whh has to be optmzed to ft the parameters on the observed data Most often ths optmzaton s performed by a Newton based strategy where the soluton an be found by teratng w (k = w (k + s (k, (6 over k untl onvergene We defne w (k as the vetor of all parameters n the k-th teraton In eah teraton the step s (k = H (k g (k an be omputed where the gradent and the j-th element of the Hessan are respetvely defned as g (k = l LR and H (k w (k j = 2 l LR The gradent and w (k w (k j Hessan an be formulated n matrx notaton whh gves g (k = X T (u (k v (k + β(k X T (u m (k v m (k + β m (k, (7 H (k = X T T (k, X + I XT T (k,2 X XT T (k,m X X T T (k m, X XT T (k m,2 X XT T m,mx (k + I where X R N d s the nput matrx wth all values x for =,, N Next we defne the ndator funton I(y = j whh s equal to f y s equal to j otherwse t [ T s 0 Some other defntons are u (k = p (k,,, p(k,n], v = [I(y =,, I(y N = ] T, t a,b f a s equal to b otherwse t s t a,b T (k a,b = dag( [ t a,b,, ta,b N = p (k a, ( p(k a, = p (k a, p(k b, and ] The followng matrx notaton s onvenent to reformulate the Newton sequene as an teratvely regularzed re-weghted least squares (IRRLS problem whh wll be explaned shortly We defne A T R md mn as x 0 0 x 2 0 0 x N 0 0 0 x 0 0 x 2 0 0 x N 0 0 0 x 0 0 x 2 0 0 x N (8 where we defne a R d m as a row of A Next we defne the followng vetor notatons r = [I(y = ; ; I(y = m], r = [r ; ; r N ], [ ] P (k = p (k, ; ; p(k m, ; ; p(k,n ; ; p(k m,n R mn (9 The -th blok of a blok dagonal weght matrx W (k an be wrtten as t, t,2 t,m W (k t 2, t 2,2 t 2,m = (0 t m, t m,2 t m,m Ths results n the blok dagonal weght matrx W (k = blokdag(w (k,, W (k ( Now, we an reformulate the resultng gradent n teraton k as g (k = A T (P (k r + w (k (2 The k-th Hessan s gven by H (k = A T W (k A + I (3 Wth the losed form solutons of the gradent and Hessan we an setup the seond order approxmaton of the objetve funton used n Newton s method and use ths to reformulate the optmzaton problem to a weghted least squares equvalent It turns out that the global regularzaton term s refleted n eah step as a usual regularzaton term, resultng n a robust algorthm when s hosen approprately The followng lemma summarzes results Lemma (IRRLS Logst regresson an be expressed as an teratvely regularzed re-weghted least squares method The weghted regularzed least squares mzaton problem s defned as s (k 2 As(k z (k 2 + W (k 2 (s(k + w (k T (s (k + w (k where z (k = (W (k (r P (k and r, P (k, A, W (k are respetvely defned as n (9, ( Proof: Newton s method omputes n eah teraton k the optmal step s (kopt usng the Taylor expanson of the objetve funton l LR Ths results n the followng loal N

objetve funton s (kopt = arg s (k l LR (w (k + (A T (P (k r + w (k T s (k + 2 s(kt (A T W (k A + Is (k (4 By tradng terms we an proof that (4 an be expressed as a teratvely regularzed re-weghted least squares problem (IRRLS whh an be wrtten as where s (k 2 As(k z (k 2 + (5 W (k 2 (s(k + w (k T (s (k + w (k, z (k = (W (k (r P (k (6 Ths lassal result s desrbed n eg [3] III KERNEL LOGISTIC REGRESSION A Mult-lass kernel logst regresson In ths seton the dervaton of the kernel verson of mult-lass logst regresson s gven Ths result s based on an optmzaton argument opposed to the use of an approprate Representer Theorem [7] We show that both steps of the IRRLS algorthm an be easly reformulated n terms of a sheme of teratvely re-weghted LS-SVMs (rls-svm Note that n [3] the relaton of KLR to Support Vetor Mahnes (SVM s stated The problem statement n Lemma an be advaned wth a nonlnear extenson to kernel mahnes where the nputs x are mapped to a hgh dmensonal spae Defne Φ R mn mdϕ as A n (8 where x s replaed by ϕ(x and where ϕ : R d R dϕ denotes the feature map ndued by a postve defnte kernel Wth the applaton of the Merer s theorem for the kernel matrx Ω as Ω j = K(a, a j = Φ T Φ j,, j =,, mn t s not requred to ompute expltly the nonlnear mappng ϕ( as ths s done mpltly through the use of postve kernel funtons K For K there are usually the followng hoes: K(a, a j = a T a j (lnear kernel; K(a, a j = (a T a j + h b (polynomal of degree b, wth h a tunng parameter; K(a, a j = exp( a a j 2 2/σ 2 (radal bass funton, RBF, where σ s a tunng parameter In the kernel verson of LR the m models are defned as Pr(Y = X = x; w = Pr(Y = 2 X = x; w = Pr(Y = C X = x; w = exp(β T ϕ(x + m = exp(β T ϕ(x exp(β T 2 ϕ(x + m = exp(β T ϕ(x + m = exp(β T ϕ(x (7 B Kernel logst regresson algorthm: teratvely reweghted least squares support vetor mahne Startng from Lemma we nlude a feature map and ntrodue the error varable e, ths results n s (k,e (k 2 e(kt W (k e (k + 2 (s(k + w (k T (s (k + w (k suh that z (k = Φs (k + e (k, (8 whh n the ontext of LS-SVMs s alled the prmal problem In ts dual formulaton the soluton to ths optmzaton problem an be found by solvng a lnear system Lemma 2 (rls-svm The soluton to the kernel logst regresson problem an be found by teratvely solvng the lnear system ( Ω + W (k α (k = z (k + Ωα (k, (9 where z (k s defned as n (6 The probabltes of a new pont x gven by m dfferent models an be predted usng (7 where β T ϕ(x = N =, D α, K(x, x Proof: The Lagrangan of the onstraned problem as stated n (8 beomes L(s (k, e (k ; α (k = 2 e(kt W (k e (k + 2 (s(k + w (k T (s (k + w (k α (kt (Φs (k + e (k z (k wth Lagrange multplers α (k R Nm The frst order ondtons for optmalty are: L = 0 s (k = s (k ΦT α (k w (k L = 0 α e (k = W (k e (k (20 (k L = 0 Φs (k + e (k = z (k α (k Ths results n the followng dual soluton ( Ω + W (k α (k = z (k + Ωα (k (2 Remark that t an be easly shown that the blok dagonal weght matrx W (k s postve defnte when the probablty of the referene lass p C, > 0, =,, N The soluton w (L an be expressed n terms of α (k omputed n the last teraton Ths an be seen when ombnng the formula for s (k (20 and (6 whh gves w (L = ΦT α L (22 The lnear system n (2 an be solved n eah teraton by substtutng w (k wth ΦT α (k Hene, Pr(Y = y X = x ; w an be predted by usng (7 where β T ϕ (x = N =, D α, K(x, x

IV KERNEL LOGISTIC REGRESSION: A FIXED-SIZE A Nyström approxmaton IMPLEMENTATION Suppose one takes a fnte dmensonal feature map (eg a lnear kernel Then one an equally well solve the prmal as the dual problem In fat solvng the prmal problem s more advantageous for larger data sets beause the dmenson of the unknowns w R md ompared to α R mn In order to work n the prmal spae usng a kernel funton other than the lnear one, t s requred to ompute an explt approxmaton of the nonlnear mappng ϕ Ths leads to a sparse representaton of the model when estmatng n prmal spae Explt expressons for ϕ an be obtaned by means of an egenvalue deomposton of the kernel matrx Ω wth entres K(a, a j Gven the ntegral equaton K(a, aj φ (ap(ada = λ φ (a j, wth solutons λ and φ for a varable a wth a probablty densty p(a, we an wrte ϕ = [ λ φ, λ 2 φ 2,, λ dϕ φ dϕ ] (23 Gven the data set, t s possble to approxmate the ntegral by a sample average Ths wll lead to the egenvalue problem (Nyström approxmaton [9] mn K(a l, a j u (a l = λ (s u (a j, (24 mn l= where the egenvalues λ and egenfuntons φ from the ontnuous problem an be approxmated by the sample egenvalues λ (s and the egenvetors u R Nm as ˆλ =, ˆφ = Nmu (25 Nm λ(s Based on ths approxmaton, t s possble to ompute the egendeomposton of the kernel matrx Ω and use ts egenvalues and egenvetors to ompute the -th requred omponent of ˆϕ(a smply by applyng (23 f a s a tranng pont, or for any new pont a by means of ˆϕ(a = Nm u (s j K(a j, a (26 λ j= B Sparseness and large sale problems Untl now the entre tranng set s of sze Nm Therefore the approxmaton of ϕ wll yeld at most Nm omponents, eah one of whh an be omputed by (25 for all a, where a s a row of A However, f we have a large sale problem, t has been motvated [] to use a subsample of M N m data ponts to ompute the ˆϕ In ths ase, up to M omponents wll be omputed External rtera suh as entropy maxmzaton an be appled for an optmal seleton of the subsample: gven a fxed-sze M, the am s to selet the support vetors that maxmze the quadrat Reny entropy [0] H R = ln p(a 2 da, (27 whh an be approxmated by usng ˆp(a 2 da = M 2 T M Ω M The use of ths atve seleton proedure an be mportant for large sale problems, as t s related to the underlyng densty dstrbuton of the sample In ths sense, the optmalty of ths seleton s related to the fnal auray of the model Ths fnte dmensonal approxmaton ˆϕ(a an be used n the prmal problem (8 to estmate w wth a sparse representaton [] C Method of alternatng desent The dmensons of the approxmate feature map ˆϕ an grow large when the number of subsamples M s large When the number of lasses s also large, the sze of the Hessan whh s proportonal to m and d beomes very large and auses the matrx nverson to be omputatonal ntratable To overome ths problem we resort to an alternatng desent verson of Newton s method [8] where n eah teraton the logst regresson objetve funton s mzed for eah parameter β separately The negatve log lkelhood rteron followng ths strategy s gven by ( N l LR (w (β = ln P r(y = y X = x ; w (β + β = 2 βt β, (28 for =,, m Here we defne w (β = [β ; ; β ; ; β m ] where only β s adjustable n ths optmzaton problem, the other β-vetors are kept onstant Ths results n a omplexty of O ( mm 2 per update of w (k nstead of O ( m 2 M 2 for solvng the lnear system usng onjugated gradent [8] As a dsadvantage the onvergene rate s worse Remark that ths formulaton an be easly embedded n a dstrbuted omputng envronment beause the m dfferent smaller optmzaton problems an be handled n parallel for eah teraton Before statng the lemma let us defne F (k = dag ([ t, ; t, 2 ; ; ] t, N, Ψ = [ ˆϕ(x ; ; ˆϕ(x N ], (29 [ ] E (k = p (k, I(y = ; ; p (k,n I(y N = Lemma 3 (alternatng desent IRRLS Kernel logst regresson an be expressed n terms of an teratve alternatng desent method n whh eah teraton onssts of m reweghted least squares optmzaton problems s (k 2 Ψs(k where z (k z (k 2 F (k + 2 (s(k +β (k = F (k (k E for =,, m T (s (k +β (k, Proof: By substtutng (7 n the rteron as defned n (28 we obtan the alternatng desent KLR objetve funton Gven fxed β,, β, β +,, β m we onsder β f(β, D + + f(β, D C + 2 βt β, (30

for =,, m Where D f(β, D j = j β T ϕ(x + ln( + e βt ϕ(x + κ = j D j ln( + e βt ϕ(x + κ j, and κ denotes a onstant Agan we use a Newton based strategy to nfer the parameter vetors β for =,, m Ths results n mzng m Newton updates per teraton ( s (k = Ψ T F (k β (k Ψ + I = β (k s (k, (3 ( Ψ T E (k + β (k (32 usng an analogous reasonng as n (6, the prevous Newton proedure an be reformulated to m IRRLS shemes, where s (k 2 Ψs(k 2 (s(k z (k z (k 2 + F (k + β (k T (s (k + β (k, = F (k E (k, (33 for =,, m The resultng alternatng desent fxed-sze algorthm for KLR s presented n algorthm Algorthm Alternated desent Fxed-Sze KLR : Input: tranng data D = {(x, y } N = 2: Parameters: w (k 3: Output: probabltes Pr(X = x Y = y ; w opt, =,, N and w opt s the onverged parameter vetor 4: Intalze: β (0 := 0 for =,, m, k := 0 5: Defne: F (k, z (k aordng to resp (29, (33 6: w (0 = [β (0 ; ; β(0 m ] 7: support vetor seleton aordng to (27 8: ompute features Ψ as n (29 9: repeat 0: k := k + : for = m do 2: ompute Pr(X = x Y = y ; w (k, =,, N 3: onstrut F (k, z (k 4: (k s 2 Ψs(k z (k 2 + F (k 5: + β (k T (s (k + β (k 2 (s(k 6: β (k = β (k 7: end for 8: w (k = [β (k 9: untl onvergene + s (k ; ; β(k m ] V EXPERIMENTS All (KLR experments n ths seton are arred out n MATLAB For the SVM experments we used LIBSVM [4] To benhmark the KLR algorthm aordng to (2 we dd some experments on several small data sets and ompared wth SVM For eah experment we used an RBF kernel The hyperparameters and σ were tuned by a 0- fold rossvaldaton proedure For eah data set we used the The data sets an be found on the webpage http://dafrstfraunhoferde /projets/benh/benhmarkshtm TABLE I THE TABLE SHOWS THE MEAN AND STANDARD DEVIATION OF THE ERROR RATES ON DIFFERENT REALIZATIONS OF TEST AND TRAININGSET OF DIFFERENT DATA SETS USING KLR AND SVM WITH RBF KERNEL KLR SVM banana 039 ± 047 53 ± 066 breast-aner 2686 ± 0467 2604 ± 066 dabetes 238 ± 74 2353 ± 73 flare-solar 3340 ± 60 3243 ± 82 german 2373 ± 25 236 ± 207 heart 738 ± 300 595 ± 326 mage 36 ± 052 296 ± 060 rngnorm 233 ± 05 66 ± 02 sple 43 ± 070 088 ± 066 thyrod 453 ± 225 480 ± 29 ttan 2288 ± 2 2242 ± 02 twonorm 239 ± 03 296 ± 023 waveform 968 ± 048 988 ± 043 provded realzatons In table t s seen that the error rates of KLR are omparable wth those aheved wth SVM In Fg 3 we plot the log lkelhoods of test data produed by models nferred wth two mult-lass versons of LR, a model traned wth LDA and a naïve baselne n funton of the number of lasses The frst mult-lass model, whh we here wll refer to as LRM, s as n (, the seond s buld usng bnary subproblems oupled va a one-versusall enodng sheme [3] whh we all LROneVsAll The baselne returns a lkelhood whh s nverse proportonal to the number of lasses, ndependent of the nput For ths experment we used a toy data set whh onssts of 600 data ponts n eah of the K lasses The data n eah lass s generated by a mxture of 2 dmensonal gaussans Eah tme we add a lass, s tuned usng a 0-fold ross valdaton and the log lkelhood averaged over 20 runs s plotted It an be seen that the KLR mult-lass approah results n more aurate lkelhood estmates on the test set ompared to the alternatves To ompare the onvergene rate of KLR and ts alternated desent verson we used the same toy data set as before wth 6 lasses The resultng urves are plotted n Fg As expeted the onvergene rate of the alternated desent algorthm s less than the orgnal formulaton of the algorthm But the ost of eah alternated desent teraton s less and therefore gves an aeptable total amount of pu tme Whle KLR onverges after 8s, alternated desent KLR reahes the stoppng rteron after 24s SVM onverges after 3s The probablty landsape of the frst out of 6 lasses modeled by KLR wth RBF kernel s plotted n Fg 2 Next we ompared the fxed-sze KLR mplementaton wth the SMO mplementaton of LIBSVM on the UCI Adult data set [3] In ths data set one s asked to predt whether an household has an nome greater than 50, 000 dollars It onssts of 48, 842 data ponts and has 4 nput varables Fg 4 shows the perentage of orretly lassfed test examples as a funton of M, the number of support vetors, together wth the CPU tme to tran the fxed-sze KLR model For SVM we aheved a test set auray of

Fg verson Convergene plot of mult-lass KLR and ts alternatng desent Fg 4 CPU tme and auray n funton of the number of support vetors when usng the fxed-sze KLR algorthm large data sets We showed that the performane n terms of orret lassfatons s omparable to that of SVM, but wth the advantage that KLR gves straghtforward probablst outomes whh s desrable n several applatons Experments show the advantage of usng a mult-lass KLR model ompared to the use of a odng sheme Aknowledgments Researh supported by GOA AMBoRICS, CoE EF/05/006; (Flemsh Government: (FWO: PhD/postdo grants, projets, G040702, G09702, G0403, G04903, G02003, G045204, G049904, (a Class I Fg 2 Probablty landsape produed by KLR usng an RBF kernel on one of the 6 lasses from the gaussan mxture data 85% whh s omparable wth the results shown n Fg 4 Fnally we used the solet task [3] whh ontans 26 spoken Englsh alphabet letters who are haraterzed by 67 spetral omponents to ompare the mult-lass fxed-sze KLR algorthm wth SVM bnary subproblems oupled va a one-versus-one odng sheme In total the data set ontans 6, 240 tranng examples and, 560 test nstanes Agan we used 0-fold rossvaldaton to tune the hyperparameters Wth fxed-sze KLR and SVM we obtaned respetvely an auray on the test set of 964% and 9686% whle the former gves addtonally probablst outomes whh are useful n the ontext of speeh VI CONCLUSIONS In ths paper we presented a fxed-sze algorthm to ompute a mult-lass KLR model whh s salable to G0205, G022606, G03206, G055306, G030207 (ICCoS, ANMMM, MLDM; (IWT: PhD Grants,GBOU (MKnow, Eureka-Flte2 - Belgan Federal Sene Poly Offe: IUAP P5/22,PODO-II,- EU: FP5-Quprods; ERNSI; - Contrat Researh/agreements: ISMC/IPCOS, Data4s, TML, Ela, LMS, Masterard JS s a professor and BDM s a full professor at KULeuven Belgum Ths publaton only reflets the authors vews REFERENCES [] JAK Suykens, T Van Gestel, J De Brabanter, B De Moor and J Vandewalle, Least Squares Support Vetor Mahnes, World Sentf, Sngapore, 2002 [2] JAK Suykens and J Vandewalle, Least squares support vetor mahne lassfers, Neural Proessng Letters,9(3:293-300, 999 [3] J Zhu, T Haste, Kernel logst regresson and the mport vetor mahne, Advanes n Neural Informaton Proessng Systems, vol 4, 200 [4] SS Keerth, K Duan, SK Shevade and AN Poo A Fast Dual Algorthm for Kernel Logs Regresson, Internatonal Conferene on Mahne Learnng, 2002 [5] J Zhu and T Haste, Classfaton of gene mroarrays by penalzed logst regresson, Bostatsts, vol 5, pp 427444, 2004 [6] K Koh, S-J Km and S Boyd An Interor-Pont Method for Large- Sale l -Regularzed Logst Regresson, Internal report, july, 2006 [7] G Kmeldorf, G Wahba, Some results on Thebyheffan splne funtons, Journal of Mathemats Analyss and Applatons,vol 33, pp 82-95, 97 [8] J Noedal, S J Wrght, Numeral Optmzaton, Sprnger, 999 [9] CKI Wllams, M Seeger Usng the Nyström Method to Speed Up Kernel Mahnes, Proeedngs Neural Informaton Proessng Systems, vol 3, MIT press, 2000 [0] M Grolam Orthogonal Seres Densty Estmaton and the Kernel Egenvalue Problem, Neural Computaton, vol 4(3, 669-688, 2003 [] JAK Suykens, J De Brabanter, L Lukas, J Vandewalle, Weghted least squares support vetor mahnes : robustness and sparse approxmaton, Neuroomputng, vol 48, no -4, pp 85-05, 2002 [2] F Pérez-Cruz and C Bousoño-Calzón and A Artés-Rodríguez, Convergene of the IRWLS Proedure to the Support Vetor Mahne Soluton, Neural Computaton, vol 7, p 7-8, 2005 [3] CJ Merz, PM Murphy, UCI repostory of mahne learnng databases, http://wwwsuedu/ mlearn/mlrepostoryhtml, 998 [4] CC Chang, CJ Ln, LIBSVM : a lbrary for support vetor mahnes, Software avalable at http://wwwsentuedutw/ jln/lbsvm, 200 Fg 3 Mean log lkelhood n funton of the number of lasses n the learnng problem