Jounal of Machine Leaning Reseach 3 (2003) 951-991 Submied 2/02; Published 1/03 Ulaconsevaive Online Algoihms fo Muliclass Poblems Koby Camme Yoam Singe School of Compue Science & Engineeing Hebew Univesiy, Jeusalem 91904, Isael KOBICS@CS.HUJI.AC.IL SINGER@CS.HUJI.AC.IL Edio: Manfed K. Wamuh Absac In his pape we sudy a paadigm o genealize online classificaion algoihms fo binay classificaion poblems o muliclass poblems. The paicula hypoheses we invesigae mainain one pooype veco pe class. Given an inpu insance, a muliclass hypohesis compues a similaiyscoe beween each pooype and he inpu insance and ses he pediced label o be he index of he pooype achieving he highes similaiy. To design and analyze he leaning algoihms in his pape we inoduce he noion of ulaconsevaiveness. Ulaconsevaive algoihms ae algoihms ha updae only he pooypes aaining similaiy-scoes which ae highe han he scoe of he coec label s pooype. We sa by descibing a family of addiive ulaconsevaive algoihms whee each algoihm in he family updaes is pooypes by finding a feasible soluion fo a se of linea consains ha depend on he insananeous similaiy-scoes. We hen discuss a specific online algoihm ha seeks a se of pooypes which have a small nom. The esuling algoihm, which we em MIRA (fo Magin Infused Relaxed Algoihm) is ulaconsevaive as well. We deive misake bounds fo all he algoihms and povide fuhe analysis of MIRA using a genealized noion of he magin fo muliclass poblems. We discuss he fom he algoihms ake in he binay case and show ha all he algoihms fom he fis family educe o he Pecepon algoihm while MIRA povides a new Pecepon-like algoihm wih a magin-dependen leaning ae. We hen eun o muliclass poblems and descibe an analogous muliplicaive family of algoihms wih coesponding misake bounds. We end he fomal pa by deiving and analyzing a muliclass vesion of Li and Long s ROMMA algoihm. We conclude wih a discussion of expeimenal esuls ha demonsae he meis of ou algoihms. 1. Inoducion In his pape we pesen a geneal appoach fo deiving algoihms fo muliclass pedicion poblems. In muliclass poblems he goal is o assign one of k labels o each inpu insance. Many machine leaning poblems can be phased as a muliclass caegoizaion poblem. Examples o such poblems include opical chaace ecogniion (OCR), ex classificaion, and medical analysis. Thee ae numeous specialized soluions fo muliclass poblems fo specific models such as decision ees (Beiman e al., 1984, Quinlan, 1993) and neual newoks. Anohe geneal appoach is based on educing a muliclass poblem o muliple binay poblems using oupu coding (Dieeich and Bakii, 1995, Allwein e al., 2000). An example of a educion ha falls ino he above famewok is he one-agains-es appoach. In one-agains-es a se of binay classifies is ained, one classifie fo each class. The h classifie is ained o disciminae beween he h c 2003 Koby Camme and Yoam Singe.
CRAMMER AND SINGER class and he es of he classes. New insances ae classified by seing he pediced label o be he index of he classifie aaining he highes confidence in is pedicion. In his pape we pesen a unified appoach ha opeaes diecly on he muliclass poblem by imposing consains on he updaes fo he vaious classes. Thus, ou appoach is inheenly diffeen fom mehods based on oupu coding. Ou famewok fo analyzing he algoihms is he misake bound model (Lilesone, 1988). The algoihms we sudy wok in ounds. On each ound he poposed algoihms ge a new insance and oupu a pedicion fo he insance. They hen eceive he coec label and updae hei pedicaion ule in case hey made a pedicion eo. The goal of he algoihms is o minimize he numbe of misakes hey made compaed o he minimal numbe of eos ha an hypohesis, buil offline, can achieve. The algoihms we conside in his pape mainain one pooype veco fo each class. Given a new insance we compae each pooype o he insance by compuing he similaiy-scoe beween he insance and each of he pooypes fo he diffeen classes. We hen pedic he class which achieves he highes similaiy-scoe. In binay poblems, his scheme educes (unde mild condiions) o a linea disciminao. Afe he algoihm makes a pedicion i eceives he coec label of he inpu insance and updaes he se of pooypes. Fo a given inpu insance, he se of labels ha aain similaiy-scoes highe han he scoe of coec label is called he eo se. The algoihms we descibe shae a common feaue: hey all updae only he pooypes fom he eo ses and he pooype of he coec label. We call such algoihms ulaconsevaive algoihms. We sa in Secion 3 in which we povide a moivaion fo ou famewok. We do ha by evisiing he well known Pecepon algoihm and give a new accoun of he algoihm using wo pooype vecos, one fo each class. We hen exend he algoihm o a muliclass seing using he noion of ulaconsevaiveness. In Secion 4 we fuhe genealize he muliclass vesion of he exended Pecepon algoihm and descibe a new family of ulaconsevaive algoihms ha we obain by eplacing he Pecepon s updae wih a se of linea equaions. We give a few illusaive examples of specific updaes fom his family of algoihms. Going back o he Pecepon algoihm, we show ha in he binay case all he diffeen updaes educe o he Pecepon algoihm. We finish Secion 4 by deiving a misake bound ha is common o all he addiive algoihms in he family. We analyze boh he sepaable and he non-sepaable case. The fac ha all algoihms fom Secion 4 achieve he same misake bound implies ha hee ae some undeemined degees of feedom. We pesen in Secion 5 a new online algoihm ha gives a unique updae and is based on a elaxaion of he se of linea consains employed by he family of algoihms fom Secion 4. The algoihm is deived by adding an objecive funcion ha incopoaes he nom of he new maix of pooypes and minimizing i subjec o a subse of he linea consains. Following ecen end, we call he new algoihm MIRA fo Magin Infused Relaxed Algoihm. We analyze MIRA and give a misake bound elaed o he insananeous magin of individual examples. This analysis leads o modificaion of MIRA which incopoaes he magin ino he updae ule. We descibe a simple and efficien fixed-poin algoihm ha efficienly compues a single updae of MIRA and pove is convegence. Boh MIRA and of he addiive algoihms fom Secion 4 can be combined wih kenels echniques and voing mehods. In Secion 6 we deive an analogous ulaconsevaive family of muliplicaive algoihms fo muliclass poblems. Hee we descibe wo vaians of muliplicaive algoihms. The wo vaians diffe in he way hey nomalize he se of pooypes. As in he addiive case, we analyze boh vaians in he misake bound model. Analogously o he addiive family of algoihms, he 952
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS muliplicaive family of algoihms educes o Winnow (Lilesone, 1988) in he binay case. In Secion 7 we combine he ulaconsevaive appoach wih Li and Long s (2002) algoihm o deive a muliclass vesion of i. In Secion 8 we discuss expeimens wih synheic daa and eal daases ha compae he addiive algoihms. Ou expeimens indicae ha MIRA oupefoms he ohe algoihms a he expense of updaing is hypohesis fequenly. The algoihms pesened in his pape undescoe a geneal famewok fo deiving ulaconsevaive muliclass algoihms. This famewok can be used in combinaion wih ohe online echniques. To conclude, we ouline some of ou cuen eseach diecions. Relaed Wok A quesion ha is common o numeous online algoihms is how o compomise he following wo demands. On one hand, we wan o updae he classifie we lean so ha i will bee pedic he cuen inpu insance, in paicula if an eo occus when using he cuen classifie. On he ohe hand, we do no wan o change he cuen classifie oo adically, especially if i classifies well mos of he peviously obseved insances. The good old Pecepon algoihm suggesed by Rosenbla (1958) copes wih hese wo equiemens by eplacing he classifie wih a linea combinaion of he cuen hypeplane and he cuen insance veco. Alhough he algoihm uses a simple updae ule, i pefoms well on many synheic and eal-wold poblems. The Pecepon algoihm spued voluminous wok which clealy canno be coveed hee. Fo an oveview of numeous addiive and muliplicaive online algoihms see he pape by Kivinen and Wamuh (1997). We also would like o noe ha he a muliclass vesion of he Pecepon algoihm has aleady been povided in he widely ead and cied book of Duda and Ha (1973). The mulicalss vesion in he book is called Kesle s consucion. We pospone he discussion of he elaion of his consucion o ou family of online algoihms o Secion 4. We now ouline moe ecen eseach ha is elevan o he wok pesened in his pape. Kivinen and Wamuh (1997) pesened numeous online algoihms fo egession. Thei algoihms ae based on minimizaion of an objecive funcion which is a sum of wo ems. The fis em is equal o he disance beween he new classifie and he cuen classifie while he second em is he loss on he cuen example. The esuling updae ule can be viewed as a gadien-descen mehod. Alhough muliclass classificaion poblems ae a special case of egession poblems, he algoihms fo egession pu emphasis on smooh loss funcions which migh no be suiable fo classificaion poblems. The idea of seeking a hypeplane of a small nom is a pimay goal in suppo veco machines (Coes and Vapnik, 1995, Vapnik, 1998). Noe ha fo SVMs minimizing he nom of he hypeplane is equivalen o maximizing he magin of he induced linea sepaao. Algoihms fo consucing suppo veco machines solve opimizaion poblems wih a quadaic objecive funcion and linea consains. Anlauf and Biehl (1989) and Fiess, Cisianini, and Campbell (1998) suggesed an alenaive appoach which minimizes he objecive funcion in a gadiendecen mehod. The minimizaion can be pefomed by going ove he sample sequenially. Algoihms wih a simila appoach include he Sequenial Minimizaion Opimizaion (SMO) algoihm inoduced by Pla (1998). SMO woks on ounds, on each ound i chooses wo examples of he sample and minimizes he objecive funcion by modifying vaiables elevan only o hese wo examples. While hese algoihms shae some similaiies wih he algoihmic appoaches descibed in his pape, hey wee all designed fo bach poblems and wee no analyzed in he misake bound model. 953
CRAMMER AND SINGER Anohe appoach o he poblem of designing an updae ule which esuls in a linea classifie of a small nom was suggesed by Li and Long (2002). The algoihm Li and Long poposed, called ROMMA, ackles he poblem by finding a hypeplane wih a minimal nom unde wo linea consains. The fis consain is pesened so ha he new classifie will classify well pevious examples, while he second ule demands ha he hypeplane will classify coecly he cuen new insance. Solving his minimizaion poblem leads o an addiive updae ule wih adapive coefficiens. Gove, Lilesone, and Schuumans (2001) inoduced a geneal famewok of quasi-addiive binay algoihms, which conain he Pecepon and Winnow as special cases. Genile (2001) poposed an exension o a subse of he quasi-addiive algoihms, which uses an addiive consevaive updae ule wih deceasing leaning aes. All of he wok descibed above is designed o solve binay classificaion poblems. These binay classifies can be used in a muliclass seing by educing hem o muliple binay poblems using oupu coding such as one-agains-es. Meseham (1999) suggesed a muliclass online algoihm which combines esuls fom a se of sub-expes. Using his algoihm Meseham deives a Winnow-like algoihm and povides a coesponding misake bound. The muliclass algoihm of Meseham is closely elaed o he muliplicaive family of algoihms we pesen in Secion 6, hough ou family of muliplicaive algoihms is moe geneal. The algoihms pesened in his pape ae eminiscen of some of he widely used mehods fo consucing classifies in muliclass poblems. As menioned above, a popula appoach fo solving classificaion poblems wih many classes is o lean a se of binay classifies whee each classifie is designed o sepaae one class fom he es of classes. If we use he Pecepon algoihm o lean he binay classifies, we need o mainain and updae one veco fo each possible class. This appoach shaes he same fom of hypohesis as he algoihms pesened in his pape, which mainain one pooype pe class. Noneheless, hee is one majo diffeence beween he ulaconsevaive algoihms we pesen and he one-agains-es appoach. In one-agains-es we updae and change each of he classifies independenly of he ohes. In fac we can consuc hem one afe he ohe by e-unning ove he daa. In conas, ulaconsevaive algoihms updae all he pooypes in andem hus updaing one pooype has a global effec on he ohe pooypes. Thee ae siuaions in which hee is an eo due o some classes, bu no all he especive pooypes should be updaed. Pu anohe way, we migh pefom milde changes o he se of classifies by changing hem ogehe wih he pooypes so as o achieve he same goal. As a esul we ge bee misake bounds and empiically bee algoihms. 2. Peliminaies The focus of his pape is online algoihms fo muliclass pedicion poblems. We obseve a sequence ( x 1,y 1 ),...,( x,y ),... of insance-label pais. Each insance x is in R n and each label belongs o a finie se Y of size k. We assume wihou loss of genealiy ha Y = {1,2,...,k}. A muliclass classifie is a funcion H( x) ha maps insances fom R n ino one of he possible labels in Y. In his pape we focus on classifies of he fom H( x)=agmax k =1 { M x}, wheem is a k n maix ove he eals and M R n denoes he h ow of M. We call he inne poduc of M wih he insance x,hesimilaiy-scoe fo class. Thus, he classifies we conside in his pape se he label of an insance o be he index of he ow of M which achieves he highes similaiy-scoe. The magin of H on x is he diffeence beween he similaiy-scoe of he coec label y and he 954
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS maximum among he similaiy-scoes of he es of he ows of M. Fomally, he magin ha M achieves on ( x, y) is, M y x max { M x}. y The l p nom of a veco ū =(u 1,...,u l ) in R l is ( ) 1 l p ū p = u i p i=1. nom of he veco we ge by concaenaing he ows of A,hais, A p = (Ā 1,...,Ā k ) p, whee fo p = 2 he nom is known as he Fobenius nom. Similaly, we define he veco-scalapoduc of wo maices A and B o be, A B = Ā B. Finally, δ i, j denoes Konecke s dela funcion, ha is, δ i, j = 1ifi = j and δ i, j = 0 ohewise. The famewok ha we use in his pape is he misake bound model fo online leaning. The algoihms we conside wok in ounds. On ound an online leaning algoihm ges an insance x. Given x, he leaning algoihm oupus a pedicion, ŷ = ag max { M x }. I hen eceives he coec label y and updaes is classificaion ule by modifying he maix M. We say ha he algoihm made a (muliclass) pedicion eo if ŷ y. Ou goal is o make as few pedicion eos as possible. When he algoihm makes a pedicion eo hee migh be moe han one ow of M achieving a scoe highe han he scoe of he ow coesponding o he coec label. We define he eo-se fo ( x,y) using a maix M o be he index of all he ows in M which achieve such high scoes. Fomally, he eo-se fo a maix M on an insance-label pai ( x,y) is, E = { y : M x M y x}. Many online algoihms updae hei pedicion ule only on ounds on which hey made a pedicion eo. Such algoihms ae called consevaive. We now give a definiion ha exends he noion of consevaiveness o muliclass seings. Definiion 1 (Ulaconsevaive) An online muliclass algoihm of he fom H( x)=ag max { M x} is ulaconsevaive if i modifies M only when he eo-se E fo ( x,y) is no empy and he indices of he ows ha ae modified ae fom E {y}. Noe ha ou definiion implies ha an ulaconsevaive algoihm is also consevaive. Fo binay poblems he wo definiions coincide. 3. Fom Binay o Muliclass The Pecepon algoihm of Rosenbla (1958) is a well known online algoihm fo binay classificaion poblems. The algoihm mainains a weigh veco w R n ha is used fo pedicion. To moivae ou muliclass algoihms le us now descibe he Pecepon algoihm using he noaion 955
CRAMMER AND SINGER x M + x 1 000 111 00 11 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 M 2 M - x 00000000 00000000 00000000 11111111 11111111 11111111 2 00000000 11111111 -x M 1 x M - x/2 2 -x/2 M 2 01 000 111 01 000 111 x 01 000 111 01 000 111 01 000 111 01 000 111 01 000 111 01 000 111 01 000 111 01 000 111 01 000 111 M 3 01 -x/2 M - x/2 3 M + x 1 x M 1 M 4 Figue 1: A geomeical illusaion of he updae fo a binay poblem (lef) and a fou-class poblem (igh) using he exended Pecepon algoihm. employed in his pape. In ou seing he label of each insance belongs o he se {1,2}. Given an inpu insance x he Pecepon algoihm pedics ha is label is ŷ = 1iff w x 0 and ohewise i pedics ŷ = 2. The algoihm modifies w only on ounds wih pedicion eos and is hus consevaive. On such ounds w is changed o w+ x if he coec label is y = 1ando w x if y = 2. To implemen he Pecepon algoihm using a pooype maix M wih one ow (pooype) pe class, we se he fis ow M 1 o w and he second ow M 2 o w. We now modify M evey ime he algoihm mis-classifies x as follows. If he coec label is 1 we eplace M 1 wih M 1 + x and M 2 wih M 2 x. Similaly, we eplace M 1 wih M 1 x and M 2 wih M 2 + x when he coec label is 2 and x is misclassified. Thus, he ow M y is moved owad he misclassified insance x while he ohe ow is moved away fom x. Noe ha his updae implies ha he oal change o he wo pooypes is zeo. An illusaion of his geomeical inepeaion is given on he lef-hand side of Figue 1. I is saighfowad o veify ha he algoihm is equivalen o he Pecepon algoihm. We can now use his inepeaion and genealize he Pecepon algoihm o muliclass poblems as follows. Fo k classes we mainain a maix M of k ows, one ow pe class. Fo each inpu insance x, he muliclass genealizaion of he Pecepon calculaes he similaiy-scoe beween he insance and each of he k pooypes. The pediced label, ŷ, is he index of he ow (pooype) of M which achieves he highes scoe, ha is, ŷ = ag max { M x}. Ifŷ y he algoihm moves M y owad x by eplacing M y wih M y + x. In addiion, he algoihm moves each ow M ( y) fo which M x M y x away fom x. The indices of hese ows consiue he eo se E. The algoihms pesened in his pape, and in paicula he muliclass vesion of he Pecepon algoihm, modify M such ha he following popey holds: The oal change in unis of x in he ows of M ha ae moved away fom x is equal o he change of M y, (in unis of x). Specifically, fo he muliclass Pecepon we eplace M y wih M y + x and fo each in E we eplace M wih M x/ E. A geomeic illusaion of his updae is given in he igh-hand side of Figue 1. Thee ae fou classes in he example appeaing in he figue. The coec label of x is y = 1 and since M 1 is no he mos simila veco o x, i is moved owad x. Theows M 2 and M 3 ae also modified by subacing x/2 fom each one. The las ow M 4 is no in he eo-se since M 1 x > M 4 x and heefoe i is no modified. We defe he analysis of he algoihm o he nex secion in which we descibe and analyze a family of online muliclass algoihms ha also includes his algoihm. 956
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS 4. A Family of Addiive Muliclass Algoihms We descibe a family of ulaconsevaive algoihms by using he algoihm of he pevious secion as ou saing poin. The algoihm is ulaconsevaive and hus updaes M only on ounds wih pedicions eos. The ow M y is changed o M y + x while fo each E we modify M o M x/ E. Le us inoduce a veco of weighs τ =(τ 1,...,τ k ) and ewie he updae of he h ow as M + τ x. Thus, fo = y we have τ = 1, fo E we se τ = 1/ E, andfo E {y}, τ is zeo. The weighs τ wee chosen such ha he oal change of he ows of M whose indices ae fom E ae equal o he change in M y,hais,1= τ y = E τ. If we do no impose he condiion ha fo E all he τ s aain he same value, hen he consains on τ become E {y} τ = 0. This consain enables us o move he pooypes fom he eo-se E away fom x in diffeen popoions as long as he oal change is sum o one. The esul is a whole family of muliclass algoihms. A pseudo-code of he family of algoihms is povided in Figue 2. Noe ha he consains on τ ae edundan and we could have used less consains. We make use of his moe elaboae se of consains in he nex secion. Befoe analyzing he family of algoihms we have jus inoduced, we give a few examples of specific schemes o se τ. We have aleady descibed one updae above which ses τ o, τ = 1 E E 1 = y 0 ohewise. Since all he τ s fo ows in he eo-se ae equal, we call his he unifom muliclass updae. We can also be fuhe consevaive and modify in addiion o M y only one ohe ow in M. A easonable choice is o modify he ow ha achieves he highes similaiy-scoe. Tha is, we se τ o, 1 τ = = ag max s { M s x} 1 = y 0 ohewise. We call his fom of updaing τ he max-scoe muliclass updae. The wo examples above se τ fo E o a fixed value, ignoing he acual values of similaiy-scoes each ow achieves. We can also se τ in pomoion o he excess in he similaiy-scoe of each ow in he eo se (wih espec o M y ). Fo insance, we can se τ o be, { τ = [ M x M y x] + y k =1 [ M x M y x] + 1 = y, whee [x] + is equal o x if x 0 and zeo ohewise. Noe ha he above updae implies ha τ = 0 fo E {y}. We descibe expeimens compaing he above updaes in Secion 8. We poceed o analyze he family of algoihms. 4.1 Analysis Befoe giving he analysis of he algoihms of Figue 2 we pove he following auxiliay lemma. Lemma 2 Fo any se {τ 1,...,τ k } such ha, k =1 τ = 0 and τ δ,y fo = 1,...,k, hen τ 2 2τ y 2. 957
CRAMMER AND SINGER Iniialize: Se M = 0 (M R k n ). Loop: Fo = 1,2,...,T Ge a new insance x R n. Pedic ŷ = ag max k { M x }. =1 Ge a new label y. Se E = { y : M x M y x }. If E /0 updae M by choosing any τ 1,...,τ k ha saisfy: 1. τ 0fo y and τ y 1. 2. k =1 τ = 0. 3. τ = 0fo / E {y }. 4. τ y = 1. Fo = 1,2,...,k updae: M M + τ x. Oupu : H( x)=ag max { M x}. Figue 2: A family of addiive muliclass algoihms. Poof Since fo y he value of τ canno be posiive we have, τ 1 = k =1 τ = τ y + k y ( τ ) Using he equaliy k =1 τ = 0wege, Applying Hölde s inequaliy we ge, τ 1 = 2τ y. k =1 τ 2 = k =1 (τ τ ) τ 1 τ = 2τ y τ y 2τ y 2, whee fo he las wo inequaliies we used he fac ha 0 τ y 1. We now give he main heoem of his secion. Theoem 3 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence fo any muliclass algoihm fom he family descibed in Figue 2 whee x R n and y {1,2,...,k}. Denoe by R = max x. Assume ha hee is a maix M of a uni veco-nom, M = 1, ha classifies he enie sequence coecly wih magin γ = min{ M y x max M y x } > 0. Then, he numbe of misakes ha he algoihm makes is a mos 2 R2 γ 2. 958
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Poof Assume ha an eo occued when classifying he h example ( x,y ) using he maix M. Denoe by M he updaed maix afe ound. Thais,fo = 1,2,...,k we have M = M + τ x. To pove he heoem we bound M 2 2 fom above and below. Fis, we deive a lowe bound on M 2 by bounding he em, k =1 M M = = k =1 k =1 M ( M + τ x ) M M +τ ( M x ). (1) We fuhe develop he second em of Equaion (1) using he second consain of he algoihm ( k =1 τ = 0 ). Subsiuing τ y = y τ we ge, τ ( M x ) = y τ = y τ ( M x ) ( + τ y M y x) ( M x) τ ( M y x) y ) x. (2) = y ( τ )( M y M Using he assumpion ha M classifies each insance wih a magin of a leas γ and ha τ y = 1 (fouh consain) we obain, Combining Equaion (1) and Equaion (3) we ge, τ ( M x) ( ) τ γ = τ y γ = γ. (3) y M M M M + γ. Thus, if he algoihm made m misakes in T ounds hen he maix M saisfies, M M mγ. (4) Using he veco-nom definiion and applying he Cauchy-Schwaz inequaliy we ge, ( )( ) k k M 2 M 2 = M 2 M 2 =1 =1 ( M 1 M 1 +...+ M k M k ) 2 = ( k =1 M M ) 2. (5) Plugging Equaion (4) ino Equaion (5) and using he assumpion ha M is of a uni veco-nom we ge he following lowe bound, M 2 m 2 γ 2. (6) 959
CRAMMER AND SINGER Nex, we bound he veco-nom of M fom above. As befoe, assume ha an eo occued when classifying he example ( x,y ) using he maix M and denoe by M he maix afe he updae. Then, M 2 = M 2 = M + τ x 2 = M 2 + 2 = M 2 + 2 τ ( M x ) + τ x 2 τ ( M x ) + x 2 (τ ) 2. (7) We fuhe develop he second em using he second consain of he algoihm and analogously o Equaion (2) we ge, τ ( M x ) = ( τ )( ) M y M x. y Since x was misclassified we need o conside he following wo cases. The fis case is when he label was no he souce of he eo, ha is ( M y M ) x > 0. Then, using he hid consain ( / E {y } τ = 0) we ge ha τ = 0 and hus ( τ ) ( ) M y M x = 0. The second case is when one of he souces of eo was he label. In ha case ( M y M ) x 0. Using he fis consain of he algoihm we know ha τ 0 and hus ( τ ) ( ) M y M x 0. Finally, summing ove all we ge, τ ( M x ) 0. (8) Plugging Equaion (8) ino Equaion (7) we ge, M 2 M 2 + x 2 (τ ) 2. Using he bound x R and Lemma 2 we obain, M 2 M 2 + 2 R 2. (9) Thus, if he algoihm made m misakes in T ounds, he maix M saisfies, Combining Equaion (6) and Equaion (10), we have ha, M 2 2m R 2. (10) m 2 γ 2 M 2 2m R 2, and heefoe, m 2 R2 γ 2. (11) We would like o noe ha he bound of he above heoem educes o he Pecepon s misake bound in he binay case (k = 2). To conclude his secion we analyze he non-sepaable case by genealizing Theoem 2 of Feund and Schapie (1999) o a muliclass seing. The poof echnique follows he poof ouline of Feund and Schapie and is given in Appendix A. 960
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Iniialize: Se M 0 M R k n. Loop: Fo = 1,2,...,T Ge a new insance x. Pedic ŷ = ag max { M x }. Ge a new label y. Find τ ha solves he following opimizaion poblem: 1 min τ 2 M + τ x 2 2 subjec o : (1) τ δ,y fo = 1,...,k (2) k =1 τ = 0 Updae : M M + τ x fo = 1,2,...,k. Oupu : H( x)=ag max { M x}. Figue 3: The Magin Infused Relaxed Algoihm (MIRA). Theoem 4 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence fo any muliclass algoihm fom he family descibed in Figue 2, whee x R n and y {1,2,...,k}. Denoe by R = max x.lem be a pooype maix of a uni veco-nom, M = 1, and fix some γ > 0. Define, d = max { 0, γ [ M y x max y ]} M x, and denoe by D 2 = =1 T (d ) 2. Then he numbe of misakes he algoihm makes is a mos (R + D)2 2 γ 2. 4.2 The Relaion o Kesle s Consucion Befoe uning o a moe complex muliclass vesion, we would like o discuss he elaion of he family of updaes descibed in his secion o Kesle s consucion (Duda and Ha, 1973). Kesle s consucion is aibued o Cal Kesle and was descibed by Nilsson (1965). The consucion educes a muliclass classificaion poblem o a binay poblem by expanding each insance in R n ino an insance R n(k 1). By unavelling Kesle s expansion he esuling updae in he oiginal space amouns o a succession of ou max updae. Specifically, he updae due o Kesle is ulaconsevaive as i modifies only he pooypes whose indices consiue he eo se. Given an example ( x,y ) Kesle s updae ule cycles hough he labels y y and if M y x > M y x i applies he max-updae o he pooypes indexed y and y. Theefoe, he family of online algoihms pesened hus fa is a genealizaion of Kesle s consucion in ems of he fom of he specific updae. 5. A Nom-Opimized Muliclass Algoihm In he pevious secion we have descibed a family of algoihms whee each algoihm of he family achieves he same misake bound given by Theoem 3 and Theoem 4. This vaiey of equivalen 961
CRAMMER AND SINGER algoihms suggess ha hee ae some degees of feedom ha we migh be able o exploi. In his secion we descibe an online algoihm ha chooses a feasible veco τ such ha he veco-nom of he maix M will be as small as possible. To deive he new algoihm we omi he foh consain (τ y = 1) and hus allow moe flexibiliy in choosing τ, o smalle changes in he pooype maix. Pevious bounds povide moivaion fo he algoihms in his secion. We choose a veco τ which minimizes he veco-nom of he new maix M subjec o he fis wo consains only. As we show in he sequel, he soluion of he opimizaion poblem auomaically saisfies he hid consain. The algoihm aemps o updae he maix M on each ound egadless of whehe hee was a pedicion eo o no. We show below ha he algoihm is ulaconsevaive and hus τ is he zeo veco if x is coecly classified (and no updae akes place). Following he end paved by Li and Long (2002) and Genile (2001), we em ou algoihm MIRA fo Magin Infused Relaxed Algoihm. The algoihm is descibed in Figue 3. Befoe invesigaing he popeies of he algoihm, we ewie he opimizaion poblem ha MIRA solves on each ound in a moe convenien fom. Omiing he example index he objecive funcion becomes, 1 2 M + τ x 2 = 1 2 M 2 +τ ( M x)+ 1 2 τ 2 x 2. Omiing 1 2 M 2 which is consan, he quadaic opimizaion poblem becomes, whee, and min Q ( τ)= 1 k τ 2 A τ 2 k + B τ (12) =1 =1 subjec o : τ δ,y and τ = 0 A = x 2, (13) B = M x. (14) Since Q is a quadaic funcion, and hus sicly convex, and he consains ae linea, he poblem has a unique soluion. We now show ha MIRA auomaically saisfies he hid consain of he family of algoihms fom Secion 4, which implies ha i is ulaconsevaive. We fis pove he following auxiliay lemma. Lemma 5 Le τ be he opimal soluion of he consained opimizaion poblem given by Equaion (12) fo an insance-label pai ( x,y). Foeach y such ha B B y hen τ = 0. Poof Assume by conadicion ha hee is a veco τ which minimizes he objecive funcion of Equaion (12) and fo some s y we have ha boh B s B y and τ s < 0. Noe ha his implies ha τ y > 0. Define a new veco τ as follows, 0 = s τ = τ y + τ s = y τ ohewise. 962
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS I is easy o veify ha wo linea consains of MIRA ae sill saisfied by τ.since τ and τ diffe only a hei s and y componens we ge, Expanding τ we ge, Q ( τ ) Q ( τ) = 1 2 A(τ s2 + τ 2 y )+τ s B s + τ y B y [ ] 1 2 A(τ s 2 + τ 2 y )+τ s B s + τ y B y. Q ( τ ) Q ( τ) = 1 2 A(τ s + τ y ) 2 +(τ y + τ s )B y [ ] 1 2 A(τ s 2 + τ 2 y )+τ s B s + τ y B y = Aτ s τ y + τ s (B y B s ). Fom he fac ha τ s < 0 and he assumpion (B s B y ) we ge ha he igh em is less han o equal o zeo. Also, since Aτ y > 0 we ge ha he lef em is less hen zeo. We heefoe ge ha Q ( τ ) Q ( τ) < 0, which conadics he assumpion ha τ is a soluion of Equaion (12). The lemma implies ha if a label is no a souce of eo, hen he h pooype, M, is no updaed afe ( x,y) has been obseved. In ohe wods, he soluion of Equaion (12) saisfies ha τ = 0foall y wih ( M x M y x). Coollay 6 MIRA is ulaconsevaive. Poof Le ( x,y) be a new example fed o he algoihm. And le τ be he coefficiens found by he algoihm. Fom Lemma 5 we ge ha fo each label whose scoe ( M x) is no lage han he scoe of he coec label ( M y x) is coesponding value τ is se o zeo. This implies ha only he indices which belong o he se E {y} = { y : M x M y x} {y} may be updaed. Fuhemoe, if he algoihm pedics coecly ha he label is y, we ge ha E = /0 and τ = 0foall y. Inhis case τ y is se o zeo due o he consain τ = τ y + y τ = 0. Hence, τ = 0 and he algoihm does no modify M on ( x, y). Thus, he condiions equied fo ulaconsevaiveness ae saisfied. In Secion 5.3 we give a deailed analysis of MIRA ha incopoaes he magin achieved on each example, and can be used o deive a misake bound. Le us fis show ha he cumulaive l 1 -nom of he coefficiens τ is bounded. Theoem 7 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence o MIRA whee x R n and y {1,2,...,k}. Le R = max x and assume ha hee is a pooype maix M of a uni veco-nom, M = 1, which classifies he enie sequence coecly wih magin γ = min { M y x max y M x } > 0. Le τ be he coefficiens ha MIRA finds fo ( x,y ). Then, he following bound holds, T =1 τ 1 4 R2 γ 2. The poof employs he echnique used in he poof of Theoem 3. The poof is given fo compleeness in Appendix A. 963
CRAMMER AND SINGER 5.1 Chaaceisics of he Soluion Le us now fuhe examine he chaaceisics of he soluion obained by MIRA. In a ecen pape (Camme and Singe, 2000) we invesigaed a elaed seing ha uses eo coecing oupu codes fo muliclass poblems. Using hese esuls, i is simple o show ha he opimal τ in Equaion (12) is given by τ = min{θ B A,δ y,}, (15) whee A = x 2 and B = M x is he similaiy-scoe of ( x,y) fo label, as defined by Equaion (13) and Equaion (14), especively. The opimal value θ is uniquely defined by he equaliy consain τ = 0 of Equaion (12) and saisfies, k =1 min{θ B A,δ y,} = 0. The value θ can be found by a binay seach (Camme and Singe, 2000) o ieaively by solving a fixed poin equaion (Camme and Singe, 2001). We now can view MIRA in he following alenaive ligh. Assume ha he insance ( x, y) was misclassified by MIRA and se E = { y : M x M y x} /0. The similaiy-scoe fo label of he updaed maix on he cuen insance x is, ( M + τ x) x = B + τ A. (16) Plugging Equaion (15) ino Equaion (16) we ge ha he similaiy-scoe fo class on he cuen insance is, min{aθ,b + Aδ y, }. Since τ δ y,, he maximal similaiy scoe he updaed maix can aain on x is B + Aδ,y. Thus, he similaiy-scoe fo class afe he updae is eihe a consan ha is common o all classes, Aθ, o he lages similaiy-scoe he class can aain, B + Aδ,y. The consan Aθ places an uppe bound on he similaiy-scoe fo all classes afe he updae. This bound is igh, ha is a leas one similaiy-scoe value is equal o Aθ. 5.2 Using MIRA fo Binay Classificaion Poblems In his secion we discuss MIRA in he special case in which hee ae only wo possible labels. Fis, noe ha any algoihm ha belongs o he family of algoihms fom Figue 2 educes o he Pecepon algoihm in he he binay case. We now fuhe analyze MIRA, assuming ha he labels ae dawn fom he se y { 1, +1}. In his case he fis ow of M coesponds o he label y =+1 and he second ow o he label y = 1. We now deive he equaions fo he case y =+1. The case y = 1 is deived similaly by eplacing he indices 1 and 2 in all he equaions. The consains of MIRA educe o τ 1 1, τ 2 0andτ 1 + τ 2 = 0. Thus, if he algoihm is iniialized wih a maix M such ha M 1 + M 2 = 0, his popey is conseved along is execuion. Theefoe, we can eplace he maix M wih a single veco w such ha M 1 = w and M 2 = w. The objecive funcion of Equaion (12) now becomes, Q = 1 2 x 2 ( τ 2 1 + τ2 2) + y( w x)τ1 + y( w x)τ 2. 964
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Iniialize: Se w 0. Loop: Fo = 1,2,...,T Ge a new insance x. Pedic ŷ = sign( w x ). Ge a new label ( y { 1,+1}. ) Define τ = G y ( w x ) whee: x 2 Updae: w w + τ y x Oupu : H( x)=sign( w x). G(x)= 0 x < 0 x 0 x 1 1 1< x Figue 4: Binay MIRA. We now omi he label index and idenify τ wih τ 1 and τ wih τ 2 o ge he following opimizaion poblem, min τ Q = x 2 τ 2 + 2y( w x)τ (17) subjec o : 0 τ 1. I is easy o veify ha he soluion of his poblem is given by, ( ) y( w x) τ = G x 2, (18) whee 0 x < 0 G(x)= x 0 x 1. 1 1< x ( ) Clealy, he binay vesion of MIRA is consevaive since if x is classified coecly y( w x) > 0 x 2 hen w is no modified. Fuhemoe, he coefficien τ is equal o he absolue value of he nomalized magin y( w x)/ x 2, as long as his nomalized magin is smalle han one. The bound on he nom ensues ha a new example does no change he pedicion veco w oo adically, even if he magin is a lage negaive numbe. The algoihm is descibed in Figue 4. Noe ha he algoihm is vey simila o he Pecepon algoihm. The only diffeence beween binay MIRA and he Pecepon is he funcion used fo deemining he value of τ. Fo he Pecepon we use he funcion S(x)= { 0 x 0 1 0< x. insead of G(x). One ineesing quesion ha comes o mind is whehe we can use ohe funcions of he nomalized magin o deive ohe online algoihms wih coesponding misake bounds. We leave his fo fuue eseach. 965
CRAMMER AND SINGER 5.3 Magin Analysis of MIRA In his secion we fuhe analyze MIRA by elaing is misake bound o he insananeous magin of he individual examples. Noe ha since MIRA was deived fom he family of algoihms in Figue 2 by dopping he fouh consain. Theefoe, Theoem 3 and 4 do no hold and we hus need o deive an alenaive analysis. The magin analysis we pesen in his secion sheds some moe ligh on he souce of difficuly in achieving a misake bound fo MIRA. Ou analysis hee also leads o an alenaive vesion of MIRA ha incopoaes he magin ino he quadaic opimizaion poblem ha we need o solve on each ound. Ou saing poin is Theoem 7. We fis give a lowe bound on τ y on each ound. If MIRA made a misake on ( x,y), hen we know ha max y B B y > 0, whee B = M x (see Equaion (14)). Theefoe, we can bound he minimal value of τ y by a funcion of he (negaive) magin, B y max y B. Lemma 8 Le τ be he opimal soluion of he consained opimizaion poblem given by Equaion (12) fo an insance-label pai ( x,y) wih A R 2. Assume ha he magin B y max y B is bounded fom above by β,whee0 < β 2R 2.Thenτ y is a leas β/(2r 2 ). Poof Assume by conadicion ha he soluion of he quadaic poblem of Equaion (12) saisfies τ y < β/(2r 2 ). Noe ha τ y > 0sincemax y B B y β > 0. Le us define = β/(2r 2 ) τ y > 0 and le s = ag max B (ies ae boken abiaily). Define a new veco τ as follows, τ s = s τ = τ y + = y τ ohewise. The veco τ saisfies he consains of he quadaic opimizaion poblem because τ y = β/(2r2 ) 1. Since τ and τ diffe only a hei s and y componens we ge, Subsiuing τ we ge, Q ( τ ) Q ( τ) = 1 2 A(τ y2 + τ 2 s )+τ y B y + τ sb s [ ] 1 2 A(τ y 2 + τ 2 s )+τ y B y + τ s B s. Q ( τ ) Q ( τ) = 1 2 A[ (τ y + ) 2 +(τ s ) 2] + B y (τ y + )+B s (τ s ) [ ] 1 2 A(τ y 2 + τ 2 s )+τ y B y + τ s B s = [A(τ y τ s )+A + B y B s ]. Using he second consain of MIRA ( τ = 0) we ge ha τ 1 = 2τ y and hus τ y τ s 2τ y. Hence, Q ( τ ) Q ( τ) (A(2τ y + )+B y B s ). Subsiuing τ y + = β/(2r 2 ) and using he assumpion ha τ y < β/(2r 2 ) we ge, ( ) βa Q ( τ ) Q ( τ) R 2 + B y B s. 966
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Since B s B y β fo ( x,y) we ge, Q ( τ ) Q ( τ) ( ) βa R 2 β = β ( A R 2 ) R 2. Finally, since A = x 2 R 2 and β > 0 we obain ha, Q ( τ ) Q ( τ) 0. Now, eihe Q ( τ )=Q ( τ), which conadics he uniqueness of he soluion, o Q ( τ ) < Q ( τ) which implies ha τ is no he opimal value and again we each a conadicion. We would like o noe ha fo he above lemma if β 2R 2 hen τ y = 1 egadless of he magin achieved. We ae now eady o pove he main esul of his secion. Theoem 9 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence o MIRA whee x R n and y {1,2,...,k}. DenoebyR= max x and assume ha hee is a pooype maix M of a uni veco-nom, M 2 = 1, which classifies he enie sequence coecly wih magin γ = min { M y x max y M x } > 0. Denoe by n β he numbe of ounds fo which B y max y B β,fosome0 < β 2R 2. Then he following bound holds, n β 4 R4 βγ 2. Poof The poof is a simple applicaion of Theoem 7 and Lemma 8. Using he second consain of MIRA ( τ = 0) and Theoem 7 we ge ha, T =1 τ y 2R2 γ 2. (19) Fom Lemma 8 we know ha wheneve max y B B y β hen 1 2R2 β τ y and heefoe, n β T =1 Combining Equaion (19) and Equaion (20) we obain he equied bound, n β 2 R2 β T =1 2R 2 β τ y. (20) τ y 2R2 β 2R2 γ 2 4 R4 βγ 2. Noe ha Theoem 9 sill does no povide a misake bound fo MIRA since in he limi of β 0 he bound diveges. Noe also ha fo β = 2R 2 he bound educes o he bounds of Theoem 3 and Theoem 7. The souce of he difficuly in obaining a misake bound is ounds on which MIRA 967
CRAMMER AND SINGER achieves a small negaive magin and hus makes small changes o M. On such ounds τ y can be abiaily small and we canno anslae he bound on τ y ino a misake bound. This implies ha MIRA is no obus o small changes in he inpu insances. We heefoe descibe now a simple modificaion o MIRA fo which we can pove a misake bound and, as we lae see, pefoms well empiically. The modified MIRA aggessively updaes M on evey ound fo which he magin is smalle han some pedefined value denoed again by β. This echnique is by no means new, see fo insance he pape of Li and Long (2002). The esul is a mixed algoihm which is boh aggessive and ulaconsevaive. On one hand, he algoihm updaes M wheneve a minimal magin is no achieved, including ounds on which ( x, y) is classified coecly bu wih a small magin. On he ohe hand, on each updae of M only he ows whose coesponding similaiy-scoes ae misakenly oo high ae updaed. We now descibe how o modify MIRA along hese lines. To achieve a minimal magin of a leas β 2R 2 we modify he opimizaion poblem given by Equaion (12). A minimal magin of β is achieved if fo all we equie M y x M x β o, alenaively, ( M y x β) ( M x) 0. Thus, if we eplace B y wih B y β, M will be updaed wheneve he magin is smalle han β. We hus le MIRA solve fo each example ( x,y) he following consained opimizaion poblem, min τ Q ( τ)= 1 2Ã k =1 τ 2 + k =1 B τ subjec o : τ δ,y and τ = 0 whee : Ã = A = x 2 ; B = B βδ y, = M x βδ y,. To ge a misake bound fo his modified vesion of MIRA we apply Theoem 9 almos vebaim by eplacing B wih B in he heoem. Noe ha if B y max y B β hen B y β max y B β and hence B y max y B 0. Theefoe, fo any 0 β 2R 2 we ge ha he numbe of misakes of he modified algoihm is equal o n β which is bounded by 4R 4 /βγ 2. This gives he following coollay. Coollay 10 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence o he aggessive vesion of MIRA wih magin 0 β 2R 2,whee x R n and y {1,2,...,k}. Denoe by R = max x and assume ha hee is a pooype maix M of a uni veco-nom, M 2 = 1, which classifies he enie sequence coecly wih magin γ = min { M y x max y M x } > 0. Then, he numbe of misakes he algoihm makes is bounded above by, 4 R4 βγ 2. Noe ha he bound is a deceasing funcion of β. This means ha he moe aggessive we ae by equiing a minimal magin he smalle he bound on he numbe of misakes he aggessively modified MIRA makes. Howeve, his also implies ha he algoihm will updae M moe ofen and he soluion will be less spase. We conclude his secion wih he binay vesion of he aggessive algoihm. As in he muliclass seing, we eplace he non-aggessive vesion given by 968
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Iniialize: Fix η > 0. vesion 1 Se M,i 1 = 1 n Loop: = 1,2,...,T Ge a new insance x R n. Pedic ŷ = ag max k { M =1 x }. Ge a new label y. Se E = { y : M x M y x }. If E /0 updae M : Choose any τ 1,...,τ k subjec o : 1. τ δ,y fo = 1,...,k. 2. k =1 τ = 0 3. τ = 0fo / E {y }. 4. τ y = 1. vesion 1 Define : Z = i Mi, eητ x i Updae : Mi, +1 Z 1 M i, eητ x i Oupu : H( x)=ag max { M T+1 x}. vesion 2 Se M 1,i = 1 nk vesion 2 Define : Z = i, M i, eητ x i Updae : M +1 i, 1 Z M i, eητ x i Figue 5: A family of muliclass muliplicaive algoihms. Equaion (17) wih he coesponding aggessive vesion and ge, min τ Q = x 2 τ 2 +[2y( w x) β]τ subjec o : 0 τ 1. Analogously o Equaion (18) he soluion of he poblem is given by, ( τ = G y ( w x ) 1 2 β ) x 2. All he algoihms pesened so fa can be saighfowadly combined wih kenel mehods (Vapnik, 1998). Assume ha we have deemined a maix M by leaning he coefficiens τ 1,..., τ T fom a sequence {( x 1,y 1 ),...,( x T,y T )}. Fomally, he h ow of M is, M = T =1 τ x. To use M fo classifying new insances we compue he similaiy-scoe of an insance x fo class by muliplying x wih he h ow of M and ge, M x = T =1 969 τ ( x x ). (21)
CRAMMER AND SINGER As in many addiive online algoihms, he value of he similaiy-scoe is a linea combinaion of inne-poducs of he fom ( x x). We heefoe can eplace he inne-poduc in Equaion (21) (and also in he algoihms oulined in Figue 2 and Figue 3) wih a geneal inne-poduc kenel K(, ) ha saisfies Mece s condiions (Vapnik, 1998). We now obain algoihms ha wok in a high dimensional space. I is also simple o incopoae voing schemes (Helmbold and Wamuh, 1995, Feund and Schapie, 1999) ino he above algoihms. Befoe poceeding o muliplicaive algoihms, le us summaize he he esuls we have pesened so fa. We saed wih he Pecepon algoihm and exended i o muliclass poblems. By eplacing he specific updae of he exended Pecepon algoihm wih a elaxed se of linea consains we obained a whole family of ulaconsevaive addiive algoihms. We deived a misake bound ha is common o all he algoihms in he family. We hen added a consain on he nom of he coefficiens used in each updae o obain MIRA. By incopoaing minimal magin equiemens ino MIRA we ge a moe obus algoihm. Finally, we closed he cicle by analyzing MIRA fo binay poblems. The esul is a Pecepon-like updae wih a magin dependen leaning ae. 6. A Family of Muliplicaive Muliclass Algoihms We now deive a family of ulaconsevaive muliplicaive algoihms fo he muliclass seing in an analogous way o he addiive family of algoihms. We give he pseudo code fo he muliplicaive family in Figue 5. Noe ha wo slighly diffeen vesion ae descibed. The diffeence in he vesions is due o he diffeen nomalizaion fo M. In he fis vesion we nomalize M afe each updae such ha he nom of each of is ows is 1, while in he second vesion he veco-nom of M is fixed o 1. The misake bounds of he he wo vesions ae simila as he nex heoem shows. Theoem 11 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence fo eihe he fis o he second vesion of he muliclass algoihm fom Figue 5, whee x R n and y {1,2,...,k}. Assume also ha fo all x 1. Assume ha hee is a maix M such ha eihe M 1 = 1 fo = 1,...,k (fis vesion) o M 1 = 1 (second vesion) and ha he inpu sequence is classified coecly wih magin γ = min { M y x max y M x } > 0. Then hee is some η > 0 fo which he numbe of misakes ha he algoihm makes is, ( k O 2 ) log(n), γ 2 fo he fis vesion, and fo he second vesion. ( ) log(n)+log(k) O γ 2, To compae he bounds of he wo vesions we need o examine he value of he minimal magin. The fis vesion nomalizes each ow sepaaely while he second nomalizes he concaenaion of he ows o 1. In he fis vesion we heefoe have ha fo all, M 1 = 1 and hus, using ou definiion of veco-noms we have M 1 = k. Thus, if we scale he magin in he second vesion so ha M 1 = k, he misake bound becomes ( O k 2 log(n)+log(k) ) γ 2, 970
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Iniialize: Se M 1 = 0. Loop: Fo = 1,2,...,T Ge a new insance x R n. Pedic ŷ = ag max k { M x }. =1 Ge a new label y. Se E = { y : M x M y x }. If E /0 updae M (ohewise M +1 = M ): Choose any τ 1,...,τ k which saisfy he consains: 1. τ δ,y fo = 1,...,k. 2. k =1 τ = 0 3. τ = 0fo / E {y }. 4. τ y = 1. Se M +1 o be he soluion of: min 1 2 M 2 2 subjec o : (1) k =1 τ ( M x ) 1 (2) M M M 2 2 (22) Oupu : H( x)=ag max { M T+1 x}. Figue 6: A muliclass vesion of ROMMA. which is lage han he misake bound of he fis vesion by an addiive faco of k 2 log(k)/γ 2.We pove he heoem fo he fis vesion. The poof fo he second vesion is slighly simple and follows he same line of poof. Since he poof of boh vesions ae faily mundane, he poof is defeed o Appendix A. 7. A Family of Relaxed Maximum Magin Algoihms In his secion we descibe an analyze Li and Long s (2002) Relaxed Online Maximum Magin Algoihm (ROMMA) wih ou ulaconsevaive famewok. The esul is a hid family of ulaconsevaive algoihms. We sa wih a eview of he undelying ideas ha moivaed ROMMA and hen pesen ou elaed family of muliclass algoihms. ROMMA (Li and Long, 2002) is an elegan online algoihm ha employs a hypeplane which is updaed afe each pedicion eo, hence denoed w R n. On ound ROMMA is fed wih an insance x and is pedicion is se o sign( w x ). In case of a pedicion eo, y ( w x ) < 0, ROMMA algoihm updaes he weigh veco w as follows. The new weigh veco w +1 is chosen such ha i is he veco w which aains he minimal nom subjec o he following wo linea consains. The fis consain, y ( w x ) 1, equies ha he pedicion of he weigh veco afe he updae, w +1,on x is coec and is is a leas 1, namely, y ( w +1 x ) 1. The second consain, w w w 2, imposes, ahe acily, ha he new veco w +1 classifies accuaely he pevious examples. Li and Long showed ha he half-space { w : w w w 2 } conains he sub-space 1 i=1 {yi ( w x i ) 1}. Hence, he second consain can be viewed as an appoximaion o he se of 971
CRAMMER AND SINGER consains y i ( x i w) 1foi = 1,..., 1. ROMMA is a consevaive algoihm on ounds i pedics coecly i does no no modify he weigh veco and simply se w +1 = w. We now descibe how o consuc an ulaconsevaive family based on ROMMA. As befoe, he ROMMA-based algoihms mainain a pooype maix M. Given a new insance x, any algoihm in he family ses he pediced label o be he index of he pooype fom M which aains he highes similaiy-scoe, H( x )=ag max k =1 { M x }. The pooype maix is updaes only on ounds on which a pedicion eo was made. In such cases he new pooype maix M +1 is se o be he maix M wih minimal veco-nom unde he following wo linea consains. Fis, we equie ha he new pooype-maix classifies he insance x coecly wih a magin of a leas one, ha is, M y x M x 1fo y. These k 1 linea consains eplace he fis consain of ROMMA. Second, we wan he new pooype-maix o classify accuaely he pevious examples, hus, similaly o he second consain of ROMMA we impose a second linea consain M M M 2, whee he veco inne-poduc beween wo maices is as defined in Secion 2. The esul of he genealized vesion is a muli-class algoihm which finds a pooype maix of a minimal nom subjec o k linea consains in oal. Howeve, he algoihm is no necessaily ulaconsevaive and i is hee is no simple soluion o his consained minimizaion poblem. We heefoe fuhe appoximae he consained opimizaion poblem by eplacing he fis k 1 linea consains M y x M x 1fo y, wih a single linea consain as follows. We pick a se of (k 1) negaive coefficiens τ 1,...,τ k (excluding τ y ) which sum o 1 and define he linea consain o be, ( τ ) ( M y x M x ) ( τ ) 1 = 1. y y This consain is a convex combinaion of he above k 1 linea consains. To fuhe simplify he las consain we also define τ y = 1 and ewie he lef hand side of he inequaliy, y ( τ )( M y x M x ) = = y ( τ )( M y x ) + τ ( M x ) y ( M x ) = ( M y x ) ( τ )+ τ y y = τ ( y M y x ) + τ ( M x ) y = τ ( M x ). Finally, o ensue ha he soluion yields an ulaconsevaive updae we impose anohe consain on he coefficiens τ. We again define he eo se, E = { y : M x M y x }, o be he se of indices of he ows in M which achieve similaiy-scoes ha ae highe han he scoe of he coec label y. We now se τ o be zeo fo / E {y }. The family of muliclass algoihms based on ROMMA, which we call MC-ROMMA, is descibed in Figue 6. We now un o pove a misake bound fo his family by genealizing he poof echniques of Li and Long o muliclass seing. In ode o pove he misake-bound we need a couple of echnical lemmas which ae given below. The poofs of he lemmas genealizes he poof of he oiginal ROMMA algoihm and ae defeed o Appendix A. We hen pove in Theoem 15 ha MC-ROMMA is indeed ulaconsevaive. 972
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Lemma 12 Le ( x 1,y 1 ),...,( x T,y T ) be a sepaable inpu sequence fo MC-ROMMA, whee x R n and y {1,2,...,k}. If MC-ROMMA made a pedicion eo on he h example (E /0) hen k ( =1 τ M +1 x ) = 1. Lemma 13 Le ( x 1,y 1 ),...,( x T,y T ) be a sepaable inpu sequence fo MC-ROMMA whee x R n and y {1,2,...,k}. If MC-ROMMA makes a pedicion eo on he h example (E /0) fo > 1 hen M +1 M = M 2. We ae now eady o sae and pove he misake bound fo MC-ROMMA. Theoem 14 Le ( x 1,y 1 ),...,( x T,y T ) be an inpu sequence fo MC-ROMMA whee x R n and y {1,2,...,k}. Denoe by R = max x. Assume ha hee is a maix M which classifies he enie sequence coecly wih a magin of a leas one, = 1,...,T, y : M y x M x 1. Then, he numbe of misakes ha MC-ROMMA makes is a mos 2R 2 M 2. Poof Fis, since M sepaaes he daa wih a uni magin we have ha M M M 2 fo = 1,...,T. Second, since M +1 aains he minimal nom in he coesponding opimizaion poblem, we have M M fo all. Also, since M 1 = 0 we can combine Lemma 12 wih he poof of Lemma 13 and ge ha M 2 = a 1,i.e. Compuing he veco-nom of M 2 we ge, M 2 = τ 1 x 1 x 1 2 [ s (τ 1 s )2 ]. M 2 2 = 1 x 1 2 [ s (τ 1 s )2 ]. Finally, by applying Lemma 2 and he assumpion ha R x we ge, M 2 2 = 1 x 1 s (τ 1 s) 2 1 2R 2. We show below ha fo all > 1 wheneve a pedicion eo occued hen M +1 2 M 2 + 1/(2R 2 ). This implies ha if MC-ROMMA made m misakes on he sequence of insances and labels hen, M T +1 2 M 1 2 +m/(2r 2 )=m/(2r 2 ).Since M T +1 2 M 2 hen, m 2 M 2 R 2, which would complee he poof and heefoe, i emains o show ha M +1 2 M 2 +1/(2R 2 ) fo any ound > 1 on which MC-ROMMA made a pedicion eo. To show ha he bound on he gowh of he nom M +1 wih espec o he nom of M we examine he disance d(m,a ) beween he maix M and he se of hypeplanes A = {M : τ ( M x )=1} which was defined in he poof of Lemma 13. We now use he assumpion ha he h example was misclassified ( τ ( M x ) < 0) and Lemma 2 o ge, d(m,a ) = τ ( M x ) 1 x s (τ s) 2 1 2 x 1 2R. (23) 973
CRAMMER AND SINGER Also, since he new maix M +1 is in he se A hen he disance beween M and M +1 is a leas as big as he disance beween M and A,hais, Combining Equaions (23) and (24) we ge, We now expand he nom M +1 2, M +1 2 = (M +1 M )+M 2 d(m,m +1 ) d(m,a ). (24) M +1 M 2 1 2R 2. (25) = M +1 M 2 + M 2 2(M +1 M ) M = M +1 M 2 + M 2 2 ( M +1 M M 2) Using Lemma 13 we know ha M +1 M M 2 = 0 and hus, Combining Equaions (25) and (26) we ge, which complees he poof. M +1 2 = M +1 M 2 + M 2. (26) M +1 2 M 2 + 1 2R 2, Finally, we conclude his secion by showing ha MC-ROMMA is ulaconsevaive. Theoem 15 MC-ROMMA is ulaconsevaive. Poof We fis show ha he opimizaion poblem given in Equaion (22) can be e-ewien as a consained opimizaion whee he unknown vaiables can be gouped ino a single maix in R n k. We eplace he pooype-maix M wih he veco ( M 1,..., M k ) and he insance x wih he veco (τ 1 x,...,τ k x ). I is saighfowad o veify ha he opimizaion poblem of Equaion (22) can now be ewien as, min ( M 1,..., M k ) 2 subjec o: ( M 1,..., M k ) (τ 1 x,...,τ k x ) 1 ( M 1,..., M k ) ( M 1,..., M k ) ( M 1,..., M k ) 2. Applying Lemma 12 and Lemma 13 we ge ha ha he opimum of Equaion (27) is achieved when he inequaliies hold as equaliies. The same popey holds fo he oiginal vesion of ROMMA. We heefoe can use Li and Long s closed fom soluion and ge ha he soluion is of he fom, ( M 1 +1,..., M k +1 )=c ( M 1 +1,..., M k +1 )+d (τ 1 x,...,τ k x ), fo some values c > 0andd. Going back o he epesenaion ha employs muliple maices we ge ha he value of he pooype-maix afe he updae is, ( M +1 = c M + d ) τ x. (27) c 974
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Name No. of No. of No. of No. of Taining Examples Tes Examples Classes Aibues Chess-Boad 10,000 10,000 8 2 MNIST 60,000 10,000 10 784 USPS 7,291 2,007 10 256 Lee 16,000 4,000 26 16 Table 1: Daa ses leaning poblems used in he expeimens The updaed given by Equaion (27) can be decomposed ino wo sages. Fis, simila o he family of addiive algoihms of Figue 2 and o MIRA (Figue 3), he algoihm eplaces he pooype M wih he sum M +(d /c )τ x. Using he hid condiion of MC-ROMMA (Figue 6) we ge ha if he label was no one of he souces fo an eo hen τ = 0 and heefoe M +1 = M. Theefoe he updae is ulaconsevaive. Afe he addiive change o M, he MC-ROMMA scales all he pooypes by a muliplicaive faco c. Alhough all of he pooypes ae modified in his sage, including hose which ae no in he eo se ( / E ), he classificaion funcion H( x) induced by M is no affeced by his scaling and hus he updae ule is can be viewed as ulaconsevaive. 8. Expeimens In his secion we descibe and discuss he esuls of expeimens we pefomed wih boh synheic daa and naual daases. The expeimens ae by no means exhausive and he main goal of hese expeimens is o undescoe he meis of he vaious online algoihms discussed in his pape. Algoihms: We compaed he following five algoihms. The fis algoihm is a muliclass classifie based on he Pecepon algoihm obained by aining seveal copies of he Pecepon. Each copy is ained o disciminae one class fom he es of he classes. To classify a new insance we compue he oupu of each of he ained Pecepons and pedic he label which aains he highes similaiy-scoe. This appoach can be viewed as a special case of eo coecing oupu codes (ECOC), used fo educing a muliclass poblems ino muliple binay poblems (Dieeich and Bakii, 1995, Allwein e al., 2000). The nex hee algoihms belong o he family of algoihms discussed in Secion 4 and whose pseudo-code is given in Figue 2. Each of he hee algoihms coesponds o a diffeen updae. All he hee algoihms eplace M y wih M y + x wheneve he pedicion is incoec. In addiion each of he algoihms modify he se of pooypes consiuing he eo se. Specifically, he fis updae changes he pooypes in he eo se in a unifom manne by adding he veco x/ E o each pooype and is hus efeed o as he unifom updae. The second updae is moe consevaive and changes only wo of he pooypes on each ound: he pooype M y coesponding o he coec label y and he pooype M which aains he highes similaiy-scoe. This updae is heefoe efeed o as max updae. Las, he hid updae modifies each pooype fom he eo-se in popoion o he similaiy-scoe i aains (see Secion 4 fo a fomal descipion) an is abbeviaed as he pop updae. We an all he algoihms above in an aggessive fashion: on each ound a value of β = 0.01 was deduced fom he similaiy-scoe of he coec label y igh befoe compuing he eo-se and he coesponding updae. This modificaion 975
CRAMMER AND SINGER 50 40 30 300 250 Chess Mnis USPS Lee 20 200 Relaive Tes Eo % 10 0 10 20 30 Relaive No. of Updaes % 150 100 50 40 50 60 Chess Mnis USPS Lee Unifom Max Pop MIRA 0 50 Unifom Max Pop MIRA 20 10 300 250 Chess Mnis USPS Lee Relaive Tes Eo % 0 10 20 30 Relaive No. of Updaes % 200 150 100 50 40 50 Chess Mnis USPS Lee Unifom Max Pop MIRA 0 50 Unifom Max Pop MIRA Figue 7: The elaive es eo (lef) and elaive numbe of updaes (igh) of fou of he algoihms pesened in his pape afe one epoch (op ow) and afe hee epochs (boom ow). of he scoe foces he algoihms o pefom an updae even on ounds wih no pedicion eo as long as he magin is smalle han β = 0.01. The fifh algoihm ha we esed is an aggessive vesion of MIRA wih a minimal magin equiemen of β = 0.01. All of he algoihms wee used in conjuncion wih Mece kenels. The kenels wee fixed fo each daase we expeimened wih and we did no aemp o une hei paamees. Each of he five algoihms was fed wih he aining se in an online fashion, i.e. example by example, and geneaed a muliclass classificaion ule. We hen evaluaed he algoihms by applying hei final se of pooypes o he es daa and compued hei es eo. We epeaed hese expeimens muliple imes. (The specific numbe of epeiions vaies beween he daases in is epoed below.) Daa-Ses: We evaluaed he algoihms on a synheic daase and on hee naual daases: MNIST 1, USPS 2 and Lee 3. The chaaceisic of he ses ae summaized in Table 1. A compehensive oveview of he pefomance of vaious algoihms on hese ses can be found in a ecen pape by Genile (2001). 1. Available fom hp://www.eseach.a.com/ yann/exdb/mnis/index.hml 2. Available fom fp.kyb.uebingen.mpg.de 3. Available fom hp://www.ics.uci.edu/ mlean/mlreposioy.hml 976
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS No Updaes / Pecepon No Updaes 10 9 8 7 6 5 4 MIRA 3 2 agg ROMMA ALMA 2 (0.9) Pecepon 1 Pop Unifom Max 0 0.5 1 1.5 2 Tes Eo / Pecepon TesEo No. Updaes / Pecepon No. Updaes 10 9 8 7 6 5 4 3 2 agg ROMMA MIRA ALMA 2 (0.9) Pecepon 1 Pop 0 Unifom Max 0.5 1 1.5 2 Tes Eo / Pecepon Tes Eo Figue 8: Summay of he es eo and he numbe of updaes fo vaious online Please efe o he ex fo he exac seing used fo each of he algoihms. The synheic daa-se has eigh classes. Each insances is a wo dimensional veco fom [0, 1] [0, 1]. We used he unifom disibuion o andomly daw examples. Each example was associaed wih a unique label accoding o he following ule. The domain [0,1] [0,1] was paiioned ino 8 8 = 64 squaes of he same size. Each squae was uniquely idenified wih a ow-column index (i, j). The label of all insances fom a given squae indexed (i, j) was se o be ((i + j) mod 8)+1. We hen geneaed a aining se and a es se, each of size 10,000. Resuls: The complee esuls obained in he expeimens ae summaized in Appendix B. The appendix also cies pefomance esuls fo ROMMA (Li and Long, 2002) and ALMA (Genile, 2001). A gaphical illusaion ha compaes he algoihms descibed in his pape is given in Figue 7. This figue conains fou ba-plos. Each ba in he plos designaes coesponds o a aio of a pefomance measue of one he algoihms and he Pecepon algoihm: he lef wo plos show he elaive es eo and he igh wo plos show he elaive numbe of updaes each algoihm pefomed. Fomally, he heigh of each ba in he lef wo plos is popoional o (ε a ε p )/ε p whee ε p is he es eo of he Pecepon algoihm and ε a is he es eo on one of he ohe fou algoihms (Unifom, Max, Pop and MIRA). Similaly, he heigh of each ba is he igh wo plos is popoional o (u a u p )/u p whee u p (u a ) is he numbe of updaes he Pecepon algoihm (one of he fou algoihms; Unifom, Max, Pop and MIRA) made. The op wo plos efes o he esuls afe cycling once hough he aining daa and he boom wo plos efes o he esuls afe hee cycles hough he aining daa. In each plos hee ae fou goups of bas, one fo each fo one of he fou muliclass algoihms descibed in his pape (Unifom, Max, Pop and MIRA). The esuls fo each consis of fou bas coesponding o fou daases: Chess-Boad, MNIST, USPS and Lee (fom lef o igh). Fom he figue we see ha MIRA oupefoms he ohe algoihms descibed in his pape, bu his impoved pefomance has a pice in ems of he spaseness of he soluion. The es eo of he Pecepon is lowe han he es eo of he es of he algoihms (Unifom, Max, and Pop) bu he Pecepon pefoms moe updaes han he hee hence he esuling classifie is less spase. Fo insance, fo he USPS daase, he es eo of Unifom, Max, and Pop is abou 10% highe han he eo of he Pecepon while he es eo of MIRA is aound 20% lowe han ha of he 977
CRAMMER AND SINGER Pecepon. The advanage of MIRA ove he Pecepon is even moe eviden in he Lee daase whee MIRA s es eo is lowe by 50% han he Pecepon s eo. Afe hee epochs he es eo of he Unifom updae becomes only 8% highe han he eo of he Pecepon algoihm on hee daases and he Unifom updae oupefoms he Pecepon on MNIST. Whehe one epoch o hee, MIRA oupefoms all of he algoihms. Howeve, MIRA makes many moe updaes which esuls in lage numbe of suppo paens when kenel ae used. The numbe of suppo paens used by MIRA afe one epoch is abou fou imes he numbe used by he Pecepon (wo imes on he Lee daa-se). Unifom, Max and Pop, on he ohe hand, makes abou half of he numbe of updaes compaed o he Pecepon algoihm. This behaviou does no change afe hee epochs. Anohe pespecive of he esuls on he MNIST daa-se is illusaed in Figue 8. The plo on he lef hand side plo coesponds o esuls obained afe one epoch while he igh hand side plo coesponds o esuls obained afe hee epochs. In each of he wo plos he x-axis designaes he es eo of an algoihm divided by he es eo of he Pecepon algoihm and he y-axis is he numbe of updaes he algoihm made divided by he numbe of updaes of he Pecepon. Each of he algoihm is hus associae wih a coodinae in each plo. By definiion, he Pecepon algoihm is he poin (1,1). We added o he plos he esuls obained by wo moe algoihms: Li and Long s (2002) ROMMA algoihm and Genile s (2001) ALMA algoihm. These algoihms wee designed fo binay classificaion poblems and wee adaped fo muliclass poblems using he onevs-es educion. Li and Long evaluaed ROMMA on MNIST using a non-homogeneous polynomial kenel of degee fou in an aggessive manne. ALMA was evaluaed using a non-homogeneous polynomial kenel of degee six. In he expeimens wih hese algoihm, each inpu insance was nomalized o have an l of one. The plos appeaing in Figue 8 fuhe undescoe he adeoff beween accuacy and spaseness. While MIRA exhibis he lowes eo ae, wih he excepion of ROMMA, i is also he algoihm ha makes he lages numbe of updaes. Analogously, he hee updaes fom Figue 2 make fa less updaes a he expense of infeio pefomance. ROMMA seems o exhibi somewha pooe pefomance in ems of he accuacy vesus numbe of updaes aio while ALMA seems o be compaable in ems of ha aio. We would like o noe hese pefomance diffeences migh be aibued o he diffeen pe-pocessing and diffeen kenels used in ou expeimens. Noneheless, all algoihms do exhibi a naual adeoff beween accuacy and spaseness of he soluion. 9. Summay In his pape we descibed a geneal famewok fo deiving ulaconsevaive algoihms fo muliclass caegoizaion poblems and analyzed he poposed algoihms in he misake bound model. We invesigaed in deail an addiive family of online algoihms. The enie family educes o he Pecepon algoihm in he binay case. In addiion, we gave a mehod fo choosing a unique membe of he family by imposing a quadaic objecive funcion ha minimizes he nom of he pooype maix afe each updae. We hen gave an analogous family of muliplicaive algoihms. A quesion ha emains open is how o impose consains simila o he one MIRA employs in he muliplicaive case. We also descibed an ulaconsevaive vesion of Li and Long s ROMMA algoihm. We believe ha he ulaconsevaive appoach o muliclass poblems can be also be applied o o quasi-addiive algoihms (Gove e al., 2001) and p-nom algoihms (Genile, 2001). Anohe ineesing diecion fo eseach ha genealizes ou famewok is he design and analysis of algoihms ha mainain moe han one pooype pe class. While his appoach is clealy useful 978
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS in cases whee he disibuion of insances fom a given class is no concenaed in one diecion, i seems ahe icky o genealize he ulaconsevaive paadigm o he case of muliple pooypes. We would like o noe ha his wok is pa of a geneal line of eseach on muliclass leaning. Allwein e al. (2000) descibed and analyzed a geneal appoach fo muliclass poblems using eo coecing oupu codes (Dieeich and Bakii, 1995). Building on ha wok, we (Camme and Singe, 2000) invesigaed he poblem of designing good oupu codes fo muliclass poblems. Alhough he model of leaning using oupu code diffes subsanially fom he famewok sudied in his pape, a few of he echniques pesened hee build upon ohe esuls (Camme and Singe, 2000). Finally, a few of he echniques used in his pape can also be applied in bach seings o consuc Muliclass Suppo Veco Machines (MSVM). The implemenaion deails on how o efficienly build MSVMs appea in anohe place (Camme and Singe, 2001). Acknowledgemen We would like o hank Elisheva Bonchek fo caefully eading a daf of he manuscip and o Noam Slonim fo useful commens. We also would like o hank he anonymous eviewes and he acion edio fo hei consucive commens. Las, we would like o acknowledge he financial suppo of EU pojec KeMIT No. IST-2000-25341. Appendix A. Technical Poofs Poof of Theoem 4: The case D = 0 follows fom Theoem 3 hus we can assume ha D > 0. The heoem is poved by ansfoming he non-sepaable seing o a sepaable one. To do so, we exend each insance x R n o z R n+t as follows. The fis n coodinaes of z ae se o x.then+ coodinae of z is se o, which is a posiive eal numbe whose value is deemined lae; he es of he coodinaes of z ae se o zeo. We similaly exend he maix M o W R k (n+t) as follows. We se he fis n columns W o be 1 Z M. Fo each ow we se W,n+ summaize, he sucue of W is, o d Z if = y and zeo ohewise. To W = 1 Z M δ,y d. We choose he value of Z so ha W 2 = 1, hence, 1 = W 2 2 = 1 ( ) Z 2 1 + D2 2 which gives ha, Z = 1 + D2 2. We now show ha W achieves a magin of γ Z on he exended daa sequence. Noe ha fo all and, W z = 1 ( M x d ) + δ,y Z = 1 ( M Z x + δ,y d ). 979
CRAMMER AND SINGER Now, using he definiion of d we ge, W y z max y { W z } = 1 Z ( M y x + d ) { 1 ( max M y x )} Z [ ] M y { x max M y x} ( [ ]) γ M y { x max M y x } = 1 Z d + 1 Z 1 Z + 1 Z [ M y x max y { M x } = γ Z. (28) ] We also have ha, z 2 = x 2 + 2 R 2 + 2. (29) In summay, Equaion (28) and Equaion (29) imply ha he sequence ( z 1,y 1 ),...,( z T,y T ) is classified coecly wih magin γ Z and each insance z is bounded above by R 2 + 2. Thus, we can use Theoem 3 and conclude ha he numbe of misakes ha he algoihm makes on ( z 1,y 1 ),...,( z T,y T ) is bounded fom above by, 2 R2 + 2 ( γ Z ) 2. (30) Minimizing Equaion (30) ove we ge ha he opimal value fo is DR and he ighes misake bound is, (D + R)2 2 γ 2. To complee he poof we show ha he pedicion of he algoihm in he exended space and in he oiginal space ae equal. Namely, le M and W be he value of he paamee maix jus befoe eceiving x and z, especively. We need o show ha he following condiions hold fo = 1,...,T : 1. The fis n columns of W ae equal o M. 2. The (n+)h column of W is equal zeo. 3. M x = W z fo = 1,...,k. The poof of hese condiions is saighfowad by inducion on. Poof of Theoem 7: Le M be he pooype maix jus befoe ound and denoe by M he updaed maix afe ound, hais, M = M + τ x ( = 1,2,...,k). 980
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS As in Theoem 3, we bound M 2 2 M 2 2 by bounding he em, fom above and below. Fis, we develop he lowe bound on k =1 M M = = k =1 k =1 M ( M + τ x ) M M +τ ( M x). (31) We fuhe develop he second em using he second consain of MIRA. Subsiuing τ y = y τ we ge, τ ( M x ) = y τ = y τ ( M x ) ( + τ y M y x) ( M x ) y τ = y ( τ )( M y M ( M y x) ) x. Using he fac ha M classifies all he insances wih magin γ we obain, Combining Equaion (31) and Equaion (32) we ge, τ ( M x) ( ) τ γ = τ y γ. (32) y M M M M + τ γ. Thus, afe T ounds he maix M saisfies, M M γτ. (33) Using he definiion of he veco-nom and applying he Cauchy-Schwaz inequaliy we ge, M 2 M 2 = ( )( ) k k M 2 M 2 =1 =1 ( M 1 M 1 +...+ M k M k )2 = ( k =1 M M ) 2. (34) Plugging Equaion (33) ino Equaion (34) and using he assumpion ha M is of a uni veco-nom we ge he following lowe bound, ) 2 M 2 γ ( 2 τ y. (35) 981
CRAMMER AND SINGER Nex, we bound he veco-nom of M fom above, M 2 = M 2 = M + τ x 2 = M 2 + 2 = M 2 + 2 τ ( M x ) + τ x 2 τ ( M x ) + x 2 (τ )2. (36) Using he definiion of MIRA (Figue 3) we know ha τ ae chosen o minimize M 2. Noe ha τ = 0 saisfies he consains of MIRA and hen M educes o M. Theefoe we have ha, 2 Bu x 2 (τ ) 2 > 0 and finally we ge, τ ( M x ) + x 2 (τ )2 0. τ ( M x ) 0. (37) Plugging Equaion (37) ino Equaion (36) while using he bound x 2 R 2 and Lemma 2 we obain, Thus, afe T ound he maix M saisfies, M 2 M 2 + 2 R 2 ( τ y ) 2 M 2 + 2 R 2 τ y. (38) M 2 2 R 2 τ y. (39) Combining Equaion (35) and Equaion (39) we obain, and heefoe, γ 2 ( ) 2 τ y M 2 2 R 2 τ y τ y 2R2 γ 2. Using he second consain of he algoihm we ge, τ 1 = τ = τ + τ y = 2τ y, y and heefoe, τ 1 4 R2 γ 2. 982
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Poof of Theoem 11: Le Φ = k =1 D kl ( M M ), and define = Φ +1 Φ. Noe ha hese definiions imply ha, = Φ +1 Φ [ ( )] M = M,i log,i i M,i +1 [ ( )] M = M,i log,i i M,i +1. [ ( M M,i log,i i M,i Recall ha if no eo was made on he h example (ŷ = y ) hen τ = 0, M +1 = M and = 0. We heefoe fuhe develop he expession fo fo he case when hee was a pedicion eo on ound, [ = [ = = = i ( ) ] Z M,i log e ητ x i ] log(z ) M,i Mi ητ x i i [ log(z ) M 1 ητ ( M x)] ( log(z ) M 1 Using he assumpion M 1 = 1foall = 1,...,k we ge, = i ) η τ ( M x). )] log(z) ητ ( M x ). (40) Le us now fuhe develop boh ems of he expession above. Fo he igh em we use he second consain of he algoihm and subsiue τ y = y τ o ge ha, ) x. τ ( M x ) = ( τ ) ( M y M y Using he assumpion ( ) ha M classifies all he insances wih magin γ and he fouh consain of he algoihm τ y = 1 we obain, To bound he lef em we use he inequaliy : τ ( M x ) ( τ )γ = γτ y = γ. (41) y η > 0, x [ 1,1] e ηx 1 + x 2 eη + 1 x 2 e η. 983
CRAMMER AND SINGER Since τ 1and x 1hen τ x i 1 and hus, Z = M,ie x i i i M,i [ 1 + τ x i 2 = M e η + e η,i i 2 = eη + e η 2 i e η + 1 τ x ] i e η 2 +M e η e η,i τ x i i 2 M,i + eη e η τ 2 ( M x ) = eη + e η M 2 1 + eη e η ( τ 2 ) ( M y M ) x + eη e η τ ( M y 2 x). Noe ha M 1 = 1 since he algoihm nomalizes he ows of he maix on evey sep. We assumed ha hee is an eo in classifying x and, as in he addiive family of algoihms, we need o conside wo cases. The fis case is when he label was no he souce of he eo, ha is ( M y M ) x > 0. Then by using he hid consain of he algoihm we ge ha τ = 0and hus ( τ ) ( ) M y M x = 0. In he second case, if he label was a possible souce of eo, hen ( M y M ) x 0. Using he fis consain of he algoihm we know ha τ 0 and hus ( τ ) ( ) M y M x 0. Since η > 0wehaveha 1 2 (eη e η ) > 0 and heefoe we ge, Z eη + e η 2 Taking he log of Equaion (42) we ge, + eη e η τ ( M y 2 x). (42) [ e log(z) η + e η log + eη e η τ ( M y 2 2 x)] [ e η + e η = log (1 + eη e η ( 2 e η + e η τ M y x))] ( e η + e η ) = log + log [1 + eη e η ( 2 e η + e η τ M y x)]. We use he fac he log(x) is concave and heefoe log(1 + x) x fo x 1. Since τ 1, M y 1 = 1, x 1and e η e η e η + e η 1, we conclude ha, ( e log(z) η + e η ) log + eη e η ( 2 e η + e η τ M y x). (43) Plugging Equaions (41) and (43) ino Equaion (40) we ge ha if hee is an eo on he h insance hen ( e η + e η ) [ e log 2 + η e η ( e η + e η τ M y x)] ηγ ( e η + e η ) = k log + eη e η ( M 2 e η + e η y x ) τ ηγ. 984
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Using he second consain of he algoihm ( τ = 0) we obain, ( e η + e η ) k log ηγ. 2 Theefoe, if he algoihm makes m misakes on he sequence ( x 1,y 1 ),...,( x T,y T ) hen On he ohe hand, T =1 T =1 [ m k log = T =1 ( e η + e η 2 (Φ +1 Φ )=Φ T +1 Φ 1 Combining Equaions (44) and Equaions (45) we obain, [ m k log ) ] ηγ. (44) Φ 1 = k log(n). (45) ( e η + e η 2 ) ] ηγ k log(n). Solving fo m we ge, log(n) m η γ k + log( ). 2 e η +e η Minimizing ove η we obain he equied bound, ( O k 2 log(n) ) γ 2. Poof of Lemma 12: Noe ha he claim implies ha he fis inequaliy consain of MC-ROMMA s opimizaion poblem is saisfied wih equaliy afe he updae. Assume, by conadicion ha his is no he case. Tha is, afe an updae we ge, τ ( M +1 x ) > 1. (46) We now show ha hee exiss a maix M which saisfies he consains of he opimizaion poblem, bu achieves a nom which is smalle han he nom of M +1. This yields a conadicion o he assumpion ha M +1 is he opimal soluion. Since x was misclassified we need o conside he following wo cases fo each label. The fis case is when he label was no he souce of he eo, ha is ( M y M ) x > 0. Then, using ( ) he hid consain ( / E {y } τ = 0) we ge ha τ = 0 and hus ( τ ) M y M x = 0. The second case is when one of he souces of eo was he label, i.e.( M y M ) x 0. Fom 985
CRAMMER AND SINGER ) he fis consain of he algoihm we know ha τ 0 and hus ( τ)( M y M x 0. Finally, summing ove all we ge, ( τ ) ( M y M ) x 0. (47) y We fuhe develop he lef hand-side of he above equaliy using he second consain of he algoihm ( τ = 0) and ge, ( )( τ M y M ) x = τ ( M x ) τ ( M y x) y y y = τ ( M x ) ( + τ y M y x) y = τ ( M x). (48) Combining Equaions (47) and (48) we ge, τ ( M x ) 0. (49) Fom Equaions (46) and (49) we ge ha hee exiss α (0,1) and M = αm ( +(1 α)m ) +1 such ha M saisfies he fis consain of he algoihm wih equaliy, i.e. τ M x = 1. Using he definiion of M and he convexiy of he squaed L 2 nom we ge ha, M 2 α M 2 +(1 α) M +1 2. (50) Noe ha M is he opimal soluion of he quadaic opimizaion poblem if we omi he fis inequaliy consain given in Equaion (22). In addiion, M does no saisfy ha fis consain, heefoe M 2 < M +1 2. Plugging his inequaliy ino Equaion (50) we ge, M 2 < M +1 2. Since boh M and M +1 saisfy he second inequaliy consain of Equaion (22) and M is a convex combinaion of M and M +1,henM also saisfies he second consain. Theefoe, M is a feasible poin and hus we ge a conadicion. Poof of Lemma 13: Le A denoe he se of all maices which saisfy he fis consain wih equaliy, ha is, { } A = M : τ ( M x ) = 1. Fom Lemma 12 we know ha M +1 A.Define and le, ā = τ x x 2 [ s (τ s )2 ], a = ā 1. ā k 986
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS be he maix whose h ow is ā. I is saighfowad o veify ha a A. We now show ha i aains he minimal veco-nom among all of he maices in A. Fom he definiions above he nom of a is, a a = Also noe ha fo evey M A we have, = = M a = τ x x 2 [ s (τ s )2 ] ] x [ 2 (τ 2 ) [ x 2 s (τ s) 2 ] 2 1 x 2 [ s (τ s) 2 ]. = = = M ā τ x x 2 [ s (τ s )2 ] τ M x x 2 [ s (τ s )2 ] 1 ( M x ) x 2 [ s (τ s )2 ] 1 x 2 [ s (τ s) 2 ], whee fo he las equaion we used he fac ha M A. Combining he las wo equaliies we ge ha fo all M A, M 2 = (M a )+a 2 = M a 2 + a 2 + 2 ( M a a a ) ( ) = M a 2 + a 2 1 + 2 x 2 [ s (τ s) 2 ] 1 x 2 [ s (τ s) 2 ] τ = M a 2 + a 2. (51) Since he em on he igh side of Equaion (51) is consan, he nom of M is minimized when he em on he lef hand side equals zeo, ha is M = a.howeve,m +1 A and i aains he minimal nom. We heefoe ge M +1 = a. We now assume by conadicion ha he second inequaliy consain of he opimizaion poblem does no hold wih equaliy fo M +1,hais M +1 M > M 2. Plugging he value of M +1 = a ino he inequaliy we ge, Reaanging he ems we finally ge, τ x x 2 [ s (τ s )2 ] M > M 2. τ ( x M ) [ > M 2 x 2 987 s ] (τ s) 2.
CRAMMER AND SINGER Howeve, M 0(since > 1), x 0 (since he inpu sequence is sepaable) and s (τ s) 2 > 0(since E /0), heefoe, τ ( x M ) > 0, which is a conadicion o he assumpion ha hee was a pedicion eo on ound. Appendix B. Summay of Expeimenal Resuls The esuls of he expeimens ae summaized in Tables 2 hough 5. Each able conains esuls fo a diffeen daase. The daases ae Chess-Boad, MNIST, USPS and Lee. Each column gives esuls afe a single pass hough he aining se. Each ow in he ables coesponds o a specific algoihm. The op ow in each pai of ows coesponds o he es eo while he boom ow gives he cumulaive numbe of updaes each algoihm made. Some of he ables also conain esuls fo ALMA (Genile, 2001) and ROMMA (Li and Long, 2002). Boh algoihms used he one-vses educion of muliclass o binay. ROMMA was ained using a non-homogeneous polynomial kenel of degee fou and he daa was nomalized o have an l nom of 1. See (Li and Long, 2002) fo fuhe deails. ALMA was designed and analyzed by Genile (2001). ALMA was ained using diffeen kenels han in his pape, On he MNIST daa-se is was ained using a non-homogeneous polynomial kenel of degee six and he daa was nomalized o have an l nom of 1. On he USPS daa-se is was ained using a Gaussian kenel wih a sandad deviaion of 3.5 and on he Lee daase is was ained using a ploy-gaussian kenel. Fuhe deails ae povides by Genile (2001). We used he pedicion he las se of pooypes each algoihm oupus afe cycling hough he aining se. Howeve, Genile (2001) epos ha bee esuls can be obained by combining ALMA wih a voing echnique (Feund and Schapie, 1999). In he ables below we epo esuls ha wee obained wihou any voing o aveaging echniques. Epochs Algoihm 1 2 3 4 5 Pecepon 5.6 4.9 4.7 4.7 4.6 1891 2029 2050 2059 2062 Unifom 6.3 5.1 4.7 4.7 4.7 1745 1933 1966 1971 1973 Max 6.1 5.4 5.2 5.1 5.1 1758 1912 1936 1944 1947 Pop 6.2 5.3 5.2 5.1 5.1 1723 1900 1927 1934 1938 MIRA 4.3 4.0 3.9 4.0 4.0 7229 7259 7260 7261 7261 Table 2: Expeimenal esuls fo Chess-Boad daa. The es eo (op) and numbe of suppo paens (boom) fo five muliclass online algoihms afe j = 1,...,10 epochs of aining on 10,000 examples. 988
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS Epochs Algoihm 1 2 3 Kenel Pecepon 1.83 1.58 1.68 Homogeneous 5299 6633 7112 Polynomial agg-romma 2.05 1.76 1.67 30088 44495 58583 Non-Homogeneous ALMA 2 (0.9) 1.84 1.53 1.45 Polynomial 11652 13712 14598 Unifom 2.31 1.89 1.62 2726 3271 3458 Max 2.61 2.13 1.89 Homogeneous 2823 3423 3605 Polynomial Pop 2.46 2.04 1.85 3050 3722 3957 MIRA 1.45 1.37 1.36 20162 23878 26176 Table 3: Expeimenal esuls fo he MNIST daa-se. The es eo (op) and numbe of suppo paens (boom) fo five muliclass online algoihms afe j = 1,...,3 epochs. Epochs Kenel Algoihm 1 2 3 4 5 Pecepon 5.93 5.63 4.98 4.78 4.83 Homogeneous 936 1167 1240 1266 1281 Polynomial ALMA 2 (0.95) 5.72 5.05 4.85 1752 2087 2239 ALMA 2 (0.9) 5.43 5.06 4.90 Gaussian 2251 2606 2746 Unifom 6.73 5.53 5.38 5.48 5.43 492 578 603 614 621 Max 6.08 6.38 5.48 5.38 5.38 Homogeneous 527 607 639 645 647 Polynomial Pop 6.63 5.98 5.73 5.58 5.43 494 575 600 612 615 MIRA 4.78 4.68 4.63 4.63 4.58 3242 3864 4250 4517 4726 Table 4: Expeimenal esuls fo he USPS daa-se.the es eo (op) and numbe of suppo paens (boom) fo five muliclass online algoihms afe j = 1,...,5 epochs. 989
CRAMMER AND SINGER Epochs Kenel Algoihm 1 2 3 4 5 Pecepon 7.45 5.13 4.60 4.32 3.95 Gaussian 4215 5635 6469 7023 7359 ALMA 2 (0.8) 4.20 3.55 3.27 Poly-Gaussian 11258 13003 13673 Unifom 7.07 5.40 4.90 4.88 4.28 2202 2754 3057 3293 3432 Max 7.40 6.08 4.63 4.73 4.73 Gaussian 2334 2951 3313 3510 3635 Pop 8.00 7.03 4.98 4.83 4.45 2205 2784 3117 3336 3475 MIRA 3.68 3.08 2.70 2.50 2.38 8184 11964 14929 17453 19701 Table 5: Expeimenal esuls fo he Lee daa-se. The es eo (op) and numbe of suppo paens (boom) fo five muliclass online algoihms afe j = 1,...,5 epochs. Refeences E. L. Allwein, R.E. Schapie, and Y. Singe. Reducing muliclass o binay: A unifying appoach fo magin classifies. Jounal of Machine Leaning Reseach, 1:113 141, 2000. J. K. Anlauf and M. Biehl. The adaon: an adapive pecepon algoihm. Euophysics Lees, 10 (7):687 692, Dec 1989. L. Beiman, J. H. Fiedman, R. A. Olshen, and C. J. Sone. Classificaion and Regession Tees. Wadswoh & Books, 1984. C. Coes and V. Vapnik. Suppo-veco newoks. Machine Leaning, 20(3):273 297, Sepembe 1995. K. Camme and Y. Singe. On he leanabiliy and design of oupu codes fo muliclass poblems. In Poceedings of he Thieenh Annual Confeence on Compuaional Leaning Theoy, 2000. K. Camme and Y. Singe. On he algoihmic implemenaion of muliclass kenel-based veco machines. Jonal of Machine Leaning Reseach, 2:265 292, 2001. T. G. Dieeich and G. Bakii. Solving muliclass leaning poblems via eo-coecing oupu codes. Jounal of Aificial Inelligence Reseach, 2:263 286, Januay 1995. R. O. Duda and P. E. Ha. Paen Classificaion and Scene Analysis. Wiley, 1973. Y. Feund and R. E. Schapie. Lage magin classificaion using he pecepon algoihm. Machine Leaning, 37(3):277 296, 1999. 990
ULTRACONSERVATIVE ONLINE ALGORITHMS FOR MULTICLASS PROBLEMS T. Fiess, N. Cisianini, and C. Campbell. The kenel-adaon: A fas and simple leaning pocedue fo suppo veco machines. In Machine Leaning: Poceedings of he Fifeenh Inenaional Confeence, 1998. C. Genile. A new appoximae maximal magin classificaion algoihm. Jounal of Machine Leaning Reseach, 2:213 242, 2001. A. J. Gove, N. Lilesone, and D. Schuumans. Geneal convegence esuls fo linea disciminan updaes. Machine Leaning, 43(3):173 210, 2001. D. P. Helmbold and M. K. Wamuh. On weak leaning. Jounal of Compue and Sysem Sciences, 50:551 573, 1995. J. Kivinen and M. K. Wamuh. Addiive vesus exponeniaed gadien updaes fo linea pedicion. Infomaion and Compuaion, 132(1):1 64, Januay 1997. Y. Li and P. M. Long. The elaxed online maximum magin algoihm. Machine Leaning, 46(1 3): 361 387, 2002. N. Lilesone. Leaning when ielevan aibues abound: A new linea-heshold algoihm. Machine Leaning, 2:285 318, 1988. C. Meseham. A muli-clss linea leaning algoihm elaed o winnow. In Advances in Neual Infomaion Pocessing Sysems 13, 1999. N. J. Nilsson. Leaning Machines: Foundaions of ainable paen classificaion sysems. McGaw-Hill, New Yok, 1965. J. C. Pla. Fas aining of Suppo Veco Machines using sequenial minimal opimizaion. In B. Schölkopf, C. Buges, and A. Smola, edios, Advances in Kenel Mehods - Suppo Veco Leaning. MIT Pess, 1998. J. R. Quinlan. C4.5: Pogams fo Machine Leaning. Mogan Kaufmann, 1993. F. Rosenbla. The pecepon: A pobabilisic model fo infomaion soage and oganizaion in he bain. Psychological Review, 65:386 407, 1958. (Repined in Neuocompuing (MIT Pess, 1988).). V. N. Vapnik. Saisical Leaning Theoy. Wiley, 1998. 991