Learning Permutations with Exponential Weights

Transcription

1 Journal of Machne Learnng Research 2009 (10) Submtted 9/08; Publshed 7/09 Learnng Permutatons wth Exponental Weghts Davd P. Helmbold Manfred K. Warmuth Computer Scence Department Unversty of Calforna, Santa Cruz Santa Cruz, CA Edtor: Yoav Freund Abstract We gve an algorthm for the on-lne learnng of permutatons. The algorthm mantans ts uncertanty about the target permutaton as a doubly stochastc weght matrx, and makes predctons usng an effcent method for decomposng the weght matrx nto a convex combnaton of permutatons. The weght matrx s updated by multplyng the current matrx entres by exponental factors, and an teratve procedure s needed to restore double stochastcty. Even though the result of ths procedure does not have a closed form, a new analyss approach allows us to prove an optmal (up to small constant factors) bound on the regret of our algorthm. Ths regret bound s sgnfcantly better than that of ether Kala and Vempala s more effcent Follow the Perturbed Leader algorthm or the computatonally expensve method of explctly representng each permutaton as an expert. Keywords: permutaton, rankng, on-lne learnng, Hedge algorthm, doubly stochastc matrx, relatve entropy projecton, Snkhorn balancng 1. Introducton Fndng a good permutaton s a key aspect of many problems such as the rankng of search results or matchng workers to tasks. In ths paper we present an effcent and effectve on-lne algorthm for learnng permutatons n a model related to the on-lne allocaton model of learnng wth experts (Freund and Schapre, 1997). In each tral, the algorthm probablstcally chooses a permutaton and then ncurs a lnear loss based on how approprate the permutaton was for that tral. The regret s the total expected loss of the algorthm on the whole sequence of trals mnus the total loss of the best permutaton chosen n hndsght for the whole sequence, and the goal s to fnd algorthms that have provably small worst-case regret. For example, one could consder a commuter arlne whch owns n arplanes of varous szes and fles n routes. 1 Each day the arlne must match arplanes to routes. If too small an arplane s assgned to a route then the arlne wll loose revenue and reputaton due to unserved potental passengers. On the other hand, f too large an arplane s used on a long route then the arlne could have larger than necessary fuel costs. If the number of passengers wantng each flght were known ahead of tme, then choosng an assgnment s a weghted matchng problem. In the on-lne. An earler verson of ths paper appears n Proceedngs of the Twenteth Annual Conference on Computatonal Learnng Theory (COLT 2007), publshed by Sprnger as LNAI Manfred K. Warmuth acknowledges the support of NSF grant IIS We assume that each route starts and ends at the arlne s home arport. c 10 Davd P. Helmbold and Manfred K. Warmuth.

2 HELMBOLD AND WARMUTH allocaton model, the arlne frst chooses a dstrbuton over possble assgnments of arplanes to routes and then randomly selects an assgnment from the dstrbuton. The regret of the arlne s the earnngs of the sngle best assgnment for the whole sequence of passenger requests mnus the total expected earnngs of the on-lne assgnments. When arplanes and routes are each numbered from 1 to n, then an assgnment s equvalent to selectng a permutaton. The randomness helps protect the on-lne algorthm from adversares and allows one to prove good bounds on the algorthm s regret for arbtrary sequences of requests. Snce there are n! permutatons on n elements, t s nfeasble to smply treat each permutaton as an expert and apply one of the expert algorthms that uses exponental weghts. Prevous work has exploted the combnatoral structure of other large sets of experts to create effcent algorthms (see Helmbold and Schapre, 1997; Takmoto and Warmuth, 2003; Warmuth and Kuzmn, 2008, for examples). Our soluton s to make a smplfyng assumpton on the loss functon whch allows the new algorthm, called PermELearn, to mantan a suffcent amount of nformaton about the dstrbuton over n! permutatons whle usng only n 2 weghts. We represent a permutaton of n elements as an n n permutaton matrx Π where Π, j = 1 f the permutaton maps element to poston j and Π, j = 0 otherwse. As the algorthm randomly selects a permutaton Π at the begnnng of a tral, an adversary smultaneously selects an arbtrary loss matrx L [0,1] n n whch specfes the loss of all permutatons for the tral. Each entry L, j of the loss matrx gves the loss for mappng element to j, and the loss of any whole permutaton s the sum of the losses of the permutaton s mappngs, that s, the loss of permutaton Π s L,Π() =, j Π, j L, j. Note that the per-tral expected losses can be as large as n, as opposed to the common assumpton for the expert settng that the losses are bounded n [0,1]. In Secton 3 we show how a varety of ntutve loss motfs can be expressed n ths matrx form. Ths assumpton that the loss has a lnear matrx form ensures the expected loss of the algorthm can be expressed as, j W, j L, j, where W = E( Π). Ths expectaton W s an n n weght matrx whch s doubly stochastc, that s, t has non-negatve entres and the property that every row and column sums to 1. The algorthm s uncertanty about whch permutaton s the target s summarzed by W; each weght W, j s the probablty that the algorthm predcts wth a permutaton mappng element to poston j. It s worth emphaszng that the W matrx s only a summary of the dstrbuton over permutatons used by any algorthm (t doesn t ndcate whch permutatons have non-zero probablty, for example). However, ths summary s suffcent to determne the algorthm s expected loss when the losses of permutatons have the assumed loss matrx form. Our PermELearn algorthm stores the weght matrx W and must convert W nto an effcently sampled dstrbuton over permutatons n order to make predctons. By Brkhoff s Theorem, every doubly stochastc matrx can be expressed as the convex combnaton of at most n 2 2n + 2 permutatons (see, e.g., Bhata, 1997). In Appendx A we show that a greedy matchng-based algorthm effcently decomposes any doubly stochastc matrx nto a convex combnaton of at most n 2 2n+2 permutatons. Although the effcacy of ths algorthm s mpled by standard dmensonalty arguments, we gve a new combnatoral proof that provdes ndependent nsght as to why the algorthm fnds a convex combnaton matchng Brkhoff s bound. Our algorthm for learnng permutatons predcts wth a random Π sampled from the convex combnaton of permutatons created by decomposng weght matrx W. It has been appled recently for prcng combnatoral markets when the outcomes are permutatons of objects (Chen et al., 2008). The PermELearn algorthm updates the entres of ts weght matrx usng exponental factors commonly used for updatng the weghts of experts n on-lne learnng algorthms (Lttlestone and 1706

3 LEARNING PERMUTATIONS Warmuth, 1994; Vovk, 1990; Freund and Schapre, 1997): each entry W, j s multpled by a factor e ηl, j. Here η s a postve learnng rate that controls the strength of the update (When η = 0, than all the factors are one and the update s vacuous). After ths update, the weght matrx no longer has the doubly stochastc property, and the weght matrx must be projected back nto the space of doubly stochastc matrces (called Snkhorn balancng, see Secton 4) before the next predcton can be made. In Theorem 4 we bound the expected loss of PermELearn over any sequence of trals by nlnn+ηl best 1 e η, (1) where n s the number of elements beng permuted, η s the learnng rate, and L best s the loss of the best permutaton on the entre sequence. If an upper boundl est L best s known, then η can be tuned (as n Freund and Schapre, 1997) and the expected loss bound becomes L best + 2L est nlnn+nlnn, (2) gvng a bound of 2L est nlnn+nlnn on the worst case expected regret of the tuned PermELearn algorthm. We also prove a matchng lower bound (Theorem 6) of Ω( L best nlnn) for the expected regret of any algorthm solvng our permutaton learnng problem. A smpler and more effcent algorthm than PermELearn mantans the sum of the loss matrces on the the prevous trals. Each tral t adds random perturbatons to the cumulatve loss matrx and then predcts wth the permutaton havng mnmum perturbed loss. Ths Follow the Perturbed Leader algorthm (Kala and Vempala, 2005) has good regret bounds for many on-lne learnng settngs. However, the regret bound we can obtan for t n the permutaton settng s about a factor of n worse than the bound for PermELearn and the lower bound. Although computatonally expensve, one can also consder runnng the Hedge algorthm whle explctly representng each of the n! permutatons as an expert. If T s the sum of the loss matrces over the past trals and F s the n n matrx wth entres F, j = e ηt, j, then the weght of each permutaton expert Π s proportonal to the product F,Π() and the normalzaton constant s the permanent of the matrx F. Calculatng the permanent s a known #P-complete problem and samplng from ths dstrbuton over permutatons s very neffcent (Jerrum et al., 2004). Moreover snce the loss range of a permutaton s [0,n], the standard loss bound for the algorthm that uses one expert per permutaton must be scaled up by a factor of n, becomng L best + n 2 L est n ln(n!)+nln(n!) L best + 2L est n 2 lnn+n 2 lnn. Ths expected loss bound s smlar to our expected loss bound for PermELearn n Equaton (2), except that the nlnn terms are replaced by n 2 lnn. Our method based on Snkhorn balancng bypasses the estmaton of permanents and somehow PermELearn s mplct representaton and predcton method explot the structure of permutatons and lets us obtan the mproved bound. We also gve a matchng lower bound that shows PermELearn has the optmum regret bound (up to a small constant factor). It s an nterestng open queston whether the structure of permutatons can be exploted to prove bounds lke (2) for the Hedge algorthm wth one expert per permutaton. PermELearn s weght updates belong to the Exponentated Gradent famly of updates (Kvnen and Warmuth, 1997) snce the components L, j of the loss matrx that appear n the exponental 1707

4 HELMBOLD AND WARMUTH factor are the dervatves of our lnear loss wth respect to the weghts W, j. Ths famly of updates usually mantans a probablty vector as ts weght vector. In that case the normalzaton of the weght vector s straghtforward and s folded drectly nto the update formula. Our new algorthm PermELearn for learnng permutatons mantans a doubly stochastc matrx wth n 2 weghts. The normalzaton alternately normalzes the rows and columns of the matrx untl convergence (Snkhorn balancng). Ths may requre an unbounded number of steps and the resultng matrx does not have a closed form. Despte ths fact, we are able to prove bounds for our algorthm. We frst show that our update mnmzes a tradeoff between the loss and a relatve entropy between doubly stochastc matrces. Ths relatve entropy becomes our measure of progress n the analyss. Luckly, the un-normalzed multplcatve update already makes enough progress (towards the best permutaton) to acheve the loss bound quoted above. Fnally, we nterpret the teratons of Snkhorn balancng as Bregman projectons wth respect to the same relatve entropy and show usng the propertes of Bregman projectons that these projectons can only ncrease the progress and thus don t hurt the analyss (Herbster and Warmuth, 2001). Our new nsght of splttng the update nto an un-normalzed step followed by a normalzaton step also leads to a streamlned proof of the loss bound for the Hedge algorthm n the standard expert settng that s nterestng n ts own rght. Snce the loss n the allocaton settng s lnear, the bounds can be proven n many dfferent ways, ncludng potental based methods (see, e.g., Kvnen and Warmuth, 1999; Gordon, 2006; Cesa-Banch and Lugos, 2006). For the sake of completeness we reprove our man loss bound for PermELearn usng potental based methods n Appendx B. We show how potental based proof methods can be extended to handle lnear equalty constrants that don t have a soluton n closed form, parallelng a related extenson to lnear nequalty constrants n Kuzmn and Warmuth (2007). In ths appendx we also dscuss the relatonshp between the projecton and potental based proof methods. In partcular, we show how the Bregman projecton step corresponds to pluggng n suboptmal dual varables nto the potental. The remander of the paper s organzed as follows. We ntroduce our notaton n the next secton. Secton 3 presents the permutaton learnng model and gves several ntutve examples of approprate loss motfs. Secton 4 gves the PermELearn algorthm and dscusses ts computatonal requrements. One part of the algorthm s to decompose the current doubly stochastc matrx nto a small convex combnaton of permutatons usng a greedy algorthm. The bound on the number of permutatons needed to decompose the weght matrx s deferred to Appendx A. We then bound PermELearn s regret n Secton 5 n a two-step analyss that uses a relatve entropy as a measure of progress. To exemplfy the new technques, we also analyze the basc Hedge algorthm wth the same methodology. The regret bounds for Hedge and PermELearn are re-proven n Appendx B usng potental based methods. In Secton 6, we apply the Follow the Perturbed Leader algorthm to learnng permutatons and show that the resultng regret bounds are not as good. In Secton 7 we prove a lower bound on the regret when learnng permutatons that s wthn a small constant factor of our regret bound on the tuned PermELearn algorthm. The concludng secton descrbes extensons and drectons for further work. 2. Notaton All matrces wll be n n matrces. When A s a matrx, A, j denotes the entry of A n row, and column j. We use A B to denote the dot product between matrces A and B, that s,, j A, j B, j. We use sngle superscrpts (e.g., A k ) to dentfy matrces/permutatons from a sequence. 1708

5 LEARNING PERMUTATIONS Permutatons on n elements are frequently represented n two ways: as a bjectve mappng of the elements {1,...,n} nto the postons {1,...,n} or as a permutaton matrx whch s an n n bnary matrx wth exactly one 1 n each row and each column. We use the notaton Π (and Π) to represent a permutaton n ether format, usng the context to ndcate the approprate representaton. Thus, for each {1,...,n}, we use Π() to denote the poston that the th element s mapped to by permutaton Π, and matrx element Π, j = 1 f Π() = j and 0 otherwse. If L s a matrx wth n rows then the product ΠL permutes the rows of L: Π = L = ΠL = perm. (2,4,3,1) as matrx an arbtrary matrx permutng the rows Convex combnatons of permutatons create doubly stochastc or balanced matrces: nonnegatve matrces whose n rows and n columns each sum to one. Our algorthm mantans ts uncertanty about whch permutaton s best as a doubly stochastc weght matrx W and needs to randomly select a permutaton from some dstrbuton whose expectaton s W. By Brkhoff s Theorem (see, e.g., Bhata, 1997), for every doubly stochastc matrx W there s a decomposton nto a convex combnaton of at most n 2 2n+2 permutaton matrces. We show n Appendx A how a decomposton of ths sze can be found effectvely. Ths decomposton gves a dstrbuton over permutatons whose expectaton s W that now can be effectvely sampled because ts support s at most n 2 2n+2 permutatons. 3. On-lne Protocol We are nterested n learnng permutatons n a model related to the on-lne allocaton model of learnng wth experts (Freund and Schapre, 1997). In that model there are N experts and at the begnnng of each tral the algorthm allocates a probablty dstrbuton w over the experts. The algorthm pcks expert wth probablty w and then receves a loss vector l [0,1] N. Each expert ncurs loss l and the expected loss of the algorthm s w l. Fnally, the algorthm updates ts dstrbuton w for the next tral. In case of permutatons we could have one expert per permutaton and allocate a dstrbuton over the n! permutatons. Explctly trackng ths dstrbuton s computatonally expensve, even for moderate n. As dscussed n the ntroducton, we assume that the losses n each tral can be specfed by a loss matrx L [0,1] n n where the loss of each permutaton Π has the lnear form L,Π() = Π L. If the algorthm s predcton Π s chosen probablstcally n each tral then the algorthm s expected loss s E[ Π L] = W L, where W = E[ Π]. Ths expected predcton W s an n n doubly stochastc matrx and algorthms for learnng permutatons under the lnear loss assumpton can be vewed as mplctly mantanng such a doubly stochastc weght matrx. More precsely, the on-lne algorthm follows the followng protocol n each tral: The learner (probablstcally) chooses a permutaton Π, and let W = E( Π). Nature smultaneously chooses a loss matrx L [0,1] n n for the tral. At the end of the tral, the algorthm s gven L. The loss of Π s Π L and the expected loss of the algorthm s W L. 1709

6 HELMBOLD AND WARMUTH Fnally, the algorthm updates ts dstrbuton over permutatons for the next tral, mplctly updatng matrx W. Although our algorthm can handle arbtrary sequences of loss matrces L [0,1] n n, nature could be sgnfcantly more restrcted. Many rankng applcatons have an assocated loss motf M and nature s constraned to choose (row) permutatons of M as ts loss matrx L. In effect, at each tral nature chooses a correct permutaton Π and uses the loss matrx L = ΠM. Note that the permutaton left-multples the loss motf, and thus permutes the rows of M. If nature chooses the dentty permutaton then the loss matrx L s the motf M tself. When M s known to the algorthm, t suffces to gve the algorthm only the permutaton Π at the end of the tral, rather than the loss matrx L tself. Fgure 1 gves examples of loss motfs. The last loss n Fgure 1 s related to a compettve Lst Update Problem where an algorthm servces requests to a lst of n tems. In the Lst Update Problem the cost of a request s the requested tem s current poston n the lst. After each request, the requested tem can be moved forward n the lst for free, and addtonal rearrangement can be done at a cost of one per transposton. The goal s for the algorthm to be cost-compettve wth the best statc orderng of the elements n hndsght. Note that the transposton cost for addtonal lst rearrangement s not represented n the permutaton loss motf. Blum et al. (2003) gve very effcent algorthms for the Lst Update Problem that do not do addtonal rearrangng of the lst (and thus do not ncur the cost neglect by the loss motf). In our notaton, ther bound has the same form as ours (1) but wth the nlnn factors replaced by O(n). However, our lower bound (see Secton 7) shows that the n ln n factors n (2) are necessary n the general permutaton settng. Note that many compostons of loss motfs are possble. For example, gven two motfs wth ther assocated losses, any convex combnaton of the motfs creates a new motf for the same convex combnaton of the assocated losses. Other component-wse combnatons of two motfs (such as product or max) can also produce nterestng loss motfs, but the combnaton usually cannot be dstrbuted across the matrx dot-product calculaton, and so cannot be expressed as a smple lnear functon of the orgnal losses. 4. PermELearn Algorthm Our permutaton learnng algorthm uses exponenental weghts and we call t PermELearn. It mantans an n n doubly stochastc weght matrx W as ts man data structure, where W, j s the probablty that PermELearn predcts wth a permutaton mappng element to poston j. In the absence of pror nformaton t s natural to start wth unform weghts, that s, the matrx wth 1 n n each entry. In each tral PermELearn does two thngs: 1. Choose a permutaton Π from some dstrbuton such that E[ Π] = W. 2. Create a new doubly stochastc matrx W for use n the next tral based on the current weght matrx W and loss matrx L. 1710

7 LEARNING PERMUTATIONS lossl( Π,Π) the number of elements where Π() Π 1 n 1 n =1 Π() Π(), how far the elements are from ther correct postons (the dvson by n 1 ensures that the entres of M are n [0,1].) 1 n 1 n Π() Π() =1 Π(), a poston weghted verson of the above emphaszng the early postons n Π the number of elements mapped to the frst half by Π but the second half by Π, or vce versa the number of elements mapped to the frst two postons by Π that fal to appear n the top three poston of Π the number of lnks traversed to fnd the frst element of Π n a lst ordered by Π motf M /2 0 1/ /3 1/3 0 1/3 3/4 1/2 1/ Fgure 1: Loss motfs Choosng a permutaton s done by Algorthm 1. The algorthm greedly decomposes W nto a convex combnaton of at most n 2 2n+2 permutatons (see Theorem 7), and then randomly selects one of these permutatons for the predcton. 2 Our decomposton algorthm uses a Temporary matrx A ntalzed to the weght matrx W. Each teraton of Algorthm 1 fnds a permutaton Π where each A,Π() > 0. Ths can be done by fndng a perfect matchng on the n n bpartte graph contanng the edge, j whenever A, j > 0. We shall soon see that each matrx A s a constant tmes a doubly stochastc matrx, so the exstence of a sutable permutaton Π follows from Brkhoff s Theorem. Gven such a permutaton Π, the algorthm updates A to A απ where α = mn A,Π(). The updated matrx A has non-negatve entres and has strctly more zeros than the orgnal A. Snce the update decreases each row and 2. The decomposton s usually not unque and the mplementaton may have a bas as to exactly whch convex combnaton s chosen. 1711

8 HELMBOLD AND WARMUTH Algorthm 1 PermELearn: Selectng a permutaton Requre: a doubly stochastc n n matrx W A := W; q = 0; repeat q := q+1; Fnd permutaton Π q such that A,Π q () s postve for each {1,...,n} α q := mn A,Π q () A := A α q Π q untl All entres of A are zero {at end of loop W = q k=1 α kπ k } Randomly select and return a Π {Π 1,...,Π q } usng probabltes α 1,...,α q. Algorthm 2 PermELearn: Weght Matrx Update Requre: learnng rate η, loss matrx L, and doubly stochastc weght matrx W Create W where each W, j = W, j e ηl, j (3) Create doubly stochastc W by re-balancng the rows and columns of W (Snkhorn balancng) and update W to W. column sum by α and the orgnal matrx W was doubly stochastc, each matrx A wll have rows and columns that sum to the same amount. In other words, each matrx A created durng Algorthm 1 s a constant tmes a doubly stochastc matrx, and thus (by Brkhoff s Theorem) s a constant tmes a convex combnaton of permutatons. After at most n 2 n teratons the algorthm arrves at a matrx A havng exactly n non-zero entres, so ths A s a constant tmes a permutaton matrx. Therefore, Algorthm 1 decomposes the orgnal doubly stochastc matrx nto the convex combnaton of (at most) n 2 n+1 permutaton matrces. The more refned arguments n Appendx A shows that the Algorthm 1 never uses more than n 2 2n+2 permutatons, matchng the bound gven by Brkhoff s Theorem. Several mprovements are possble. In partcular, we need not compute each perfect matchng from scratch. If only z entres of A are zeroed by a permutaton, then that permutaton s stll a matchng of sze n z n the graph for the updated matrx. Thus we need to fnd only z augmentng paths to complete the perfect matchng. The entre process thus requres fndng O(n 2 ) augmentng paths at a cost of O(n 2 ) each, for a total cost of O(n 4 ) to decompose weght matrx W nto a convex combnaton of permutatons. 4.1 Updatng the Weghts In the second step, Algorthm 2 updates the weght matrx by multplyng each W, j entry by the factor e ηl, j. These factors destroy the row and column normalzaton, so the matrx must be rebalanced to restore the doubly-stochastc property. There s no closed form for the normalzaton step. The standard teratve re-balancng method for non-negatve matrces s called Snkhorn balancng. Ths method frst normalzes each row of the matrx to sum to one, and then normalzes the columns. Snce normalzng the columns typcally destroys the row normalzaton, the process must be terated untl convergence (Snkhorn, 1964). 1712

9 LEARNING PERMUTATIONS ( ) Snkhorn balancng = Fgure 2: Example where Snkhorn balancng requres nfntely many steps. Normalzng the rows corresponds to pre-multplyng by a dagonal matrx. The product of these dagonal matrces thus represents the combned effect of the multple row normalzaton steps. Smlarly, the combned effect of the column normalzaton steps can be represented by post-multplyng the matrx by a dagonal matrx. Therefore we get the well known fact that Snkhorn balancng a matrx A results n a doubly stochastc matrx RAC where R and C are dagonal matrces. Each entry R, s the postve multpler appled to row, and each entry C j, j s the postve multpler of column j needed to convert A nto a doubly stochastc matrx. In Fgure 2 we gve a ratonal matrx that balances to an rratonal matrx. Snce each row and column balancng step creates ratonals, Snkhorn balancng produces rratonals only n the lmt (after nfntely many steps). Multplyng a weght matrx from the left and/or rght by non-negatve dagonal matrces (e.g., row or column normalzaton) preserves the rato of product weghts between permutatons. That s f A = RAC, then for any two permutatons Π 1 and Π 2, A,Π 1 () A,Π 2 () = A,Π1 ()R, C Π1 (),Π 1 () A,Π2 ()R, C Π2 (),Π 2 () = A,Π1 (). A,Π2 () ( ) 1/2 1/2 Therefore must balance to a doubly stochastc matrx ( ) a 1 a 1/2 1 1 a a such that the rato of the product weght between the two permutatons (1,2) and (2,1) s preserved. Ths means 1/ /4 = a2 (1 a) 2 and thus a = Ths example leads to another mportant observaton: PermELearn s predctons are dfferent than Hedge s when each permutaton s treated as an expert. If each permutaton s explctly represented as an expert, then the Hedge algorthm predcts permutaton Π wth probablty proportonal to the product weght, e η t L t,π(). However, algorthm PermELearn predcts dfferently. Wth the weght matrx n Fgure 4.1, Hedge puts probablty 2 3 on permutaton (1,2) and probablty 1 3 on permutaton (2, 1) whle PermELearn puts probablty on permutaton (1,2) and 2 probablty on permutaton (2,1). 2 There has been much wrtten on the balancng of matrces, and we brefly descrbe only a few of the results here. Snkhorn showed that ths procedure converges and that the RAC balancng of any matrx A nto a doubly stochastc matrx s unque (up to cancelng multples of R and C) f t exsts 3 (Snkhorn, 1964). A number of authors consder balancng a matrx A so that the row and column sums are 1 ± ε. Frankln and Lorenz (1989) show that O(length(A)/ε) Snkhorn teratons suffce, where length(a) s the bt-length of matrx A s bnary representaton. Kalantar and Khachyan (1996) show that 3. Some non-negatve matrces, lke , cannot be converted nto doubly stochastc matrces because of ther pattern of zeros. The weght matrces we deal wth have strctly postve entres, and thus can always be made doubly stochastc wth an RAC balancng. 1713

10 HELMBOLD AND WARMUTH O(n 4 ln n ε ln 1 mna, j ) operatons suffce usng an nteror pont method. Lnal et al. (2000) gve a preprocessng step after whch only O((n/ε) 2 ) Snkhorn teratons suffce. They also present a strongly polynomal tme teratve procedure requrng Õ(n 7 log(1/ε)) teratons. Balakrshnan et al. (2004) gve an nteror pont method wth complexty O(n 6 log(n/ε)). Fnally, Fürer (2004) shows that f the row and column sums of A are 1 ± ε then every matrx entry changes by at most ±nε when A s balanced to a doubly stochastc matrx. 4.2 Dealng wth Approxmate Balancng Wth slght modfcatons, Algorthm PermELearn can handle the stuaton where ts weght matrx s mperfectly balanced (and thus not qute doubly stochastc). As before, let W be the fully balanced doubly stochastc weght matrx, but we now assume that only an approxmately balanced Ŵ s avalable to predct from. In partcular, we assume that each row and column of Ŵ sum to 1 ± ε for some ε < 1 3. Let s 1 ε be the smallest row or column sum n Ŵ. We modfy Algorthm 1 n two ways. Frst, A s ntalzed to 1 sŵ rather than W. Ths ensures every row and column n the ntal A sums to at least one, to at most 1+3ε, and at least one row or column sums to exactly 1. Second, the loop exts as soon as A has an all-zero row or column. Snce the smallest row or column sum starts at 1, s decreased by α k each teraton k, and ends at zero, we have that q k=1 α k = 1 and the modfed Algorthm 1 stll outputs a convex combnaton of permutatons C = q k=1 α kπ k. Furthermore, each entry C, j 1 sŵ, j. We now bound the addtonal loss of ths modfed algorthm. Lemma 1 If the weght matrx Ŵ s approxmately balanced so each row and column sum s n 1±ε (for ε 1 3 ) then the modfed Algorthm 1 has an expected loss C L at most 3n3 ε greater than the expected loss W L of the orgnal algorthm that uses the completely balanced doubly stochastc matrx W. Proof Let s be the smallest row or column sum n Ŵ. Snce each row and column sum of 1 sŵ les n [1,1+3ε], each entry of 1 sŵ s close to the correspondng entry of the fully balanced W. In partcular each 1 sŵ, j W, j + 3nε (Fürer, 2004). Ths allows us to bound the expected loss when predctng wth the convex combnaton C n terms of the expected loss usng a decomposton of the perfectly balanced W: C L 1 sŵ L Ŵ, j =, j s L, j (W, j + 3nε)L, j, j W L+3n 3 ε. Therefore the extra loss ncurred by usng a ε-approxmately balanced weght matrx at a partcular tral s at most 3n 3 ε, as desred. 1714

11 LEARNING PERMUTATIONS If n a sequence of T trals the matrces Ŵ are ε = 1/(3T n 3 ) balanced (so that each row and column sum s 1 ± 1/(3T n 3 )) then Lemma 1 mples that the total addtonal expected loss for usng approxmate balancng s at most 1. The algorthm of Balakrshnan et al. (2004) ε-balances a matrx n O(n 6 log(n/ε)) tme (note that ths domnates the tme for the loss update and constructng the convex combnaton). Ths balancng algorthm wth ε = 1/(3T n 3 ) together wth the modfed predcton algorthm gve a method requrng O(T n 6 log(t n)) total tme over the T trals and havng a bound of 2L est nlnn+nlnn+1 on the worst-case regret. If the number of trals T s not known n advance then settng ε as a functon of t can be helpful. A natural choce s ε t = 1/(3t 2 n 3 ). In ths case the total extra regret for not havng perfect balancng s bounded by T t=1 1/t2 5/3 and the total computaton tme over the T trals s stll bounded by O(T n 6 log(t n)). One mght be concerned about the effects of approxmate balancng propagatng between trals. However ths s not an ssue. In the followng secton we show that the loss updates and balancng can be arbtrarly nterleaved. Therefore the modfed algorthm can ether keep a cumulatve loss matrx L t = t =1 L and create ts next Ŵ by (approxmately) balancng the matrx wth entres 1 n e ηl t, j, or apply the multplcatve updates to the prevous approxmately balanced Ŵ. 5. Bounds for PermELearn Our analyss of PermELearn follows the entropy-based analyss of the exponentated gradent famly of algorthms (Kvnen and Warmuth, 1997). Ths style of analyss frst shows a per-tral progress bound usng relatve entropy to a comparator as a measure of progress, and then sums ths nvarant over the trals to bound the expected total loss of the algorthm. We also show that PermELearn s weght update belongs to the exponentated gradent famly of updates (Kvnen and Warmuth, 1997) snce t s the soluton to a mnmzaton problem that trades of the loss (n ths case a lnear loss) aganst a relatve entropy regularzaton. Recall that the expected loss of PermELearn on a tral s a lnear functon of ts weght matrx W. Therefore the gradent of the loss s ndependent of the current value of W. Ths property of the loss greatly smplfes the analyss. Our analyss for ths settng provdes a good foundaton for learnng permutaton matrces and lays the groundwork for the future study of other permutaton loss functons. We start our analyss wth an attempt to mmc the standard analyss (Kvnen and Warmuth, 1997) for the exponentated gradent famly updates whch multply by exponental factors and renormalze. The per-tral nvarant used to analyze the exponentated gradent famly bounds the decrease n relatve entropy from any (normalzed) vector u to the algorthm s weght vector by a lnear combnaton of the algorthm s loss and the loss of u on the tral. In our case the weght vectors are matrces and we use the followng (un-normalzed) relatve entropy between matrces A and B wth non-negatve entres: (A,B) = A, j ln A, j + B, j A, j., j B, j Note that ths s just the sum of the relatve entropes between the correspondng rows (or equvalently, between the correspondng columns): (A,B) = (A,,B, ) = (A, j,b, j ) j 1715

12 HELMBOLD AND WARMUTH (here A, s the th row of A and A, j s ts jth column). Unfortunately, the lack of a closed form for the matrx balancng procedure makes t dffcult to prove bounds on the loss of the algorthm. Our soluton s to break PermELearn s update (Algorthm 2) nto two steps, and use only the progress made to the ntermedate un-balanced matrx n our per-tral bound (8). After showng that balancng to a doubly stochastc matrx only ncreases the progress, we can sum the per-tral bound to obtan our man theorem. 5.1 A Dead End In each tral, PermELearn multples each entry of ts weght matrx by an exponental factor and then uses one addtonal factor per row and column to make the matrx doubly stochastc (Algorthm 2 descrbed n Secton 4.1): W, j := r c j W, j e ηl, j (4) where the r and c j factors are chosen so that all rows and columns of the matrx W sum to one. We now show that PermELearn s update (4) gves the matrx A solvng the followng mnmzaton problem: argmn : j A, j = 1 j : A, j = 1 ( (A,W)+η (A L)). (5) Snce the lnear constrants are feasble and the dvergence s strctly convex, there always s a unque soluton, even though the soluton does not have a closed form. Lemma 2 PermELearn s updated weght matrx W (4) s the soluton of (5). Proof We form a Lagrangan for the optmzaton problem: l(a,ρ,γ) = (A,W)+η (A L)+ ρ ( j A, j 1)+ j γ j (A, j 1). Settng the dervatve wth respect to A, j to 0 yelds A, j = W, j e ηl, j e ρ e γ j. By enforcng the row and column sum constrants we see that the factors r = e ρ and c j = e γ j functon as row and column normalzers, respectvely. We now examne the progress (U,W) (U, W) towards an arbtrary stochastc matrx U. Usng Equaton (4) and notng that all three matrces are doubly stochastc (so ther entres sum to n), we see that (U,W) (U, W) = ηu L+ lnr +lnc j. j Makng ths a useful nvarant requres lower boundng the sums on the rhs by a constant tmes W L, the loss of the algorthm. Unfortunately we are stuck because the r and c j normalzaton factors don t even have a closed form. 1716

13 LEARNING PERMUTATIONS 5.2 Successful Analyss Our successful analyss splts the update (4) nto two steps: W, j := W, j e ηl, j and W, j := r c j W, j, (6) where (as before) r and c j are chosen so that each row and column of the matrx W sum to one. Usng the Lagrangan (as n the proof of Lemma 2), t s easy to see that these W and W matrces solve the followng mnmzaton problems: W = argmn( (A,W)+η (A L)) and W := argmn A : j A, j = 1 j : A, j = 1 (A,W ). (7) The second problem shows that the doubly stochastc matrx W s the projecton of W onto to the lnear row and column sum constrants. The strct convexty of the relatve entropy between nonnegatve matrces and the feasblty of the lnear constrants ensure that the solutons for both steps are unque. We now lower bound the progress (U,W) (U,W ) n the followng lemma to get our pertral nvarant. Lemma 3 For any η > 0, any doubly stochastc matrces U and W and any tral wth loss matrx L [0,1] n n (U,W) (U,W ) (1 e η )(W L) η(u L), where W s the unbalanced ntermedate matrx (6) constructed by PermELearn from W. Proof The proof manpulates the dfference of relatve entropes and uses the nequalty e ηx 1 (1 e η )x, whch holds for any η and any x [0,1]: (U,W) (U,W ) = (U, j ln W, j, j =, j, j ) +W, j W, j W, j ( U, j ln(e ηl, j )+W, j W, j e ηl ), j ( ηl, j U, j +W, j W, j (1 (1 e η )L, j ) ) = η(u L)+(1 e η )(W L). Relatve entropy s a Bregman dvergence, so the Generalzed Pythagorean Theorem (Bregman, 1967) apples. Specalzed to our settng, ths theorem states that f S s a closed convex set contanng some matrx U wth non-negatve entres, W s any matrx wth strctly postve entres, and W s the relatve entropy projecton of W onto S then (U,W ) (U, W)+ ( W,W ). 1717

14 HELMBOLD AND WARMUTH Furthermore, ths holds wth equalty when S s affne, whch s the case here snce S s the set of matrces whose rows and columns each sum to 1. Rearrangng and notng that (A,B) s nonnegatve yelds Corollary 3 of Herbster and Warmuth (2001), whch s the nequalty we need: (U,W ) (U, W) = ( W,W ) 0. Combnng ths wth the nequalty of Lemma 3 gves the crtcal per-tral nvarant: (U,W) (U, W) (1 e η )(W L) η(u L). (8) We now ntroduce some notaton and bound the expected total loss by summng the above nequalty over a sequence of trals. When consderng a sequence of trals, L t s the loss matrx at tral t, W t 1 s PermELearn s weght matrx W at the start of tral t (so W 0 s the ntal weght matrx) and W t s the updated weght matrx W at the end of the tral. Theorem 4 For any learnng rate η > 0, any doubly stochastc matrces U and ntal W 0, and any sequence of T trals wth loss matrces L t [0,1] n n (for 1 t T ), the expected loss of PermELearn s bounded by: T t=1 Proof Applyng (8) to tral t gves: W t 1 L t (U,W 0 ) (U,W T )+η T t=1 U Lt 1 e η. (U,W t 1 ) (U,W t ) (1 e η )(W t 1 L t ) η(u L t ). By summng the above over all T trals we get: (U,W 0 ) (U,W T ) (1 e η ) T t=1 W t 1 L t η T t=1 U L t. The bound then follows by solvng for the total expected loss, T t=1 W t 1 L t, of the algorthm. When the entres of W 0 are all ntalzed to 1 n and U s a permutaton then (U,W 0 ) = nlnn. Snce each doubly stochastc matrx U s a convex combnaton of permutaton matrces, at least one mnmzer of the total loss t=1 T U L wll be a permutaton matrx. IfL best denotes the loss of such a permutaton U, then Theorem 4 mples that the total loss of the algorthm s bounded by (U,W 0 )+ηl best 1 e η. If upper bounds (U,W 0 ) Dest nlnn and L est L best are known, then by choosng η = ( ) ln 1+ 2D est L, and the above bound becomes (Freund and Schapre, 1997): est L best + 2L est Dest + (U,W 0 ). (9) A natural choce for Dest s nlnn. In ths case the tuned bound becomes L best + 2L est nlnn+nlnn. 1718

15 LEARNING PERMUTATIONS 5.3 Approxmate Balancng The precedng analyss assumes that PermELearn s weght matrx s perfectly balanced each teraton. However, balancng technques are only capably of approxmately balancng the weght matrx n fnte tme, so mplementatons of PermELearn must handle approxmately balanced matrces. In Secton 4.2, we descrbe an mplementaton that uses an approxmately balanced Ŵ t 1 at the start of teraton t rather than the completely balanced W t 1 of the precedng analyss. Lemma 1 shows that when ths mplementaton of PermELearn uses an approxmately balanced Ŵ t 1 where each row and column sum s n 1 ± ε t, then the expected loss on tral t s at most W t 1 L t + 3n 3 ε t. Summng over all trals and usng Theorem 4, ths mplementaton s total loss s at most T t=1 ( W t 1 L t + 3n 3 ) (U,W 0 ) (U,W T )+η T ε t t=1 U Lt 1 e η + T t=1 3n 3 ε t. As dscussed n Secton 4.2, settng ε t = 1/(3n 3 t 2 ) leads to an addtonal loss of less than 5/3 over the bound of Theorem 4 and ts subsequent tunngs whle ncurrng a total runnng tme (over all T trals) n O(T n 6 log(t n)). In fact, the addtonal loss for approxmate balancng can be made less than any postve c by settng ε t = c/(5n 3 t 2 ). Snce the tme to approxmately balance depends only logarthmcally on 1/ε, the total tme taken over T trals remans n O(T n 6 log(t n)). 5.4 Splt Analyss for the Hedge Algorthm Perhaps the smplest case where the loss s lnear n the parameter vector s the on-lne allocaton settng of Freund and Schapre (1997). It s nstructve to apply our method of splttng the update n ths smpler settng. There are N experts and the algorthm keeps a probablty dstrbuton w over the experts. In each tral the algorthm pcks expert wth probablty w and then gets a loss vector l [0,1] N. Each expert ncurs loss l and the algorthm s expected loss s w l. Fnally w s updated to w for the next tral. The Hedge algorthm (Freund and Schapre, 1997) updates ts weght vector to w = w e ηl j w j e ηl j. Ths update can be motvated by a tradeoff between the un-normalzed relatve entropy to the old weght vector and expected loss n the last tral (Kvnen and Warmuth, 1999): w := argmn( (ŵ,w)+η ŵ l). ŵ =1 For vectors, the relatve entropy s smply (ŵ,w) := ŵ ln ŵ w + w ŵ. As n the permutaton case, we can splt ths update (and motvaton) nto two steps: settng each w = w e ηl then w = w / w. These are the solutons to: w := argmn ŵ ( (ŵ,w)+η ŵ l) and w := argmn (ŵ,w ). ŵ =1 1719

16 HELMBOLD AND WARMUTH The followng lower bound has been shown on the progress towards any probablty vector u servng as a comparator: 4 (u,w) (u, w) = η u l lnw e ηl η u l lnw (1 (1 e η )l ) η u l+w l (1 e η ), (10) where the frst nequalty uses e ηx 1 (1 e η )x, for any x [0,1], and the second uses ln(1 x) x, for x [0,1]. Surprsngly the same nequalty already holds for the un-normalzed update: 5 (u,w) (u,w ) = η u l+w (1 e ηl ) w l (1 e η ) η u l. Snce the normalzaton s a projecton w.r.t. a Bregman dvergence onto a lnear constrant satsfed by the comparator u, (u,w ) (u, w) 0 by the Generalzed Pythagorean Theorem (Herbster and Warmuth, 2001). The total progress for both steps s agan Inequalty (10). Wth the key Inequalty (10) n hand, t s easy to ntroduce tral dependent notaton and sum over trals (as done n the proof of Theorem 4, arrvng at the famlar bound for Hedge (Freund and Schapre, 1997): For any η > 0, any probablty vectors w 0 and u, and any loss vectors l t [0,1] n, T t=1 w t 1 l t (u,w0 ) (u,w T )+η T t=1 u lt 1 e η. (11) Note that the r.h.s. s actually constant n the comparator u (Kvnen and Warmuth, 1999), that s, for all u, (u,w 0 ) (u,w T )+ηt=1 T u lt 1 e η = ln w 0 e ηl T 1 e η. The r.h.s. of the above equalty s often used as a potental n provng bounds for expert algorthms. We dscuss ths further n Appendx B. 5.5 When to Normalze? Probably the most surprsng aspect about the proof methodology s the flexblty about how and when to project onto the constrants. Instead of projectng a nonnegatve matrx onto all 2n constrants at once (as n optmzaton problem (7)), we could mmc the Snkhorn balancng algorthm by frst projectng onto the row constrants and then the column constrants and alternatng untl convergence. The Generalzed Pythagorean Theorem shows that projectng onto any convex constrant that s satsfed by the comparator class of doubly stochastc matrces brngs the weght matrx closer to every doubly stochastc matrx. 6 Therefore our bound on t W t 1 L t (Theorem 4) holds f the exponental updates are nterleaved wth any sequence of projectons to some subsets of the 4. Ths s essentally Lemma 5.2 of Lttlestone and Warmuth (1994). The reformulaton of ths type of nequalty wth relatve entropes goes back to Kvnen and Warmuth (1999) 5. Note that f the algorthm does not normalze the weghts then w s no longer a dstrbuton. When w < 1, the loss w L amounts to ncurrng 0 loss wth probablty 1 w, and predctng as expert wth probablty w. 6. There s a large body of work on fndng a soluton subject to constrants va terated Bregman projectons (see, e.g., Censor and Lent, 1981). 1720

17 LEARNING PERMUTATIONS constrants. However, f the normalzaton constrants are not enforced then W s no longer a convex combnaton of permutatons. Furthermore, the exponental update factors only decrease the entres of W and wthout any normalzaton all of the entres of W can get arbtrarly small. If ths s allowed to happen then the loss W L can approach 0 for any loss matrx, volatng the sprt of the predcton model. There s a drect argument that shows that the same fnal doubly stochastc matrx s reached f we nterleave the exponental updates wth projectons to any of the constrants as long as all 2n constrants hold at the end. To see ths we partton the class of matrces wth postve entres nto equvalence classes. Call two such matrces A and B equvalent f there are dagonal matrces R and C wth postve dagonal entres such that B = RAC. Note that [RAC], j = R, A, j C j, j and therefore B s just a rescaled verson of A. Projectng onto any row and/or column sum constrants amounts to pre- and/or post-multplyng the matrx by some postve dagonal matrces R and C. Therefore f matrces A and B are equvalent then the projecton of A (or B) onto a set of row and/or column sum constrants results n another matrx equvalent to both A and B. The mportance of equvalent matrces s that they balance to the same doubly stochastc matrx. Lemma 5 For any two equvalent matrces A and RAC, where the entres of A and the dagonal entres of R and C are postve, argmn : j Â, j = 1 j : Â, j = 1 (Â, A) = argmn : j Â, j = 1 j : Â, j = 1 (Â,RAC). Proof The strct convexty of the relatve entropy mples that both problems have a unque matrx as ther soluton. We wll now reason that the unque solutons for both problems are the same. By usng a Lagrangan (as n the proof of Lemma 2) we see that the soluton of the left optmzaton problem s a square matrx wth ṙ A, j ċ j n poston, j. Smlarly the soluton of the problem on the rght has r R, A, j C j, j c j n poston, j. Here the factors ṙ, r functon as row normalzers and ċ j, c j as column normalzers. Gven a soluton matrx ṙ,ċ j to the left problem, then ṙ /R,,ċ j /C j, j s a soluton of the rght problem of the same value. Also f r, c j s a soluton of rght problem, then r R,, c j C j, j s a soluton to the left problem of the same value. Ths shows that both mnmzaton problems have the same value and the matrx solutons for both problems are the same and unque (even though the normalzaton factors ṙ,ċ j of say the left problem are not necessarly unque). Note that ts crucal for the above argument that the dagonal entres of R,C are postve. The analogous phenomenon s much smpler n the weghted majorty case: Two non-negatve vectors a and b are equvalent f a = cb, where c s any nonnegatve scalar, and agan each equvalence class has exactly one normalzed weght vector. PermELearn s ntermedate matrx W, j := W, je ηl, j can be wrtten W M where denotes the Hadamard (entry-wse) Product and M, j = e ηl, j. Note that the Hadamard product commutes wth matrx multplcaton by dagonal matrces, f C s dagonal and P = (A B)C then P, j = (A, j B, j )C j, j = (A, j C j, j )B, j so we also have P = (AC) B. Smlarly, R(A B) = (RA) B when R s dagonal. 1721

18 HELMBOLD AND WARMUTH Hadamard products also preserve equvalence. For equvalent matrces A and B = RAC (for dagonal R and C) the matrces A M and B M are equvalent (although they are not lkely to be equvalent to A and B) snce B M = (RAC) M = R(A M)C. Ths means that any two runs of PermELearn-lke algorthms that have the same bag of loss matrces and equvalent ntal matrces end wth equvalent fnal matrces even f they project onto dfferent subsets of the constrants at the end of the varous trals. In summary the proof method dscussed so far uses a relatve entropy as a measure of progress and reles on Bregman projectons as ts fundamental tool. In Appendx B we re-derve the bound for PermELearn usng the value of the optmzaton problem (5) as a potental. Ths value s expressed usng the dual optmzaton problem and ntutvely the applcaton of the Generalzed Pythagorean Theorem now s replaced by pluggng n a non-optmal choce for the dual varables. Both proof technques are useful. 5.6 Learnng Mappngs We have an algorthm that has small regret aganst the best permutaton. Permutatons are a subset of all mappngs from {1,...,n} to {1,...,n}. We contnue usng Π for a permutaton and ntroduce Ψ to denote an arbtrary mappng from {1,...,n} to {1,...,n}. Mappngs dffer from permutatons n that the n dmensonal vector (Ψ()) n =1 can have repeats, that s, Ψ() mght equal Ψ( j) for j. Agan we alternately represent a mappng Ψ as an n n matrx where Ψ, j = 1 f Ψ() = j and 0 otherwse. Note that such square 7 mappng matrces have the specal property that they have exactly one 1 n each row. Agan the loss s specfed by a loss matrx L and the loss of mappng Ψ s Ψ L. It s straghtforward to desgn an algorthm MapELearn for learnng mappngs wth exponental weghts: Smply run n ndependent copes of the Hedge algorthm for each of the n rows of the receved loss matrces. That s, the r th copy of Hedge always receves the r th row of the loss matrx L as ts loss vector. Even though learnng mappngs s easy, t s nevertheless nstructve to dscuss the dfferences wth PermELearn. Note that MapELearn s combned weght matrx s now a convex combnaton of mappngs, that s, a sngly stochastc matrx wth the constrant that each row sums to one. Agan, after the exponental update (3), the constrants are typcally not satsfed any more, but they can be easly reestablshed by smply normalzng each row. The row normalzaton only needs to be done once n each tral: no teratve process s needed. Furthermore, no fancy decomposton algorthm s needed n MapELearn: for (sngly) stochastc weght matrx W, the predcton Ψ() s smply a random element chosen from the row dstrbuton W,. Ths samplng procedure produces a mappng Ψ such that W = E(Ψ) and thus E(Ψ L) = W L as needed. We can use the same relatve entropy between the sngle stochastc matrces, and the lower bound on the progress for the exponental update gven n Lemma 3 stll holds. Also our man bound (Theorem 4) s stll true for MapELearn and we arrve at the same tuned bound for the total loss of MapELearn: L best + 2L est Dest + (U,W 0 ), where L best, L est, and Dest are now the total loss of the best mappng, a known upper bound on L best, and an upper bound on (U,W 0 ), respectvely. Recall thatl est and Dest are needed to tune the η parameter. 7. In the case of mappngs the restrcton to square matrces s not essental. 1722

19 LEARNING PERMUTATIONS Our algorthm PermElearn for permutatons may be seen as the above algorthm for mappngs whle enforcng the column sum constrants n addton to the row constrants used n MapELearn. Snce PermELearn s row balancng messes up the column sums and vce versa, an nteractve procedure (.e., Snkhorn Balancng) s needed to create to a matrx n whch each row and column sums to one. The enforcement of the addtonal column sum constrants results n a doubly stochastc matrx, an apparently necessary step to produce predctons that are permutatons (and an expected predcton equal to the doubly stochastc weght matrx). When t s known that the comparator s a permutaton, then the algorthm always benefts from enforcng the addtonal column constrants. In general we should always make use of any constrants that the comparator s known to satsfy (see, e.g., Warmuth and Vshwanathan, 2005, for a dscusson of ths). As dscussed n Secton 4.1, f A s a Snkhorn-balanced verson of a non-negatve matrx A, then for any permutatons Π 1 and Π 2, A,Π1 () A,Π2 () = A,Π 1 () A. (12),Π 2 () An analogous nvarant holds for mappngs: If A s a row-balanced verson of a non-negatve matrx A, then A,Ψ1 () for any mappngs Ψ 1 and Ψ 2, = A,Ψ 1 () A,Ψ2 () A.,Ψ 2 () However t s mportant to note that column balancng does not preserve the above nvarant for mappngs. In fact, permutatons are the subclass of mappngs where nvarant 12 holds. There s another mportant dfference between PermELearn and MapELearn. For MapELearn, the probablty of predctng mappng Ψ wth weght matrx W s always the product W,Ψ(). The analogous property does not hold for PermELearn. Consder the balanced 2 2 weght matrx W on the rght of Fgure 2. Ths matrx decomposes nto tmes the permutaton (1,2) plus tmes the permutaton (2,1). Thus the probablty of predctng wth permutaton (1,2) s 2 tmes the probablty of permutaton (2,1) for the PermELearn algorthm. However, when the probabltes are proportonal to the ntutve product form W,Π(), then the probablty rato for these two permutatons s 2. Notce that ths ntutve product weght measure s the dstrbuton used by the Hedge algorthm that explctly treats each permutaton as a separate expert. Therefore PermELearn s clearly dfferent than a concse mplementaton of Hedge for permutatons. 6. Follow the Perturbed Leader Algorthm Perhaps the smplest on-lne algorthm s the Follow the Leader (FL) algorthm: at each tral predct wth one of the best models on the data seen so far. Thus FL predcts at tral t wth an expert n argmn l <t or any permutaton n argmn Π Π L <t, where <t ndcates that we sum over the past trals, that s, l <t := t 1 q=1 lq. The FL algorthm s clearly non-optmal; n the expert settng there s a smple adversary strategy that forces FL to have loss at least n tmes larger than the loss of the best expert n hndsght. The expected total loss of tuned Hedge s one tmes the loss of the best expert plus lower order terms. Hedge acheves ths by randomly choosng experts. The probablty w t 1 for choosng expert at tral t s proportonal to e ηl<t. As the learnng rate η, Hedge becomes FL (when there are 1723

20 HELMBOLD AND WARMUTH no tes) and the same holds for PermELearn. Thus the exponental weghts wth moderate η may be seen as a soft mn calculaton: the algorthm hedges ts bets and does not put all ts probablty on the expert wth mnmum loss so far. The Follow the Perturbed Leader (FPL) algorthm of Kala and Vempala (2005) s an alternate on-lne predcton algorthm that works n a very general settng. It adds random perturbatons to the total losses of the experts ncurred so far and then predcts wth the expert of mnmum perturbed loss. Ther FPL algorthm has bounds closely related to Hedge and other multplcatve weght algorthms and n some cases Hedge can be smulated exactly (Kuzmn and Warmuth, 2005) by judcously choosng the dstrbuton of perturbatons. However, for the permutaton problem the bounds we were able to obtan for FPL are weaker than the the bound we obtaned bounds for PermELearn that uses exponental weghts despte the apparent smlarty between our representatons and the general formulaton of FPL. The FPL settng uses an abstract k-dmensonal decson space used to encode predctors as well as a k-dmensonal state space used to represent the losses of the predctors. At any tral, the current loss of a partcular predctor s the dot product between that predctor s representaton n the decson space and the state-space vector for the tral. Ths general settng can explctly represent each permutaton and ts loss when k = n!. The FPL settng also easly handles the encodngs of permutatons and losses used by PermELearn by representng each permutaton matrx Π and loss matrx L as n 2 -dmensonal vectors. The FPL algorthm (Kala and Vempala, 2005) takes a parameter ε and mantans a cumulatve loss matrx C (ntally C s the zero matrx) At each tral, FPL : 1. Generates a random perturbaton matrx P where each P, j s proportonal to ±r, j where r, j s drawn from the standard exponental dstrbuton. 2. Predcts wth a permutaton Π mnmzng Π (C+ P). 3. After gettng the loss matrx L, updates C to C+ L. Note that FPL s more computatonally effcent than PermELearn. It takes only O(n 3 ) tme to make ts predcton (the tme to compute a mnmum weght bpartte matchng) and only O(n 2 ) tme to update C. Unfortunately the generc FPL loss bounds are not as good as the bounds on PermELearn. In partcular, they show that the loss of FPL on any sequence of trals s at most 8 (1+ε)L best + 8n3 (1+lnn) ε where ε s a parameter of the algorthm. When the loss of the best expert s known ahead of tme, ε can be tuned and the bound becomes L best + 4 2L best n 3 (1+lnn)+8n 3 (1+lnn). Although FPL gets the same L best leadng term, the excess loss over the best permutaton grows as n 3 lnn rather the nlnn growth of PermELearn s bound. Of course, PermELearn pays for the mproved bound by requrng more computaton. 8. The n 3 terms n the bounds for FPL are n tmes the sum of the entres n the loss matrx. So f the applcaton has a loss motf whose entres sum to only n, then the n 3 factors become n