Learning Permutations with Exponential Weights



Similar documents
Support Vector Machines

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

The Greedy Method. Introduction. 0/1 Knapsack Problem

1 Example 1: Axis-aligned rectangles

What is Candidate Sampling

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

An Alternative Way to Measure Private Equity Performance

Recurrence. 1 Definitions and main statements

PERRON FROBENIUS THEOREM

BERNSTEIN POLYNOMIALS

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

8 Algorithm for Binary Searching in Trees

Fisher Markets and Convex Programs

This circuit than can be reduced to a planar circuit

Project Networks With Mixed-Time Constraints

How To Calculate The Accountng Perod Of Nequalty

where the coordinates are related to those in the old frame as follows.

L10: Linear discriminants analysis

Implementation of Deutsch's Algorithm Using Mathcad

The OC Curve of Attribute Acceptance Plans

Calculation of Sampling Weights

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

General Auction Mechanism for Search Advertising

1. Measuring association using correlation and regression

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Extending Probabilistic Dynamic Epistemic Logic

The Mathematical Derivation of Least Squares

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Logistic Regression. Steve Kroon

Loop Parallelization

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Generalizing the degree sequence problem

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Joe Pimbley, unpublished, Yield Curve Calculations

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Forecasting the Direction and Strength of Stock Market Movement

Formulating & Solving Integer Problems Chapter

An Interest-Oriented Network Evolution Mechanism for Online Communities

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Lecture 2: Single Layer Perceptrons Kevin Swingler

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

A Probabilistic Theory of Coherence

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

The Geometry of Online Packing Linear Programs

J. Parallel Distrib. Comput.

Stochastic Bandits with Side Observations on Networks

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

A Lyapunov Optimization Approach to Repeated Stochastic Games

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Time Value of Money Module

7.5. Present Value of an Annuity. Investigate

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Efficient Project Portfolio as a tool for Enterprise Risk Management

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

CHAPTER 14 MORE ABOUT REGRESSION

Traffic State Estimation in the Traffic Management Center of Berlin

Section 5.4 Annuities, Present Value, and Amortization

Product-Form Stationary Distributions for Deficiency Zero Chemical Reaction Networks

Enabling P2P One-view Multi-party Video Conferencing

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

NMT EE 589 & UNM ME 482/582 ROBOT ENGINEERING. Dr. Stephen Bruder NMT EE 589 & UNM ME 482/582

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Ants Can Schedule Software Projects

Matrix Multiplication I

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

From Selective to Full Security: Semi-Generic Transformations in the Standard Model

Quantization Effects in Digital Filters

Multiple-Period Attribution: Residuals and Compounding

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Chapter 7: Answers to Questions and Problems

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

A Fast Incremental Spectral Clustering for Large Data Sets

21 Vectors: The Cross Product & Torque

DEFINING %COMPLETE IN MICROSOFT PROJECT

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Transcription:

Journal of Machne Learnng Research 2009 (10) 1705-1736 Submtted 9/08; Publshed 7/09 Learnng Permutatons wth Exponental Weghts Davd P. Helmbold Manfred K. Warmuth Computer Scence Department Unversty of Calforna, Santa Cruz Santa Cruz, CA 95064 DPH@CSE.UCSC.EDU MANFRED@CSE.UCSC.EDU Edtor: Yoav Freund Abstract We gve an algorthm for the on-lne learnng of permutatons. The algorthm mantans ts uncertanty about the target permutaton as a doubly stochastc weght matrx, and makes predctons usng an effcent method for decomposng the weght matrx nto a convex combnaton of permutatons. The weght matrx s updated by multplyng the current matrx entres by exponental factors, and an teratve procedure s needed to restore double stochastcty. Even though the result of ths procedure does not have a closed form, a new analyss approach allows us to prove an optmal (up to small constant factors) bound on the regret of our algorthm. Ths regret bound s sgnfcantly better than that of ether Kala and Vempala s more effcent Follow the Perturbed Leader algorthm or the computatonally expensve method of explctly representng each permutaton as an expert. Keywords: permutaton, rankng, on-lne learnng, Hedge algorthm, doubly stochastc matrx, relatve entropy projecton, Snkhorn balancng 1. Introducton Fndng a good permutaton s a key aspect of many problems such as the rankng of search results or matchng workers to tasks. In ths paper we present an effcent and effectve on-lne algorthm for learnng permutatons n a model related to the on-lne allocaton model of learnng wth experts (Freund and Schapre, 1997). In each tral, the algorthm probablstcally chooses a permutaton and then ncurs a lnear loss based on how approprate the permutaton was for that tral. The regret s the total expected loss of the algorthm on the whole sequence of trals mnus the total loss of the best permutaton chosen n hndsght for the whole sequence, and the goal s to fnd algorthms that have provably small worst-case regret. For example, one could consder a commuter arlne whch owns n arplanes of varous szes and fles n routes. 1 Each day the arlne must match arplanes to routes. If too small an arplane s assgned to a route then the arlne wll loose revenue and reputaton due to unserved potental passengers. On the other hand, f too large an arplane s used on a long route then the arlne could have larger than necessary fuel costs. If the number of passengers wantng each flght were known ahead of tme, then choosng an assgnment s a weghted matchng problem. In the on-lne. An earler verson of ths paper appears n Proceedngs of the Twenteth Annual Conference on Computatonal Learnng Theory (COLT 2007), publshed by Sprnger as LNAI 4539.. Manfred K. Warmuth acknowledges the support of NSF grant IIS 0325363. 1. We assume that each route starts and ends at the arlne s home arport. c 10 Davd P. Helmbold and Manfred K. Warmuth.

HELMBOLD AND WARMUTH allocaton model, the arlne frst chooses a dstrbuton over possble assgnments of arplanes to routes and then randomly selects an assgnment from the dstrbuton. The regret of the arlne s the earnngs of the sngle best assgnment for the whole sequence of passenger requests mnus the total expected earnngs of the on-lne assgnments. When arplanes and routes are each numbered from 1 to n, then an assgnment s equvalent to selectng a permutaton. The randomness helps protect the on-lne algorthm from adversares and allows one to prove good bounds on the algorthm s regret for arbtrary sequences of requests. Snce there are n! permutatons on n elements, t s nfeasble to smply treat each permutaton as an expert and apply one of the expert algorthms that uses exponental weghts. Prevous work has exploted the combnatoral structure of other large sets of experts to create effcent algorthms (see Helmbold and Schapre, 1997; Takmoto and Warmuth, 2003; Warmuth and Kuzmn, 2008, for examples). Our soluton s to make a smplfyng assumpton on the loss functon whch allows the new algorthm, called PermELearn, to mantan a suffcent amount of nformaton about the dstrbuton over n! permutatons whle usng only n 2 weghts. We represent a permutaton of n elements as an n n permutaton matrx Π where Π, j = 1 f the permutaton maps element to poston j and Π, j = 0 otherwse. As the algorthm randomly selects a permutaton Π at the begnnng of a tral, an adversary smultaneously selects an arbtrary loss matrx L [0,1] n n whch specfes the loss of all permutatons for the tral. Each entry L, j of the loss matrx gves the loss for mappng element to j, and the loss of any whole permutaton s the sum of the losses of the permutaton s mappngs, that s, the loss of permutaton Π s L,Π() =, j Π, j L, j. Note that the per-tral expected losses can be as large as n, as opposed to the common assumpton for the expert settng that the losses are bounded n [0,1]. In Secton 3 we show how a varety of ntutve loss motfs can be expressed n ths matrx form. Ths assumpton that the loss has a lnear matrx form ensures the expected loss of the algorthm can be expressed as, j W, j L, j, where W = E( Π). Ths expectaton W s an n n weght matrx whch s doubly stochastc, that s, t has non-negatve entres and the property that every row and column sums to 1. The algorthm s uncertanty about whch permutaton s the target s summarzed by W; each weght W, j s the probablty that the algorthm predcts wth a permutaton mappng element to poston j. It s worth emphaszng that the W matrx s only a summary of the dstrbuton over permutatons used by any algorthm (t doesn t ndcate whch permutatons have non-zero probablty, for example). However, ths summary s suffcent to determne the algorthm s expected loss when the losses of permutatons have the assumed loss matrx form. Our PermELearn algorthm stores the weght matrx W and must convert W nto an effcently sampled dstrbuton over permutatons n order to make predctons. By Brkhoff s Theorem, every doubly stochastc matrx can be expressed as the convex combnaton of at most n 2 2n + 2 permutatons (see, e.g., Bhata, 1997). In Appendx A we show that a greedy matchng-based algorthm effcently decomposes any doubly stochastc matrx nto a convex combnaton of at most n 2 2n+2 permutatons. Although the effcacy of ths algorthm s mpled by standard dmensonalty arguments, we gve a new combnatoral proof that provdes ndependent nsght as to why the algorthm fnds a convex combnaton matchng Brkhoff s bound. Our algorthm for learnng permutatons predcts wth a random Π sampled from the convex combnaton of permutatons created by decomposng weght matrx W. It has been appled recently for prcng combnatoral markets when the outcomes are permutatons of objects (Chen et al., 2008). The PermELearn algorthm updates the entres of ts weght matrx usng exponental factors commonly used for updatng the weghts of experts n on-lne learnng algorthms (Lttlestone and 1706

LEARNING PERMUTATIONS Warmuth, 1994; Vovk, 1990; Freund and Schapre, 1997): each entry W, j s multpled by a factor e ηl, j. Here η s a postve learnng rate that controls the strength of the update (When η = 0, than all the factors are one and the update s vacuous). After ths update, the weght matrx no longer has the doubly stochastc property, and the weght matrx must be projected back nto the space of doubly stochastc matrces (called Snkhorn balancng, see Secton 4) before the next predcton can be made. In Theorem 4 we bound the expected loss of PermELearn over any sequence of trals by nlnn+ηl best 1 e η, (1) where n s the number of elements beng permuted, η s the learnng rate, and L best s the loss of the best permutaton on the entre sequence. If an upper boundl est L best s known, then η can be tuned (as n Freund and Schapre, 1997) and the expected loss bound becomes L best + 2L est nlnn+nlnn, (2) gvng a bound of 2L est nlnn+nlnn on the worst case expected regret of the tuned PermELearn algorthm. We also prove a matchng lower bound (Theorem 6) of Ω( L best nlnn) for the expected regret of any algorthm solvng our permutaton learnng problem. A smpler and more effcent algorthm than PermELearn mantans the sum of the loss matrces on the the prevous trals. Each tral t adds random perturbatons to the cumulatve loss matrx and then predcts wth the permutaton havng mnmum perturbed loss. Ths Follow the Perturbed Leader algorthm (Kala and Vempala, 2005) has good regret bounds for many on-lne learnng settngs. However, the regret bound we can obtan for t n the permutaton settng s about a factor of n worse than the bound for PermELearn and the lower bound. Although computatonally expensve, one can also consder runnng the Hedge algorthm whle explctly representng each of the n! permutatons as an expert. If T s the sum of the loss matrces over the past trals and F s the n n matrx wth entres F, j = e ηt, j, then the weght of each permutaton expert Π s proportonal to the product F,Π() and the normalzaton constant s the permanent of the matrx F. Calculatng the permanent s a known #P-complete problem and samplng from ths dstrbuton over permutatons s very neffcent (Jerrum et al., 2004). Moreover snce the loss range of a permutaton s [0,n], the standard loss bound for the algorthm that uses one expert per permutaton must be scaled up by a factor of n, becomng L best + n 2 L est n ln(n!)+nln(n!) L best + 2L est n 2 lnn+n 2 lnn. Ths expected loss bound s smlar to our expected loss bound for PermELearn n Equaton (2), except that the nlnn terms are replaced by n 2 lnn. Our method based on Snkhorn balancng bypasses the estmaton of permanents and somehow PermELearn s mplct representaton and predcton method explot the structure of permutatons and lets us obtan the mproved bound. We also gve a matchng lower bound that shows PermELearn has the optmum regret bound (up to a small constant factor). It s an nterestng open queston whether the structure of permutatons can be exploted to prove bounds lke (2) for the Hedge algorthm wth one expert per permutaton. PermELearn s weght updates belong to the Exponentated Gradent famly of updates (Kvnen and Warmuth, 1997) snce the components L, j of the loss matrx that appear n the exponental 1707

HELMBOLD AND WARMUTH factor are the dervatves of our lnear loss wth respect to the weghts W, j. Ths famly of updates usually mantans a probablty vector as ts weght vector. In that case the normalzaton of the weght vector s straghtforward and s folded drectly nto the update formula. Our new algorthm PermELearn for learnng permutatons mantans a doubly stochastc matrx wth n 2 weghts. The normalzaton alternately normalzes the rows and columns of the matrx untl convergence (Snkhorn balancng). Ths may requre an unbounded number of steps and the resultng matrx does not have a closed form. Despte ths fact, we are able to prove bounds for our algorthm. We frst show that our update mnmzes a tradeoff between the loss and a relatve entropy between doubly stochastc matrces. Ths relatve entropy becomes our measure of progress n the analyss. Luckly, the un-normalzed multplcatve update already makes enough progress (towards the best permutaton) to acheve the loss bound quoted above. Fnally, we nterpret the teratons of Snkhorn balancng as Bregman projectons wth respect to the same relatve entropy and show usng the propertes of Bregman projectons that these projectons can only ncrease the progress and thus don t hurt the analyss (Herbster and Warmuth, 2001). Our new nsght of splttng the update nto an un-normalzed step followed by a normalzaton step also leads to a streamlned proof of the loss bound for the Hedge algorthm n the standard expert settng that s nterestng n ts own rght. Snce the loss n the allocaton settng s lnear, the bounds can be proven n many dfferent ways, ncludng potental based methods (see, e.g., Kvnen and Warmuth, 1999; Gordon, 2006; Cesa-Banch and Lugos, 2006). For the sake of completeness we reprove our man loss bound for PermELearn usng potental based methods n Appendx B. We show how potental based proof methods can be extended to handle lnear equalty constrants that don t have a soluton n closed form, parallelng a related extenson to lnear nequalty constrants n Kuzmn and Warmuth (2007). In ths appendx we also dscuss the relatonshp between the projecton and potental based proof methods. In partcular, we show how the Bregman projecton step corresponds to pluggng n suboptmal dual varables nto the potental. The remander of the paper s organzed as follows. We ntroduce our notaton n the next secton. Secton 3 presents the permutaton learnng model and gves several ntutve examples of approprate loss motfs. Secton 4 gves the PermELearn algorthm and dscusses ts computatonal requrements. One part of the algorthm s to decompose the current doubly stochastc matrx nto a small convex combnaton of permutatons usng a greedy algorthm. The bound on the number of permutatons needed to decompose the weght matrx s deferred to Appendx A. We then bound PermELearn s regret n Secton 5 n a two-step analyss that uses a relatve entropy as a measure of progress. To exemplfy the new technques, we also analyze the basc Hedge algorthm wth the same methodology. The regret bounds for Hedge and PermELearn are re-proven n Appendx B usng potental based methods. In Secton 6, we apply the Follow the Perturbed Leader algorthm to learnng permutatons and show that the resultng regret bounds are not as good. In Secton 7 we prove a lower bound on the regret when learnng permutatons that s wthn a small constant factor of our regret bound on the tuned PermELearn algorthm. The concludng secton descrbes extensons and drectons for further work. 2. Notaton All matrces wll be n n matrces. When A s a matrx, A, j denotes the entry of A n row, and column j. We use A B to denote the dot product between matrces A and B, that s,, j A, j B, j. We use sngle superscrpts (e.g., A k ) to dentfy matrces/permutatons from a sequence. 1708

LEARNING PERMUTATIONS Permutatons on n elements are frequently represented n two ways: as a bjectve mappng of the elements {1,...,n} nto the postons {1,...,n} or as a permutaton matrx whch s an n n bnary matrx wth exactly one 1 n each row and each column. We use the notaton Π (and Π) to represent a permutaton n ether format, usng the context to ndcate the approprate representaton. Thus, for each {1,...,n}, we use Π() to denote the poston that the th element s mapped to by permutaton Π, and matrx element Π, j = 1 f Π() = j and 0 otherwse. If L s a matrx wth n rows then the product ΠL permutes the rows of L: 0 1 0 0 11 12 13 14 21 22 23 24 Π = 0 0 0 1 0 0 1 0 L = 21 22 23 24 31 32 33 34 ΠL = 41 42 43 44 31 32 33 34. 1 0 0 0 41 42 43 44 11 12 13 14 perm. (2,4,3,1) as matrx an arbtrary matrx permutng the rows Convex combnatons of permutatons create doubly stochastc or balanced matrces: nonnegatve matrces whose n rows and n columns each sum to one. Our algorthm mantans ts uncertanty about whch permutaton s best as a doubly stochastc weght matrx W and needs to randomly select a permutaton from some dstrbuton whose expectaton s W. By Brkhoff s Theorem (see, e.g., Bhata, 1997), for every doubly stochastc matrx W there s a decomposton nto a convex combnaton of at most n 2 2n+2 permutaton matrces. We show n Appendx A how a decomposton of ths sze can be found effectvely. Ths decomposton gves a dstrbuton over permutatons whose expectaton s W that now can be effectvely sampled because ts support s at most n 2 2n+2 permutatons. 3. On-lne Protocol We are nterested n learnng permutatons n a model related to the on-lne allocaton model of learnng wth experts (Freund and Schapre, 1997). In that model there are N experts and at the begnnng of each tral the algorthm allocates a probablty dstrbuton w over the experts. The algorthm pcks expert wth probablty w and then receves a loss vector l [0,1] N. Each expert ncurs loss l and the expected loss of the algorthm s w l. Fnally, the algorthm updates ts dstrbuton w for the next tral. In case of permutatons we could have one expert per permutaton and allocate a dstrbuton over the n! permutatons. Explctly trackng ths dstrbuton s computatonally expensve, even for moderate n. As dscussed n the ntroducton, we assume that the losses n each tral can be specfed by a loss matrx L [0,1] n n where the loss of each permutaton Π has the lnear form L,Π() = Π L. If the algorthm s predcton Π s chosen probablstcally n each tral then the algorthm s expected loss s E[ Π L] = W L, where W = E[ Π]. Ths expected predcton W s an n n doubly stochastc matrx and algorthms for learnng permutatons under the lnear loss assumpton can be vewed as mplctly mantanng such a doubly stochastc weght matrx. More precsely, the on-lne algorthm follows the followng protocol n each tral: The learner (probablstcally) chooses a permutaton Π, and let W = E( Π). Nature smultaneously chooses a loss matrx L [0,1] n n for the tral. At the end of the tral, the algorthm s gven L. The loss of Π s Π L and the expected loss of the algorthm s W L. 1709

HELMBOLD AND WARMUTH Fnally, the algorthm updates ts dstrbuton over permutatons for the next tral, mplctly updatng matrx W. Although our algorthm can handle arbtrary sequences of loss matrces L [0,1] n n, nature could be sgnfcantly more restrcted. Many rankng applcatons have an assocated loss motf M and nature s constraned to choose (row) permutatons of M as ts loss matrx L. In effect, at each tral nature chooses a correct permutaton Π and uses the loss matrx L = ΠM. Note that the permutaton left-multples the loss motf, and thus permutes the rows of M. If nature chooses the dentty permutaton then the loss matrx L s the motf M tself. When M s known to the algorthm, t suffces to gve the algorthm only the permutaton Π at the end of the tral, rather than the loss matrx L tself. Fgure 1 gves examples of loss motfs. The last loss n Fgure 1 s related to a compettve Lst Update Problem where an algorthm servces requests to a lst of n tems. In the Lst Update Problem the cost of a request s the requested tem s current poston n the lst. After each request, the requested tem can be moved forward n the lst for free, and addtonal rearrangement can be done at a cost of one per transposton. The goal s for the algorthm to be cost-compettve wth the best statc orderng of the elements n hndsght. Note that the transposton cost for addtonal lst rearrangement s not represented n the permutaton loss motf. Blum et al. (2003) gve very effcent algorthms for the Lst Update Problem that do not do addtonal rearrangng of the lst (and thus do not ncur the cost neglect by the loss motf). In our notaton, ther bound has the same form as ours (1) but wth the nlnn factors replaced by O(n). However, our lower bound (see Secton 7) shows that the n ln n factors n (2) are necessary n the general permutaton settng. Note that many compostons of loss motfs are possble. For example, gven two motfs wth ther assocated losses, any convex combnaton of the motfs creates a new motf for the same convex combnaton of the assocated losses. Other component-wse combnatons of two motfs (such as product or max) can also produce nterestng loss motfs, but the combnaton usually cannot be dstrbuted across the matrx dot-product calculaton, and so cannot be expressed as a smple lnear functon of the orgnal losses. 4. PermELearn Algorthm Our permutaton learnng algorthm uses exponenental weghts and we call t PermELearn. It mantans an n n doubly stochastc weght matrx W as ts man data structure, where W, j s the probablty that PermELearn predcts wth a permutaton mappng element to poston j. In the absence of pror nformaton t s natural to start wth unform weghts, that s, the matrx wth 1 n n each entry. In each tral PermELearn does two thngs: 1. Choose a permutaton Π from some dstrbuton such that E[ Π] = W. 2. Create a new doubly stochastc matrx W for use n the next tral based on the current weght matrx W and loss matrx L. 1710

LEARNING PERMUTATIONS lossl( Π,Π) the number of elements where Π() Π 1 n 1 n =1 Π() Π(), how far the elements are from ther correct postons (the dvson by n 1 ensures that the entres of M are n [0,1].) 1 n 1 n Π() Π() =1 Π(), a poston weghted verson of the above emphaszng the early postons n Π the number of elements mapped to the frst half by Π but the second half by Π, or vce versa the number of elements mapped to the frst two postons by Π that fal to appear n the top three poston of Π the number of lnks traversed to fnd the frst element of Π n a lst ordered by Π motf M 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 2 3 1 1 0 1 2 3 2 1 0 1 3 2 1 0 0 1 2 3 1 1/2 0 1/2 1 3 2/3 1/3 0 1/3 3/4 1/2 1/4 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 1 0 0 0 0 3 0 0 0 0 0 0 0 0 Fgure 1: Loss motfs Choosng a permutaton s done by Algorthm 1. The algorthm greedly decomposes W nto a convex combnaton of at most n 2 2n+2 permutatons (see Theorem 7), and then randomly selects one of these permutatons for the predcton. 2 Our decomposton algorthm uses a Temporary matrx A ntalzed to the weght matrx W. Each teraton of Algorthm 1 fnds a permutaton Π where each A,Π() > 0. Ths can be done by fndng a perfect matchng on the n n bpartte graph contanng the edge, j whenever A, j > 0. We shall soon see that each matrx A s a constant tmes a doubly stochastc matrx, so the exstence of a sutable permutaton Π follows from Brkhoff s Theorem. Gven such a permutaton Π, the algorthm updates A to A απ where α = mn A,Π(). The updated matrx A has non-negatve entres and has strctly more zeros than the orgnal A. Snce the update decreases each row and 2. The decomposton s usually not unque and the mplementaton may have a bas as to exactly whch convex combnaton s chosen. 1711

HELMBOLD AND WARMUTH Algorthm 1 PermELearn: Selectng a permutaton Requre: a doubly stochastc n n matrx W A := W; q = 0; repeat q := q+1; Fnd permutaton Π q such that A,Π q () s postve for each {1,...,n} α q := mn A,Π q () A := A α q Π q untl All entres of A are zero {at end of loop W = q k=1 α kπ k } Randomly select and return a Π {Π 1,...,Π q } usng probabltes α 1,...,α q. Algorthm 2 PermELearn: Weght Matrx Update Requre: learnng rate η, loss matrx L, and doubly stochastc weght matrx W Create W where each W, j = W, j e ηl, j (3) Create doubly stochastc W by re-balancng the rows and columns of W (Snkhorn balancng) and update W to W. column sum by α and the orgnal matrx W was doubly stochastc, each matrx A wll have rows and columns that sum to the same amount. In other words, each matrx A created durng Algorthm 1 s a constant tmes a doubly stochastc matrx, and thus (by Brkhoff s Theorem) s a constant tmes a convex combnaton of permutatons. After at most n 2 n teratons the algorthm arrves at a matrx A havng exactly n non-zero entres, so ths A s a constant tmes a permutaton matrx. Therefore, Algorthm 1 decomposes the orgnal doubly stochastc matrx nto the convex combnaton of (at most) n 2 n+1 permutaton matrces. The more refned arguments n Appendx A shows that the Algorthm 1 never uses more than n 2 2n+2 permutatons, matchng the bound gven by Brkhoff s Theorem. Several mprovements are possble. In partcular, we need not compute each perfect matchng from scratch. If only z entres of A are zeroed by a permutaton, then that permutaton s stll a matchng of sze n z n the graph for the updated matrx. Thus we need to fnd only z augmentng paths to complete the perfect matchng. The entre process thus requres fndng O(n 2 ) augmentng paths at a cost of O(n 2 ) each, for a total cost of O(n 4 ) to decompose weght matrx W nto a convex combnaton of permutatons. 4.1 Updatng the Weghts In the second step, Algorthm 2 updates the weght matrx by multplyng each W, j entry by the factor e ηl, j. These factors destroy the row and column normalzaton, so the matrx must be rebalanced to restore the doubly-stochastc property. There s no closed form for the normalzaton step. The standard teratve re-balancng method for non-negatve matrces s called Snkhorn balancng. Ths method frst normalzes each row of the matrx to sum to one, and then normalzes the columns. Snce normalzng the columns typcally destroys the row normalzaton, the process must be terated untl convergence (Snkhorn, 1964). 1712

LEARNING PERMUTATIONS ( 1 1 2 2 1 2 1 ) Snkhorn balancng = 2 1+ 2 1 1+ 2 1 1+ 2 2 1+ 2 Fgure 2: Example where Snkhorn balancng requres nfntely many steps. Normalzng the rows corresponds to pre-multplyng by a dagonal matrx. The product of these dagonal matrces thus represents the combned effect of the multple row normalzaton steps. Smlarly, the combned effect of the column normalzaton steps can be represented by post-multplyng the matrx by a dagonal matrx. Therefore we get the well known fact that Snkhorn balancng a matrx A results n a doubly stochastc matrx RAC where R and C are dagonal matrces. Each entry R, s the postve multpler appled to row, and each entry C j, j s the postve multpler of column j needed to convert A nto a doubly stochastc matrx. In Fgure 2 we gve a ratonal matrx that balances to an rratonal matrx. Snce each row and column balancng step creates ratonals, Snkhorn balancng produces rratonals only n the lmt (after nfntely many steps). Multplyng a weght matrx from the left and/or rght by non-negatve dagonal matrces (e.g., row or column normalzaton) preserves the rato of product weghts between permutatons. That s f A = RAC, then for any two permutatons Π 1 and Π 2, A,Π 1 () A,Π 2 () = A,Π1 ()R, C Π1 (),Π 1 () A,Π2 ()R, C Π2 (),Π 2 () = A,Π1 (). A,Π2 () ( ) 1/2 1/2 Therefore must balance to a doubly stochastc matrx ( ) a 1 a 1/2 1 1 a a such that the rato of the product weght between the two permutatons (1,2) and (2,1) s preserved. Ths means 1/2 2 1+ 2. 1/4 = a2 (1 a) 2 and thus a = Ths example leads to another mportant observaton: PermELearn s predctons are dfferent than Hedge s when each permutaton s treated as an expert. If each permutaton s explctly represented as an expert, then the Hedge algorthm predcts permutaton Π wth probablty proportonal to the product weght, e η t L t,π(). However, algorthm PermELearn predcts dfferently. Wth the weght matrx n Fgure 4.1, Hedge puts probablty 2 3 on permutaton (1,2) and probablty 1 3 on permutaton (2, 1) whle PermELearn puts probablty 1 2 1+ 0.59 on permutaton (1,2) and 2 probablty 1+ 0.41 on permutaton (2,1). 2 There has been much wrtten on the balancng of matrces, and we brefly descrbe only a few of the results here. Snkhorn showed that ths procedure converges and that the RAC balancng of any matrx A nto a doubly stochastc matrx s unque (up to cancelng multples of R and C) f t exsts 3 (Snkhorn, 1964). A number of authors consder balancng a matrx A so that the row and column sums are 1 ± ε. Frankln and Lorenz (1989) show that O(length(A)/ε) Snkhorn teratons suffce, where length(a) s the bt-length of matrx A s bnary representaton. Kalantar and Khachyan (1996) show that 3. Some non-negatve matrces, lke 1 1 0 0 1 0 0 1 1, cannot be converted nto doubly stochastc matrces because of ther pattern of zeros. The weght matrces we deal wth have strctly postve entres, and thus can always be made doubly stochastc wth an RAC balancng. 1713

HELMBOLD AND WARMUTH O(n 4 ln n ε ln 1 mna, j ) operatons suffce usng an nteror pont method. Lnal et al. (2000) gve a preprocessng step after whch only O((n/ε) 2 ) Snkhorn teratons suffce. They also present a strongly polynomal tme teratve procedure requrng Õ(n 7 log(1/ε)) teratons. Balakrshnan et al. (2004) gve an nteror pont method wth complexty O(n 6 log(n/ε)). Fnally, Fürer (2004) shows that f the row and column sums of A are 1 ± ε then every matrx entry changes by at most ±nε when A s balanced to a doubly stochastc matrx. 4.2 Dealng wth Approxmate Balancng Wth slght modfcatons, Algorthm PermELearn can handle the stuaton where ts weght matrx s mperfectly balanced (and thus not qute doubly stochastc). As before, let W be the fully balanced doubly stochastc weght matrx, but we now assume that only an approxmately balanced Ŵ s avalable to predct from. In partcular, we assume that each row and column of Ŵ sum to 1 ± ε for some ε < 1 3. Let s 1 ε be the smallest row or column sum n Ŵ. We modfy Algorthm 1 n two ways. Frst, A s ntalzed to 1 sŵ rather than W. Ths ensures every row and column n the ntal A sums to at least one, to at most 1+3ε, and at least one row or column sums to exactly 1. Second, the loop exts as soon as A has an all-zero row or column. Snce the smallest row or column sum starts at 1, s decreased by α k each teraton k, and ends at zero, we have that q k=1 α k = 1 and the modfed Algorthm 1 stll outputs a convex combnaton of permutatons C = q k=1 α kπ k. Furthermore, each entry C, j 1 sŵ, j. We now bound the addtonal loss of ths modfed algorthm. Lemma 1 If the weght matrx Ŵ s approxmately balanced so each row and column sum s n 1±ε (for ε 1 3 ) then the modfed Algorthm 1 has an expected loss C L at most 3n3 ε greater than the expected loss W L of the orgnal algorthm that uses the completely balanced doubly stochastc matrx W. Proof Let s be the smallest row or column sum n Ŵ. Snce each row and column sum of 1 sŵ les n [1,1+3ε], each entry of 1 sŵ s close to the correspondng entry of the fully balanced W. In partcular each 1 sŵ, j W, j + 3nε (Fürer, 2004). Ths allows us to bound the expected loss when predctng wth the convex combnaton C n terms of the expected loss usng a decomposton of the perfectly balanced W: C L 1 sŵ L Ŵ, j =, j s L, j (W, j + 3nε)L, j, j W L+3n 3 ε. Therefore the extra loss ncurred by usng a ε-approxmately balanced weght matrx at a partcular tral s at most 3n 3 ε, as desred. 1714

LEARNING PERMUTATIONS If n a sequence of T trals the matrces Ŵ are ε = 1/(3T n 3 ) balanced (so that each row and column sum s 1 ± 1/(3T n 3 )) then Lemma 1 mples that the total addtonal expected loss for usng approxmate balancng s at most 1. The algorthm of Balakrshnan et al. (2004) ε-balances a matrx n O(n 6 log(n/ε)) tme (note that ths domnates the tme for the loss update and constructng the convex combnaton). Ths balancng algorthm wth ε = 1/(3T n 3 ) together wth the modfed predcton algorthm gve a method requrng O(T n 6 log(t n)) total tme over the T trals and havng a bound of 2L est nlnn+nlnn+1 on the worst-case regret. If the number of trals T s not known n advance then settng ε as a functon of t can be helpful. A natural choce s ε t = 1/(3t 2 n 3 ). In ths case the total extra regret for not havng perfect balancng s bounded by T t=1 1/t2 5/3 and the total computaton tme over the T trals s stll bounded by O(T n 6 log(t n)). One mght be concerned about the effects of approxmate balancng propagatng between trals. However ths s not an ssue. In the followng secton we show that the loss updates and balancng can be arbtrarly nterleaved. Therefore the modfed algorthm can ether keep a cumulatve loss matrx L t = t =1 L and create ts next Ŵ by (approxmately) balancng the matrx wth entres 1 n e ηl t, j, or apply the multplcatve updates to the prevous approxmately balanced Ŵ. 5. Bounds for PermELearn Our analyss of PermELearn follows the entropy-based analyss of the exponentated gradent famly of algorthms (Kvnen and Warmuth, 1997). Ths style of analyss frst shows a per-tral progress bound usng relatve entropy to a comparator as a measure of progress, and then sums ths nvarant over the trals to bound the expected total loss of the algorthm. We also show that PermELearn s weght update belongs to the exponentated gradent famly of updates (Kvnen and Warmuth, 1997) snce t s the soluton to a mnmzaton problem that trades of the loss (n ths case a lnear loss) aganst a relatve entropy regularzaton. Recall that the expected loss of PermELearn on a tral s a lnear functon of ts weght matrx W. Therefore the gradent of the loss s ndependent of the current value of W. Ths property of the loss greatly smplfes the analyss. Our analyss for ths settng provdes a good foundaton for learnng permutaton matrces and lays the groundwork for the future study of other permutaton loss functons. We start our analyss wth an attempt to mmc the standard analyss (Kvnen and Warmuth, 1997) for the exponentated gradent famly updates whch multply by exponental factors and renormalze. The per-tral nvarant used to analyze the exponentated gradent famly bounds the decrease n relatve entropy from any (normalzed) vector u to the algorthm s weght vector by a lnear combnaton of the algorthm s loss and the loss of u on the tral. In our case the weght vectors are matrces and we use the followng (un-normalzed) relatve entropy between matrces A and B wth non-negatve entres: (A,B) = A, j ln A, j + B, j A, j., j B, j Note that ths s just the sum of the relatve entropes between the correspondng rows (or equvalently, between the correspondng columns): (A,B) = (A,,B, ) = (A, j,b, j ) j 1715

HELMBOLD AND WARMUTH (here A, s the th row of A and A, j s ts jth column). Unfortunately, the lack of a closed form for the matrx balancng procedure makes t dffcult to prove bounds on the loss of the algorthm. Our soluton s to break PermELearn s update (Algorthm 2) nto two steps, and use only the progress made to the ntermedate un-balanced matrx n our per-tral bound (8). After showng that balancng to a doubly stochastc matrx only ncreases the progress, we can sum the per-tral bound to obtan our man theorem. 5.1 A Dead End In each tral, PermELearn multples each entry of ts weght matrx by an exponental factor and then uses one addtonal factor per row and column to make the matrx doubly stochastc (Algorthm 2 descrbed n Secton 4.1): W, j := r c j W, j e ηl, j (4) where the r and c j factors are chosen so that all rows and columns of the matrx W sum to one. We now show that PermELearn s update (4) gves the matrx A solvng the followng mnmzaton problem: argmn : j A, j = 1 j : A, j = 1 ( (A,W)+η (A L)). (5) Snce the lnear constrants are feasble and the dvergence s strctly convex, there always s a unque soluton, even though the soluton does not have a closed form. Lemma 2 PermELearn s updated weght matrx W (4) s the soluton of (5). Proof We form a Lagrangan for the optmzaton problem: l(a,ρ,γ) = (A,W)+η (A L)+ ρ ( j A, j 1)+ j γ j (A, j 1). Settng the dervatve wth respect to A, j to 0 yelds A, j = W, j e ηl, j e ρ e γ j. By enforcng the row and column sum constrants we see that the factors r = e ρ and c j = e γ j functon as row and column normalzers, respectvely. We now examne the progress (U,W) (U, W) towards an arbtrary stochastc matrx U. Usng Equaton (4) and notng that all three matrces are doubly stochastc (so ther entres sum to n), we see that (U,W) (U, W) = ηu L+ lnr +lnc j. j Makng ths a useful nvarant requres lower boundng the sums on the rhs by a constant tmes W L, the loss of the algorthm. Unfortunately we are stuck because the r and c j normalzaton factors don t even have a closed form. 1716

LEARNING PERMUTATIONS 5.2 Successful Analyss Our successful analyss splts the update (4) nto two steps: W, j := W, j e ηl, j and W, j := r c j W, j, (6) where (as before) r and c j are chosen so that each row and column of the matrx W sum to one. Usng the Lagrangan (as n the proof of Lemma 2), t s easy to see that these W and W matrces solve the followng mnmzaton problems: W = argmn( (A,W)+η (A L)) and W := argmn A : j A, j = 1 j : A, j = 1 (A,W ). (7) The second problem shows that the doubly stochastc matrx W s the projecton of W onto to the lnear row and column sum constrants. The strct convexty of the relatve entropy between nonnegatve matrces and the feasblty of the lnear constrants ensure that the solutons for both steps are unque. We now lower bound the progress (U,W) (U,W ) n the followng lemma to get our pertral nvarant. Lemma 3 For any η > 0, any doubly stochastc matrces U and W and any tral wth loss matrx L [0,1] n n (U,W) (U,W ) (1 e η )(W L) η(u L), where W s the unbalanced ntermedate matrx (6) constructed by PermELearn from W. Proof The proof manpulates the dfference of relatve entropes and uses the nequalty e ηx 1 (1 e η )x, whch holds for any η and any x [0,1]: (U,W) (U,W ) = (U, j ln W, j, j =, j, j ) +W, j W, j W, j ( U, j ln(e ηl, j )+W, j W, j e ηl ), j ( ηl, j U, j +W, j W, j (1 (1 e η )L, j ) ) = η(u L)+(1 e η )(W L). Relatve entropy s a Bregman dvergence, so the Generalzed Pythagorean Theorem (Bregman, 1967) apples. Specalzed to our settng, ths theorem states that f S s a closed convex set contanng some matrx U wth non-negatve entres, W s any matrx wth strctly postve entres, and W s the relatve entropy projecton of W onto S then (U,W ) (U, W)+ ( W,W ). 1717

HELMBOLD AND WARMUTH Furthermore, ths holds wth equalty when S s affne, whch s the case here snce S s the set of matrces whose rows and columns each sum to 1. Rearrangng and notng that (A,B) s nonnegatve yelds Corollary 3 of Herbster and Warmuth (2001), whch s the nequalty we need: (U,W ) (U, W) = ( W,W ) 0. Combnng ths wth the nequalty of Lemma 3 gves the crtcal per-tral nvarant: (U,W) (U, W) (1 e η )(W L) η(u L). (8) We now ntroduce some notaton and bound the expected total loss by summng the above nequalty over a sequence of trals. When consderng a sequence of trals, L t s the loss matrx at tral t, W t 1 s PermELearn s weght matrx W at the start of tral t (so W 0 s the ntal weght matrx) and W t s the updated weght matrx W at the end of the tral. Theorem 4 For any learnng rate η > 0, any doubly stochastc matrces U and ntal W 0, and any sequence of T trals wth loss matrces L t [0,1] n n (for 1 t T ), the expected loss of PermELearn s bounded by: T t=1 Proof Applyng (8) to tral t gves: W t 1 L t (U,W 0 ) (U,W T )+η T t=1 U Lt 1 e η. (U,W t 1 ) (U,W t ) (1 e η )(W t 1 L t ) η(u L t ). By summng the above over all T trals we get: (U,W 0 ) (U,W T ) (1 e η ) T t=1 W t 1 L t η T t=1 U L t. The bound then follows by solvng for the total expected loss, T t=1 W t 1 L t, of the algorthm. When the entres of W 0 are all ntalzed to 1 n and U s a permutaton then (U,W 0 ) = nlnn. Snce each doubly stochastc matrx U s a convex combnaton of permutaton matrces, at least one mnmzer of the total loss t=1 T U L wll be a permutaton matrx. IfL best denotes the loss of such a permutaton U, then Theorem 4 mples that the total loss of the algorthm s bounded by (U,W 0 )+ηl best 1 e η. If upper bounds (U,W 0 ) Dest nlnn and L est L best are known, then by choosng η = ( ) ln 1+ 2D est L, and the above bound becomes (Freund and Schapre, 1997): est L best + 2L est Dest + (U,W 0 ). (9) A natural choce for Dest s nlnn. In ths case the tuned bound becomes L best + 2L est nlnn+nlnn. 1718

LEARNING PERMUTATIONS 5.3 Approxmate Balancng The precedng analyss assumes that PermELearn s weght matrx s perfectly balanced each teraton. However, balancng technques are only capably of approxmately balancng the weght matrx n fnte tme, so mplementatons of PermELearn must handle approxmately balanced matrces. In Secton 4.2, we descrbe an mplementaton that uses an approxmately balanced Ŵ t 1 at the start of teraton t rather than the completely balanced W t 1 of the precedng analyss. Lemma 1 shows that when ths mplementaton of PermELearn uses an approxmately balanced Ŵ t 1 where each row and column sum s n 1 ± ε t, then the expected loss on tral t s at most W t 1 L t + 3n 3 ε t. Summng over all trals and usng Theorem 4, ths mplementaton s total loss s at most T t=1 ( W t 1 L t + 3n 3 ) (U,W 0 ) (U,W T )+η T ε t t=1 U Lt 1 e η + T t=1 3n 3 ε t. As dscussed n Secton 4.2, settng ε t = 1/(3n 3 t 2 ) leads to an addtonal loss of less than 5/3 over the bound of Theorem 4 and ts subsequent tunngs whle ncurrng a total runnng tme (over all T trals) n O(T n 6 log(t n)). In fact, the addtonal loss for approxmate balancng can be made less than any postve c by settng ε t = c/(5n 3 t 2 ). Snce the tme to approxmately balance depends only logarthmcally on 1/ε, the total tme taken over T trals remans n O(T n 6 log(t n)). 5.4 Splt Analyss for the Hedge Algorthm Perhaps the smplest case where the loss s lnear n the parameter vector s the on-lne allocaton settng of Freund and Schapre (1997). It s nstructve to apply our method of splttng the update n ths smpler settng. There are N experts and the algorthm keeps a probablty dstrbuton w over the experts. In each tral the algorthm pcks expert wth probablty w and then gets a loss vector l [0,1] N. Each expert ncurs loss l and the algorthm s expected loss s w l. Fnally w s updated to w for the next tral. The Hedge algorthm (Freund and Schapre, 1997) updates ts weght vector to w = w e ηl j w j e ηl j. Ths update can be motvated by a tradeoff between the un-normalzed relatve entropy to the old weght vector and expected loss n the last tral (Kvnen and Warmuth, 1999): w := argmn( (ŵ,w)+η ŵ l). ŵ =1 For vectors, the relatve entropy s smply (ŵ,w) := ŵ ln ŵ w + w ŵ. As n the permutaton case, we can splt ths update (and motvaton) nto two steps: settng each w = w e ηl then w = w / w. These are the solutons to: w := argmn ŵ ( (ŵ,w)+η ŵ l) and w := argmn (ŵ,w ). ŵ =1 1719

HELMBOLD AND WARMUTH The followng lower bound has been shown on the progress towards any probablty vector u servng as a comparator: 4 (u,w) (u, w) = η u l lnw e ηl η u l lnw (1 (1 e η )l ) η u l+w l (1 e η ), (10) where the frst nequalty uses e ηx 1 (1 e η )x, for any x [0,1], and the second uses ln(1 x) x, for x [0,1]. Surprsngly the same nequalty already holds for the un-normalzed update: 5 (u,w) (u,w ) = η u l+w (1 e ηl ) w l (1 e η ) η u l. Snce the normalzaton s a projecton w.r.t. a Bregman dvergence onto a lnear constrant satsfed by the comparator u, (u,w ) (u, w) 0 by the Generalzed Pythagorean Theorem (Herbster and Warmuth, 2001). The total progress for both steps s agan Inequalty (10). Wth the key Inequalty (10) n hand, t s easy to ntroduce tral dependent notaton and sum over trals (as done n the proof of Theorem 4, arrvng at the famlar bound for Hedge (Freund and Schapre, 1997): For any η > 0, any probablty vectors w 0 and u, and any loss vectors l t [0,1] n, T t=1 w t 1 l t (u,w0 ) (u,w T )+η T t=1 u lt 1 e η. (11) Note that the r.h.s. s actually constant n the comparator u (Kvnen and Warmuth, 1999), that s, for all u, (u,w 0 ) (u,w T )+ηt=1 T u lt 1 e η = ln w 0 e ηl T 1 e η. The r.h.s. of the above equalty s often used as a potental n provng bounds for expert algorthms. We dscuss ths further n Appendx B. 5.5 When to Normalze? Probably the most surprsng aspect about the proof methodology s the flexblty about how and when to project onto the constrants. Instead of projectng a nonnegatve matrx onto all 2n constrants at once (as n optmzaton problem (7)), we could mmc the Snkhorn balancng algorthm by frst projectng onto the row constrants and then the column constrants and alternatng untl convergence. The Generalzed Pythagorean Theorem shows that projectng onto any convex constrant that s satsfed by the comparator class of doubly stochastc matrces brngs the weght matrx closer to every doubly stochastc matrx. 6 Therefore our bound on t W t 1 L t (Theorem 4) holds f the exponental updates are nterleaved wth any sequence of projectons to some subsets of the 4. Ths s essentally Lemma 5.2 of Lttlestone and Warmuth (1994). The reformulaton of ths type of nequalty wth relatve entropes goes back to Kvnen and Warmuth (1999) 5. Note that f the algorthm does not normalze the weghts then w s no longer a dstrbuton. When w < 1, the loss w L amounts to ncurrng 0 loss wth probablty 1 w, and predctng as expert wth probablty w. 6. There s a large body of work on fndng a soluton subject to constrants va terated Bregman projectons (see, e.g., Censor and Lent, 1981). 1720

LEARNING PERMUTATIONS constrants. However, f the normalzaton constrants are not enforced then W s no longer a convex combnaton of permutatons. Furthermore, the exponental update factors only decrease the entres of W and wthout any normalzaton all of the entres of W can get arbtrarly small. If ths s allowed to happen then the loss W L can approach 0 for any loss matrx, volatng the sprt of the predcton model. There s a drect argument that shows that the same fnal doubly stochastc matrx s reached f we nterleave the exponental updates wth projectons to any of the constrants as long as all 2n constrants hold at the end. To see ths we partton the class of matrces wth postve entres nto equvalence classes. Call two such matrces A and B equvalent f there are dagonal matrces R and C wth postve dagonal entres such that B = RAC. Note that [RAC], j = R, A, j C j, j and therefore B s just a rescaled verson of A. Projectng onto any row and/or column sum constrants amounts to pre- and/or post-multplyng the matrx by some postve dagonal matrces R and C. Therefore f matrces A and B are equvalent then the projecton of A (or B) onto a set of row and/or column sum constrants results n another matrx equvalent to both A and B. The mportance of equvalent matrces s that they balance to the same doubly stochastc matrx. Lemma 5 For any two equvalent matrces A and RAC, where the entres of A and the dagonal entres of R and C are postve, argmn : j Â, j = 1 j : Â, j = 1 (Â, A) = argmn : j Â, j = 1 j : Â, j = 1 (Â,RAC). Proof The strct convexty of the relatve entropy mples that both problems have a unque matrx as ther soluton. We wll now reason that the unque solutons for both problems are the same. By usng a Lagrangan (as n the proof of Lemma 2) we see that the soluton of the left optmzaton problem s a square matrx wth ṙ A, j ċ j n poston, j. Smlarly the soluton of the problem on the rght has r R, A, j C j, j c j n poston, j. Here the factors ṙ, r functon as row normalzers and ċ j, c j as column normalzers. Gven a soluton matrx ṙ,ċ j to the left problem, then ṙ /R,,ċ j /C j, j s a soluton of the rght problem of the same value. Also f r, c j s a soluton of rght problem, then r R,, c j C j, j s a soluton to the left problem of the same value. Ths shows that both mnmzaton problems have the same value and the matrx solutons for both problems are the same and unque (even though the normalzaton factors ṙ,ċ j of say the left problem are not necessarly unque). Note that ts crucal for the above argument that the dagonal entres of R,C are postve. The analogous phenomenon s much smpler n the weghted majorty case: Two non-negatve vectors a and b are equvalent f a = cb, where c s any nonnegatve scalar, and agan each equvalence class has exactly one normalzed weght vector. PermELearn s ntermedate matrx W, j := W, je ηl, j can be wrtten W M where denotes the Hadamard (entry-wse) Product and M, j = e ηl, j. Note that the Hadamard product commutes wth matrx multplcaton by dagonal matrces, f C s dagonal and P = (A B)C then P, j = (A, j B, j )C j, j = (A, j C j, j )B, j so we also have P = (AC) B. Smlarly, R(A B) = (RA) B when R s dagonal. 1721

HELMBOLD AND WARMUTH Hadamard products also preserve equvalence. For equvalent matrces A and B = RAC (for dagonal R and C) the matrces A M and B M are equvalent (although they are not lkely to be equvalent to A and B) snce B M = (RAC) M = R(A M)C. Ths means that any two runs of PermELearn-lke algorthms that have the same bag of loss matrces and equvalent ntal matrces end wth equvalent fnal matrces even f they project onto dfferent subsets of the constrants at the end of the varous trals. In summary the proof method dscussed so far uses a relatve entropy as a measure of progress and reles on Bregman projectons as ts fundamental tool. In Appendx B we re-derve the bound for PermELearn usng the value of the optmzaton problem (5) as a potental. Ths value s expressed usng the dual optmzaton problem and ntutvely the applcaton of the Generalzed Pythagorean Theorem now s replaced by pluggng n a non-optmal choce for the dual varables. Both proof technques are useful. 5.6 Learnng Mappngs We have an algorthm that has small regret aganst the best permutaton. Permutatons are a subset of all mappngs from {1,...,n} to {1,...,n}. We contnue usng Π for a permutaton and ntroduce Ψ to denote an arbtrary mappng from {1,...,n} to {1,...,n}. Mappngs dffer from permutatons n that the n dmensonal vector (Ψ()) n =1 can have repeats, that s, Ψ() mght equal Ψ( j) for j. Agan we alternately represent a mappng Ψ as an n n matrx where Ψ, j = 1 f Ψ() = j and 0 otherwse. Note that such square 7 mappng matrces have the specal property that they have exactly one 1 n each row. Agan the loss s specfed by a loss matrx L and the loss of mappng Ψ s Ψ L. It s straghtforward to desgn an algorthm MapELearn for learnng mappngs wth exponental weghts: Smply run n ndependent copes of the Hedge algorthm for each of the n rows of the receved loss matrces. That s, the r th copy of Hedge always receves the r th row of the loss matrx L as ts loss vector. Even though learnng mappngs s easy, t s nevertheless nstructve to dscuss the dfferences wth PermELearn. Note that MapELearn s combned weght matrx s now a convex combnaton of mappngs, that s, a sngly stochastc matrx wth the constrant that each row sums to one. Agan, after the exponental update (3), the constrants are typcally not satsfed any more, but they can be easly reestablshed by smply normalzng each row. The row normalzaton only needs to be done once n each tral: no teratve process s needed. Furthermore, no fancy decomposton algorthm s needed n MapELearn: for (sngly) stochastc weght matrx W, the predcton Ψ() s smply a random element chosen from the row dstrbuton W,. Ths samplng procedure produces a mappng Ψ such that W = E(Ψ) and thus E(Ψ L) = W L as needed. We can use the same relatve entropy between the sngle stochastc matrces, and the lower bound on the progress for the exponental update gven n Lemma 3 stll holds. Also our man bound (Theorem 4) s stll true for MapELearn and we arrve at the same tuned bound for the total loss of MapELearn: L best + 2L est Dest + (U,W 0 ), where L best, L est, and Dest are now the total loss of the best mappng, a known upper bound on L best, and an upper bound on (U,W 0 ), respectvely. Recall thatl est and Dest are needed to tune the η parameter. 7. In the case of mappngs the restrcton to square matrces s not essental. 1722

LEARNING PERMUTATIONS Our algorthm PermElearn for permutatons may be seen as the above algorthm for mappngs whle enforcng the column sum constrants n addton to the row constrants used n MapELearn. Snce PermELearn s row balancng messes up the column sums and vce versa, an nteractve procedure (.e., Snkhorn Balancng) s needed to create to a matrx n whch each row and column sums to one. The enforcement of the addtonal column sum constrants results n a doubly stochastc matrx, an apparently necessary step to produce predctons that are permutatons (and an expected predcton equal to the doubly stochastc weght matrx). When t s known that the comparator s a permutaton, then the algorthm always benefts from enforcng the addtonal column constrants. In general we should always make use of any constrants that the comparator s known to satsfy (see, e.g., Warmuth and Vshwanathan, 2005, for a dscusson of ths). As dscussed n Secton 4.1, f A s a Snkhorn-balanced verson of a non-negatve matrx A, then for any permutatons Π 1 and Π 2, A,Π1 () A,Π2 () = A,Π 1 () A. (12),Π 2 () An analogous nvarant holds for mappngs: If A s a row-balanced verson of a non-negatve matrx A, then A,Ψ1 () for any mappngs Ψ 1 and Ψ 2, = A,Ψ 1 () A,Ψ2 () A.,Ψ 2 () However t s mportant to note that column balancng does not preserve the above nvarant for mappngs. In fact, permutatons are the subclass of mappngs where nvarant 12 holds. There s another mportant dfference between PermELearn and MapELearn. For MapELearn, the probablty of predctng mappng Ψ wth weght matrx W s always the product W,Ψ(). The analogous property does not hold for PermELearn. Consder the balanced 2 2 weght matrx W on the rght of Fgure 2. Ths matrx decomposes nto 1 1+ 2 2 1+ 2 tmes the permutaton (1,2) plus tmes the permutaton (2,1). Thus the probablty of predctng wth permutaton (1,2) s 2 tmes the probablty of permutaton (2,1) for the PermELearn algorthm. However, when the probabltes are proportonal to the ntutve product form W,Π(), then the probablty rato for these two permutatons s 2. Notce that ths ntutve product weght measure s the dstrbuton used by the Hedge algorthm that explctly treats each permutaton as a separate expert. Therefore PermELearn s clearly dfferent than a concse mplementaton of Hedge for permutatons. 6. Follow the Perturbed Leader Algorthm Perhaps the smplest on-lne algorthm s the Follow the Leader (FL) algorthm: at each tral predct wth one of the best models on the data seen so far. Thus FL predcts at tral t wth an expert n argmn l <t or any permutaton n argmn Π Π L <t, where <t ndcates that we sum over the past trals, that s, l <t := t 1 q=1 lq. The FL algorthm s clearly non-optmal; n the expert settng there s a smple adversary strategy that forces FL to have loss at least n tmes larger than the loss of the best expert n hndsght. The expected total loss of tuned Hedge s one tmes the loss of the best expert plus lower order terms. Hedge acheves ths by randomly choosng experts. The probablty w t 1 for choosng expert at tral t s proportonal to e ηl<t. As the learnng rate η, Hedge becomes FL (when there are 1723

HELMBOLD AND WARMUTH no tes) and the same holds for PermELearn. Thus the exponental weghts wth moderate η may be seen as a soft mn calculaton: the algorthm hedges ts bets and does not put all ts probablty on the expert wth mnmum loss so far. The Follow the Perturbed Leader (FPL) algorthm of Kala and Vempala (2005) s an alternate on-lne predcton algorthm that works n a very general settng. It adds random perturbatons to the total losses of the experts ncurred so far and then predcts wth the expert of mnmum perturbed loss. Ther FPL algorthm has bounds closely related to Hedge and other multplcatve weght algorthms and n some cases Hedge can be smulated exactly (Kuzmn and Warmuth, 2005) by judcously choosng the dstrbuton of perturbatons. However, for the permutaton problem the bounds we were able to obtan for FPL are weaker than the the bound we obtaned bounds for PermELearn that uses exponental weghts despte the apparent smlarty between our representatons and the general formulaton of FPL. The FPL settng uses an abstract k-dmensonal decson space used to encode predctors as well as a k-dmensonal state space used to represent the losses of the predctors. At any tral, the current loss of a partcular predctor s the dot product between that predctor s representaton n the decson space and the state-space vector for the tral. Ths general settng can explctly represent each permutaton and ts loss when k = n!. The FPL settng also easly handles the encodngs of permutatons and losses used by PermELearn by representng each permutaton matrx Π and loss matrx L as n 2 -dmensonal vectors. The FPL algorthm (Kala and Vempala, 2005) takes a parameter ε and mantans a cumulatve loss matrx C (ntally C s the zero matrx) At each tral, FPL : 1. Generates a random perturbaton matrx P where each P, j s proportonal to ±r, j where r, j s drawn from the standard exponental dstrbuton. 2. Predcts wth a permutaton Π mnmzng Π (C+ P). 3. After gettng the loss matrx L, updates C to C+ L. Note that FPL s more computatonally effcent than PermELearn. It takes only O(n 3 ) tme to make ts predcton (the tme to compute a mnmum weght bpartte matchng) and only O(n 2 ) tme to update C. Unfortunately the generc FPL loss bounds are not as good as the bounds on PermELearn. In partcular, they show that the loss of FPL on any sequence of trals s at most 8 (1+ε)L best + 8n3 (1+lnn) ε where ε s a parameter of the algorthm. When the loss of the best expert s known ahead of tme, ε can be tuned and the bound becomes L best + 4 2L best n 3 (1+lnn)+8n 3 (1+lnn). Although FPL gets the same L best leadng term, the excess loss over the best permutaton grows as n 3 lnn rather the nlnn growth of PermELearn s bound. Of course, PermELearn pays for the mproved bound by requrng more computaton. 8. The n 3 terms n the bounds for FPL are n tmes the sum of the entres n the loss matrx. So f the applcaton has a loss motf whose entres sum to only n, then the n 3 factors become n 2. 1724

LEARNING PERMUTATIONS It s mportant to note that Kala and Vempala also present a refned analyss of FPL when the perturbed leader changes only rarely. Ths analyss leads to bounds that are smlar to the bounds gven by the entropc analyss of the Hedge algorthm (although the constant on the squareroot term s not qute as good). However, ths refned analyss cannot be drectly appled wth the effcent representatons of permutatons because the total perturbatons assocated wth dfferent permutatons are no longer ndependent exponentals. We leave the adaptaton of the refned analyss to the permutaton case as an open problem. 7. Lower Bounds In ths secton we prove lower bounds on the worst-case regret of any algorthm for our permutaton learnng problem by reducng the expert allocaton problem for n experts wth loss range [0, n] to the permutaton learnng problem. We then show n Appendx C a lower bound for ths n expert allocaton problem that uses a known lower bound n the expert advce settng wth losses n [0,1]. For the reducton we choose any set of n permutatons {Π 1,...,Π n } that use dsjont postons, that s, n =1 Π s the n n matrx of all ones. Usng dsjont postons ensures that the losses of these n permutatons can be set ndependently. Each Π matrx n ths set corresponds to the th expert n the n-expert allocaton problem. To smulate an n-expert tral wth loss vector l [0,n] n we use a loss matrx L s.t. Π L = l. Ths s done by settng all entres n {L q,π (q) : 1 q n} to l /n [0,1], that s, L = Π (l /n). Now for any doubly stochastc matrx W, Π W L = W n Note that the n dmensonal vector wth the components (Π W)/n s a probablty vector and therefore any algorthm for the n-element permutaton problem can be used as an algorthm for the n-expert allocaton problem wth losses n the range [0,n]. Thus any lower bound for the latter model s also a lower bound on the n-element permutaton problem. We frst prove a lower bound for the case when at least one expert has loss zero for the entre sequence of trals. If the algorthm allocates any weght to experts that have already ncurred postve loss, then the adversary can assgn loss only to those experts and force the algorthm ncrease ts expected loss wthout reducng the number of experts of loss zero. Thus we can assume w.l.o.g. that the algorthm allocates postve weght only to experts of zero loss. The algorthm mnmzes ts expected loss and the adversary maxmzes t. We get a lower bound by fxng the adversary: Ths adversary assgns loss n to one of the experts whch receved the hghest probablty by the algorthm and all other experts are assgned loss zero. Clearly the optmal allocaton aganst such an adversary uses the unform dstrbuton over those experts wth zero loss. The number of experts wth loss zero s reduced by one n each tral. At tral t = 1,...,n 1, n+1 t experts are left and the expected loss s n n+1 t. In the frst n 1 trals the algorthm ncurs expected loss n n =2 nlnn. When the loss of the best expert s large then the followng theorem follows from Corollary 11: l. 1725

HELMBOLD AND WARMUTH Theorem 6 There exsts n 0 such that for each dmenson n n 0, there s a T n where for any number of trals T T n the followng holds for any algorthm A for learnng permutatons of n elements n the allocaton settng: there s a sequence S of T trals such that L best (S) nt/2 and L A (S) L best (S) (nt/2) nlnn. These two lower bounds can be combned to the followng lower bound on the expected regret for our permutaton learnng problem: ( ) Lbest nlnn+nlnn max L best nlnn,nlnn. 2 Ths means that the tuned upper bound on the expected regret of PermELearn gven after Theorem 4 cannot be mproved by more than a small (2 2) constant factor. 8. Conclusons We consdered the problem of learnng a permutaton on-lne, when the per-tral loss s specfed by a matrx L [0,1] n n and the loss of a permutaton matrx Π s the lnear loss Π L. The standard approach would treat each permutaton as an expert. However ths s computatonally neffcent and ntroduces an addtonal factor of n n the regret bounds (snce the per-tral loss of a permutaton s [0,n] rather than [0,1]). We do not know f ths factor of n s necessary for permutatons, and t remans open whether ther specal structure allows better regret bounds on the standard expert algorthms when the experts are permutatons. We developed a new algorthm called PermELearn that uses a doubly stochastc matrx to mantan ts uncertanty over the hdden permutaton. PermELearn decomposes ths doubly stochastc matrx nto a small mxture of permutaton matrces and predcts wth a random permutaton from ths mxture. A smlar decomposton was used by Warmuth and Kuzmn (2008) to learn as well as the best fxed-sze subset of experts. PermELearn belongs to the Exponentated Gradent famly of updates and the analyss uses a relatve entropy as a measure of progress. The man techncal nsght s that the per-tral progress bound already holds for the un-normalzed update and that re-balancng the matrx only ncreases the progress. Snce the re-balancng step does not have a closed form, accountng for t n the analyss would otherwse be problematc. We also showed that the update for the Hedge algorthm can be splt nto an un-normalzed update and a normalzaton. In ths more basc settng the per tral progress bound also holds for the un-normalzed update. Our analyss technques rely on Bregman projecton methods 9 and the regret bounds hold not only for permutatons but also for mxtures of permutatons. Ths means that f we have addtonal convex constrants that are satsfed by the mxture that we compare aganst, then we can project the algorthm s weght matrx onto these constrants wthout hurtng the analyss (Herbster and Warmuth, 2001). Wth these knds of sde constrants we can enforce some relatonshps between the parameters, such as W, j W,k ( s more lkely mapped to j than k). Our man contrbuton s showng how to apply the analyss technques from the expert advce settng to the problem of effcently learnng a permutaton. Ths means that many of the tools from 9. Followng Kuzmn and Warmuth (2007), we also showed n Appendx B that the regret bounds proven n ths paper can be reproduced wth potental based methods. 1726

LEARNING PERMUTATIONS the expert settng are lkely to carry over to permutatons: lower boundng the weghts when the comparator s shftng (Herbster and Warmuth, 1998), long-term memory when shftng between a small set of comparators (Bousquet and Warmuth, 2002), cappng the weghts from the top f the goal s to be close to the best set of dsjont permutatons of fxed sze (Warmuth and Kuzmn, 2008), adaptng the updates to the mult-armed bandt settng when less feedback s provded (Auer et al., 2002), 10 and PAC Bayes analyss of the exponental updates (McAllester, 2003). We also appled the Follow the Perturbed Leader technques to our permutaton problem. Ths algorthm adds randomness to the total losses and then predcts wth a mnmum weghted matchng whch costs O(n 3 ) whereas our more complcated algorthm s at least O(n 4 ) and has precson ssues. However the bounds currently provable for the FPL algorthm of Kala and Vempala (2005) are much worse than for our PermELearn algorthm. The key open problem s whether we can have the best of both worlds: add randomness to the loss matrx so that the expected mnmum weghted matchng s the stochastc matrx produced by the PermELearn update (4). Ths would mean that we could use the faster algorthm together wth our tghter analyss. In the smpler weghted majorty settng ths has been done already (Kuzmn and Warmuth, 2005; Kala, 2005). However we do not yet know how to smulate the PermELearn update ths way. Our on-lne learnng problem requres that the learner s predcton to be an actual permutaton. Ths requrement makes sense for the lnear loss we focus on n ths paper, but may be less approprate for on-lne regresson problems. Consder the case where on each tral the algorthm selects a doubly stochastc matrx M whle nature smultaneously pcks a matrx X [0,1] n n and a real number y. The predcton s ŷ = M X and the loss on the tral s (ŷ y) 2. Wth ths convex quadratc loss, t s generally better for the algorthm to hedge ts bets between competng permutatons and select ts doubly stochastc parameter matrx W as M nstead of a random permutaton matrx Π chosen s.t. E(Π) = W. The Exponentated Gradent algorthm can be appled to ths type of nonlnear regresson problem (see, e.g., Helmbold et al., 1999) and Snkhorn Balancng can project the parameter matrx W onto the row and column sum constrants. We close wth an open problem nvolvng hgher order loss functons. In ths paper we consdered lnear losses specfed by a square matrx L where L, j gves the loss when entry (, j) s used n the permutaton. Can one prove good regret bounds when the loss depends on how the permutaton assgns multple elements? A parwse loss could be represented wth a four-dmensonal matrx L where L, j,k,l s added to the loss only when the predcted permutaton maps both to j and k to l. The recently developed Fourer analyss technques for permutatons (Kondor et al., 2007; Huang et al., 2009) may be helpful n generalzng our technques to ths knd of hgher order loss. Acknowledgments We thank Vshy Vshwanathan for helpng us smplfy the lower bounds, and Davd DesJardns for helpful dscussons and ponters to the lterature on Snkhorn Balancng. 10. Recently, a less effcent algorthm whch explctly mantans one expert per permutaton has been analyzed n the bandt settng by Cesa-Banch and Lugos (2009). However the bounds they obtan have the loss range as an addtonal factor n the regret bound (a factor of n for permutatons). 1727

HELMBOLD AND WARMUTH Appendx A. Sze of the Decomposton Here we show that the teratve matchng method of Algorthm 1 requres most n 2 2n+2 permutatons to decompose an doubly stochastc matrx. Ths matches the bound provded by Brkhoff s Theorem. Note that the dscusson n Secton 4 shows why Algorthm 1 can always fnd a sutable permutaton. Theorem 7 Algorthm 1 decomposes any doubly stochastc matrx nto a convex combnaton of at most n 2 2n+2 permutatons. Proof Let W be a doubly stochastc matrx and let Π 1,...,Π l and α 1,...,α l be any sequence of permutatons and coeffcents created by Algorthm 1 on nput W. For 0 j l, defne M j = W j =1 α Π. By permutng rows and columns we can assume wthout loss of generalty that Π l s the dentty permutaton. Let G j (for 1 j l) be the (undrected) graph on the n vertces {1,...n} where the undrected edge {p,q} between nodes p q s present f and only f ether Mp,q j or Mq,p j s non-zero. Thus both G l 1 and G l are the empty graph and each G j+1 has a (not necessarly strct) subset of the edges n G j. Note the natural correspondences between vertces n the graphs and rows and columns n the matrces. The proof s based n the followng key nvarant: # of zero entres n M j j+ (# connected components n G j ) 1. Ths holds for the ntal M 0. Furthermore, when the connected components of G j and G j+1 are the same, the algorthm nsures that M j+1 has at least one more zero than M j. We now analyze the case when new connected components are created. Let vertex set V be a connected component n G j+1 that was splt off a larger connected component n G j. We overload the notaton, and use V also for the set of matrx rows and/or columns assocated wth the vertces n the connected component. Snce V s a connected component of G j+1 there are no edges gong between V and the rest of the graph, so f M j+1 s vewed as a (conserved) flow, there s no flow ether nto or out of V : r V c V M j+1 r,c = r V c V M j+1 r,c = 0. Thus all entres of M j n the sets {Mr,c j > 0 : r V,c V } and {Mr,c j > 0 : r V,c V } are set to zero n M j+1. Snce V was part of a larger connected component n G j, at least one of these sets must be non-empty. We now show that both these sets of entres are non-empty. Each row and column of M j sum to 1 j =1 α. Therefore ( ) j n 1 α V = Mr,c j n = Mr,c. j =1 By splttng the nner sums we get: r V c V M j r,c + r V c V r V c=1 M j r,c = 1728 c V r V c V r=1 M j r,c + c V r V M j r,c.

LEARNING PERMUTATIONS By cancelng the frst sums and vewng M j as a flow n G j we conclude that the total flow out of V n M j equals the total flow nto V n M j, that s, r V c V M j r,c = c V r V and both sets {Mr,c j > 0 : r V,c V } and {Mr,c j > 0 : r V,c V } sum to the same postve total, and thus are non-empty. Ths establshes the followng fact that we can use n the remander of the proof: for each new connected component V n G j+1, some entry Mr,c j from a row r n V was set to zero. Now let k j (and k j+1 ) be the number of connected components n graph G j (and G j+1 respectvely). Snce the edges n G j+1 are a subset of the edges n G j, k j+1 k j. We already verfed the nvarant when k j = k j+1, so we proceed assumng k j+1 > k j. In ths case at most k j 1 components of G j survve when gong to G j+1, and at least k j+1 (k j 1) new connected components are created. The vertex sets of the new connected components are dsjont, and n the rows correspondng to each new connected component there s at least one non-zero entry n M j that s zero n M j+1. Therefore, M j+1 has at least k j+1 k j + 1 more zeros than M j, verfyng the nvarant for the case when k j+1 > k j. Snce G l 1 has n connected components, the nvarant shows that the number of zeros n M l 1 s at least l 1+n 1. Furthermore, M l has n more zeros than M l 1, so M l has at least l+2n 2 zeros. Snce M l has only n 2 entres, n 2 l+2n 2 and l n 2 2n+2 as desred. M j r,c The fact that Algorthm 1 uses at most n 2 2n+2 permutatons can also be establshed wth a dmensonalty argument lke that n Secton 2.7 of Bazaraa et al. (1977). Appendx B. Potental Based Bounds Let us begn wth the on-lne allocaton problem n the smpler expert settng. There are always two ways to motvate on-lne updates. One trades the dvergence to the last weght vector aganst the loss n the last tral, and the other trades the dvergence to the ntal weght vector aganst the loss n all past trals (Azoury and Warmuth, 2001): w t := argmn w =1 ( (w,w t 1 )+η w l t), w t := argmn w =1 ( (w,w 0 )+η w l t). By dfferentatng the Lagrangan for each optmzaton problem we obtan the solutons to both mnmzaton problems: w t = w t 1 e ηlt + β t, w t = w 0 e ηl t +β t, where the sgns of the Lagrange multplers β t and β t are unconstraned and ther values are chosen so that the equalty constrants are satsfed. The left update can be unrolled to obtan w t = w 0 e ηl t + t q=1 β q. Ths means the Lagrangan multplers for both problems are related by the equalty t q=1 β q = β t and both problems have the same soluton: 11 w t = t w0 e ηl n j=1 w0 j e ηl t j. We use the value of the rght convex 11. The solutons can dffer f the mnmzaton s over lnear nequalty constrants (Kuzmn and Warmuth, 2007). 1729

HELMBOLD AND WARMUTH optmzaton problem s objectve functon as our potental v t. Its Lagrangan s ( w ln w ) ( ) w 0 + w 0 w + ηw l t + β w 1 and snce there s no dualty gap: 12 v t := mn w =1 ( (w,w 0 )+η w l t ) = max β w 0 (1 e ηl t β ) β. } {{ } dual functon θ t (β) Here β s the (unconstraned) dual varable for the prmal equalty constrant and the w s have been optmzed out. By dfferentatng we can optmze β n the dual problem and arrve at v t = lnw 0 e ηl t. Ths form of the potental has been used extensvely for analyzng expert algorthms (see, e.g., Kvnen and Warmuth, 1999; Cesa-Banch and Lugos, 2006). One can easly show the followng key nequalty (essentally Lemma 5.2 of Lttlestone and Warmuth, 1994): v t v t 1 = lnw 0 e ηl t w t 1 e ηlt = ln + lnw 0 e ηl<t lnw t 1 (1 (1 e η )l t ) Summng over all trals and usng v 0 = 0 gves the famlar bound: T t=1 (1 e η ) w t 1 l t. (13) w t 1 l t vt 1 e η = 1 ( mn 1 e η (w,w 0 )+η w l T). w =1 Note that by Kvnen and Warmuth (1999) ( ) v t v t 1 = ln T t=1 w t 1 e ηlt = (u,w t 1 ) (u,w t )+η u l t, and therefore the Inequalty (13) s the same as Inequalty (10). Snce ( ) (v t v t 1 ) = v T = ln w e ηl t, summng the bound (13) over t concdes wth the bound (11). 12. There s no dualty gap n ths case because the prmal problem s a feasble convex optmzaton problem subject to lnear constrants. 1730

LEARNING PERMUTATIONS We now reprove the key nequalty (13) usng the dual functon θ t (β). Note β t maxmzes ths functon, that s, v t = θ t (β t ), and the optmal prmal soluton s w t = w0 e ηl t β t. Now, v t v t 1 = θ t (β t ) θ t 1 (β t 1 ) θ t (β t 1 ) θ t 1 (β t 1 ) = w 0 e ηl<t β t 1 } {{ } w t 1 (1 e ηlt ) w t 1 (1 (1 (1 e η )l t )) = (1 e η ) w t 1 l t where we used e ηlt 1 (1 e η )l t to get the fourth lne. Notce that n the frst nequalty above we used θ t (β t ) θ t (β t 1 ). Ths s true because β t maxmzes θ t (β) and the old choce β t 1 s non-optmal. The dual parameter β t 1 assures that w t 1 s normalzed and θ t (β t 1 ) s related to pluggng the ntermedate unnormalzed weghts w t := w 0 e ηl t β t 1 nto the prmal problem for tral t. Ths means that the nequalty θ t (β t ) θ t (β t 1 ) corresponds to the Bregman projecton of the unnormalzed update onto the equalty constrant. The dfference θ t (β t 1 ) θ t 1 (β t 1 ) n the second lne above s the progress n the value when gong from w t 1 at the end of tral t 1 to the ntermedate unnormalzed update w t at tral t. Therefore ths proof also does not explot the normalzaton. The bound for the permutaton problem follows the same outlne. We use the value of the followng optmzaton problem as our potental: v t+1 := mn : j A, j = 1 j : A, j = 1 = max α,β j, j ( (A,W 0 )+η (A L t ) ) W 0, j(1 e ηl t, j α β j ) α j β j } {{ } θ t (α,β) The α and β j are the dual varables for the row and column constrants. Now we can t optmze out the dual varables n the dual functon θ t (α,β) does not have a maxmum n closed form. Nevertheless the above proof technque based on dualty stll works. Let α t and β t be the optmzers of θ t (α,β). Then the optmum prmal soluton (the parameter vector of PermELearn) becomes W t, j = W 0, je ηl t, j αt βt j and we can analyze the ncrease n value as before: v t v t 1 = θ t (α t,β t ) θ t 1 (α t 1,β t 1 ) θ t (α t 1,β t 1 ) θ t 1 (α t 1,β t 1 ) =, j W, 0 je ηl<t, j αt 1 β t 1 j } {{ } W, t 1 j W, t 1 j (1 (1 e η )L t, j), j = (1 e η ) W t 1 L t. (1 e ηlt 1, j ). 1731

HELMBOLD AND WARMUTH Summng over trals n the usual way gves the bound T t=1 W t 1 L t v T 1 e η = 1 1 e η mn : j A, j = 1 j : A, j = 1 ( (A,W 0 )+η (A L T ) ) whch s the same as the bound of Theorem 4. Appendx C. Lower Bounds for the Expert Advce Settng We frst modfy a known lower bound from the expert advce settng wth the absolute loss (Cesa- Banch et al., 1997). We begn by descrbng that settng and show how t relates to the allocaton settng for experts. In the expert advce settng there are n experts. Each tral t starts wth nature selectng a predcton x t n [0,1] for each expert {1,...,n}. The algorthm s gven these predctons and then produces ts own predcton ŷ t [0,1]. Fnally, nature selects a label y t {0,1} for the tral. The algorthm s charged loss ŷ t y t and expert gets loss x t yt. Any algorthm n the allocaton settng leads to an algorthm n the above expert advce settng: keep the weght update unchanged, predct wth the weghted average (.e., ŷ t = w t 1 x t ) and defne the loss vector n l t [0,1] n n terms of the absolute loss: w } t 1 {{ x } t ŷ t y t = w t 1 x t y t } {{ } l t = w t 1 l t, where the frst equalty holds because x t [0,1] and yt {0,1}. Ths means that any lower bound on the regret n the above expert advce settng mmedately leads to a lower bound on the expected loss n the allocaton settng for experts when the loss vectors le n [0,1] n. We now ntroduce some more notaton and state the lower bound from the expert advce settng that we buld on. LetS n,t be the set of all sequences of T trals wth n experts n the expert advce settng wth the absolute loss. Let V n,t be the mnmum over algorthms of the worst case regret over sequences ns n,t. Theorem 8 (Cesa-Banch et al., 1997, Theorem 4.5.2) lm lm n T V n,t (T/2)lnn = 1. Ths means that for all ε > 0 there exsts n ε such that for each n n ε, there s a T ε,n where for all T T ε,n, V n,t (1 ε) (T/2)lnn. By further expandng the defnton of V n,t we get the followng verson of the above lower bound that avods the use of lmts: Corollary 9 For all ε > 0 there exsts n ε such that for each number of experts n n ε, there s a T ε,n where for any number of trals T T ε,n the followng holds for any algorthm A n the expert advce settng wth the absolute loss: there s a sequence S of T trals wth n experts such that L A (S) L best (S) (1 ε) (T/2)lnn. 1732

LEARNING PERMUTATIONS Ths lower bound on the regret depends on the number of trals T. We now use a reducton to bound L best (S) by T/2. Defne R(S ) as the transformaton that takes a sequence S of trals n S n 1,T and produces a sequence of trals n S n,t by addng an extra expert whose predctons are smply 1 mnus the predctons of the frst expert. On each tral the absolute loss of the addtonal expert on sequence R(S ) s 1 mnus the loss of the frst expert. Therefore ether the frst expert or the addtonal expert wll have loss at most T/2 on R(S ). Theorem 10 For all ε > 0 there exsts n ε such that for each number of experts n n ε, there s a T ε,n where for any number of trals T T ε,n the followng holds for any algorthm A n the expert advce settng wth the absolute loss: there s a sequence S of T trals wth n experts such that L best (S) T/2 and L A (S) L best (S) (1 ε) (T/2)lnn. Proof We begn by showng that the regret on a transformed sequence n {R(S ) : S S n 1,T } s at least (1 ε/2) (T/2)ln(n 1). Note that for all R(S ),L best (R(S )) T/2 and assume to the contrary that some algorthm A has regret strctly less than (1 ε/2) (T/2)ln(n 1) on every sequence n {R(S ) : S S n 1,T }. We then create an algorthm A that runs transformaton R( ) on-the-fly and predcts as A does on the transformed sequence. Therefore A on S and A on R(S ) make the same predctons and have the same total loss. On every sequence S S n 1,T we have L best (S ) L best (R(S)) and therefore L A (S ) L best (S ) L A (S ) L best (R(S )) = L A (R(S )) L best (R(S )) < (1 ε/2) (T/2)ln(n 1). Now f n 1 s at least the n ε/2 of Corollary 9 and T s at least the T ε/2,n 1 of the same corollary, then ths contradcts that corollary. Ths means that for any algorthm A and large enough n and T, there s a sequence S for whch the algorthm has regret at least (1 ε/2) (T/2)ln(n 1) andl best (S) T/2. By choosng the lower bound on n large enough, and the theorem follows. (1 ε/2) (T/2)ln(n 1) (1 ε) (T/2)lnn Note that the tuned upper bounds n the allocaton settng (9) have an addtonal factor of 2. Ths s due to the fact that n the allocaton settng the algorthm predcts wth the weghted average and ths s non-optmal. In the expert settng wth the absolute loss, the upper bound (based on a dfferent predcton functon) and the lower bound on the regret are asymptotcally tght (See Theorem 8). We are now ready to prove our lower bound for the allocaton settng wth experts when the losses of the experts are n [0,n] n nstead of [0,1] n. Corollary 11 There exsts n 0 such that for each dmenson n n 0, there s a T n where for any number of trals T T n the followng holds for any algorthm A for allocaton settng wth n experts: there s a sequence S of T trals wth loss vectors n [0,n] n such that L best (S) nt/2 and L A (S) L best (S) (nt/2)nlnn. 1733

HELMBOLD AND WARMUTH Proof Va the reducton we stated at the begnnng of the secton, the followng lower bound for the allocaton settng wth n experts mmedately follows from the prevous theorem: For any algorthm n the allocaton settng for n experts there s a sequence S of T trals where the losses of the experts le n [0,1] such that L best ( S) T/2 and L A ( S) L best ( S) (T/2)lnn. Now we smply scale the loss vectors by the factor n, that s, the scaled sequences S have loss vectors n the range [0,n] n and L best (S) nt/2. The lower bound becomes n (T/2)lnn = (nt/2)nlnn. References P. Auer, N. Cesa-Banch, Y. Freund, and R. E. Schapre. The nonstochastc multarmed bandt problem. SIAM Journal on Computng, 32(1):48 77, 2002. K. Azoury and M. K. Warmuth. Relatve loss bounds for on-lne densty estmaton wth the exponental famly of dstrbutons. Journal of Machne Learnng, 43(3):211 246, June 2001. Specal ssue on Theoretcal Advances n On-lne Learnng, Game Theory and Boostng, edted by Yoram Snger. H. Balakrshnan, I. Hwang, and C. Tomln. Polynomal approxmaton algorthms for belef matrx mantenance n dentty management. In 43rd IEEE Conference on Decson and Control, pages 4874 4879, December 2004. M. S. Bazaraa, J. J. Jarvs, and H. D. Sheral. Lnear Programmng and Network Flows. Wley, second edton, 1977. R. Bhata. Matrx Analyss. Sprnger-Verlag, Berln, 1997. A. Blum, S. Chawla, and A. Kala. Statc optmalty and dynamc search-optmalty n lsts and trees. Algorthmca, 36:249 260, 2003. O. Bousquet and M. K. Warmuth. Trackng a small set of experts by mxng past posterors. Journal of Machne Learnng Research, 3:363 396, 2002. L. M. Bregman. The relaxaton method of fndng the common pont of convex sets and ts applcaton to the soluton of problems n convex programmng. USSR Computatonal Mathematcs and Physcs, 7:200 217, 1967. Y. Censor and A. Lent. An teratve row-acton method for nterval convex programmng. Journal of Optmzaton Theory and Applcatons, 34(3):321 353, July 1981. N. Cesa-Banch and G. Lugos. Predcton, Learnng, and Games. Cambrdge Unversty Press, 2006. N. Cesa-Banch and G. Lugos. Combnatoral bandts. In Proceedngs of the 22nd Annual Conference on Learnng Theory (COLT 09), 2009. 1734

LEARNING PERMUTATIONS N. Cesa-Banch, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapre, and M. K. Warmuth. How to use expert advce. Journal of the ACM, 44(3):427 485, May 1997. Y. Chen, L. Fortnow, N. Lambert, D. Pennock, and J. Wortman. Complexty of combnatoral market makers. In Nnth ACM Conference on Electronc Commerce (EC 08). ACM Press, July 2008. J. Frankln and J. Lorenz. On the scalng of multdmensonal matrces. Lnear Algebra and ts applcatons, 114/115:717 735, 1989. Y. Freund and R. E. Schapre. A decson-theoretc generalzaton of on-lne learnng and an applcaton to Boostng. Journal of Computer and System Scences, 55(1):119 139, August 1997. M. Fürer. Quadratc convergence for scalng of matrces. In Proceedngs of ALENEX/ANALCO, pages 216 223. SIAM, 2004. Geoffrey J. Gordon. No-regret algorthms for onlne convex programs. In Bernhard Schölkopf, John C. Platt, and Thomas Hoffman, edtors, NIPS, pages 489 496. MIT Press, 2006. D. P. Helmbold and R. E. Schapre. Predctng nearly as well as the best prunng of a decson tree. Machne Learnng, 27(01):51 68, 1997. D. P. Helmbold, J. Kvnen, and M. K. Warmuth. Relatve loss bounds for sngle neurons. IEEE Transactons on Neural Networks, 10(6):1291 1304, November 1999. M. Herbster and M. K. Warmuth. Trackng the best expert. Machne Learnng, 32(2):151 178, 1998. Earler verson n 12th ICML, 1995. M. Herbster and M. K. Warmuth. Trackng the best lnear predctor. Journal of Machne Learnng Research, 1:281 309, 2001. J. Huang, C. Guestrn, and L. Gubas. Fourer theoretc probablstc nference over permutatons. Journal of Machne Learnng Research, 10:997 1070, 2009. M. Jerrum, A. Snclar, and E. Vgoda. A polynomal-tme approxmaton algorthm for the permanent of a matrx wth nonnegatve entres. Journal of the ACM, 51(4):671 697, July 2004. A. Kala. Smulatng weghted majorty wth FPL. Prvate communcaton, 2005. A. Kala and S. Vempala. Effcent algorthms for onlne decson problems. J. Comput. Syst. Sc., 71(3):291 307, 2005. Specal ssue Learnng Theory 2003. B. Kalantar and L. Khachyan. On the complexty of nonnegatve-matrx scalng. Lnear Algebra and ts applcatons, 240:87 103, 1996. J. Kvnen and M. K. Warmuth. Addtve versus exponentated gradent updates for lnear predcton. Informaton and Computaton, 132(1):1 64, January 1997. J. Kvnen and M. K. Warmuth. Averagng expert predctons. In Computatonal Learnng Theory, 4th European Conference (EuroCOLT 99), Nordkrchen, Germany, March 29-31, 1999, Proceedngs, volume 1572 of Lecture Notes n Artfcal Intellgence, pages 153 167. Sprnger, 1999. 1735

HELMBOLD AND WARMUTH R. Kondor, A. Howard, and T. Jebara. Mult-object trackng wth representatons of the symmetrc group. In Proc. of the 11th Internatonal Conference on Artfcal Intellgence and Statstcs, March 2007. D. Kuzmn and M. K. Warmuth. Optmum follow the leader algorthm. In Proceedngs of the 18th Annual Conference on Learnng Theory (COLT 05), pages 684 686. Sprnger-Verlag, June 2005. Open problem. D. Kuzmn and M. K. Warmuth. Onlne Kernel PCA wth entropc matrx updates. In Proceedngs of the 24rd nternatonal conference on Machne learnng (ICML 07), pages 465 471. ACM Internatonal Conference Proceedngs Seres, June 2007. N. Lnal, A. Samorodntsky, and A. Wgderson. A determnstc strongly polynomal algorthm for matrx scalng and approxmate permanents. Combnatorca, 20(4):545 568, 2000. N. Lttlestone and M. K. Warmuth. The weghted majorty algorthm. Inform. Comput., 108(2): 212 261, 1994. Prelmnary verson n FOCS 89. D. McAllester. PAC-Bayesan stochastc model selecton. Machne Learnng, 51(1):5 21, 2003. R. Snkhorn. A relatonshp between arbtrary postve matrces and doubly stochastc matrces. The Annals of Mathematcal Statcstcs, 35(2):876 879, June 1964. E. Takmoto and M. K. Warmuth. Path kernels and multplcatve updates. Journal of Machne Learnng Research, 4:773 818, 2003. V. Vovk. Aggregatng strateges. In Proceedngs of the Thrd Annual Workshop on Computatonal Learnng Theory, pages 371 383. Morgan Kaufmann, 1990. M. K. Warmuth and D. Kuzmn. Randomzed PCA algorthms wth regret bounds that are logarthmc n the dmenson. Journal of Machne Learnng Research, 9:2217 2250, 2008. M. K. Warmuth and S.V.N. Vshwanathan. Leavng the span. In Proceedngs of the 18th Annual Conference on Learnng Theory (COLT 05), Bertnoro, Italy, June 2005. Sprnger-Verlag. Journal verson n progress. 1736