LARGE CLASSES OF EXPERTS

LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 31, 2006

OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE FORECASTER 4 OTHER LARGE CLASSES OF EXPERTS 5 BIBLIOGRAPHY

TRACKING THE BEST EXPERT [HERBSTER AND WARMUTH, 1998] Discrete prediction problem Want to compete with compound action sets : B n,m = {(i 1,..., i n ) s(i 1,..., i n ) m}, where s(i 1,..., i n ) = n t=2 I {it 1 i t } is the number of switches. Shorthand notation i 1:n = (i 1,..., i n ), (i 1:t 1, i t, i t+1:n ) î1:n, etc. Regret: R n,m def = n l(i t, y t ) t=1 Instead we use R n,m, where def R n,m = max R(i 1:n ), R(i 1:n ) def = i 1:n B n,m min i 1:n B n,m t=1 n l(i t, y t ). n l(p t, y t ) t=1 n l(i t, y t ). t=1

RANDOMIZED EWA APPLIED TO TRACKING PROBLEMS Action set: B n,m. We always select a compound, but just play the next primitive action. Previous regret bound gives: n R n,m 2 ln( B n,m ). M = B n,m? M = m k=0 ( n 1 k ) N(N 1) k. M N m+1 exp( (n 1)H ( m n 1 ) ), H(x) = x ln x (1 x) ln(1 x), x [0, 1], binary entropy function. Hence R n,m n 2 ( ( )) m (m + 1) ln N + (n 1)H. n 1 Problem: randomized EWA is not efficient (M weights!)

TOWARDS AN EFFICIENT IMPLEMENTATION A useful observation: LEMMA (EWA WITH NON-UNIFORM PRIORS) Assume that W 0 = i w i0 1, w i0 0. Consider randomized EWA. Then n l(p t, y t ) 1 η ln 1 + η W n 8 n, t=1 where W n = N i=1 w in = N i=1 w i0e ηl in, L in = n t=1 l(i, y t). How does this help? Initial weights act like priors L 1n L 2n e ηl 1n e ηl 2n. It is good to assign small initial weights to actions with small expected loss.

TOWARDS AN EFFICIENT IMPLEMENTATION IDEA Consider EWA on all action sequences, but with an appropriate prior, reflecting our belief that many switches are unlikely. Let w t (i 1:n) be the weight of EWA after observing y 1:t. What is a good set of initial weights? w 0 (i 1:n) def = 1 ( α ) s(i1:n ) ( 1 α + α ) n s(i1:n ). N N N 0 < α < 1: Prior belief in a switch. If i 1:n has many switches, it will be assigned a small weight by this prior!

WHAT?? Marginalized weights: w 0 (i 1:t) = i t+1:n w 0 (i 1:n). LEMMA (MARKOV PROCESS VIEW) The followings hold: w 0 (i 1) = 1 N, w 0 (i 1:t+1) = w 0 (i 1:t) ( α N + (1 α) I {it+1=i t }). Interpretation: non-marginalized w 0 is the distribution underlying a Markov process.. Stay at the same primitive action with probability 1 α + α N Switch to any other particular action with probability α/n. Role of α: prior belief in a switch.

TOWARDS AN EFFICIENT ALGORITHM.. w t (i 1:n): Weights assigned to compound i 1:n by EWA after seeing y 1:t. What is the probability of a primitive action i? w it def = w t (i 1,..., i t, i, i t+2,..., i n ), t 1 i 1:t,i t+2:n Clearly, w i0 = 1/N in line with previous definition. p it = w it /W t, W t = N i=1 w it.

RECURSION FOR THE WEIGHTS w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). γ i,it = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). w it = w t (i 1:t, i, i t+1:n ) = e ηl(i1:t ) w 0 (î1:n) i 1:t,i t+2:n i 1:t,i t+2:n = e ηl(i t,y t ) i t = i t e ηl(i t,y t ) i 1:t 1 e ηl(i1:t 1) i t+2:n w 0 (î1:n) e ηl(i1:t 1) w 0 (i 1:t, i) i 1:t 1 = e ηl(i t,y t ) e ηl(i1:t 1) w 0 (i 1:t) w 0 (i 1:t, i) w i t i 1:t 1 0 (i 1:t) = e ηl(i t,y t ) i t i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)γ i,it = i t e ηl(i t,y t ) γ i,it i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)

RECURSION, CONTINUED w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). γ i,it w it =.. = e ηl(i t,y t ) γ i,it i t e ηl(i1:t 1) w 0 (i 1:t) i 1:t 1 = e ηl(i t,y t ) γ i,it i t i 1:t 1 e ηl(i1:t 1) = e ηl(i t,y t ) γ i,it i t i t+1:n w 0 (i 1:n) e ηl(i1:t 1) w 0 (i 1:n) i 1:t 1 i t+1:n = e ηl(i t,y t ) γ i,it w t 1 (i 1:n) i t i t+1:n i 1:t 1 = i t e ηl(i t,y t ) γ i,it w i t,t 1.

FIXED-SHARE FORECASTER w it = i t e ηl(i t,y t ) w i t,t 1 γ i,i t = i t e ηl(i t,y t ) w i t,t 1 (α/n + (1 α) I {it =i}) = (1 α)e ηl(i,y t ) w i,t 1 + α/n j e ηl(j,y t ) w j,t 1. FIXED-SHARE FORECASTER (FSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = α N W t + (1 α)v it, where W t = N j=1 v jt.

REGRET BOUND FOR FSF THEOREM ([HERBSTER AND WARMUTH, 1998]) Consider a discrete prediction problem, any sequence y 1:n. For any compound action i 1:n, R(i 1:n ) s(i 1:n) + 1 ln N + 1 ( ) η η ln 1 α s(i1:n) (1 α) n s(i + η 1:n) 8 n. For 0 m n, α = m/(n 1), with a specific choice of η = η(n, m, N), ( R n,m n (m + 1) ln N + (n 1)H 2 ( ) ( m + ln n 1 1 1 m n 1 )).

EASY PROOF We use the Lemma on EWA with non-uniform prior; just need upper bound on ln(1/w n) = ln(w n)! W n w n (i 1:n ). Hence, ln(w n) ln(w n(i 1:n )). ln w n(i 1:n ) = ln w 0 (i 1:n) + ηl(i 1:n ). Need upper bound on ln w 0 (i 1:n) Remember def: w 0 (i 1:n) = 1 N ( α N ) s(i1:n ) ( 1 α + α N ) n s(i1:n )...to get bound ln w 0 (i 1:n) (1 + s(i 1:n )) ln(n) + ln( 1 α s(i 1:n ) (1 α) n s(i 1:n ) ). Put together. Qu.e.d.

VARIABLE-SHARE FORECASTER GOAL Regret should be small when there is a compound action that achieves a small loss with a small number of switches Tool: Change initial prior penalizing switches from good primitive actions! ( ) w 0 (i 1:t+1) = w 0 (i 1 (1 α) l(i t,y t ) 1:t) + (1 α) l(i t,y t ) I N 1 {it =i t+1 }. Makes prior dependent on losses Cheating?..no, we don t need the prior w 0 (i 1:t+1) before observing y t! What does it do? If loss of current action is small, stay at it, otherwise encourage switching!

VARIABLE-SHARE FORECASTER: ALGORITHM VARIABLE-SHARE FORECASTER (VSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = 1 ( N 1 j i 1 (1 α) l(j,y t ) ) v jt + (1 α) l(i,y t ) v it. Result: For binary losses, n s(i 1:n) 1 η ln 1 1 α is replaced by s(i 1:n ) + 1 η L(i 1:n) ln 1 1 α. Small complexity, small loss: big win

OTHER EXAMPLES Tree experts (side info); e.g. [D.P. Helmbold, 1997] Shortest path FPL: [Kalai and Vempala, 2003]; additive losses Shortest path EWA: [György et al., 2005]; compression best scalar quantizers [György et al., 2004] Shortest path tracking Applications: Sequential allocation Motion planning (robot arms) Opponent modeling

REFERENCES D.P. Helmbold, R. S. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:51 68. György, A., Linder, T., and Lugosi, G. (2004). Efficient algorithms and minimax bounds for zero-delay lossy source coding. IEEE Transactions on Signal Processing, 52:2337 2347. György, A., Linder, T., and Lugosi, G. (2005). Tracking the best of many experts. pages 204 216. Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32:151 178. Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem. In Proceedings of the 16th Annual Conference on Learning Theory, pages 26 40. Springer.