LARGE CLASSES OF EXPERTS

Size: px

Start display at page:

Download "LARGE CLASSES OF EXPERTS"

Rosalind Ball
7 years ago
Views:

1 LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT UofA, October 31, 2006

2 OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE FORECASTER 4 OTHER LARGE CLASSES OF EXPERTS 5 BIBLIOGRAPHY

3 TRACKING THE BEST EXPERT [HERBSTER AND WARMUTH, 1998] Discrete prediction problem Want to compete with compound action sets : B n,m = {(i 1,..., i n ) s(i 1,..., i n ) m}, where s(i 1,..., i n ) = n t=2 I {it 1 i t } is the number of switches. Shorthand notation i 1:n = (i 1,..., i n ), (i 1:t 1, i t, i t+1:n ) î1:n, etc. Regret: R n,m def = n l(i t, y t ) t=1 Instead we use R n,m, where def R n,m = max R(i 1:n ), R(i 1:n ) def = i 1:n B n,m min i 1:n B n,m t=1 n l(i t, y t ). n l(p t, y t ) t=1 n l(i t, y t ). t=1

Shorthand notation i 1:n = (i 1,..., i n ), (i 1:t 1, i t, i t+1:n ) î1:n, etc.

4 RANDOMIZED EWA APPLIED TO TRACKING PROBLEMS Action set: B n,m. We always select a compound, but just play the next primitive action. Previous regret bound gives: n R n,m 2 ln( B n,m ). M = B n,m? M = m k=0 ( n 1 k ) N(N 1) k. M N m+1 exp( (n 1)H ( m n 1 ) ), H(x) = x ln x (1 x) ln(1 x), x [0, 1], binary entropy function. Hence R n,m n 2 ( ( )) m (m + 1) ln N + (n 1)H. n 1 Problem: randomized EWA is not efficient (M weights!)

Previous regret bound gives: n R n,m 2 ln( B n,m ). M = B n,m? M = m k=0 ( n 1 k ) N(N 1) k.

5 TOWARDS AN EFFICIENT IMPLEMENTATION A useful observation: LEMMA (EWA WITH NON-UNIFORM PRIORS) Assume that W 0 = i w i0 1, w i0 0. Consider randomized EWA. Then n l(p t, y t ) 1 η ln 1 + η W n 8 n, t=1 where W n = N i=1 w in = N i=1 w i0e ηl in, L in = n t=1 l(i, y t). How does this help? Initial weights act like priors L 1n L 2n e ηl 1n e ηl 2n. It is good to assign small initial weights to actions with small expected loss.

Then n l(p t, y t ) 1 η ln 1 + η W n 8 n, t=1 where W n = N i=1 w in = N i=1 w i0e ηl in, L in = n t=1

6 TOWARDS AN EFFICIENT IMPLEMENTATION IDEA Consider EWA on all action sequences, but with an appropriate prior, reflecting our belief that many switches are unlikely. Let w t (i 1:n) be the weight of EWA after observing y 1:t. What is a good set of initial weights? w 0 (i 1:n) def = 1 ( α ) s(i1:n ) ( 1 α + α ) n s(i1:n ). N N N 0 < α < 1: Prior belief in a switch. If i 1:n has many switches, it will be assigned a small weight by this prior!

Let w t (i 1:n) be the weight of EWA after observing y 1:t. What is a good set of initial weights?

7 WHAT?? Marginalized weights: w 0 (i 1:t) = i t+1:n w 0 (i 1:n). LEMMA (MARKOV PROCESS VIEW) The followings hold: w 0 (i 1) = 1 N, w 0 (i 1:t+1) = w 0 (i 1:t) ( α N + (1 α) I {it+1=i t }). Interpretation: non-marginalized w 0 is the distribution underlying a Markov process.. Stay at the same primitive action with probability 1 α + α N Switch to any other particular action with probability α/n. Role of α: prior belief in a switch.

I {it+1=i t }). Interpretation: non-marginalized w 0 is the distribution underlying a Markov process.

8 TOWARDS AN EFFICIENT ALGORITHM.. w t (i 1:n): Weights assigned to compound i 1:n by EWA after seeing y 1:t. What is the probability of a primitive action i? w it def = w t (i 1,..., i t, i, i t+2,..., i n ), t 1 i 1:t,i t+2:n Clearly, w i0 = 1/N in line with previous definition. p it = w it /W t, W t = N i=1 w it.

What is the probability of a primitive action i? w it def = w t (i 1,.

9 RECURSION FOR THE WEIGHTS w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). γ i,it = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). w it = w t (i 1:t, i, i t+1:n ) = e ηl(i1:t ) w 0 (î1:n) i 1:t,i t+2:n i 1:t,i t+2:n = e ηl(i t,y t ) i t = i t e ηl(i t,y t ) i 1:t 1 e ηl(i1:t 1) i t+2:n w 0 (î1:n) e ηl(i1:t 1) w 0 (i 1:t, i) i 1:t 1 = e ηl(i t,y t ) e ηl(i1:t 1) w 0 (i 1:t) w 0 (i 1:t, i) w i t i 1:t 1 0 (i 1:t) = e ηl(i t,y t ) i t i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)γ i,it = i t e ηl(i t,y t ) γ i,it i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)

w it = w t (i 1:t, i, i t+1:n ) = e ηl(i1:t ) w 0 (î1:n) i 1:t,i t+2:n i 1:t,i t+2:n = e ηl(i t,y t ) i t = i t e ηl(i t,y t ) i 1:t 1 e

10 RECURSION, CONTINUED w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). γ i,it w it =.. = e ηl(i t,y t ) γ i,it i t e ηl(i1:t 1) w 0 (i 1:t) i 1:t 1 = e ηl(i t,y t ) γ i,it i t i 1:t 1 e ηl(i1:t 1) = e ηl(i t,y t ) γ i,it i t i t+1:n w 0 (i 1:n) e ηl(i1:t 1) w 0 (i 1:n) i 1:t 1 i t+1:n = e ηl(i t,y t ) γ i,it w t 1 (i 1:n) i t i t+1:n i 1:t 1 = i t e ηl(i t,y t ) γ i,it w i t,t 1.

. = e ηl(i t,y t ) γ i,it i t e ηl(i1:t 1) w 0 (i 1:t) i 1:t 1 = e ηl(i t,y t ) γ i,it i t i 1:t 1 e ηl(i1:t 1) = e

11 FIXED-SHARE FORECASTER w it = i t e ηl(i t,y t ) w i t,t 1 γ i,i t = i t e ηl(i t,y t ) w i t,t 1 (α/n + (1 α) I {it =i}) = (1 α)e ηl(i,y t ) w i,t 1 + α/n j e ηl(j,y t ) w j,t 1. FIXED-SHARE FORECASTER (FSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = α N W t + (1 α)v it, where W t = N j=1 v jt.

FIXED-SHARE FORECASTER (FSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1.

12 REGRET BOUND FOR FSF THEOREM ([HERBSTER AND WARMUTH, 1998]) Consider a discrete prediction problem, any sequence y 1:n. For any compound action i 1:n, R(i 1:n ) s(i 1:n) + 1 ln N + 1 ( ) η η ln 1 α s(i1:n) (1 α) n s(i + η 1:n) 8 n. For 0 m n, α = m/(n 1), with a specific choice of η = η(n, m, N), ( R n,m n (m + 1) ln N + (n 1)H 2 ( ) ( m + ln n m n 1 )).

For any compound action i 1:n, R(i 1:n ) s(i 1:n) + 1 ln N + 1 ( ) η η ln 1 α s(i1:n) (1

13 EASY PROOF We use the Lemma on EWA with non-uniform prior; just need upper bound on ln(1/w n) = ln(w n)! W n w n (i 1:n ). Hence, ln(w n) ln(w n(i 1:n )). ln w n(i 1:n ) = ln w 0 (i 1:n) + ηl(i 1:n ). Need upper bound on ln w 0 (i 1:n) Remember def: w 0 (i 1:n) = 1 N ( α N ) s(i1:n ) ( 1 α + α N ) n s(i1:n )...to get bound ln w 0 (i 1:n) (1 + s(i 1:n )) ln(n) + ln( 1 α s(i 1:n ) (1 α) n s(i 1:n ) ). Put together. Qu.e.d.

Need upper bound on ln w 0 (i 1:n) Remember def: w 0 (i 1:n) = 1 N ( α N ) s(i1:n ) ( 1 α + α N ) n

14 VARIABLE-SHARE FORECASTER GOAL Regret should be small when there is a compound action that achieves a small loss with a small number of switches Tool: Change initial prior penalizing switches from good primitive actions! ( ) w 0 (i 1:t+1) = w 0 (i 1 (1 α) l(i t,y t ) 1:t) + (1 α) l(i t,y t ) I N 1 {it =i t+1 }. Makes prior dependent on losses Cheating?..no, we don t need the prior w 0 (i 1:t+1) before observing y t! What does it do? If loss of current action is small, stay at it, otherwise encourage switching!

( ) w 0 (i 1:t+1) = w 0 (i 1 (1 α) l(i t,y t ) 1:t) + (1 α) l(i t,y t ) I N 1 {it =i t+1 }.

15 VARIABLE-SHARE FORECASTER: ALGORITHM VARIABLE-SHARE FORECASTER (VSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = 1 ( N 1 j i 1 (1 α) l(j,y t ) ) v jt + (1 α) l(i,y t ) v it. Result: For binary losses, n s(i 1:n) 1 η ln 1 1 α is replaced by s(i 1:n ) + 1 η L(i 1:n) ln 1 1 α. Small complexity, small loss: big win

2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = 1 (

16 OTHER EXAMPLES Tree experts (side info); e.g. [D.P. Helmbold, 1997] Shortest path FPL: [Kalai and Vempala, 2003]; additive losses Shortest path EWA: [György et al., 2005]; compression best scalar quantizers [György et al., 2004] Shortest path tracking Applications: Sequential allocation Motion planning (robot arms) Opponent modeling

17 REFERENCES D.P. Helmbold, R. S. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27: György, A., Linder, T., and Lugosi, G. (2004). Efficient algorithms and minimax bounds for zero-delay lossy source coding. IEEE Transactions on Signal Processing, 52: György, A., Linder, T., and Lugosi, G. (2005). Tracking the best of many experts. pages Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32: Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem. In Proceedings of the 16th Annual Conference on Learning Theory, pages Springer.

, Linder, T., and Lugosi, G. (2005). Tracking the best of many experts. pages 204 216. Herbster, M. and Warmuth, M. (1998). Tracking the best expert.

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider