LARGE CLASSES OF EXPERTS



Similar documents
Week 1: Introduction to Online Learning

Online Algorithms: Learning & Optimization with No Regret.

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

The Multiplicative Weights Update method

4 Learning, Regret minimization, and Equilibria

Strongly Adaptive Online Learning

The Ergodic Theorem and randomness

Follow the Perturbed Leader

Achieving All with No Parameters: AdaNormalHedge

Adaptive Online Gradient Descent

1. Prove that the empty set is a subset of every set.

The Cost of Offline Binary Search Tree Algorithms and the Complexity of the Request Sequence

( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Notes from Week 1: Algorithms for sequential prediction

Single machine parallel batch scheduling with unbounded capacity

Introduction to Scheduling Theory

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The Goldberg Rao Algorithm for the Maximum Flow Problem

Influences in low-degree polynomials

Introduction to Online Optimization

Reinforcement Learning

Scheduling Single Machine Scheduling. Tim Nieberg

3. Monte Carlo Simulations. Math6911 S08, HM Zhu

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

The Advantages and Disadvantages of Online Linear Optimization

Proximal mapping via network optimization

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Adaptive Bound Optimization for Online Convex Optimization

Reinforcement Learning

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

Chapter 2: Binomial Methods and the Black-Scholes Formula

Classification - Examples

Message-passing sequential detection of multiple change points in networks

Find-The-Number. 1 Find-The-Number With Comps

1 Error in Euler s Method

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Fairness in Routing and Load Balancing

1 Portfolio Selection

Gambling and Data Compression

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...

Introduction to Learning & Decision Trees

Option Properties. Liuren Wu. Zicklin School of Business, Baruch College. Options Markets. (Hull chapter: 9)

OPTIMAL SELECTION BASED ON RELATIVE RANK* (the "Secretary Problem")

Lecture 4: BK inequality 27th August and 6th September, 2007

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Binomial lattice model for stock prices

Equilibrium computation: Part 1

Connectivity and cuts

ALMOST COMMON PRIORS 1. INTRODUCTION

1 The Brownian bridge construction

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

Kolmogorov Complexity and the Incompressibility Method

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

Introduction to Topology

Offline sorting buffers on Line

2.3 Convex Constrained Optimization Problems

Follow the Leader with Dropout Perturbations

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Sequential lmove Games. Using Backward Induction (Rollback) to Find Equilibrium

Online Learning and Online Convex Optimization. Contents

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

An Introduction to Competitive Analysis for Online Optimization. Maurice Queyranne University of British Columbia, and IMA Visitor (Fall 2002)

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Un point de vue bayésien pour des algorithmes de bandit plus performants

Course: Model, Learning, and Inference: Lecture 5

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Completion Time Scheduling and the WSRPT Algorithm

WORKED EXAMPLES 1 TOTAL PROBABILITY AND BAYES THEOREM

National Sun Yat-Sen University CSE Course: Information Theory. Gambling And Entropy

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Solutions to Homework 6

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Transcription:

LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 31, 2006

OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE FORECASTER 4 OTHER LARGE CLASSES OF EXPERTS 5 BIBLIOGRAPHY

TRACKING THE BEST EXPERT [HERBSTER AND WARMUTH, 1998] Discrete prediction problem Want to compete with compound action sets : B n,m = {(i 1,..., i n ) s(i 1,..., i n ) m}, where s(i 1,..., i n ) = n t=2 I {it 1 i t } is the number of switches. Shorthand notation i 1:n = (i 1,..., i n ), (i 1:t 1, i t, i t+1:n ) î1:n, etc. Regret: R n,m def = n l(i t, y t ) t=1 Instead we use R n,m, where def R n,m = max R(i 1:n ), R(i 1:n ) def = i 1:n B n,m min i 1:n B n,m t=1 n l(i t, y t ). n l(p t, y t ) t=1 n l(i t, y t ). t=1

RANDOMIZED EWA APPLIED TO TRACKING PROBLEMS Action set: B n,m. We always select a compound, but just play the next primitive action. Previous regret bound gives: n R n,m 2 ln( B n,m ). M = B n,m? M = m k=0 ( n 1 k ) N(N 1) k. M N m+1 exp( (n 1)H ( m n 1 ) ), H(x) = x ln x (1 x) ln(1 x), x [0, 1], binary entropy function. Hence R n,m n 2 ( ( )) m (m + 1) ln N + (n 1)H. n 1 Problem: randomized EWA is not efficient (M weights!)

TOWARDS AN EFFICIENT IMPLEMENTATION A useful observation: LEMMA (EWA WITH NON-UNIFORM PRIORS) Assume that W 0 = i w i0 1, w i0 0. Consider randomized EWA. Then n l(p t, y t ) 1 η ln 1 + η W n 8 n, t=1 where W n = N i=1 w in = N i=1 w i0e ηl in, L in = n t=1 l(i, y t). How does this help? Initial weights act like priors L 1n L 2n e ηl 1n e ηl 2n. It is good to assign small initial weights to actions with small expected loss.

TOWARDS AN EFFICIENT IMPLEMENTATION IDEA Consider EWA on all action sequences, but with an appropriate prior, reflecting our belief that many switches are unlikely. Let w t (i 1:n) be the weight of EWA after observing y 1:t. What is a good set of initial weights? w 0 (i 1:n) def = 1 ( α ) s(i1:n ) ( 1 α + α ) n s(i1:n ). N N N 0 < α < 1: Prior belief in a switch. If i 1:n has many switches, it will be assigned a small weight by this prior!

WHAT?? Marginalized weights: w 0 (i 1:t) = i t+1:n w 0 (i 1:n). LEMMA (MARKOV PROCESS VIEW) The followings hold: w 0 (i 1) = 1 N, w 0 (i 1:t+1) = w 0 (i 1:t) ( α N + (1 α) I {it+1=i t }). Interpretation: non-marginalized w 0 is the distribution underlying a Markov process.. Stay at the same primitive action with probability 1 α + α N Switch to any other particular action with probability α/n. Role of α: prior belief in a switch.

TOWARDS AN EFFICIENT ALGORITHM.. w t (i 1:n): Weights assigned to compound i 1:n by EWA after seeing y 1:t. What is the probability of a primitive action i? w it def = w t (i 1,..., i t, i, i t+2,..., i n ), t 1 i 1:t,i t+2:n Clearly, w i0 = 1/N in line with previous definition. p it = w it /W t, W t = N i=1 w it.

RECURSION FOR THE WEIGHTS w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). γ i,it = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). w it = w t (i 1:t, i, i t+1:n ) = e ηl(i1:t ) w 0 (î1:n) i 1:t,i t+2:n i 1:t,i t+2:n = e ηl(i t,y t ) i t = i t e ηl(i t,y t ) i 1:t 1 e ηl(i1:t 1) i t+2:n w 0 (î1:n) e ηl(i1:t 1) w 0 (i 1:t, i) i 1:t 1 = e ηl(i t,y t ) e ηl(i1:t 1) w 0 (i 1:t) w 0 (i 1:t, i) w i t i 1:t 1 0 (i 1:t) = e ηl(i t,y t ) i t i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)γ i,it = i t e ηl(i t,y t ) γ i,it i 1:t 1 e ηl(i1:t 1) w 0 (i 1:t)

RECURSION, CONTINUED w t (i 1:n) = w 0 (i 1:n)e ηl(i 1:t ), where L(i 1:t ) = t s=1 l(i s, y s ). = (α/n + (1 α)i {it =i}) = w 0 (i 1:t, i)/w 0 (i 1:t). γ i,it w it =.. = e ηl(i t,y t ) γ i,it i t e ηl(i1:t 1) w 0 (i 1:t) i 1:t 1 = e ηl(i t,y t ) γ i,it i t i 1:t 1 e ηl(i1:t 1) = e ηl(i t,y t ) γ i,it i t i t+1:n w 0 (i 1:n) e ηl(i1:t 1) w 0 (i 1:n) i 1:t 1 i t+1:n = e ηl(i t,y t ) γ i,it w t 1 (i 1:n) i t i t+1:n i 1:t 1 = i t e ηl(i t,y t ) γ i,it w i t,t 1.

FIXED-SHARE FORECASTER w it = i t e ηl(i t,y t ) w i t,t 1 γ i,i t = i t e ηl(i t,y t ) w i t,t 1 (α/n + (1 α) I {it =i}) = (1 α)e ηl(i,y t ) w i,t 1 + α/n j e ηl(j,y t ) w j,t 1. FIXED-SHARE FORECASTER (FSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = α N W t + (1 α)v it, where W t = N j=1 v jt.

REGRET BOUND FOR FSF THEOREM ([HERBSTER AND WARMUTH, 1998]) Consider a discrete prediction problem, any sequence y 1:n. For any compound action i 1:n, R(i 1:n ) s(i 1:n) + 1 ln N + 1 ( ) η η ln 1 α s(i1:n) (1 α) n s(i + η 1:n) 8 n. For 0 m n, α = m/(n 1), with a specific choice of η = η(n, m, N), ( R n,m n (m + 1) ln N + (n 1)H 2 ( ) ( m + ln n 1 1 1 m n 1 )).

EASY PROOF We use the Lemma on EWA with non-uniform prior; just need upper bound on ln(1/w n) = ln(w n)! W n w n (i 1:n ). Hence, ln(w n) ln(w n(i 1:n )). ln w n(i 1:n ) = ln w 0 (i 1:n) + ηl(i 1:n ). Need upper bound on ln w 0 (i 1:n) Remember def: w 0 (i 1:n) = 1 N ( α N ) s(i1:n ) ( 1 α + α N ) n s(i1:n )...to get bound ln w 0 (i 1:n) (1 + s(i 1:n )) ln(n) + ln( 1 α s(i 1:n ) (1 α) n s(i 1:n ) ). Put together. Qu.e.d.

VARIABLE-SHARE FORECASTER GOAL Regret should be small when there is a compound action that achieves a small loss with a small number of switches Tool: Change initial prior penalizing switches from good primitive actions! ( ) w 0 (i 1:t+1) = w 0 (i 1 (1 α) l(i t,y t ) 1:t) + (1 α) l(i t,y t ) I N 1 {it =i t+1 }. Makes prior dependent on losses Cheating?..no, we don t need the prior w 0 (i 1:t+1) before observing y t! What does it do? If loss of current action is small, stay at it, otherwise encourage switching!

VARIABLE-SHARE FORECASTER: ALGORITHM VARIABLE-SHARE FORECASTER (VSF) Initialize: w i0 = 1/N. 1 Draw primitive action I t from w i,t 1 / N j=1 w j,t 1. 2 Observe y t, losses l(i, y t ) (suffers loss l(i t, y t )) 3 Compute v it = w i,t 1 e ηl(i,y t ) 4 Let w it = 1 ( N 1 j i 1 (1 α) l(j,y t ) ) v jt + (1 α) l(i,y t ) v it. Result: For binary losses, n s(i 1:n) 1 η ln 1 1 α is replaced by s(i 1:n ) + 1 η L(i 1:n) ln 1 1 α. Small complexity, small loss: big win

OTHER EXAMPLES Tree experts (side info); e.g. [D.P. Helmbold, 1997] Shortest path FPL: [Kalai and Vempala, 2003]; additive losses Shortest path EWA: [György et al., 2005]; compression best scalar quantizers [György et al., 2004] Shortest path tracking Applications: Sequential allocation Motion planning (robot arms) Opponent modeling

REFERENCES D.P. Helmbold, R. S. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:51 68. György, A., Linder, T., and Lugosi, G. (2004). Efficient algorithms and minimax bounds for zero-delay lossy source coding. IEEE Transactions on Signal Processing, 52:2337 2347. György, A., Linder, T., and Lugosi, G. (2005). Tracking the best of many experts. pages 204 216. Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32:151 178. Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem. In Proceedings of the 16th Annual Conference on Learning Theory, pages 26 40. Springer.