Week 1: Introduction to Online Learning



Similar documents
1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Adaptive Online Gradient Descent

Notes from Week 1: Algorithms for sequential prediction

Trading regret rate for computational efficiency in online learning with limited feedback

Online Convex Optimization

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Follow the Perturbed Leader

4 Learning, Regret minimization, and Equilibria

The Multiplicative Weights Update method

Strongly Adaptive Online Learning

Prediction, Learning, and Games

Optimal Strategies and Minimax Lower Bounds for Online Convex Games

Simple and efficient online algorithms for real world applications

LARGE CLASSES OF EXPERTS

How to Use Expert Advice

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

2.3 Convex Constrained Optimization Problems

Prediction with Expert Advice for the Brier Game

Some stability results of parameter identification in a jump diffusion model

arxiv: v1 [math.pr] 5 Dec 2011

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Lecture 2: August 29. Linear Programming (part I)

Properties of BMO functions whose reciprocals are also BMO

Equilibria and Dynamics of. Supply Chain Network Competition with Information Asymmetry. and Minimum Quality Standards

Gambling and Data Compression

6.852: Distributed Algorithms Fall, Class 2

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

8.1 Min Degree Spanning Tree

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

Psychology and Economics (Lecture 17)

Stochastic Inventory Control

Online Algorithms: Learning & Optimization with No Regret.

Prediction With Expert Advice For The Brier Game

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

GI01/M055 Supervised Learning Proximal Methods

Lecture 2: The SVM classifier

LECTURE 15: AMERICAN OPTIONS

4. Expanding dynamical systems

Solutions of Equations in One Variable. Fixed-Point Iteration II

Tail inequalities for order statistics of log-concave vectors and applications

Fairness in Routing and Load Balancing

Cloud Computing. Computational Tasks Have value for task completion Require resources (Cores, Memory, Bandwidth) Compete for resources

ON SOME ANALOGUE OF THE GENERALIZED ALLOCATION SCHEME

How To Find An Optimal Search Protocol For An Oblivious Cell

K 1 < K 2 = P (K 1 ) P (K 2 ) (6) This holds for both American and European Options.

Lecture 1: Course overview, circuits, and formulas

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

The Expectation Maximization Algorithm A short tutorial

Randomization Approaches for Network Revenue Management with Customer Choice Behavior

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Mélange de Prédicteurs, Application à la Prévision de. Comsommation Electrique

Discrete Optimization

Optimal File Sharing in Distributed Networks

Advanced Microeconomics

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Online Learning, Stability, and Stochastic Gradient Descent

The Goldberg Rao Algorithm for the Maximum Flow Problem

Using Generalized Forecasts for Online Currency Conversion

Differentiating under an integral sign

LIMITS AND CONTINUITY

Lecture Notes on Elasticity of Substitution

1 Norms and Vector Spaces

Math 120 Final Exam Practice Problems, Form: A

Achieving All with No Parameters: AdaNormalHedge

Message-passing sequential detection of multiple change points in networks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Error Bound for Classes of Polynomial Systems and its Applications: A Variational Analysis Approach

Introduction to Online Learning Theory

Sequences and Series

Scheduling Real-time Tasks: Algorithms and Complexity

Bargaining Solutions in a Social Network

Infinitely Imbalanced Logistic Regression

An example of a computable

Lecture 4 Online and streaming algorithms for clustering

Introduction to Online Optimization

Lecture 4: AC 0 lower bounds and pseudorandomness

Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Game Theory: Supermodular Games 1

The Epsilon-Delta Limit Definition:

Mixability is Bayes Risk Curvature Relative to Log Loss

On the Interaction and Competition among Internet Service Providers

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Lecture 13: Factoring Integers

Lecture 4: BK inequality 27th August and 6th September, 2007

Choice under Uncertainty

How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

ALMOST COMMON PRIORS 1. INTRODUCTION

Lecture 3: Linear methods for classification

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Find-The-Number. 1 Find-The-Number With Comps

We shall turn our attention to solving linear systems of equations. Ax = b

Mathematics 31 Pre-calculus and Limits

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Transcription:

Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider the problem of predicting an unknown sequence y 1, y 2,... of bits y t {, 1}. At each time t the forecaster first makes his guess ˆp t {, 1} for y t. Then the true bit y t is revealed and the forecaster finds out whether his prediction was correct. To compute ˆp t the forecaster listens to the advice of N experts. This advice takes the form of a binary vector (f 1,t,..., f N,t, where f i,t {, 1} is the prediction that expert i makes for the next bit y t. Goal Bound the number of time steps t in which ˆp t y t, that is, to bound the number of mistakes made by the forecaster. 1.1.1 If we know god exist among experts If there is an expert that predicts y t 1% correctly, you can define a forecaster (algorithm that makes at most log 2 N mistakes. Algorithm Do majority vote using experts that haven t made any mistakes yet. 1. Set t = 1. Define w k = 1 for experts k = 1..N. 2. Predict ˆp t = 1 if weighted average of f 1,t,.., f N,t is greater than 1/2, otherwise. 1

3. y t reveals. Set w k for those experts who made a mistake. 4. Increase t t + 1 Go to 2. Proof m: #mistakes that forecaster have made so far. w k {, 1}: weight for expert k. W m : Sum of weights of experts after making m mistakes. 1.1.2 If we don t know anything about experts? Let s come up with a simple idea and bound the performance. Algorithm Redefine w k [, 1]. We do weighted majority vote with weight updates. At each round, apply w k βw k for all experts who made a mistake (β (, 1. 1. Set t = 1. Define w k = 1 for experts k = 1..N. 2. Predict ˆp t = 1 if weighted average of f 1,t,.., f N,t is greater than 1/2, otherwise. 3. y t reveals. Set w k βw k for those experts who made a mistake. 4. Increase t t + 1 Go to 2. Analysis and Bound Consider we are at some time step where the forecaster made his m th mistake. k: expert index with the fewest mistakes so far. m : mistakes made by expert k. m log N + m log( 1 β log 2 1+β (1.1 By choosing β = 1/e. m ln 1 β ln 2 m + 1 2 ln N (1.2 1+β 1+β 2.63m + 2.63 ln N (1.3 2 Prediction with Expert Advice 2.1 Formal Protocol Parameters: decision space D for ˆp t and f i,t, outcome space Y for y t, loss function l : D Y R, set E of expert indices. For each round t = 1,2,... 2

1. the environment chooses the next outcome y t and the expert advice {f E,t D : E E}; the expert advice is revealed to the forecaster: 2. the forecaster chooses the prediction ˆp t D; 3. the environment reveals the next outcome y t Y; 4. the forecaster incurs loss l(ˆp t, y t and each expert E incurs loss l(f E,t, y t. 2.2 Goal The forecaster s goal is to keep as small as possible the cumulative regret (or simply regret with respect to each expert. Regret w.r.t expert E is defined by the sum R E,n = (l(ˆp t, y t l(f E,t, y t = ˆL n L E,n (2.1 where cumulative loss of forcaster : ˆLn = cumulative loss of expert E : L E,n = l(ˆp t, y t (2.2 l(f E,t, y t (2.3 Sometimes it is also convenient to define instantaneous regret r E,t = l(ˆp t, y t l(f E,t, y t (2.4 The appropriate goal would be achieving vanishing per-round regret, that is,. max R i,n = ˆL n min L i,n = o(n (2...N..N Note that this has to hold for outcome sequence (y t, t = 1..n and any experts. In other words, we don t make any assumption about outcome sequence and experts. It is easy to notice that per-round regret ( 1 n max..n R i,n converges to as n goes to infinity. * Don t be confused. R i,t is regret w.r.t expert i, and max..n R i,n is regret of a forecaster. 3 Weighted Average Prediction In this section we assume regression setting where D is continuous. We will introduce how to adjust these scheme to classification later. We now generalize weighted scheme that we introduced earlier. ˆp t = w i,t 1f i,t j=1 w j,t 1 (3.1 where w i,t 1 is the weight assigned to the expert i = 1..N. 3

Proper choice of w? Weight function We define φ : R R + {}, which is nonnegative, convex, twice differentiable, and increasing function. Now we use this to define weight w i,t 1 = φ (R i,t 1. Sanity check for φ (R i,t 1 3.1 Blackwell Condition Lemma 2.1 If the loss function l is convex in its first argument, then N r i,t φ (R i,t 1 (3.2 sup y t Y Proof Let s rephrase it. y t Y, N r i,t φ (R i,t 1 (3.3 Further rephrase using definition for r i,t and ˆp t... cf Jensen s Inequality: If a real function f is convex, f ( P n w ix i Pn j=1 w j P n P w if(x i n j=1 w, for w j i <, i = 1..n. Scholars love new notations? If we further develop the framework in the following way, we can cast the above lemma as an interesting way. We introduce instantaneous regret vector : r t = (r 1,t,..., r N,t R N (3.4 regret vector : R t = r t (3. It is also introduce a potential function Φ : R N R of the form ( N Φ(u = ψ φ(u i where ψ : R R is nonnegative, strictly increasing, concave, and twice differentiable function, and φ as defined before. Now we can describe ˆp t using our new notations. (3.6 ˆp t = Φ(R t 1 i f i,t j=1 Φ(R t 1 j (3.7 where Φ(R t 1 i = Φ(R t 1 / R i,t 1. It s exactly same definition of ˆp t as previous one since derivative of function φ will cancel out. 4

Polynomial Potential Exponential Potential 2 1 1 1 1 1 1 1 expert 2 1 1 expert 1 1 expert 2 1 1 expert 1 1 Figure 1: Examples of potential function. Left: Φ p (R t 1 = ( (R i,t 1 p + 2/p. Right: Φ η (R t 1 = (1/η ln( exp(ηr i,t 1. We chose p = 2 and η = 1 to plot. Polynomially Weighted Average Forecaster What ψ( and φ( are chosen? ( N 2/p Φ p (R t 1 = (R i,t 1 p + = u + 2 p (3.8 See Figure 1. Exponentially Weighted Average Forecaster What ψ( and φ( are chosen? Φ η (R t 1 = (1/η ln ( e ηr i,t 1 (3.9 See Figure 1. Scenario: Calculate R r 1, predict ˆp t. Suppose we are at time t 1 and we ve just received the outcome value y t 1. R t 1 = (R 1,t 1,..., R N,t 1 is calculated using convex loss function l(,. After proceeding to time t with R t 1 and revealed expert values, we can calculate weights and do weighted sum to get prediction ˆp t. We receive y t and calculate Rt again in the same way. Blackwell condition Lemma 2.1 can be simply written as follows. sup r t Φ(R t 1 (3.1 y t Y It is called Blackwell condition. Interpretation - Regret & potential plot. * Note that convex loss function is not the only way to achieve Blackwell condition.

Theorem 2.1 Assume that a forecaster satisfies the Blackwell condition for a potential Φ(u = ψ( φ(u i. Then, for all n=1,2,... where Φ(R n Φ( + 1 2 C(r t = sup u R N ψ C(r t, (3.11 ( φ(u i φ (u i ri,t 2 (3.12 We will use this to bound the regret of forecaster. 3.2 Exponentially Weighted Average Forecaster Polynomially weighted average forecaster also gives a good bound too, but slightly worse. Let s focus on exponentially weighted average forecaster for now. Recall that it is defined as ( N Φ η (R t 1 = (1/η ln e ηr i,t 1 (3.13 Now the forecaster s prediction becomes ˆp t = exp(η(ˆl t 1 L i,t 1 f i,t j=1 exp(η(ˆl t 1 L j,t 1 = e ηl i,t 1 f i,t j=1 e ηl j,t 1 (3.14 Corollary 2.2 Assume that the loss function l is convex in its first argument and that it takes values in [, 1]. For any n and η >, and for all y 1,..., y n Y, the regret of the exponentially weighted average forecaster satisfies ˆL n min..n L i,n ln N η + nη 2 (3.1 By choosing η = 2 ln N/n, the upper bound becomes 2n ln N Proof Compare this bound with one introduced earlier 6