1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09



Similar documents
Adaptive Online Gradient Descent

Online Convex Optimization

Week 1: Introduction to Online Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Follow the Perturbed Leader

Optimal Strategies and Minimax Lower Bounds for Online Convex Games

Strongly Adaptive Online Learning

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

The Advantages and Disadvantages of Online Linear Optimization

Introduction to Online Learning Theory

Introduction to Online Optimization

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Lecture Notes on Online Learning DRAFT

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Achieving All with No Parameters: AdaNormalHedge

Mind the Duality Gap: Logarithmic regret algorithms for online optimization

How to Use Expert Advice

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

The p-norm generalization of the LMS algorithm for adaptive filtering

Notes from Week 1: Algorithms for sequential prediction

The Multiplicative Weights Update method

Online Algorithms: Learning & Optimization with No Regret.

Mélange de Prédicteurs, Application à la Prévision de. Comsommation Electrique

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

Online Learning with Switching Costs and Other Adaptive Adversaries

Log Wealth In each period, algorithm A performs 85% as well as algorithm B

Online Semi-Supervised Learning

Follow the Leader with Dropout Perturbations

4 Learning, Regret minimization, and Equilibria

Simple and efficient online algorithms for real world applications

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

2.3 Convex Constrained Optimization Problems

1 Portfolio Selection

LARGE CLASSES OF EXPERTS

How To Find Out How Regret Works In An Online Structured Learning Problem

Logistic Regression for Spam Filtering

Prediction with Expert Advice for the Brier Game

Online Learning and Online Convex Optimization. Contents

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

arxiv: v1 [math.pr] 5 Dec 2011

ON-LINE PORTFOLIO SELECTION USING MULTIPLICATIVE UPDATES

arxiv: v2 [cs.lg] 27 Jun 2008

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Trading regret rate for computational efficiency in online learning with limited feedback

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

Gambling and Data Compression

Load Balancing and Switch Scheduling

Online Learning, Stability, and Stochastic Gradient Descent

Prediction With Expert Advice For The Brier Game

GI01/M055 Supervised Learning Proximal Methods

(Quasi-)Newton methods

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Online Lazy Updates for Portfolio Selection with Transaction Costs

Interior-Point Methods for Full-Information and Bandit Online Learning

Game Theory: Supermodular Games 1

Economics 1011a: Intermediate Microeconomics

Tail inequalities for order statistics of log-concave vectors and applications

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

14.1 Rent-or-buy problem

Regret in the On-Line Decision Problem

Machine Learning and Pattern Recognition Logistic Regression

Approximation Algorithms

ALMOST COMMON PRIORS 1. INTRODUCTION

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Monotone multi-armed bandit allocations

How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

Temporal Difference Learning in the Tetris Game

The Online Set Cover Problem

Online Learning: Theory, Algorithms, and Applications. Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz

CSCI567 Machine Learning (Fall 2014)

Master s Theory Exam Spring 2006

Degeneracy in Linear Programming

Adaptive Bound Optimization for Online Convex Optimization

How I won the Chess Ratings: Elo vs the rest of the world Competition

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Big Data - Lecture 1 Optimization reminders

Mathematical finance and linear programming (optimization)

Online Kernel Selection: Algorithms and Evaluations

A simple and fast algorithm for computing exponentials of power series

Universal Algorithms for Probability Forecasting

Online Learning for Global Cost Functions

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

On Adaboost and Optimal Betting Strategies

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Nonlinear Programming Methods.S2 Quadratic Programming

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

Find-The-Number. 1 Find-The-Number With Comps

An Unsupervised Learning Algorithm for Rank Aggregation

Efficient Algorithms for Universal Portfolios

Online Ad Auctions. By Hal R. Varian. Draft: February 16, 2009

Can We Learn to Beat the Best Stock

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

The Relative Worst Order Ratio for On-Line Algorithms

Big Data Analytics: Optimization and Randomization

Transcription:

1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch learning, an algorithm takes the data (training samples) and returns a hypothesis. We assume a static nature of the world: a new example that we will encounter will be similar to the training set. More precisely, we suppose that all the training samples, as well as the test point, are independent and identically distributed. Hence, given the bunch z 1,..., z n of training samples drawn from a distribution p, the quality of the learned function f is E z p l(f, z), where l is some loss function. he questions addressed by statistical learning theory are: how many examples are needed to have small expected error with certain confidence? what is the lowest error that can be achieved under certain conditions on p? etc. If the world is not static, it might be difficult to take advantage of large amounts of data. We can no longer rely on statistical assumptions. In fact, we take an even more dramatic view of the world. We suppose that there are no correlations whatsoever between any two days. As there is no stationary distribution responsible for the data, we no longer want to minimize some expected error. All we want is to survive no matter how adversarial the world is. By surviving, we mean that we do not do too badly relative to other agents in the world. In fact, the goal is to do not much worse than the best agent (this difference is called regret). Note that this does not guarantee that we will be doing well in absolute terms, as the best agent might be quite bad. However, this is the price we have to pay for not having any coherence between today and tomorrow. he goal of online learning is, therefore, to do almost as well as the best comparator. oday we will describe the most famous setting, prediction with expert advice. We will then try to understand why the algorithm for this setting works by making some abstractions and moving to the Online Convex Optimization framework. Finally, we will unify the two via Regularization and provide some powerful tools for proving regret guarantees. 2 Prediction with Expert Advice Suppose we have access to predictions of N experts. Denote these predictions at time t by f 1,t,..., f N,t. Fix a convex loss function l. We suppose that l(a, b) [0, 1] for simplicity. At each time step t = 1 to, Player observes f 1,t,..., f N,t and predicts p t Outcome y t is revealed Player suffers loss l(p t, y t ) and experts suffer l(f i,t, y t ) Denote the cumulative loss of expert i by L i,t = t s=1 l(f i,s, y s ) and the cumulative loss of the player by L t = t s=1 l(p s, y s ). he goal of the Player is to minimize the regret Lecturer: Sasha Rakhlin 1

l(p t, y t ) min i 1...N l(f i,t, y t ) = L min L i,. How can we make predictions based on the history so that R is small? Solution (called Exponential Weights or Weighted Majority [4]): keep weights w 1,t,..., w N,t over experts and predict p t = w i,t 1f i,t w. i,t 1 Once the outcome is revealed and the losses l(f i,t, y t ) can be calculated, the weights are updated as where η is the learning rate parameter. w i,t = w i,t 1 exp( ηl(f i,t, y t )), heorem 2.1. For the Exponential Weights algorithm with η = R (/2) ln N ln N, Proof, see [3, 2]. Let W t = w i,t. Suppose we initialize w i,0 = 1 for all experts i. hen ln W 0 N. Furthermore, On the other hand, ln W W 0 N w i, ln N N exp( ηl i, ) ln N ln( max,...,n exp( ηl i, )) ln N = η min,...,n L i, ln N. ln W t W t 1 w i,t w i,t 1 exp( ηl(f i,t, y t )) exp( ηl i,t 1 ) exp( ηl i,t 1) exp( ηl(f i,t, y t )) w i,t 1 w i,t 1 η l(f i,t, y t )w i,t 1 w + η2 i,t 1 ηl(p t, y t ) + η2 Lecturer: Sasha Rakhlin 2

where the last inequality follows by the definition of p t and convexity of l (via an application of Jensen s inequality). he next to last inequality holds because for a random variable X [a, b], for any s R. See [3, 2] for more details. ln Ee sx sex + s2 (b a) 2 Summing the last inequality over t = 1,..., and observing that logs telescope, ln W W 0 η l(p t, y t ) + η 2 /. Combining the upper and lower bounds for ln W /W 0, Balancing the two terms with η = ln N L min L i, + ln N i, η gives the bound. + η. We ve presented the Prediction with Expert Advice framework and the Exponential Weights algorithm because they are the most widely known. However, the proof above does not give us much insight into why things work. Let us now go to a somewhat simpler setting by removing the extra layer of combining predictions of experts to make our own prediction. he following setting is closely related, but simpler to understand. 3 Online Gradient Descent Consider the following repeated game: At each time step t = 1 to, Player predicts w t, a distribution over N experts Vector of losses l t R N is revealed Player suffers l t w t, the vector product he goal is to minimize the regret, defined as l t w t min w N simplex l t w. Since the minimum over the simplex occurs at the vertices, we could equivalently write min i. Lecturer: Sasha Rakhlin 3

One can see that the game is closely related to the one introduced in the beginning of the lecture. Here w t are the normalized versions of the weights kept by the Exponential Weights algorithm. hink of the loss vector l t as the vector of l(f i,t, y t ). he repeated game we just defined is called an Online Linear Optimization game. In fact, we can define the Online Linear Optimization over any convex set K. In this case l t w t min w K l t w. Suppose we just perform gradient steps w t+1 = w t ηl t, followed by a projection onto the set K. It turns out that this is a very good (often optimal) algorithm. Let s prove that this algorithm is good. heorem 3.1 (Online Gradient Descent [5]). Suppose the Online Linear Game is performed by updating w t+1 = w t ηl t with η = 1/2, followed by the projection onto the set K. hen the regret R GD, where G is the maximum norm of the gradient of l t s and D is the diameter of K. Proof. By the definition of the update, w t+1 w 2 w t ηl t w 2 = w t w 2 + η 2 l t 2 2l t (w t w ), where w K is the optimum which can only be computed in hindsight. Solving for the last term, l t (w t w ) w t w 2 w t+1 w 2 2η + η 2 l t 2 Summing over time, l t w t l t w w 1 w 2 2η + η 2 l t 2. Assuming the upper bound on the norms and choosing η = D G, we obtain the bound on the regret. 4 Regularization Let us compare the bounds DG and (/2) log N. he asymptotic dependence on is the same this is typical for online learning if no second-order curvature information is taken into account. However, the bounds are different. For the simplex, the L 2 -diameter D N instead of log N. his difference is noticeable whenever one has a large (exponential in parameters) number of experts. So, have we lost something by considering the Online Gradient Descent instead of Exponential Weights? he answer is Lecturer: Sasha Rakhlin 4

no. It turns out that Exponential Weights can be viewed as Online Gradient Descent, but in a different (dual) space. So, the difference in N vs log N arises from the choice of Euclidean potential vs Entropy potential. he dual-space gradient descent is known as Mirror Descent (see Chapter 11 of [3]). Somewhat surprisingly, regularization leads to Mirror Descent, and this is a very general framework subsuming many known online algorithms. By regularization we mean the following Follow he Regularized Leader (FRL) algorithm w t+1 = arg min w t l s w + η 1 R(w) s=1 where R is an appropriate convex differentiable regularizer. When we take R(w) = 1 2 w 2, we obtain the Online Gradient Descent described above. When we take R(w) = w(i) log w(i), we get Exponential Weights for the simplex (or, more generally, Exponentiated Gradient). References [1] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427 45, 1997. [2] Nicolò Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences. Ann. Statist, 27(6):165 195, 1999. [3] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [4] Nick Littlestone and Manfred K. Warmuth. he weighted majority algorithm. Information and Computation, 10(2):212 261, 1994. [5] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 92 936, 2003. Lecturer: Sasha Rakhlin 5