Online Algorithms: Learning & Optimization with No Regret.



Similar documents
Notes from Week 1: Algorithms for sequential prediction

Follow the Perturbed Leader

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The Multiplicative Weights Update method

4 Learning, Regret minimization, and Equilibria

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Adaptive Online Gradient Descent

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Simple and efficient online algorithms for real world applications

Boosting.

LARGE CLASSES OF EXPERTS

Active Learning with Boosting for Spam Detection

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Week 1: Introduction to Online Learning

Strongly Adaptive Online Learning

Online Convex Optimization

Achieving All with No Parameters: AdaNormalHedge

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Introduction to Online Learning Theory

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

The Advantages and Disadvantages of Online Linear Optimization

CS570 Data Mining Classification: Ensemble Methods

Monotone multi-armed bandit allocations

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Follow the Leader with Dropout Perturbations

1 Portfolio Selection

Computational Game Theory and Clustering

Model Combination. 24 Novembre 2009

The Online Set Cover Problem

Data Mining Practical Machine Learning Tools and Techniques

4.1 Introduction - Online Learning Model

Online Learning: Theory, Algorithms, and Applications. Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz

2.1 Complexity Classes

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

GI01/M055 Supervised Learning Proximal Methods

Interactive Machine Learning. Maria-Florina Balcan

Lecture 2: Universality

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Efficient algorithms for online decision problems

Supervised Learning (Big Data Analytics)

Online Semi-Supervised Learning

FilterBoost: Regression and Classification on Large Datasets

Online Selection of Mediated and Domain-Specific Predictions for Improved Recommender Systems

Load Balancing and Switch Scheduling

Making Sense of the Mayhem: Machine Learning and March Madness

Algorithms for sustainable data centers

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Offline sorting buffers on Line

1 What is Machine Learning?

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

Online Learning and Competitive Analysis: a Unified Approach

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Dynamics and Equilibria

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSE 473: Artificial Intelligence Autumn 2010

The p-norm generalization of the LMS algorithm for adaptive filtering

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Introduction to Online Optimization

8.1 Min Degree Spanning Tree

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Distributed Machine Learning and Big Data

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Approximation Algorithms

Reinforcement Learning

Nonparametric tests these test hypotheses that are not statements about population parameters (e.g.,

How to Use Expert Advice

Big Data - Lecture 1 Optimization reminders

Lecture 6 Online and streaming algorithms for clustering

Cell Phone based Activity Detection using Markov Logic Network

1 Maximum likelihood estimation

The Goldberg Rao Algorithm for the Maximum Flow Problem

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Statistical Machine Learning

Reinforcement Learning

Talk announcement please consider attending!

Lecture 4 Online and streaming algorithms for clustering

How To Find Out How Regret Works In An Online Structured Learning Problem

Finite-time Analysis of the Multiarmed Bandit Problem*

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

The Basics of Graphical Models

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Transcription:

Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1

The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the problem (objective, hypothesis class) Pick best hypothesis from a feasible set. 2

Online Learning/Optimization Choose an action x t 2 X Get f t (x t ) and feedback f t : X! [0; 1] Same feasible set X in each round t Different Reward Models: Stochastic, Arbitrary but Oblivious, Adaptive and Arbitrary 3

Concrete Example: Commuting Pick a path x t from home to school. Pay cost f t (x t ) := P e2x t c t (e) Then see all edge costs for that round. Dealing with Limited Feedback: later in the course. 4

Other Applications Sequential decision problems Streaming algorithms for optimization/learning with large data sets Combining weak learners into strong ones ( boosting ) Fast approximate solvers for certain classes of convex programs Playing repeated games 5

Binary prediction with a perfect expert n hypotheses ( experts ) h 1 ; h 2 ; : : : ; h n Guaranteed that some hypothesis is perfect. Each round, get a data point pt and classifications h i (p t ) 2 f0; 1g Output binary prediction xt, observe correct label Minimize # mistakes Any Suggestions? 6

A Weighted Majority Algorithm Each expert votes for it's classification. Only votes from experts who have never been wrong are counted. Go with the majority # mistakes M log 2 (n) Weights w it = I(h i correct on rst t rounds ). W t = P i w it. W 0 = n, W T 1 Mistake on round t implies W t+1 W t =2 So 1 W T W 0 =2 M = n=2 M 7

Weighted Majority [Littlestone & Warmuth '89] What if there's no perfect expert? Each expert i has a weight w(i), votes for it's classification in {-1, 1}. Go with the weighted majority, predict sign( i w i x i ). Halve weights of wrong experts. Let m = # mistakes of best expert. How many mistakes M do we make? Weights w it = (1=2) ( # mistakes by i on rst t rounds). Let W t := P i w it. Note W 0 = n, W T (1=2) m Mistake on round t implies W t+1 3 4 W t So (1=2) m W T W 0 (3=4) M = n (3=4) M Thus (4=3) M n 2 m and M 2:41(m + log 2 (n)). 8

Can we do better? M 2:41(m + log 2 (n)) Experts\Time 1 2 3 4 0 1 0 1 e 2 1 1 0 1 0 e 1 1 No deterministic algorithm can get M < 2m. What if there are more than 2 choices? 9

Notation: Define loss or cost functions ct and define the regret of x 1, x 2,..., x T as R T = TX c t (x t ) t=1 Regret Maybe all one can do is hope to end up with the right regrets. Arthur Miller TX t=1 c t (x ) where x = argmin x2x P T t=1 c t(x) A sequence has \no-regret" if R T = o(t ). Questions: How can we improve Weighted Majority? What is the lowest regret we can hope for? 10

The Hedge/WMR Algorithm* [Freund & Schapire '97] Hedge(²) p t (i) := w it = X w jt Initialize w i0 = 1 for all i. j In each round t: Choose expert e t from categorical distribution p t Select x t = x(e t ; t), the advice/prediction of e t. For each i, set w i;t+1 = w it (1 ²) c t(x(e i ;t)) How does this compare to WM? * Pedantic note: Hedge is often called Randomized Weighted Majority, and abbreviated WMR, though WMR was published in the context of binary classification, unlike Hedge. 11

The Hedge/WMR Algorithm Hedge(²) p t (i) := w it = X w jt Initialize w i0 = 1 for all i. j In each round t: Choose expert e t from categorical distribution p t Select x t = x(e t ; t), the advice/prediction of e t. For each i, set w i;t+1 = w it (1 ²) c t(x(e i ;t)) Randomization Influence shrinks exponentially with cumulative loss. Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. 12

t=1 Hedge Performance Theorem: Let x 1 ; x 2 ; : : : be the choices of Hedge(²). Then " T # X µ 1 E c t (x t ) O PT T + ln(n) 1 ² ² P T where O PT T := min i t=1 c t(x(e i ; t)). ³ pln(n)=opt If ² =, the regret is ( p O PT ln(n)) 13

Hedge Analysis Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. Let W t := P i w it. Then W 0 = n and W T +1 (1 ²) OPT. W t+1 = X i w it (1 ²) c t(x it ) (1) = X W t p t (i)(1 ²) c t(x it ) [def of pt(i)] (2) i X W t p t (i) (1 ² c t (x it )) [Bernoulli's ineq] (3) i If x > 1; r 2 (0; 1) then (1 + x) r 1 + rx = W t (1 ² E [c t (x t )]) (4) W t exp ( ² E [c t (x t )]) [1 x e x ] (5) 14

E t=1 Hedge Analysis W T +1 =W 0 exp W 0 =W T +1 exp ln(n) ² Ã Ã ² ² + OPT 1 ² TX t=1 TX t=1 E [c t (x t )] E [c t (x t )] Recall W 0 = n and W T +1 (1 ²) OPT. " X T # c t (x t ) 1 µ ² ln W0 ln(n) O PT ln(1 ²) ² ² W T +1!! 15

If ² = Lower Bound ³ pln(n)=opt, the regret is ( p O PT ln(n)) Can we do better? Let c t (x)» Bernoulli(1/2) for all x and t. Let Z i := P T t=1 c t(x(e i ; t)). Then Z i» Bin(T; 1=2) is roughly normally distributed, with ¾ = 2p 1 T. P [Z i ¹ k¾] = exp (k 2 ) We get about ¹ = T=2, best choice is likely to get ¹ ( p T ln(n)) = ¹ ( p O PT ln(n)). 16

What have we shown? Simple algorithm that learns to do nearly as well as best fixed choice. Hedge can exploit any pattern that the best choice does. Works for Adaptive Adversaries. Suitable for playing repeated games. Related ideas appearing in Algorithmic Game Theory literature. 17

Related Questions Optimize and get no-regret against richer classes of strategies/experts: All distributions over experts All sequences of experts that have K transitions [Auer et al '02] Various classes of functions of input features [Blum & Mansour '05] E.g., consider time of day when choosing driving route. Arbitrary convex set of experts, metric space of experts, etc, with linear, convex, or Lipschitz costs. [Zinkevich '03, Kleinberg et al '08] All policies of a K-state initially unknown Markov Decision Process that models the world. [Auer et al '08] Arbitrary sets of strategies in R n with linear costs that we can optimize offline. [Hannan'57, Kalai & Vempala '02] 18

Related Questions Other notions of regret (see e.g., [Blum & Mansour '05]) Time selection functions: get low regret on mondays, rainy days, etc. Sleeping experts: if rule if(p) then predict Q is right 90% of the time it applies, be right 89% of the time P applies. Internal regret & swap regret: If you played x1,..., xt then have no regret against g(x 1 ),..., g(x T ) for every g:x X 19

Sleeping Experts if rule if(p) then predict Q is right 90% of the time it applies, be right 89% of the time P applies. Get this for every rule simultaneously. Idea: Generate lots of hypotheses that specialize on certain inputs, some good, some lousy, and combine them into a great classifier. Many applications: [Freund et al '97, Blum '97, Blum & Mansour '05] Document classification, Spam filtering, Adaptive Uis,... if ( physics in D) then classify D as science. Predicates can overlap. 20

Sleeping Experts Predicates can overlap E.g., predict college major given the classes C you're enrolled in? if(ml-101, CS-201 in C) then CS if(ml-101, Stats-201 in C) then Stats What do we predict for students enrolled in ML-101, CS-201, and Stats-201? 21

Sleeping Experts [Algorithm from Blum & Mansour '05] SleepingExperts(, E, F) Input: 2 (0; 1), experts E, time selection functions F Initialize we;f 0 = 1 for all e 2 E; f 2 F. In each round t: Let we t = P f f(t)wt e;f. Let W t = P e wt e. Let p t e = we=w t t. Choose expert e t from categorical distribution p t Select x t = x(e t ; t), the advice/prediction of e t. For each e 2 E; f 2 F w t+1 e;f = wt e;f f(t)(c t(e) E[c t (e t )]) 22

Sleeping Experts [Algorithm from Blum & Mansour '05] w t+1 e;f = wt e;f f(t)(c t(e) E[c t (e t )]) Ensures total sum of weights can never increase. X e;f w t e;f nm for all t w T e;f = Y t 0 f(t)(c t(e) E[c t (e t )]) = P t 0 [f(t)(c t(e) E[c t (e t )])] nm 23

Sleeping Experts Performance Let n = jej; m = jfj. Fix T 2 N. Let C(e; f) := P T t=1 f(t) c t(e) Let C alg (f) := P T t=1 f(t) c t(e t ) Then for all e 2 E; f 2 F E [C alg (f)] 1 If = 1 ² is close to 1, ³ C(e; f) + log 1= (nm) E [C alg (f)] = (1 + (²)) C(e; f) + Optimizing yields a regret bound of O( p C(e; f) log(nm) + log(nm)). µ log2 (nm) ² 24