Multi-Armed Bandit Problems

Similar documents
Finite-time Analysis of the Multiarmed Bandit Problem*

Monotone multi-armed bandit allocations

Online Learning with Switching Costs and Other Adaptive Adversaries

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Trading regret rate for computational efficiency in online learning with limited feedback

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Online Algorithms: Learning & Optimization with No Regret.

Un point de vue bayésien pour des algorithmes de bandit plus performants

Handling Advertisements of Unknown Quality in Search Advertising

Revenue Optimization against Strategic Buyers

Week 1: Introduction to Online Learning

The Price of Truthfulness for Pay-Per-Click Auctions

Online Network Revenue Management using Thompson Sampling

Algorithms for the multi-armed bandit problem

Simple and efficient online algorithms for real world applications

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

Online Optimization and Personalization of Teaching Sequences

arxiv: v1 [cs.lg] 1 Jun 2013

Strongly Adaptive Online Learning

How To Solve Two Sided Bandit Problems

Online Learning with Composite Loss Functions

Adaptive Online Gradient Descent

The Advantages and Disadvantages of Online Linear Optimization

A Contextual-Bandit Approach to Personalized News Article Recommendation

4 Learning, Regret minimization, and Equilibria

Regret Minimization for Reserve Prices in Second-Price Auctions

14.1 Rent-or-buy problem

An Introduction to Competitive Analysis for Online Optimization. Maurice Queyranne University of British Columbia, and IMA Visitor (Fall 2002)

Interior-Point Methods for Full-Information and Bandit Online Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

our algorithms will work and our results will hold even when the context space is discrete given that it is bounded. arm (or an alternative).

Introduction to Online Learning Theory

Stochastic Online Greedy Learning with Semi-bandit Feedbacks

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Notes from Week 1: Algorithms for sequential prediction

The Multiplicative Weights Update method

Gambling and Data Compression

CHAPTER 2 Estimating Probabilities

Follow the Perturbed Leader

1 Portfolio Selection

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Online Convex Optimization

Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

The Value of Knowing a Demand Curve: Bounds on Regret for On-line Posted-Price Auctions

Adaptive play in Texas Hold em Poker

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Collaborative Filtering. Radek Pelánek

Statistical Machine Learning

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Online Combinatorial Optimization under Bandit Feedback MOHAMMAD SADEGH TALEBI MAZRAEH SHAHI

Optimizing engagement in online news

Online Adwords Allocation

Achieving All with No Parameters: AdaNormalHedge

Beat the Mean Bandit

Simple linear regression

The Price of Truthfulness for Pay-Per-Click Auctions

Submodular Function Maximization

Load balancing of temporary tasks in the l p norm

Applied Algorithm Design Lecture 5

Basics of Statistical Machine Learning

4.1 Introduction - Online Learning Model

Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Principle of Data Reduction

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

A Survey of Preference-based Online Learning with Bandit Algorithms

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Keyword Optimization in Search-Based Advertising Markets

MA4001 Engineering Mathematics 1 Lecture 10 Limits and Continuity

Goal Problems in Gambling and Game Theory. Bill Sudderth. School of Statistics University of Minnesota

Lecture 2: Universality

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Lecture 8. Confidence intervals and the central limit theorem

CONSIDER a merchant selling items through e-bay

Mathematical finance and linear programming (optimization)

Confidence Intervals for One Standard Deviation Using Standard Deviation

LARGE CLASSES OF EXPERTS

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Algorithm and the Advantages of a Competitive Approach to Productivity

Online Semi-Supervised Learning

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

Confidence Intervals for Spearman s Rank Correlation

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

1 Approximating Set Cover

Maximum Likelihood Estimation

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Exploitation and Exploration in a Performance based Contextual Advertising System

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Decentralized control of stochastic multi-agent service system. J. George Shanthikumar Purdue University

Online Learning in Bandit Problems

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Introduction to Online Optimization

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Boosting.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Mathematical Finance

Transcription:

Machine Learning Coms-4771 Multi-Armed Bandit Problems Lecture 20

Multi-armed Bandit Problems The Setting: K arms (or actions) Each time t, each arm i pays off a bounded real-valued reward x i (t), say in [0, 1]. Each time t, the learner chooses a single arm i t {1,..., K} and receives reward x it (t). The goal is to maximize the return. 1 K {}}{... The simplest instance of the exploration-exploitation problem

Bandits for targeting content Choose the best content to display to the next visitor of your website Content options = slot machines Reward = user s response (e.g., click on a ad) A simplifying assumption: no context (no visitor profiles). In practice, we want to solve contextual bandit problems.

Stochastic bandits: Each arm i is associated with some unknown probability distribution with expectation µ i. Rewards are drawn iid. The largest expected reward: µ = max i {1,...,K} E[x i ] Regret after T plays: µ T E[x it (t)] expectation is over the draws of rewards and the randomness in player s strategy Adversarial (nonstochastic) bandits: No assumption is made about the reward sequence (other than it s bounded). Regret after T plays: max i x i (t) E[x it (t)] expectation is only over the randomness in the player s strategy

Stochastic Bandits: Upper Confidence Bounds Strategy UCB Play each arm once At any time t > K (deterministically) play machine i t maximizing over j {1,..., K} where x j (t) + 2 ln t T j,t, x j is the average reward obtained from machine j Tj,t is the number of times j has been played so far

UCB Intuition: The second term 2 ln t/t j,t is the the size of the one-sided (1 1/t)-condifence interval for the average reward (using Chernoff-Hoeffding bounds). Theorem (Auer, Cesa-Bianchi, Fisher) At time T, the regret of the UCB policy is at most 8K ln T + 5K, where = µ max i:µi <µ µ i (the gap between the best expected reward and the expected reward of the runner up).

Stochastic Bandits: ɛ-greedy Randomized policy: ɛ t -greedy Parameter: schedule ɛ 1, ɛ 2,..., where 0 ɛ t 1. At each time t (exploit) with probability 1 ɛ t, play the arm i t with the highest current average return (explore) with probability ɛ, play a random arm Is there a schedule of ɛ t which guarantees logarithmic regret? Constant ɛ causes linear regret. Fix: let ɛ go to 0 as our estimates of the expected rewards become more accurate.

Theorem (Auer, Cesa-Bianchi, Fisher) If ɛ t = 12/(d 2 t) where 0 < d, then the instantaneous regret (i.e., probability of choosing a suboptimal arm) at any time t of ɛ-greedy is at most O( K dt ). The regret of ɛ-greedy at time T (summing over the steps) is thus at most O( d K log T ) (using T 1 t lnt + γ where γ 0.5772 is the Euler constant).

Practical performance (from Auer, Cesa-Bianchi and Fisher): Tuning the UCB: replace 2 ln t/t i,t with ln t T i,t min{1/4, V i,t }, where V i,t is an upper confidence bound for the variance of arm i. (The factor 1/4 is an upper bound on the variance of any [0, 1] bounded variable.) Performs significantly better in practice. ɛ-greedy is quite sensitive to bad parameter tuning and large differences in response rates. Otherwise an optimally tuned ɛ-greedy performs very well. UCB tuned performs comparably to a well-tuned ɛ-greedy and is not very sensitive to large differences in response rates.

Nonstochastic Bandits: Recap No assumptions are made about the generation of rewards. Modeled by an arbitrary sequence of reward vectors x 1 (t),..., x K (t), where x i (t) [0, 1] is the reward obtained if action i is chosen at time t. At step t, the player chooses arm i t and receives x it. Regret after T plays (with respect to the best single action): max x j (t) j } {{ } G max=reward of the best action in hindsight E[x it (t)] }{{} expected reward of the player

Exp3 Algorithm (Auer, Cesa-Bianchi, Freund, and Schapire) Initialization: w i (1) = 1 for i {1,..., K} Set γ = min{1, K ln K (e 1)g }, where g G max. For each t = 1, 2,... Set w i p i (t) = (1 γ) K j=1 w j(t) + γ K Draw i t randomly according to p 1 (t),..., p K (t). Receive reward xit (t) [0, 1] For j = 1,..., K set the estimated rewards and update the weights: { x j (t)/p j (t) if j = i t ˆx j (t) = 0 otherwise w j (t + 1) = w j (t) exp(γˆx j (t)/k)

Exp3 Algorithm (Auer, Cesa-Bianchi, Freund, and Schapire) Theorem: For any T > 0 and for any sequence of rewards, regret of the player is bounded by 2 e 1 gk ln K 2.63 TK ln K Observation: Setting ˆx it to x it (t)/p it (t) guarantees that the expectations are equal to the actual rewards for each action: E[ˆx j i 1,..., i t 1 ] = p j (t)x j (t)/p j (t) = x j (t), where the expectation is with respect to the random choice of i t at time t (given the choices in the previous rounds). So dividing by p it compensates for the reward of actions with small probability of being drawn.

Proof: Let W t = j w j(t). We have W t+1 W t = K i=1 K i=1 w i (t)exp(γˆx i (t)/k) {}}{ w i (t + 1) W t = K i=1 p i (t) (γ/k) exp(γˆx i (t)/k) 1 γ p i (t) (γ/k) (1 + γ 1 γ k ˆx i(t) + (e 2) γ2 k 2 ˆx i 2 (t)) (using e x 1 + x + (e 2)x 2 for x [0, 1]) =1 {}}{ K p i (t) (γ/k) + 1 γ i=1 K i=1 p i (t)γˆx i (t) (1 γ)k (e 2)γ2 + (1 γ)k 2 γ 1 + (1 γ)k x i t (t) +(e 2) (γ/k)2 }{{} 1 γ the only non-zero term use approximation 1 + z e z i K ˆx i (t) i=1 p i (t)ˆx 2 i (t)

Take logs: ln W T +1 W t Summing over t, ln W T +1 W 1 γ/k (1 γ) Now, for any fixed arm j γ (1 γ)k x i t (t) + (e 2) (γ/k)2 1 γ reward of Exp3 {}}{ (e 2)(γ/K)2 G exp 3 + 1 γ ln W T +1 ln w j(t + 1) = ln w j(1) + (γ/k) W 1 W 1 W 1 Combine with the upper bound, γ ˆx j (t) ln K γ/k K 1 γ G (e 2)(γ/K)2 exp 3 + 1 γ Solve for G exp 3 : K ˆx i (t) i=1 i=1 K ˆx i (t) ˆx j (t). i=1 K ˆx i (t) G exp 3 (1 γ) x j (t) K T γ ln K (1 γ) (e 2)(γ/K) K ˆx i (t) i=1

Take expectaion of both sides wrt distribution of i 1,..., i T : E[G exp 3 ] (1 γ) x j (t) K γ ln K (e 2) γ K KG max. Since j was chosen arbitrarily, it holds for j = max: Thus E[G exp 3 ] (1 γ)g max K γ ln K (e 2)γG max G max E[G exp 3 ] K ln K γ + (e 1)γG max The value of γ in the algorithm is chosen to minimize the regret.

Comments: Don t need to know T in advance (guess and double) Possible to get high probability bounds (with a modified version of Exp3 that uses upper confidence bounds) Stronger notions of regret. Compete with the best in a class of strategies. The difference between T bounds and log T bounds is a bit misleading. The difference is not due to the adversarial nature of rewards but in the asymptotic quantification! log T bounds hold for any fixed set of reward distributions (so is fixed before T, not after).