Introduction Main Results Experiments Conclusions. The UCB Algorithm

Similar documents
Finite-time Analysis of the Multiarmed Bandit Problem*

Un point de vue bayésien pour des algorithmes de bandit plus performants

Algorithms for the multi-armed bandit problem

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Handling Advertisements of Unknown Quality in Search Advertising

Monotone multi-armed bandit allocations

Online Combinatorial Optimization under Bandit Feedback MOHAMMAD SADEGH TALEBI MAZRAEH SHAHI

Online Network Revenue Management using Thompson Sampling

Chapter 7. BANDIT PROBLEMS.

Master s Theory Exam Spring 2006

Optimizing engagement in online news

LARGE CLASSES OF EXPERTS

The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Collaborative Filtering. Radek Pelánek

Basics of Statistical Machine Learning

Online Algorithms: Learning & Optimization with No Regret.

Adaptive play in Texas Hold em Poker

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

Feature Selection with Monte-Carlo Tree Search

arxiv: v1 [cs.lg] 1 Jun 2013

Adaptive Online Gradient Descent

Revenue Optimization against Strategic Buyers

1 Sufficient statistics

Regret Minimization for Reserve Prices in Second-Price Auctions

Creating a NL Texas Hold em Bot

Online Learning with Switching Costs and Other Adaptive Adversaries

THE CENTRAL LIMIT THEOREM TORONTO

PROBABILITY AND STATISTICS. Ma To teach a knowledge of combinatorial reasoning.

Data Structures and Algorithms

Algorithm and the Advantages of a Competitive Approach to Productivity

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Exploitation and Exploration in a Performance based Contextual Advertising System

How To Solve Two Sided Bandit Problems

Non Parametric Inference

Trading regret rate for computational efficiency in online learning with limited feedback

In Optimum Design 2000, A. Atkinson, B. Bogacka, and A. Zhigljavsky, eds., Kluwer, 2001, pp

Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

sin(x) < x sin(x) x < tan(x) sin(x) x cos(x) 1 < sin(x) sin(x) 1 < 1 cos(x) 1 cos(x) = 1 cos2 (x) 1 + cos(x) = sin2 (x) 1 < x 2

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

A Contextual-Bandit Approach to Personalized News Article Recommendation

Inference of Probability Distributions for Trust and Security applications

Value of Learning in Sponsored Search Auctions

Supplement to Call Centers with Delay Information: Models and Insights

Stochastic Online Greedy Learning with Semi-bandit Feedbacks

Lecture 8. Confidence intervals and the central limit theorem

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...

The Price of Truthfulness for Pay-Per-Click Auctions

Week 1: Introduction to Online Learning

BANDIT FRAMEWORK FOR SYSTEMATIC LEARNING IN WIRELESS VIDEO-BASED FACE RECOGNITION

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

Math 370/408, Spring 2008 Prof. A.J. Hildebrand. Actuarial Exam Practice Problem Set 5 Solutions

ECONOMIC MECHANISMS FOR ONLINE PAY PER CLICK ADVERTISING: COMPLEXITY, ALGORITHMS AND LEARNING

Math 461 Fall 2006 Test 2 Solutions

Online Learning with Composite Loss Functions

MATHEMATICAL METHODS OF STATISTICS

Online Optimization and Personalization of Teaching Sequences

TD(0) Leads to Better Policies than Approximate Value Iteration

Budget Limited Multi Armed Bandits

1 The Line vs Point Test

Interior-Point Methods for Full-Information and Bandit Online Learning

A Practical Scheme for Wireless Network Operation

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

arxiv: v1 [math.pr] 5 Dec 2011

The Binomial Distribution

Efficient Fault-Tolerant Infrastructure for Cloud Computing

Neural Networks and Learning Systems

Hypothesis Testing for Beginners

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Approximation Algorithms

The Taxman Game. Robert K. Moniot September 5, 2003

CONSIDER a merchant selling items through e-bay

Nonparametric adaptive age replacement with a one-cycle criterion

On the Complexity of A/B Testing

Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising

Strongly Adaptive Online Learning

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Online Convex Optimization

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Confidence Intervals for One Standard Deviation Using Standard Deviation

Online Learning in Bandit Problems

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

Lies My Calculator and Computer Told Me

Probabilistic Strategies: Solutions

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Some stability results of parameter identification in a jump diffusion model

Monte Carlo Simulation

Predict the Popularity of YouTube Videos Using Early View Data

Context-Aware Online Traffic Prediction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Abstract. Keywords. restless bandits, reinforcement learning, sequential decision-making, change detection, non-stationary environments

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Transcription:

Paper Finite-time Analysis of the Multiarmed Bandit Problem by Auer, Cesa-Bianchi, Fischer, Machine Learning 27, 2002 Presented by Markus Enzenberger. Go Seminar, University of Alberta. March 14, 2007

Introduction Multiarmed Bandit Problem Regret Lai and Robbins (1985) In this Paper Main Results Theorem 1 (UCB1) Theorem 2 (UCB2) Theorem 3 (ɛ n -GREEDY) Theorem 4 (UCB1-NORMAL) Independence Assumptions Experiments UCB1-TUNED Setup Best value for α Summary of Results Comparison on Distribution 11 Conclusions

Multiarmed Bandit Problem Multiarmed Bandit Problem Example for the exploration vs exploitation dilemma K independent gambling machines (armed bandits) Each machine has an unknown stationary probability distribution for generating the reward Observed rewards when playing machine i: X i,1, X i,2,... Policy A chooses next machine based on previous sequence of plays and rewards

Regret Regret Regret of policy A K µ n µ j E[T j (n)] j=1 µ i expectation of machine i µ expectation of optimal machine T j number of times machine j was played The regret is the expected loss after n plays due to the fact that the policy does not always play the optimal machine.

Lai and Robbins (1985) Lai and Robbins (1985) Policy for a class of reward distributions (including: normal, Bernoulli, Poisson) with regret asymptotically bounded by logarithm of n: E[T j (n)] ln(n) D(p j p ) n D(p j p ) Kullback-Leibler divergence between reward densities This is the best possible regret Policy computes upper confidence index for each machine Needs entire sequence of rewards for each machine

In this Paper In this Paper Show policies with logarithmic regret uniformly over time Policies are simple and efficient Notation: i := µ µ i

Theorem 1 (UCB1) Theorem 1 (UCB1) Policy with finite-time regret logarithmically bounded for arbitrary sets of reward distributions with bounded support

Theorem 1 (UCB1) E[T j (n)] 8 2 j ln(n) worse than Lai and Robbins D(p j p ) 2 2 j with best possible constant 2 UCB2 brings main constant arbitrarily close to 1 2 2 j

Theorem 2 (UCB2) Theorem 2 (UCB2) More complicated version of UCB1 with better constants for bound on regret.

Theorem 2 (UCB2)

Theorem 2 (UCB2) First term brings constant arbitrarily close to 1 2 2 j c α as α 0 Let α = α n slowly decrease for small α

Theorem 3 (ɛ n-greedy) Theorem 3 (ɛ n -GREEDY) Similar result for ɛ-greedy heuristic. (ɛ needs to go to 0; constant ɛ has linear regret)

Theorem 3 (ɛ n-greedy) For c large enough, the bound is of order c/(d 2 n) + o(1/n) logarithmic bound on regret Bound on instanteneous regret Need to know lower bound d on expectation between best and second-best machine

Theorem 4 (UCB1-NORMAL) Theorem 4 (UCB1-NORMAL) Indexed based policy with logarithmically bounded finite-time regret for normally distributed reward distributions with unknown mean and variance.

Theorem 4 (UCB1-NORMAL)

Theorem 4 (UCB1-NORMAL) Like UCB1, but since kind of distribution is known, sample variance is used to estimate variance of distribution Proof depends on bounds for tails of χ 2 and Student distribution, which were only verified numerically

Independence Assumptions Independence Assumptions Theorem 1 3 also hold for rewards that are not independent across machines: X i,s and X j,t might be dependent for any s, t and i j The rewards of a single machine do not need to be independent and identically-distributed. Weaker assumption: E[X i,t X i,1,..., X i,t 1 ] = µ i for all 1 t n

UCB1-TUNED UCB1-TUNED Fined-tuned version of UCB taking the measured variance into account (no proven regret bounds) Upper confidence bound on variance of machine j Replace upper confidence bound in UCB1 by 1/4 is upper bound on variance of a Bernoulli random variable

Setup Distributions

Best value for α Best value for α Relatively insensitive, as long as α is small Use fixed α = 0.001

Summary of Results Summary of Results An optimally tuned ɛ n -GREEDY performs almost always best Performance of not well-tuned ɛ n -GREEDY degrades rapidly In most cases UCB1-TUNED performs comparably to a well-tuned ɛ n -GREEDY UCB1-TUNED not sensitive to the variances of the machines UCB2 performs similar to UCB1-TUNED, but always slightly worse

Comparison on Distribution 11 Comparison on Distribution 11

Conclusions Simple, efficient policies for the bandit problem on any set of reward distributions with known bounded support with uniform logarithmic regret Based on upper confidence bounds (with exception of ɛ n -GREEDY) Robust with respect to the introduction of moderate dependencies Many extensions of this work are possible Generalize to non-stationary problems Based on Gittins allocation indices (needs preliminary knowledge or learning of the indices)