Un point de vue bayésien pour des algorithmes de bandit plus performants

Similar documents

Online Network Revenue Management using Thompson Sampling

Finite-time Analysis of the Multiarmed Bandit Problem*

Online Combinatorial Optimization under Bandit Feedback MOHAMMAD SADEGH TALEBI MAZRAEH SHAHI

Basics of Statistical Machine Learning

Handling Advertisements of Unknown Quality in Search Advertising

Chapter 7. BANDIT PROBLEMS.

Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Beat the Mean Bandit

Algorithms for the multi-armed bandit problem

Fast and Robust Normal Estimation for Point Clouds with Sharp Features

Contextual-Bandit Approach to Recommendation Konstantin Knauf

CHAPTER 2 Estimating Probabilities

Adaptive Online Gradient Descent

Online Learning in Bandit Problems

Message-passing sequential detection of multiple change points in networks

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Week 1: Introduction to Online Learning

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

The Variability of P-Values. Summary

Introduction to Detection Theory

A Survey of Preference-based Online Learning with Bandit Algorithms

Big Data, Statistics, and the Internet

Regret Minimization for Reserve Prices in Second-Price Auctions

Machine Learning tools for Online Advertisement

Introduction to Online Learning Theory

Linear Classification. Volker Tresp Summer 2015

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Online Algorithms: Learning & Optimization with No Regret.

Notes from Week 1: Algorithms for sequential prediction

Monotone multi-armed bandit allocations

How To Solve Two Sided Bandit Problems

Random graphs with a given degree sequence

Monte Carlo Methods in Finance

Introduction to Markov Chain Monte Carlo

our algorithms will work and our results will hold even when the context space is discrete given that it is bounded. arm (or an alternative).

( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Strongly Adaptive Online Learning

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

4 Learning, Regret minimization, and Equilibria

FULL LIST OF REFEREED JOURNAL PUBLICATIONS Qihe Tang

An Introduction to Extreme Value Theory

Consistent Binary Classification with Generalized Performance Metrics

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Likelihood Approaches for Trial Designs in Early Phase Oncology

Computational Game Theory and Clustering

Follow the Perturbed Leader

The Kelly Betting System for Favorable Games.

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

A direct approach to false discovery rates

Master s Theory Exam Spring 2006

LARGE CLASSES OF EXPERTS

Detection of changes in variance using binary segmentation and optimal partitioning

A Contextual-Bandit Approach to Personalized News Article Recommendation

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Volatility Density Forecasting & Model Averaging Imperial College Business School, London

Betting Strategies, Market Selection, and the Wisdom of Crowds

Exploitation and Exploration in a Performance based Contextual Advertising System

HT2015: SC4 Statistical Data Mining and Machine Learning

Context-Aware Online Traffic Prediction

Fuzzy Probability Distributions in Bayesian Analysis

Part 2: One-parameter models

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

Information Theory and Stock Market

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Inference of Probability Distributions for Trust and Security applications

Bayesian Analysis for the Social Sciences

Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach

More details on the inputs, functionality, and output can be found below.

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

If You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets

Feature Selection with Monte-Carlo Tree Search

SAS Certificate Applied Statistics and SAS Programming

False Discovery Rates

Transcription:

Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 1 / 36

1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 2 / 36

Two bandit problems 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 3 / 36

Two bandit problems Bandit model A multi-armed bandit model is a set of K arms where Arm a is an unknown probability distribution ν a with mean µ a Drawing arm a is observing a realization of ν a Arms are assumed to be independent In a bandit game, at round t, a forecaster chooses arm A t to draw based on past observations, according to its sampling strategy (or bandit algorithm) observes reward X t ν At The forecaster is to learn which arm(s) is (are) the best a = argmax a µ a Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 4 / 36

Two bandit problems Bernoulli bandit model A multi-armed bandit model is a set of K arms where Arm a is a Bernoulli distribution B(p a ) with unknown mean µ a = p a Drawing arm a is observing a realization of B(p a ) (0 or 1) Arms are assumed to be independent In a bandit game, at round t, a forecaster chooses arm A t to draw based on past observations, according to its sampling strategy (or bandit algorithm) observes reward X t B(p At ) The forecaster is to learn which arm(s) is (are) the best a = argmax a p a Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 5 / 36

Two bandit problems Regret minimization The classical bandit problem: regret minimization The forecaster wants to maximize the reward accumulated during learning or equivalentely minimize its regret: [ n ] R n = nµ a E X t He has to find a sampling strategy (or bandit algorithm) that realizes a tradeoff between exploration and exploitation t=1 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 6 / 36

Two bandit problems An alternative: pure-exploration Regret minimization The forecaster has to find the best arm(s), and does not suffer a loss when drawing bad arms. He has to find a sampling strategy that optimaly explores the environnement, together with a stopping criterion and to recommand a set S of m arms such that P(S is the set of m best arms) 1 δ. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 7 / 36

Two bandit problems Regret minimization Zoom on an application: Online advertisement Yahoo!(c) has to choose between K different advertisement the one to display on its webpage for each user (indexed by t N). Ad number a unknown probability of click p a Unknown best advertisement a = argmax a p a If ad a is displayed for user t, he clicks on it with probability p a Yahoo!(c): chooses ad A t to display for user number t observes whether the user has clicked or not: X t B(p At ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 8 / 36

Two bandit problems Regret minimization Zoom on an application: Online advertisement Yahoo!(c) has to choose between K different advertisement the one to display on its webpage for each user (indexed by t N). Ad number a unknown probability of click p a Unknown best advertisement a = argmax a p a If ad a is displayed for user t, he clicks on it with probability p a Yahoo!(c): chooses ad A t to display for user number t observes whether the user has clicked or not: X t B(p At ) Yahoo!(c) can ajust its strategy (A t ) so as to Regret minimization Maximize the number of clicks during n interactions Pure-exploration Identify the best advertisement with probability at least 1 δ Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 8 / 36

Two bandit problems Regret minimization Zoom on an application: Online advertisement Yahoo!(c) has to choose between K different advertisement the one to display on its webpage for each user (indexed by t N). Ad number a unknown probability of click p a Unknown best advertisement a = argmax a p a If ad a is displayed for user t, he clicks on it with probability p a Yahoo!(c): chooses ad A t to display for user number t observes whether the user has clicked or not: X t B(p At ) Yahoo!(c) can ajust its strategy (A t ) so as to Regret minimization Maximize the number of clicks during n interactions Pure-exploration Identify the best advertisement with probability at least 1 δ Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 9 / 36

Regret minimization: Bayesian bandits, frequentist bandits 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 10 / 36

Regret minimization: Bayesian bandits, frequentist bandits Two probabilistic modellings Two probabilistic models K independent arms. µ = µ a Frequentist : θ = (θ 1,..., θ K ) unknown parameter (Y a,t ) t is i.i.d. with distribution ν θa with mean µ a = µ(θ a ) highest expectation among the arms. Bayesian : i.i.d. θ a π a (Y a,t ) t is i.i.d. conditionally to θ a with distribution ν θa At time t, arm A t is chosen and reward X t = Y At,t is observed Two measures of performance Minimize regret R n (θ) = E θ [ n t=1 µ µ At ] Minimize Bayes risk Risk n = R n (θ)dπ(θ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 11 / 36

Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 12 / 36

Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) One can separate tools and objectives: Objective Frequentist Bayesian algorithms algorithms Regret?? Bayes risk?? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 12 / 36

Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) One can separate tools and objectives: Objective Frequentist Bayesian algorithms algorithms Regret?? Bayes risk?? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 13 / 36

Regret minimization: Bayesian bandits, frequentist bandits Our goal Two probabilistic models We want to design Bayesian algorithm that are optimal with respect to the frequentist regret Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 14 / 36

Regret minimization: Bayesian bandits, frequentist bandits Frequentist optimality Asymptotically optimal algorithms towards the regret N a (t) the number of draws of arm a up to time t K R n (θ) = (µ µ a )E θ [N a (n)] a=1 Lai and Robbins,1985 : every consistent algorithm satisfies µ a < µ lim inf n E θ [N a (n)] ln n 1 KL(ν θa, ν θ ) A bandit algorithm is asymptotically optimal if µ a < µ lim sup n E θ [N a (n)] ln n 1 KL(ν θa, ν θ ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 15 / 36

Regret minimization: Bayesian bandits, frequentist bandits The optimism principle A family of frequentist algorithms The following heuristic defines a family of optimistic index policies: For each arm a, compute a confidence interval on the unknown mean: µ a UCB a (t) w.h.p Use the optimism-in-face-of-uncertainty principle: act as if the best possible model was the true model The algorithm chooses at time t A t = arg max a UCB a (t) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 16 / 36

Regret minimization: Bayesian bandits, frequentist bandits UCB Towards optimal algorithms for Bernoulli bandits UCB [Auer et al. 02] uses Hoeffding bounds: α log(t) UCB a (t) = ˆp a (t) + 2N a (t) where ˆp a (t) = Sa(t) N a(t) Finite-time bound: E[N a (n)] is the empirical mean of arm a. K 1 2(p p a ) 2 ln n + K 2, with K 1 > 1. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 17 / 36

Regret minimization: Bayesian bandits, frequentist bandits 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 KL-UCB 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Towards optimal algorithms for Bernoulli bandits KL-UCB[Cappé et al. 2013] uses the index: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} β(t,δ)/n a (t) with K(p, q) := KL (B(p), B(q)) = p log Finite-time bound: E[N a (n)] S a (t)/n a (t) ( ) ( ) p 1 p + (1 p) log q 1 q 1 K(p a, p ) ln n + C Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 18 / 36

Two Bayesian bandit algorithms 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 19 / 36

Two Bayesian bandit algorithms UCBs versus Bayesian algorithms Bayesian algorithms Figure : Confidence intervals for the arms means after t rounds of a bandit game Figure : Posterior over the arms means after t rounds of a bandit game Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 20 / 36

Two Bayesian bandit algorithms UCBs versus Bayesian algorithms Bayesian algorithms Figure : Confidence intervals for the arms means after t rounds of a bandit game Figure : Posterior over the arms means after t rounds of a bandit game How do we exploit the posterior in a Bayesian bandit algorithm? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 20 / 36

Two Bayesian bandit algorithms The Bayes-UCB algorithm 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 21 / 36

Two Bayesian bandit algorithms The Bayes-UCB algorithm The algorithm Let : Π 0 = (π 0 1,..., π0 K ) be a prior distribution over (θ 1,..., θ K ) Λ t = (λ t 1,..., λt K ) be the posterior over the means (µ 1,..., µ K ) a the end of round t The Bayes-UCB algorithm chooses at time t ( Q 1 A t = argmax a 1 t(log t) c, λt 1 a where Q(α, π) is the quantile of order α of the distribution π. For Benoulli bandits with uniform prior on the means: θ a = µ a = p a Λ t = Π t i.i.d θ a U([0, 1]) = Beta(1, 1) λ t a = π t a = Beta(S a (t) + 1, N a (t) S a (t) + 1) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 22 / 36 )

Two Bayesian bandit algorithms Theoretical results Theoretical results for Bernoulli bandits Bayes-UCB is asymptotically optimal Theorem [K.,Cappé,Garivier 2012] Let ɛ > 0. The Bayes-UCB algorithm using a uniform prior over the arms and with parameter c 5 satisfies E θ [N a (n)] 1 + ɛ K(p a, p ) log(n) + o ɛ,c (log(n)). Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 23 / 36

Two Bayesian bandit algorithms Link to a frequentist algorithm Theoretical results Bayes-UCB index is close to KL-UCB index: ũ a (t) q a (t) u a (t) with: { u a (t) = max q S ( ) } a(t) N a (t) : N Sa (t) a(t)k N a (t), q log t + c log log t { ũ a (t) = max q S ( ) a(t) N a (t) + 1 : (N Sa (t) a(t) + 1)K N a (t) + 1, q ( ) } t log + c log log t N a (t) + 2 Bayes-UCB appears to build automatically confidence intervals based on Kullback-Leibler divergence, that are adapted to the geometry of the problem in this specific case. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 24 / 36

Two Bayesian bandit algorithms Thompson Sampling 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 25 / 36

Thompson Sampling Two Bayesian bandit algorithms The algorithm A randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t = argmax a µ(θ a (t)) (Recent) interest for this algorithm: a very old algorithm [Thompson 1933] partial analysis proposed [Granmo 2010][May, Korda, Lee, Leslie 2012] extensive numerical study beyond the Bernoulli case [Chapelle, Li 2011] first logarithmic upper bound on the regret [Agrawal,Goyal 2012] Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 26 / 36

Two Bayesian bandit algorithms The algorithm Thompson Sampling (Bernoulli bandits) A randomized Bayesian algorithm: a {1..K}, θ a (t) Beta(S a (t) + 1, N a (t) S a (t) + 1) A t = argmax a θ a (t) (Recent) interest for this algorithm: a very old algorithm [Thompson 1933] partial analysis proposed [Granmo 2010][May, Korda, Lee, Leslie 2012] extensive numerical study beyond the Bernoulli case [Chapelle, Li 2011] first logarithmic upper bound on the regret [Agrawal,Goyal 2012] Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 27 / 36

Two Bayesian bandit algorithms Theoretical results An optimal regret bound for Bernoulli bandits Assume the first arm is the unique optimal arm. Known result : [Agrawal,Goyal 2012] ( K ) 1 E[R n ] C p ln(n) + o µ (ln(n)) p a a=2 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 28 / 36

Two Bayesian bandit algorithms Theoretical results An optimal regret bound for Bernoulli bandits Assume the first arm is the unique optimal arm. Known result : [Agrawal,Goyal 2012] ( K ) 1 E[R n ] C p ln(n) + o µ (ln(n)) p a a=2 Our improvement : [K.,Korda,Munos 2012] Theorem ɛ > 0, E[R n ] (1 + ɛ) ( K a=2 p p a K(p a, p ) ) ln(n) + o µ,ɛ (ln(n)) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 28 / 36

In practise Two Bayesian bandit algorithms Practical conclusion θ = [0.1 0.05 0.05 0.05 0.02 0.02 0.02 0.01 0.01 0.01] 400 UCB 400 KLUCB 350 350 300 300 250 250 regret 200 150 200 150 100 100 50 50 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 time 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 time 400 BayesUCB 400 Thompson 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 time 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 time Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 29 / 36

Two Bayesian bandit algorithms Practical conclusion In practise In the Bernoulli case, for each arm, KL-UCB requires to solve an optimization problem: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} Bayes-UCB requires to compute one quantile of a Beta distribution Thompson requires to compute one sample of a Beta distribution Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 30 / 36

Two Bayesian bandit algorithms Practical conclusion In practise In the Bernoulli case, for each arm, KL-UCB requires to solve an optimization problem: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} Bayes-UCB requires to compute one quantile of a Beta distribution Thompson requires to compute one sample of a Beta distribution Other advantages of Bayesian algorithms: they easily generalize to more complex models......even when the posterior is not directly computable (using MCMC) the prior can incorporate correlation between arms Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 30 / 36

Conclusion and perspectives 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 31 / 36

Conclusion and perspectives Summary for regret minimization Future work: Objective Frequentist Bayesian algorithms algorithms Regret KL-UCB Bayes-UCB Thompson Sampling Bayes risk KL-UCB Gittins algorithm for finite horizon Is Gittins algorithm optimal with respect to the regret? Are our Bayesian algorithms efficient with respect to the Bayes risk? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 32 / 36

Conclusion and perspectives Bayesian algorithm for pure-exploration? At round t, the KL-LUCB algorithm ([K., Kalyanakrishnan, 13]) draws two well-chosen arms: u t and l t stops when CI for arms in J(t) and J(t) c are separated recommends the set of m empirical best arms 1 0 58 118 346 330 120 72 m=3. Set J(t), arm l t in bold Set J(t) c, arm u t in bold Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 33 / 36

Conclusion and perspectives Bayesian algorithm for pure-exploration? KL-LUCB uses KL-confidence intervals: L a (t) = min {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}, U a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}. ( ) We use β(t, δ) = log k1 Kt α to make sure P(S = Sm) 1 δ. δ 1 0 58 118 346 330 120 72 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 34 / 36

Conclusion and perspectives Bayesian algorithm for pure-exploration? KL-LUCB uses KL-confidence intervals: L a (t) = min {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}, U a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}. ( ) We use β(t, δ) = log k1 Kt α to make sure P(S = Sm) 1 δ. δ 1 0 58 118 346 330 120 72 How to propose a Bayesian algorithm that adapts to δ? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 34 / 36

Conclusion and perspectives Conclusion Regret minimization: Go Bayesian! Bayes-UCB show striking similarities with KL-UCB Thompson Sampling is an easy-to-implement alternative to the optimistic approach both algorithms are asymptotically optimal towards frequentist regret (and more efficient in practise) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 35 / 36

Conclusion and perspectives Conclusion Regret minimization: Go Bayesian! Bayes-UCB show striking similarities with KL-UCB Thompson Sampling is an easy-to-implement alternative to the optimistic approach both algorithms are asymptotically optimal towards frequentist regret (and more efficient in practise) TODO list: Go deeper into the link between Bayes risk and (frequentist) regret (Gittins frequentist optimality?) Obtain theoretical guarantees for Bayes-UCB and Thompson Sampling beyond Bernoulli bandit models (e.g. when rewards belong to the exponential family) Develop Bayesian algorithm for the pure-exploration objective? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 35 / 36

Conclusion and perspectives References E. Kaufmann, O. Cappé, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. AISTATS 2012 E.Kaufmann, N.Korda, and R.Munos. Thompson Sampling: an asymptotically optimal finite-time analysis. ALT 2012 E.Kaufmann, S.Kalyanakrishnan. Information Complexity in Bandit Subset Selection, COLT 2013 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 36 / 36