# Un point de vue bayésien pour des algorithmes de bandit plus performants

Save this PDF as:

Size: px
Start display at page:

Download "Un point de vue bayésien pour des algorithmes de bandit plus performants"

## Transcription

1 Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 1 / 36

2 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 2 / 36

3 Two bandit problems 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 3 / 36

4 Two bandit problems Bandit model A multi-armed bandit model is a set of K arms where Arm a is an unknown probability distribution ν a with mean µ a Drawing arm a is observing a realization of ν a Arms are assumed to be independent In a bandit game, at round t, a forecaster chooses arm A t to draw based on past observations, according to its sampling strategy (or bandit algorithm) observes reward X t ν At The forecaster is to learn which arm(s) is (are) the best a = argmax a µ a Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 4 / 36

5 Two bandit problems Bernoulli bandit model A multi-armed bandit model is a set of K arms where Arm a is a Bernoulli distribution B(p a ) with unknown mean µ a = p a Drawing arm a is observing a realization of B(p a ) (0 or 1) Arms are assumed to be independent In a bandit game, at round t, a forecaster chooses arm A t to draw based on past observations, according to its sampling strategy (or bandit algorithm) observes reward X t B(p At ) The forecaster is to learn which arm(s) is (are) the best a = argmax a p a Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 5 / 36

6 Two bandit problems Regret minimization The classical bandit problem: regret minimization The forecaster wants to maximize the reward accumulated during learning or equivalentely minimize its regret: [ n ] R n = nµ a E X t He has to find a sampling strategy (or bandit algorithm) that realizes a tradeoff between exploration and exploitation t=1 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 6 / 36

7 Two bandit problems An alternative: pure-exploration Regret minimization The forecaster has to find the best arm(s), and does not suffer a loss when drawing bad arms. He has to find a sampling strategy that optimaly explores the environnement, together with a stopping criterion and to recommand a set S of m arms such that P(S is the set of m best arms) 1 δ. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 7 / 36

8 Two bandit problems Regret minimization Zoom on an application: Online advertisement Yahoo!(c) has to choose between K different advertisement the one to display on its webpage for each user (indexed by t N). Ad number a unknown probability of click p a Unknown best advertisement a = argmax a p a If ad a is displayed for user t, he clicks on it with probability p a Yahoo!(c): chooses ad A t to display for user number t observes whether the user has clicked or not: X t B(p At ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 8 / 36

11 Regret minimization: Bayesian bandits, frequentist bandits 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 10 / 36

12 Regret minimization: Bayesian bandits, frequentist bandits Two probabilistic modellings Two probabilistic models K independent arms. µ = µ a Frequentist : θ = (θ 1,..., θ K ) unknown parameter (Y a,t ) t is i.i.d. with distribution ν θa with mean µ a = µ(θ a ) highest expectation among the arms. Bayesian : i.i.d. θ a π a (Y a,t ) t is i.i.d. conditionally to θ a with distribution ν θa At time t, arm A t is chosen and reward X t = Y At,t is observed Two measures of performance Minimize regret R n (θ) = E θ [ n t=1 µ µ At ] Minimize Bayes risk Risk n = R n (θ)dπ(θ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 11 / 36

13 Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 12 / 36

14 Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) One can separate tools and objectives: Objective Frequentist Bayesian algorithms algorithms Regret?? Bayes risk?? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 12 / 36

15 Regret minimization: Bayesian bandits, frequentist bandits Frequentist tools, Bayesian tools Two probabilistic models Bandit algorithms based on frequentist tools use: MLE for the parameter of each arm confidence intervals for the mean of each arm Bandit algorithms based on Bayesian tools use: Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) π t a = p(θ a past observations from arm a) One can separate tools and objectives: Objective Frequentist Bayesian algorithms algorithms Regret?? Bayes risk?? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 13 / 36

16 Regret minimization: Bayesian bandits, frequentist bandits Our goal Two probabilistic models We want to design Bayesian algorithm that are optimal with respect to the frequentist regret Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 14 / 36

17 Regret minimization: Bayesian bandits, frequentist bandits Frequentist optimality Asymptotically optimal algorithms towards the regret N a (t) the number of draws of arm a up to time t K R n (θ) = (µ µ a )E θ [N a (n)] a=1 Lai and Robbins,1985 : every consistent algorithm satisfies µ a < µ lim inf n E θ [N a (n)] ln n 1 KL(ν θa, ν θ ) A bandit algorithm is asymptotically optimal if µ a < µ lim sup n E θ [N a (n)] ln n 1 KL(ν θa, ν θ ) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 15 / 36

18 Regret minimization: Bayesian bandits, frequentist bandits The optimism principle A family of frequentist algorithms The following heuristic defines a family of optimistic index policies: For each arm a, compute a confidence interval on the unknown mean: µ a UCB a (t) w.h.p Use the optimism-in-face-of-uncertainty principle: act as if the best possible model was the true model The algorithm chooses at time t A t = arg max a UCB a (t) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 16 / 36

19 Regret minimization: Bayesian bandits, frequentist bandits UCB Towards optimal algorithms for Bernoulli bandits UCB [Auer et al. 02] uses Hoeffding bounds: α log(t) UCB a (t) = ˆp a (t) + 2N a (t) where ˆp a (t) = Sa(t) N a(t) Finite-time bound: E[N a (n)] is the empirical mean of arm a. K 1 2(p p a ) 2 ln n + K 2, with K 1 > 1. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 17 / 36

20 Regret minimization: Bayesian bandits, frequentist bandits KL-UCB Towards optimal algorithms for Bernoulli bandits KL-UCB[Cappé et al. 2013] uses the index: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} β(t,δ)/n a (t) with K(p, q) := KL (B(p), B(q)) = p log Finite-time bound: E[N a (n)] S a (t)/n a (t) ( ) ( ) p 1 p + (1 p) log q 1 q 1 K(p a, p ) ln n + C Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 18 / 36

21 Two Bayesian bandit algorithms 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 19 / 36

22 Two Bayesian bandit algorithms UCBs versus Bayesian algorithms Bayesian algorithms Figure : Confidence intervals for the arms means after t rounds of a bandit game Figure : Posterior over the arms means after t rounds of a bandit game Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 20 / 36

23 Two Bayesian bandit algorithms UCBs versus Bayesian algorithms Bayesian algorithms Figure : Confidence intervals for the arms means after t rounds of a bandit game Figure : Posterior over the arms means after t rounds of a bandit game How do we exploit the posterior in a Bayesian bandit algorithm? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 20 / 36

24 Two Bayesian bandit algorithms The Bayes-UCB algorithm 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 21 / 36

25 Two Bayesian bandit algorithms The Bayes-UCB algorithm The algorithm Let : Π 0 = (π 0 1,..., π0 K ) be a prior distribution over (θ 1,..., θ K ) Λ t = (λ t 1,..., λt K ) be the posterior over the means (µ 1,..., µ K ) a the end of round t The Bayes-UCB algorithm chooses at time t ( Q 1 A t = argmax a 1 t(log t) c, λt 1 a where Q(α, π) is the quantile of order α of the distribution π. For Benoulli bandits with uniform prior on the means: θ a = µ a = p a Λ t = Π t i.i.d θ a U([0, 1]) = Beta(1, 1) λ t a = π t a = Beta(S a (t) + 1, N a (t) S a (t) + 1) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 22 / 36 )

26 Two Bayesian bandit algorithms Theoretical results Theoretical results for Bernoulli bandits Bayes-UCB is asymptotically optimal Theorem [K.,Cappé,Garivier 2012] Let ɛ > 0. The Bayes-UCB algorithm using a uniform prior over the arms and with parameter c 5 satisfies E θ [N a (n)] 1 + ɛ K(p a, p ) log(n) + o ɛ,c (log(n)). Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 23 / 36

27 Two Bayesian bandit algorithms Link to a frequentist algorithm Theoretical results Bayes-UCB index is close to KL-UCB index: ũ a (t) q a (t) u a (t) with: { u a (t) = max q S ( ) } a(t) N a (t) : N Sa (t) a(t)k N a (t), q log t + c log log t { ũ a (t) = max q S ( ) a(t) N a (t) + 1 : (N Sa (t) a(t) + 1)K N a (t) + 1, q ( ) } t log + c log log t N a (t) + 2 Bayes-UCB appears to build automatically confidence intervals based on Kullback-Leibler divergence, that are adapted to the geometry of the problem in this specific case. Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 24 / 36

28 Two Bayesian bandit algorithms Thompson Sampling 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 25 / 36

29 Thompson Sampling Two Bayesian bandit algorithms The algorithm A randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t = argmax a µ(θ a (t)) (Recent) interest for this algorithm: a very old algorithm [Thompson 1933] partial analysis proposed [Granmo 2010][May, Korda, Lee, Leslie 2012] extensive numerical study beyond the Bernoulli case [Chapelle, Li 2011] first logarithmic upper bound on the regret [Agrawal,Goyal 2012] Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 26 / 36

30 Two Bayesian bandit algorithms The algorithm Thompson Sampling (Bernoulli bandits) A randomized Bayesian algorithm: a {1..K}, θ a (t) Beta(S a (t) + 1, N a (t) S a (t) + 1) A t = argmax a θ a (t) (Recent) interest for this algorithm: a very old algorithm [Thompson 1933] partial analysis proposed [Granmo 2010][May, Korda, Lee, Leslie 2012] extensive numerical study beyond the Bernoulli case [Chapelle, Li 2011] first logarithmic upper bound on the regret [Agrawal,Goyal 2012] Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 27 / 36

31 Two Bayesian bandit algorithms Theoretical results An optimal regret bound for Bernoulli bandits Assume the first arm is the unique optimal arm. Known result : [Agrawal,Goyal 2012] ( K ) 1 E[R n ] C p ln(n) + o µ (ln(n)) p a a=2 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 28 / 36

32 Two Bayesian bandit algorithms Theoretical results An optimal regret bound for Bernoulli bandits Assume the first arm is the unique optimal arm. Known result : [Agrawal,Goyal 2012] ( K ) 1 E[R n ] C p ln(n) + o µ (ln(n)) p a a=2 Our improvement : [K.,Korda,Munos 2012] Theorem ɛ > 0, E[R n ] (1 + ɛ) ( K a=2 p p a K(p a, p ) ) ln(n) + o µ,ɛ (ln(n)) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 28 / 36

33 In practise Two Bayesian bandit algorithms Practical conclusion θ = [ ] 400 UCB 400 KLUCB regret time time 400 BayesUCB 400 Thompson time time Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 29 / 36

34 Two Bayesian bandit algorithms Practical conclusion In practise In the Bernoulli case, for each arm, KL-UCB requires to solve an optimization problem: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} Bayes-UCB requires to compute one quantile of a Beta distribution Thompson requires to compute one sample of a Beta distribution Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 30 / 36

35 Two Bayesian bandit algorithms Practical conclusion In practise In the Bernoulli case, for each arm, KL-UCB requires to solve an optimization problem: u a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) log t + c log log t} Bayes-UCB requires to compute one quantile of a Beta distribution Thompson requires to compute one sample of a Beta distribution Other advantages of Bayesian algorithms: they easily generalize to more complex models......even when the posterior is not directly computable (using MCMC) the prior can incorporate correlation between arms Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 30 / 36

36 Conclusion and perspectives 1 Two bandit problems 2 Regret minimization: Bayesian bandits, frequentist bandits 3 Two Bayesian bandit algorithms The Bayes-UCB algorithm Thompson Sampling 4 Conclusion and perspectives Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 31 / 36

37 Conclusion and perspectives Summary for regret minimization Future work: Objective Frequentist Bayesian algorithms algorithms Regret KL-UCB Bayes-UCB Thompson Sampling Bayes risk KL-UCB Gittins algorithm for finite horizon Is Gittins algorithm optimal with respect to the regret? Are our Bayesian algorithms efficient with respect to the Bayes risk? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 32 / 36

38 Conclusion and perspectives Bayesian algorithm for pure-exploration? At round t, the KL-LUCB algorithm ([K., Kalyanakrishnan, 13]) draws two well-chosen arms: u t and l t stops when CI for arms in J(t) and J(t) c are separated recommends the set of m empirical best arms m=3. Set J(t), arm l t in bold Set J(t) c, arm u t in bold Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 33 / 36

39 Conclusion and perspectives Bayesian algorithm for pure-exploration? KL-LUCB uses KL-confidence intervals: L a (t) = min {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}, U a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}. ( ) We use β(t, δ) = log k1 Kt α to make sure P(S = Sm) 1 δ. δ Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 34 / 36

40 Conclusion and perspectives Bayesian algorithm for pure-exploration? KL-LUCB uses KL-confidence intervals: L a (t) = min {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}, U a (t) = max {q ˆp a (t) : N a (t)k(ˆp a (t), q) β(t, δ)}. ( ) We use β(t, δ) = log k1 Kt α to make sure P(S = Sm) 1 δ. δ How to propose a Bayesian algorithm that adapts to δ? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 34 / 36

41 Conclusion and perspectives Conclusion Regret minimization: Go Bayesian! Bayes-UCB show striking similarities with KL-UCB Thompson Sampling is an easy-to-implement alternative to the optimistic approach both algorithms are asymptotically optimal towards frequentist regret (and more efficient in practise) Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 35 / 36

42 Conclusion and perspectives Conclusion Regret minimization: Go Bayesian! Bayes-UCB show striking similarities with KL-UCB Thompson Sampling is an easy-to-implement alternative to the optimistic approach both algorithms are asymptotically optimal towards frequentist regret (and more efficient in practise) TODO list: Go deeper into the link between Bayes risk and (frequentist) regret (Gittins frequentist optimality?) Obtain theoretical guarantees for Bayes-UCB and Thompson Sampling beyond Bernoulli bandit models (e.g. when rewards belong to the exponential family) Develop Bayesian algorithm for the pure-exploration objective? Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 35 / 36

43 Conclusion and perspectives References E. Kaufmann, O. Cappé, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. AISTATS 2012 E.Kaufmann, N.Korda, and R.Munos. Thompson Sampling: an asymptotically optimal finite-time analysis. ALT 2012 E.Kaufmann, S.Kalyanakrishnan. Information Complexity in Bandit Subset Selection, COLT 2013 Emilie Kaufmann (Telecom ParisTech) Bandits bayésiens Aussois, 28/08/13 36 / 36

### Introduction Main Results Experiments Conclusions. The UCB Algorithm

Paper Finite-time Analysis of the Multiarmed Bandit Problem by Auer, Cesa-Bianchi, Fischer, Machine Learning 27, 2002 Presented by Markus Enzenberger. Go Seminar, University of Alberta. March 14, 2007

### UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM

UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM PETER AUER AND RONALD ORTNER ABSTRACT In the stochastic multi-armed bandit problem we consider a modification of the

### Multi-Armed Bandit Problems

Machine Learning Coms-4771 Multi-Armed Bandit Problems Lecture 20 Multi-armed Bandit Problems The Setting: K arms (or actions) Each time t, each arm i pays off a bounded real-valued reward x i (t), say

### On Bayesian Upper Confidence Bounds for Bandit Problems

Emilie Kaufmann Olivier Cappé Aurélien Garivier LTCI, CNRS & Telecom ParisTech Abstract Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter

### Lecture 3: UCB Algorithm, Worst-Case Regret Bound

IEOR 8100-001: Learning and Optimization for Sequential Decision Making 01/7/16 Lecture 3: UCB Algorithm, Worst-Case Regret Bound Instructor: Shipra Agrawal Scribed by: Karl Stratos = Jang Sun Lee 1 UCB

### Finite-time Analysis of the Multiarmed Bandit Problem*

Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010

### Online Network Revenue Management using Thompson Sampling

Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira David Simchi-Levi He Wang Working Paper 16-031 Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira

### Yishay Mansour Google & Tel Aviv Univ. Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

Regret Minimization: Algorithms and Applications Yishay Mansour Google & Tel Aviv Univ. Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz 2 Weather Forecast Sunny: Rainy: No meteorological

### Online Combinatorial Optimization under Bandit Feedback MOHAMMAD SADEGH TALEBI MAZRAEH SHAHI

Online Combinatorial Optimization under Bandit Feedback MOHAMMAD SADEGH TALEBI MAZRAEH SHAHI Licentiate Thesis Stockholm, Sweden 2016 TRITA-EE 2016:001 ISSN 1653-5146 ISBN 978-91-7595-836-1 KTH Royal Institute

### An Introduction to Game-theoretic Online Learning Tim van Erven

General Mathematics Colloquium Leiden, December 4, 2014 An Introduction to Game-theoretic Online Learning Tim van Erven Statistics Bayesian Statistics Frequentist Statistics Statistics Bayesian Statistics

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Carnegie Mellon University spandey@cs.cmu.edu Christopher Olston Yahoo! Research olston@yahoo-inc.com Abstract We consider

### Revenue Optimization against Strategic Buyers

Revenue Optimization against Strategic Buyers Mehryar Mohri Courant Institute of Mathematical Sciences 251 Mercer Street New York, NY, 10012 Andrés Muñoz Medina Google Research 111 8th Avenue New York,

### Chapter 7. BANDIT PROBLEMS.

Chapter 7. BANDIT PROBLEMS. Bandit problems are problems in the area of sequential selection of experiments, and they are related to stopping rule problems through the theorem of Gittins and Jones (974).

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Beat the Mean Bandit

Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu

### Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising Min Xu Machine Learning Department Carnegie Mellon University minx@cs.cmu.edu ao Qin Microsoft Research Asia taoqin@microsoft.com

### Fast and Robust Normal Estimation for Point Clouds with Sharp Features

1/37 Fast and Robust Normal Estimation for Point Clouds with Sharp Features Alexandre Boulch & Renaud Marlet University Paris-Est, LIGM (UMR CNRS), Ecole des Ponts ParisTech Symposium on Geometry Processing

### Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Submitted to Operations Research manuscript OPRE-2009-06-288 Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising Hamid Nazerzadeh Microsoft Research, Cambridge, MA, hamidnz@microsoft.com

### Algorithms for the multi-armed bandit problem

Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00 Algorithms for the multi-armed bandit problem Volodymyr Kuleshov Doina Precup School of Computer Science McGill University

### Contextual-Bandit Approach to Recommendation Konstantin Knauf

Contextual-Bandit Approach to Recommendation Konstantin Knauf 22. Januar 2014 Prof. Ulf Brefeld Knowledge Mining & Assesment 1 Agenda Problem Scenario Scenario Multi-armed Bandit Model for Online Recommendation

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

OPERATIONS RESEARCH Vol. 6, No. 1, January February 212, pp. 18 195 ISSN 3-364X (print) ISSN 1526-5463 (online) http://d.doi.org/1.1287/opre.111.999 212 INFORMS The Knowledge Gradient Algorithm for a General

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Online Learning in Bandit Problems

Online Learning in Bandit Problems by Cem Tekin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering: Systems) in The University

### Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

### Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### In Optimum Design 2000, A. Atkinson, B. Bogacka, and A. Zhigljavsky, eds., Kluwer, 2001, pp. 195 208.

In Optimum Design 2000, A. Atkinson, B. Bogacka, and A. Zhigljavsky, eds., Kluwer, 2001, pp. 195 208. ÔØ Ö ½ ÇÈÌÁÅÁ ÁÆ ÍÆÁÅÇ Ä Ê ËÈÇÆË ÍÆ ÌÁÇÆ ÇÊ ÁÆ Ê Î ÊÁ Ä Ë Â Ò À Ö Û ÈÙÖ Ù ÍÒ Ú Ö ØÝ Ï Ø Ä Ý ØØ ÁÆ ¼

### Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

### The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann

### Signal detection and goodness-of-fit: the Berk-Jones statistics revisited

Signal detection and goodness-of-fit: the Berk-Jones statistics revisited Jon A. Wellner (Seattle) INET Big Data Conference INET Big Data Conference, Cambridge September 29-30, 2015 Based on joint work

### Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Bayesian logistic betting strategy against probability forecasting Akimichi Takemura, Univ. Tokyo (joint with Masayuki Kumon, Jing Li and Kei Takeuchi) November 12, 2012 arxiv:1204.3496. To appear in Stochastic

### The Intersection of Statistics and Topology:

The Intersection of Statistics and Topology: Confidence Sets Brittany Terese Fasy joint work with S. Balakrishnan, F. Chazal, F. Lecci, A. Rinaldo, A. Singh, L. Wasserman 18 January 2014 B. Fasy (Tulane)

### Regret Minimization for Reserve Prices in Second-Price Auctions

Regret Minimization for Reserve Prices in Second-Price Auctions Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Claudio Gentile (Varese) and Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi

### Introduction to Detection Theory

Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection

### arxiv:1306.0155v1 [cs.lg] 1 Jun 2013

Dynamic ad allocation: bandits with budgets Aleksandrs Slivkins May 2013 arxiv:1306.0155v1 [cs.lg] 1 Jun 2013 Abstract We consider an application of multi-armed bandits to internet advertising (specifically,

Year 2008-2009. Research Internship Report Machine Learning tools for Online Advertisement Victor Gabillon victorgabillon@inria.fr Supervisors INRIA : Philippe Preux Jérémie Mary Supervisors Orange Labs

### A Survey of Preference-based Online Learning with Bandit Algorithms

A Survey of Preference-based Online Learning with Bandit Algorithms Róbert Busa-Fekete and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {busarobi,eyke}@upb.de Abstract.

### The Price of Truthfulness for Pay-Per-Click Auctions

Technical Report TTI-TR-2008-3 The Price of Truthfulness for Pay-Per-Click Auctions Nikhil Devanur Microsoft Research Sham M. Kakade Toyota Technological Institute at Chicago ABSTRACT We analyze the problem

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### Monotone multi-armed bandit allocations

JMLR: Workshop and Conference Proceedings 19 (2011) 829 833 24th Annual Conference on Learning Theory Monotone multi-armed bandit allocations Aleksandrs Slivkins Microsoft Research Silicon Valley, Mountain

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### On the Complexity of A/B Testing

JMLR: Workshop and Conference Proceedings vol 35:1 3, 014 On the Complexity of A/B Testing Emilie Kaufmann LTCI, Télécom ParisTech & CNRS KAUFMANN@TELECOM-PARISTECH.FR Olivier Cappé CAPPE@TELECOM-PARISTECH.FR

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

Adaptive Bidding for Display Advertising ABSTRACT Arpita Ghosh Yahoo! Research 2821 Mission College Blvd. Santa Clara, CA 95054 arpita@yahoo-inc.com Sergei Vassilvitskii Yahoo! Research 111 West 40th St.,

### Online Algorithms: Learning & Optimization with No Regret.

Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the

### Big Data, Statistics, and the Internet

Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes

### Lecture 4: Random Variables

Lecture 4: Random Variables 1. Definition of Random variables 1.1 Measurable functions and random variables 1.2 Reduction of the measurability condition 1.3 Transformation of random variables 1.4 σ-algebra

### Performance measures for Neyman-Pearson classification

Performance measures for Neyman-Pearson classification Clayton Scott Abstract In the Neyman-Pearson (NP) classification paradigm, the goal is to learn a classifier from labeled training data such that

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Two-Sided Bandits and the Dating Market

Two-Sided Bandits and the Dating Market Sanmay Das Center for Biological and Computational Learning and Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge,

### Random graphs with a given degree sequence

Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### Set theory, and set operations

1 Set theory, and set operations Sayan Mukherjee Motivation It goes without saying that a Bayesian statistician should know probability theory in depth to model. Measure theory is not needed unless we

### ECONOMIC MECHANISMS FOR ONLINE PAY PER CLICK ADVERTISING: COMPLEXITY, ALGORITHMS AND LEARNING

POLITECNICO DI MILANO DIPARTIMENTO DI ELETTRONICA, INFORMAZIONE E BIOINFORMATICA DOCTORAL PROGRAMME IN INFORMATION TECHNOLOGY ECONOMIC MECHANISMS FOR ONLINE PAY PER CLICK ADVERTISING: COMPLEXITY, ALGORITHMS

### our algorithms will work and our results will hold even when the context space is discrete given that it is bounded. arm (or an alternative).

1 Context-Adaptive Big Data Stream Mining - Online Appendix Cem Tekin*, Member, IEEE, Luca Canzian, Member, IEEE, Mihaela van der Schaar, Fellow, IEEE Electrical Engineering Department, University of California,

### Abstract. Keywords. restless bandits, reinforcement learning, sequential decision-making, change detection, non-stationary environments

Modeling Human Performance in Restless Bandits with Particle Filters Sheng Kung M. Yi 1, Mark Steyvers 1, and Michael Lee 1 Abstract Bandit problems provide an interesting and widely-used setting for the

### Monte Carlo Methods in Finance

Author: Yiyang Yang Advisor: Pr. Xiaolin Li, Pr. Zari Rachev Department of Applied Mathematics and Statistics State University of New York at Stony Brook October 2, 2012 Outline Introduction 1 Introduction

### Model for dynamic website optimization

Chapter 10 Model for dynamic website optimization In this chapter we develop a mathematical model for dynamic website optimization. The model is designed to optimize websites through autonomous management

### A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

New Mathematics and Natural Computation Vol. 1, No. 2 (2005) 295 303 c World Scientific Publishing Company A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA:

### ( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

{ } ( ) = ( ) = {,,, } ( ) β ( ), < 1 ( ) + ( ) = ( ) + ( ) max, ( ) [ ( )] + ( ) [ ( )], [ ( )] [ ( )] = =, ( ) = ( ) = 0 ( ) = ( ) ( ) ( ) =, ( ), ( ) =, ( ), ( ). ln ( ) = ln ( ). + 1 ( ) = ( ) Ω[ (

Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance

### 4 Learning, Regret minimization, and Equilibria

4 Learning, Regret minimization, and Equilibria A. Blum and Y. Mansour Abstract Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive

### SECOND PART, LECTURE 4: CONFIDENCE INTERVALS

Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 4: CONFIDENCE INTERVALS Lecture 4: Confidence Intervals

### Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.

### An Introduction to Extreme Value Theory

An Introduction to Extreme Value Theory Petra Friederichs Meteorological Institute University of Bonn COPS Summer School, July/August, 2007 Applications of EVT Finance distribution of income has so called

### ABC methods for model choice in Gibbs random fields

ABC methods for model choice in Gibbs random fields Jean-Michel Marin Institut de Mathématiques et Modélisation Université Montpellier 2 Joint work with Aude Grelaud, Christian Robert, François Rodolphe,

### Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

### Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising

Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising Hamid Nazerzadeh Amin Saberi Rakesh Vohra July 13, 2007 Abstract We examine the problem of allocating a resource repeatedly over

### FULL LIST OF REFEREED JOURNAL PUBLICATIONS Qihe Tang

FULL LIST OF REFEREED JOURNAL PUBLICATIONS Qihe Tang 87. Li, J.; Tang, Q. Interplay of insurance and financial risks in a discrete-time model with strongly regular variation. Bernoulli 21 (2015), no. 3,

### Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising Hamid Nazerzadeh Amin Saberi Rakesh Vohra September 24, 2012 Abstract We examine the problem of allocating an item repeatedly over

### Average Y Average Predicted. Calibration. Dean P. Foster Amazon

0.5 0.4 Average Y 0.3 0.2 0.1 0.1 0.15 0.2 0.25 Average Predicted Calibration Dean P. Foster Amazon Outline Calibration for humans Calibration for big data Theory of calibrated Game theory: Convergence

### L1 vs. L2 Regularization and feature selection.

L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### Big Data: The Computation/Statistics Interface

Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used

### Information Compexity in Bandit Subset Selection Emilie Kaufmann and Shivaram Kalyanakrishnan

Information Compexity in Bandit Subset Selection Emilie Kaufmann and Shivaram Kalyanakrishnan In JMLR Workshop and Conference Proceedings Conference on Learning Theory, 203), 30:228 25, 203. JMLR: Workshop

### The Kelly Betting System for Favorable Games.

The Kelly Betting System for Favorable Games. Thomas Ferguson, Statistics Department, UCLA A Simple Example. Suppose that each day you are offered a gamble with probability 2/3 of winning and probability

### Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

### Computational Game Theory and Clustering

Computational Game Theory and Clustering Martin Hoefer mhoefer@mpi-inf.mpg.de 1 Computational Game Theory? 2 Complexity and Computation of Equilibrium 3 Bounding Inefficiencies 4 Conclusion Computational

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

### MIN-MAX CONFIDENCE INTERVALS

MIN-MAX CONFIDENCE INTERVALS Johann Christoph Strelen Rheinische Friedrich Wilhelms Universität Bonn Römerstr. 164, 53117 Bonn, Germany E-mail: strelen@cs.uni-bonn.de July 2004 STOCHASTIC SIMULATION Random

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### A direct approach to false discovery rates

J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiple-hypothesis

### Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach

Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach Rémi Bardenet Arnaud Doucet Chris Holmes Department of Statistics, University of Oxford, Oxford OX1 3TG, UK REMI.BARDENET@GMAIL.COM

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Stochastic Online Greedy Learning with Semi-bandit Feedbacks

Stochastic Online Greedy Learning with Semi-bandit Feedbacks Tian Lin Tsinghua University Beijing, China lintian06@gmail.com Jian Li Tsinghua University Beijing, China lapordge@gmail.com Wei Chen Microsoft

### Examples of decision problems:

Examples of decision problems: The Investment Problem: a. Problem Statement: i. Give n possible shares to invest in, provide a distribution of wealth on stocks. ii. The game is repetitive each time we

### Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

### LARGE CLASSES OF EXPERTS

LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 31, 2006 OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### S = k=1 k. ( 1) k+1 N S N S N

Alternating Series Last time we talked about convergence of series. Our two most important tests, the integral test and the (limit) comparison test, both required that the terms of the series were (eventually)

Electronic Commerce Research, 5: 75 98 (2005) 2005 Springer Science + Business Media, Inc. Manufactured in the Netherlands. Improvements to the Linear Programming Based Scheduling of Web Advertisements