Multi-Armed Bandit Problems

Machine Learning Coms-4771 Multi-Armed Bandit Problems Lecture 20

Multi-armed Bandit Problems The Setting: K arms (or actions) Each time t, each arm i pays off a bounded real-valued reward x i (t), say in [0, 1]. Each time t, the learner chooses a single arm i t {1,..., K} and receives reward x it (t). The goal is to maximize the return. 1 K {}}{... The simplest instance of the exploration-exploitation problem

Bandits for targeting content Choose the best content to display to the next visitor of your website Content options = slot machines Reward = user s response (e.g., click on a ad) A simplifying assumption: no context (no visitor profiles). In practice, we want to solve contextual bandit problems.

Stochastic bandits: Each arm i is associated with some unknown probability distribution with expectation µ i. Rewards are drawn iid. The largest expected reward: µ = max i {1,...,K} E[x i ] Regret after T plays: µ T E[x it (t)] expectation is over the draws of rewards and the randomness in player s strategy Adversarial (nonstochastic) bandits: No assumption is made about the reward sequence (other than it s bounded). Regret after T plays: max i x i (t) E[x it (t)] expectation is only over the randomness in the player s strategy

Stochastic Bandits: Upper Confidence Bounds Strategy UCB Play each arm once At any time t > K (deterministically) play machine i t maximizing over j {1,..., K} where x j (t) + 2 ln t T j,t, x j is the average reward obtained from machine j Tj,t is the number of times j has been played so far

UCB Intuition: The second term 2 ln t/t j,t is the the size of the one-sided (1 1/t)-condifence interval for the average reward (using Chernoff-Hoeffding bounds). Theorem (Auer, Cesa-Bianchi, Fisher) At time T, the regret of the UCB policy is at most 8K ln T + 5K, where = µ max i:µi <µ µ i (the gap between the best expected reward and the expected reward of the runner up).

Stochastic Bandits: ɛ-greedy Randomized policy: ɛ t -greedy Parameter: schedule ɛ 1, ɛ 2,..., where 0 ɛ t 1. At each time t (exploit) with probability 1 ɛ t, play the arm i t with the highest current average return (explore) with probability ɛ, play a random arm Is there a schedule of ɛ t which guarantees logarithmic regret? Constant ɛ causes linear regret. Fix: let ɛ go to 0 as our estimates of the expected rewards become more accurate.

Theorem (Auer, Cesa-Bianchi, Fisher) If ɛ t = 12/(d 2 t) where 0 < d, then the instantaneous regret (i.e., probability of choosing a suboptimal arm) at any time t of ɛ-greedy is at most O( K dt ). The regret of ɛ-greedy at time T (summing over the steps) is thus at most O( d K log T ) (using T 1 t lnt + γ where γ 0.5772 is the Euler constant).

Practical performance (from Auer, Cesa-Bianchi and Fisher): Tuning the UCB: replace 2 ln t/t i,t with ln t T i,t min{1/4, V i,t }, where V i,t is an upper confidence bound for the variance of arm i. (The factor 1/4 is an upper bound on the variance of any [0, 1] bounded variable.) Performs significantly better in practice. ɛ-greedy is quite sensitive to bad parameter tuning and large differences in response rates. Otherwise an optimally tuned ɛ-greedy performs very well. UCB tuned performs comparably to a well-tuned ɛ-greedy and is not very sensitive to large differences in response rates.

Nonstochastic Bandits: Recap No assumptions are made about the generation of rewards. Modeled by an arbitrary sequence of reward vectors x 1 (t),..., x K (t), where x i (t) [0, 1] is the reward obtained if action i is chosen at time t. At step t, the player chooses arm i t and receives x it. Regret after T plays (with respect to the best single action): max x j (t) j } {{ } G max=reward of the best action in hindsight E[x it (t)] }{{} expected reward of the player

Exp3 Algorithm (Auer, Cesa-Bianchi, Freund, and Schapire) Initialization: w i (1) = 1 for i {1,..., K} Set γ = min{1, K ln K (e 1)g }, where g G max. For each t = 1, 2,... Set w i p i (t) = (1 γ) K j=1 w j(t) + γ K Draw i t randomly according to p 1 (t),..., p K (t). Receive reward xit (t) [0, 1] For j = 1,..., K set the estimated rewards and update the weights: { x j (t)/p j (t) if j = i t ˆx j (t) = 0 otherwise w j (t + 1) = w j (t) exp(γˆx j (t)/k)

Exp3 Algorithm (Auer, Cesa-Bianchi, Freund, and Schapire) Theorem: For any T > 0 and for any sequence of rewards, regret of the player is bounded by 2 e 1 gk ln K 2.63 TK ln K Observation: Setting ˆx it to x it (t)/p it (t) guarantees that the expectations are equal to the actual rewards for each action: E[ˆx j i 1,..., i t 1 ] = p j (t)x j (t)/p j (t) = x j (t), where the expectation is with respect to the random choice of i t at time t (given the choices in the previous rounds). So dividing by p it compensates for the reward of actions with small probability of being drawn.

Proof: Let W t = j w j(t). We have W t+1 W t = K i=1 K i=1 w i (t)exp(γˆx i (t)/k) {}}{ w i (t + 1) W t = K i=1 p i (t) (γ/k) exp(γˆx i (t)/k) 1 γ p i (t) (γ/k) (1 + γ 1 γ k ˆx i(t) + (e 2) γ2 k 2 ˆx i 2 (t)) (using e x 1 + x + (e 2)x 2 for x [0, 1]) =1 {}}{ K p i (t) (γ/k) + 1 γ i=1 K i=1 p i (t)γˆx i (t) (1 γ)k (e 2)γ2 + (1 γ)k 2 γ 1 + (1 γ)k x i t (t) +(e 2) (γ/k)2 }{{} 1 γ the only non-zero term use approximation 1 + z e z i K ˆx i (t) i=1 p i (t)ˆx 2 i (t)

Take logs: ln W T +1 W t Summing over t, ln W T +1 W 1 γ/k (1 γ) Now, for any fixed arm j γ (1 γ)k x i t (t) + (e 2) (γ/k)2 1 γ reward of Exp3 {}}{ (e 2)(γ/K)2 G exp 3 + 1 γ ln W T +1 ln w j(t + 1) = ln w j(1) + (γ/k) W 1 W 1 W 1 Combine with the upper bound, γ ˆx j (t) ln K γ/k K 1 γ G (e 2)(γ/K)2 exp 3 + 1 γ Solve for G exp 3 : K ˆx i (t) i=1 i=1 K ˆx i (t) ˆx j (t). i=1 K ˆx i (t) G exp 3 (1 γ) x j (t) K T γ ln K (1 γ) (e 2)(γ/K) K ˆx i (t) i=1

Take expectaion of both sides wrt distribution of i 1,..., i T : E[G exp 3 ] (1 γ) x j (t) K γ ln K (e 2) γ K KG max. Since j was chosen arbitrarily, it holds for j = max: Thus E[G exp 3 ] (1 γ)g max K γ ln K (e 2)γG max G max E[G exp 3 ] K ln K γ + (e 1)γG max The value of γ in the algorithm is chosen to minimize the regret.

Comments: Don t need to know T in advance (guess and double) Possible to get high probability bounds (with a modified version of Exp3 that uses upper confidence bounds) Stronger notions of regret. Compete with the best in a class of strategies. The difference between T bounds and log T bounds is a bit misleading. The difference is not due to the adversarial nature of rewards but in the asymptotic quantification! log T bounds hold for any fixed set of reward distributions (so is fixed before T, not after).