Follow the Leader with Dropout Perturbations

Similar documents

Adaptive Online Gradient Descent

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Follow the Perturbed Leader

How To Find Out How Regret Works In An Online Structured Learning Problem

Online Convex Optimization

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The p-norm generalization of the LMS algorithm for adaptive filtering

Achieving All with No Parameters: AdaNormalHedge

Online Algorithms: Learning & Optimization with No Regret.

The Advantages and Disadvantages of Online Linear Optimization

Monotone multi-armed bandit allocations

1 Portfolio Selection

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Numerical Analysis Lecture Notes

Notes from Week 1: Algorithms for sequential prediction

Week 1: Introduction to Online Learning

How to Use Expert Advice

The Multiplicative Weights Update method

Strongly Adaptive Online Learning

Notes on Factoring. MA 206 Kurt Bryan

arxiv: v1 [math.pr] 5 Dec 2011

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

LOOKING FOR A GOOD TIME TO BET

Lecture 13: Martingales

Continued Fractions and the Euclidean Algorithm

Nonparametric adaptive age replacement with a one-cycle criterion

2.3 Convex Constrained Optimization Problems

Introduction to Online Learning Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

The Online Set Cover Problem

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

The Goldberg Rao Algorithm for the Maximum Flow Problem

Course Notes for Math 162: Mathematical Statistics Approximation Methods in Statistics

The Trip Scheduling Problem

Applied Algorithm Design Lecture 5

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Gambling Systems and Multiplication-Invariant Measures

BX in ( u, v) basis in two ways. On the one hand, AN = u+

Load balancing of temporary tasks in the l p norm

24. The Branch and Bound Method

Master s Theory Exam Spring 2006

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

Chapter 4 Lecture Notes

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

0.1 Phase Estimation Technique

7 Gaussian Elimination and LU Factorization

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Load Balancing and Switch Scheduling

Using Generalized Forecasts for Online Currency Conversion

CPC/CPA Hybrid Bidding in a Second Price Auction

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Analysis of Algorithms I: Binary Search Trees

Stochastic Inventory Control

Linear Codes. Chapter Basics

The Relation between Two Present Value Formulae

Note on some explicit formulae for twin prime counting function

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Introduction to Online Optimization

The van Hoeij Algorithm for Factoring Polynomials

LARGE CLASSES OF EXPERTS

Integer Factorization using the Quadratic Sieve

Important Probability Distributions OPRE 6301

Offline sorting buffers on Line

Projection-free Online Learning

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Lecture 4 Online and streaming algorithms for clustering

Fairness in Routing and Load Balancing

Random variables, probability distributions, binomial random variable

Factoring & Primality

Data Mining: Algorithms and Applications Matrix Math Review

Single-Period Balancing of Pay Per-Click and Pay-Per-View Online Display Advertisements

Practical Guide to the Simplex Method of Linear Programming

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

Approximation Algorithms

Simple and efficient online algorithms for real world applications

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Machine Learning and Pattern Recognition Logistic Regression

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Analysis of Load Frequency Control Performance Assessment Criteria

1 Formulating The Low Degree Testing Problem

Online Lazy Updates for Portfolio Selection with Transaction Costs

Chapter 2: Binomial Methods and the Black-Scholes Formula

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

LOGNORMAL MODEL FOR STOCK PRICES

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

DUOL: A Double Updating Approach for Online Learning

Transcription:

JMLR: Worshop and Conference Proceedings vol 35: 26, 204 Follow the Leader with Dropout Perturbations Tim van Erven Département de Mathématiques, Université Paris-Sud, France Wojciech Kotłowsi Institute of Computing Science, Poznań University of Technology, Poland Manfred K. Warmuth Department of Computer Science, University of California, Santa Cruz TIM@TIMVANERVEN.NL WKOTLOWSKI@CS.PUT.POZNAN.PL MANFRED@CSE.UCSC.EDU Abstract We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss i.e. regret compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader FPL. The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss the leader where ties are broen uniformly at random. To achieve the optimal worst-case regret as a function of the loss L of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or before selecting the leader. We show that our perturbations are an instance of dropout because experts may be interpreted as features although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O L ln K + ln K regret as a function of L, and optimal Oln K regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family AdaHedge and FlipFlop also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert because they depend on the expert s total loss and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants. Keywords: Online learning, regret bounds, expert setting, Follow the Leader, Follow the Perturbed Leader, Dropout. Introduction We address the following online sequential prediction problem, nown as prediction with expert advice see e.g. Littlestone and Warmuth, 994; Vov, 998: in each trial t =, 2,... the algorithm randomly chooses one of K experts as its predictor before being told the loss vector l t [0, ] K of all experts. Let L T, be the cumulative loss of the -th expert after T rounds, and let L = min L T, be the cumulative loss of the best expert. Then the goal for the algorithm is to minimize the difference between its expected cumulative loss and L, which is nown as the regret. In this paper, we consider not just the standard worst-case setting, in which losses are generated by an 204 T. van Erven, W. Kotłowsi & M.K. Warmuth.

VAN ERVEN KOTŁOWSKI WARMUTH adversary that tries to maximize the regret, but also a setting with easier data, in which loss vectors are sampled from a fixed probability distribution. The simplest algorithm is to Follow the Leader FTL, i.e. to choose an expert of minimal cumulative loss, with ties broen uniformly at random. This algorithm wors well for probabilistic data, but unfortunately its regret will grow linearly with L and T on worst-case data, which means that it is not able to learn. This problem can be avoided by the Weighted Majority algorithm or its stratified version, the Hedge algorithm Littlestone and Warmuth, 994; Freund and Schapire, 997, which chooses expert with probability proportional to the exponential weight exp ηl t, in trial t, for some positive learning rate η. For the appropriate tuning of η, Hedge is guaranteed to achieve sublinear regret of order O L ln K + ln K, which is worst-case optimal Cesa-Bianchi et al., 997; Jiazhong et al., 203. In addition to using exponential weights, there is a second method that can achieve sublinear worst-case regret: perturb the cumulative losses of the experts by independent additive noise from a fixed distribution and then apply FTL to the perturbed losses. For exponentially distributed noise with its parameter appropriately chosen as a function of L, the resulting Follow the Perturbed Leader FPL algorithm can also guarantee regret O L ln K + ln K Kalai and Vempala, 2005. FPL can also exactly simulate Hedge with any learning rate η by using log-exponential random variables for the noise that are scaled as a function of η Kuzmin and Warmuth, 2005; Kalai, 2005. Instead of using additive noise, we perturb the loss vector of the current trial: for binary losses l t {0, } K, we set each component l t, to zero with fixed probability α. As we explain in Section 2, this is equivalent to dropout Hinton et al., 202, because l t, may be interpreted as the -th feature value at trial t. Surprisingly, this simple method achieves the same optimal regret bound for any value of α 0,, without tuning. We can also handle the general case when the loss vector l t taes values in the continuous range [0, ] K. In this case, we show that, on worst-case data, the vanilla dropout procedure that simply sets coordinates of the loss vector to zero, gets a suboptimal ΩK dependence on the number of experts K. However, by binarizing the dropout loss, by which we mean setting the loss l t, of the -th expert to with probability αl t, and to 0 otherwise, FPL again achieves the optimal regret of O L ln K + ln K. Most Hedge or FPL algorithms depend on a learning rate or noise parameter η > 0, which needs to be tuned in terms of L to get the worst-case bound O L ln K + ln K. A simple way is to maintain an estimate of L and tune η based on the current estimate. As soon as the loss of the best expert exceeds the current estimate, the estimate is doubled and the algorithm re-tuned see e.g. Cesa-Bianchi et al., 996, 997. This method achieves the optimal regret for Hedge and the related previous versions of FPL, albeit with a large constant factor. Much fancier tuning methods were introduced by Auer et al. 2002. Amongst others, their techniques have recently led to the AdaHedge and FlipFlop algorithms van Erven et al., 20; de Rooij et al., 204, which wor well in the case when the loss vector is drawn i.i.d. from a fixed distribution. In this case they achieve constant regret Oln K when there is a gap between the loss of the best expert and the expected loss of all others. Surprisingly, we can show that FPL based on the binarized dropout perturbations also achieves regret Oln K in the i.i.d. case, without the need of any tuning. On the other side of the spectrum, there exists another simple perturbation based on Random Wal Perturbation RWP Devroye et al., 203: simply add an independent Bernoulli coin flip to each loss component in each trial without any tuning. We highly benefited from the techniques 2

DROPOUT PERTURBATIONS used in the analysis of this version of FPL. However, we can show that RWP already has Ω T regret in the noise-free case L = 0, where T is the number of trials. In contrast, FPL with binarized dropout perturbation achieves constant regret of order Oln K. It seems that since the perturbation of the loss of expert depends on the total current loss of expert, this maes FPL more adaptive and no additional tuning is required. Our research opens up a large number of questions regarding tuning-free methods for learning with the linear loss.. Does FPL with dropout perturbations lead to efficient algorithms with close to optimal regret for shifting experts or for various combinatorial expert settings such as the online shortest path problem Taimoto and Warmuth, 2003? See Devroye et al., 203 for a similar discussion. 2. Are there other versions of dropout such as dropping the entire loss vector with probability α that achieve the same feats as the dropout perturbations used in this paper? 3. There is a natural matrix generalization of the expert setting Warmuth and Kuzmin, 20. Is there a version of dropout perturbation that avoids the expensive matrix decompositions used by all algorithms with good regret bounds for this generalization Hazan et al., 200 and replaces the matrix decompositions by maximum eigenvector calculations? 4. Wager et al. 203 consider a Follow the Regularized Leader type algorithm for the batch setting. They use a regularizer that is a variation of the squared Euclidean distance, which they motivate as FTL on an approximation to the expected dropout loss. Although this use of dropout is entirely different from ours, it would be interesting to now whether it leads to novel regret or generalization bounds. Outline The paper is organized as follows. In Section 2 we formally define dropout perturbation, and explain how dropping loss components l t, is the same as dropping features. Section 3 contains the core of our paper: We prove that the regret of FPL with binarized dropout perturbations is bounded by O L ln K + ln K. Even though our algorithm is simple to state, this proof is quite involved at this point. We construct a worst-case sequence of losses using a sequence of reductions. This sequence consists of two regimes and we prove bounds for each regime separately, which we then combine. We also clarify the difference to the RWP algorithm, which provably does not achieve this bound. Then, in Section 4, we show that dropout perturbation automatically adapts to i.i.d. loss vectors, for which it gets constant regret of order Oln K. Some of the proof details are postponed to the appendix. 2. Dropout Perturbation Setting Online prediction with expert advice Cesa-Bianchi and Lugosi, 2006 is formalized as a repeated game between the algorithm and an adversary. At each iteration t =,..., T, the algorithm randomly pics an expert ˆ t {,..., K} according to a probability distribution w t = w t,,..., w t,k of its choice. Then, the loss vector l t [0, ] K is revealed by the adversary, and the algorithm suffers loss l t,ˆt. Let L T, = T t= l t, be the cumulative loss of expert. Then the goal of the algorithm is to minimize its regret, which is the difference between its expected 3

VAN ERVEN KOTŁOWSKI WARMUTH Algorithm : Follow the Leader with Binarized Dropout Perturbation BDP Parameter : Dropout probability α 0, Initialization: L 0, = 0 for all =,..., K. for t = to T do Pic expert ˆ t = argmin Lt, with ties broen uniformly at random. Observe loss vector l t and suffer loss l t,ˆt. Draw l t, according to 2., independently for all. Update L t, = L t, + l t, for all. end cumulative loss and the cumulative loss of the best expert chosen in hindsight: [ T ] R T = E l t,ˆt L, where L = min L T,. t= [ T The expectation is taen with respect to the random choices of the algorithm, i.e. E t= l t,ˆt] = T t= w t l t. For the process that generates the losses, we will consider two different models: In the worst-case setting, studied in Section 3, we need to guarantee small regret for every possible sequence of losses l,..., l T [0, ] K, including the sequence that is the most difficult for the algorithm; in the independent, identically distributed i.i.d. setting, considered in Section 4, we assume the loss vectors l t [0, ] K all come from a fixed, but unnown probability distribution, which maes it possible to give stronger guarantees on the regret. Follow the Perturbed Leader The general class of Follow the Perturbed Leader FPL algorithms Kalai and Vempala, 2005; Hannan, 957 is defined by the choice { } ˆ t = argmin Lt, + ξ t,, where ξ t, is an additive random perturbation of the cumulative losses, which is chosen independently for every expert. Kalai and Vempala 2005 consider exponential and uniform distributions for ξ t, that depend on a parameter η > 0, and in the RWP algorithm of Devroye et al., 203 ξ t, Binomial/2, t t/2 is a binomial variable on t outcomes with success probability /2 that is centered around its mean. Binarized Dropout Perturbations We introduce an instance of FPL based on binarized dropout perturbation BDP. For any dropout probability α 0,, the binarized dropout perturbation of loss l t, is defined as: { with probability αl t,, l t, = 2. 0 otherwise. Let L T, Algorithm chooses = T t= l t, be the cumulative BDP loss for expert. Then the BDP algorithm see ˆ t = argmin 4 L t,,

DROPOUT PERTURBATIONS with ties broen uniformly at random. BDP is conceptually very simple. Computationally, it might be of interest that it does sparse updates, which only operate on non-zero features: if l t, = 0, then l t, = 0 as well, so, if one uses the same variable to store L t, and L t,, then no update to the internal state of the algorithm is required. There is a parameter α, but this parameter only affects the constants in the theorems, so simply setting it to α = /2 without any tuning already gives good performance. BDP may be viewed as an instance of FPL with the additive data-dependent perturbations ξ t, = L t, L t,. If the losses are binary, α = /2 and the cumulative loss grows approximately as L t, t for all experts, then ξ t, is approximately distributed as Binomial/2, t t. Consequently, in this special case BDP is very similar to RWP because shifts by a constant do not change the minimizer ˆ t. Standard Dropout Perturbations Dropout is normally defined as dropping hidden units or features in a neural networ while training the parameters of the networ on batch data using Gradient Descent Hinton et al., 202; Wang and Manning, 203; Wager et al., 203. In each iteration, hidden units or features are dropped with a fixed probability. We are in the single neuron case, in which each expert may be identified with a feature and the standard dropout perturbations without binarization become : { l s t, = l t, with probability α, 0 otherwise. For binary losses, standard dropout perturbation is the same as BDP, but for general losses in the interval [0, ] they are different. We initially tried to prove our results for standard dropout perturbation, but it turns out that this approach cannot achieve the right dependence on the number of experts: Theorem 2. Consider the FPL algorithm based on standard dropout perturbation with parameter α 0,. Then, for any B > 0 and any K 2, there exists a loss sequence l,..., l T with l t [0, ] K for which L B, while the algorithm incurs regret at least 2 K. Proof Tae ɛ and δ to be very small positive numbers specified later. The sequence is constructed as follows. First, all experts except expert incur losses ɛ for a number of iterations that is sufficiently large that, with probability w.p. at least δ, the algorithm will choose expert as a leader. Note that the required number of iterations depends on δ, K and α, but it does not depend on ɛ. Then, expert incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. Next, all experts except expert 2 incur losses ɛ for sufficiently many iterations that, w.p. at least δ, the algorithm will choose expert 2 as a leader. Then, expert 2 incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. This process is then repeated for = 3, 4,..., K. At the end of the game, the algorithm incurred loss at least K w.p.. The loss w l t in round t that we consider here is linear in the weights w. For a general convex loss of the form f tw x t, where x t is the feature vector at round t, learning the weights w is often achieved by locally linearizing the loss with a first-order Taylor expansion: f tw t x t f tw x t w t l t w l t for the surrogate loss l t, = f tw t x tx t,. See the discussion in Section 4. of Kivinen and Warmuth 997 and Section 2.4 of Shalev- Shwartz 20. Thus, experts correspond to features, and setting the -th feature x t, to 0 is equivalent to setting l t, to 0. 5

VAN ERVEN KOTŁOWSKI WARMUTH at least K δ, while expert K incurred no more loss than T ɛ. Taing δ small enough that δ < 2K, the expected loss of the algorithm is at least 2 K. Finally, we tae ɛ small enough to satisfy 0 < T ɛ B, which gives L B. For the case that L B ln K, a better bound of order O L ln K + ln K = Oln K is possible. So we see that standard dropout leads to a suboptimal algorithm, and therefore we use binarized dropout for the rest of the paper. 3. Regret Bounded in Terms of the Cumulative Loss of the Best Expert We will prove that the regret for the BDP algorithm is bounded by O L ln K + ln K: Theorem 3. For any sequence of losses taing values in [0, ] with min L T, = L, the regret of the BDP algorithm is bounded by R T α 2 4 2L α ln3k + 3 ln + L + ln3k α α 2 + 3 α. Since RWP is similar to BDP and is nown to have regret bounded by O T ln K, one might try to show that the regret for RWP satisfies the stronger bound O L ln K + ln K as well, but this turns out to be impossible: Theorem 3.2 For every T > 0, there exists a loss sequence l,..., l T for which L = 0, while RWP suffers Ω T regret. This theorem, proved in Section A. of the appendix, shows that there is a fundamental difference between RWP and BDP, and that the data dependent perturbations used in BDP are crucial to obtaining a bound in terms of L. We now turn to proving Theorem 3., starting with the observation that it is in fact sufficient to prove the theorem only for binary losses: Lemma 3.3 Let a 0 and b be any constants. If the regret for the BDP algorithm is bounded by R T a L + b 3. for all sequences of binary losses, then it also satisfies 3. for any sequence of losses with values in the whole interval [0, ]. Proof Let l,..., l T be an arbitrary sequence of loss vectors with components l t, taing values in the whole interval [0, ]. We need to show that the regret for this sequence satisfies 3.. To this end, let l,..., l T be an alternative sequence of losses that is generated randomly from l,..., l T by letting { l t, = 0 with probability l t,, with probability l t,, independently for all t and. Accordingly, let a prime on any quantity denote that it is evaluated on the alternative losses. For example, L T, = T t= l t, is the cumulative alternative loss for expert. Let w t be the probability distribution on experts induced by the internal randomization of the 6

DROPOUT PERTURBATIONS BDP algorithm, i.e. w t, = P ˆ t =, and let w t be its counterpart on the alternative losses. Then, because the BDP algorithm internally generates a binary sequence of losses with the same probabilities as l t,, we have E l,...,l [w t t] = w t, and, independently, E l t [l t] = l t. Consequently, E[w t l t] = w t l t and E[ L T ] = L T, where L T and L T are the expected with respect to the internal randomization cumulative losses of the algorithm on the original and the alternative losses, respectively. Applying 3. for the alternative losses and taing expectations, it now follows by Jensen s inequality that [ ] E[ L T ] E[min L T, ] a E min L T, + b a E[min L T, ] + b E[ L T ] min E[L T, ] a min E[L T, ] + b L T min L T, a min L T, + b. Thus, from now on, we assume the losses are binary, i.e. l t, {0, }. Following the standard approach for FPL algorithms of Kalai and Vempala 2005, we start by bounding the regret by applying the so-called be-the-leader lemma to the perturbed losses. Lie in the analysis of Devroye et al. 203, the result simplifies, because we can relate the expectation of the perturbed losses bac to the original losses: Lemma 3.4 be-the-leader algorithm is bounded by For any sequence of loss vectors l,..., l T, the regret for the BDP R T α t= E [ lt,ˆt t,ˆt+] l, where l,..., l T are the dropout perturbed losses and ˆ t is the random expert chosen by the BDP algorithm based on the perturbed losses before trial t. Proof The expected regret is given by R T = t= ] E [l t,ˆt min L T, = α E t= l t,ˆt min ] α [ lt,ˆt E ] min [ Lt, α E α E lt,ˆt l t,ˆt+, t= LT, where the second equality holds because ˆ t only depends on l,..., l t, which are independent of l t, and E [ lt, ] = αl t, ; and the last inequality is from the be-the-leader lemma, applied to the perturbed losses, which shows that T t= l t,ˆt+ min LT, Kalai and Vempala, 2005, Equation 4, Cesa-Bianchi and Lugosi, 2006, Lemma 3.. t= 7

VAN ERVEN KOTŁOWSKI WARMUTH Every single term in the bound from Lemma 3.4 may be bounded further as follows: E [ lt,ˆt t,ˆt+] l = α α K Prˆ t = ˆ t+ l t, = l t, E [ lt,ˆt l t,ˆt+ ˆ t = ˆ t+, l ] t, = l t, = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.2 = In the case considered by Kalai and Vempala 2005, this expression can be easily controlled, because, for their specific choice of perturbations, the conditional probability P m, = Prˆ t+ ˆ t =, l t, = l t,, min j L t, = m is small, uniformly for all m see the second display on p. 298 of their paper, which is easy to show, because P m, depends only on the cumulative loss and the perturbations for a single expert. Thus their choice of perturbations is what maes their analysis simple. Unfortunately, in our case and also for the RWP algorithm of Devroye et al. 203, this simple approach breas down, because, for the perturbations we consider, the probability P m, may be large for some m although such m will have small probability. We therefore cannot use a uniform bound on P m,, and as a consequence our proof is more complicated. We also cannot follow the line of reasoning of Devroye et al., because it relies on the fact that, at any time t, the perturbation for every expert has a standard deviation of order t. For our algorithm this is only true if the cumulative losses for the experts grow at least at a linear rate ct for some absolute constant c > 0, which need not be the case in general. Put differently, if the expert losses grow sublinearly, then our perturbations are too small to use the approach of Devroye et al.. Thus, we cannot use any of the existing approaches to control 3.2 for arbitrary losses, and, instead, we proceed as follows:. Given any integers L,..., L K, we find a canonical worst-case sequence of losses, which maximizes the regret of the BDP algorithm among all binary loss sequences of arbitrary length T such that L T, = L for all experts. 2. We bound 3.2 on this sequence. 3.. The Canonical Worst-case Sequence We will show that the worst-case regret is achieved for a sequence of unit vectors: let u {0, } K denote the unit vector 0,..., 0,, 0,..., 0 where the is in the -th position. This restriction to unit vectors has been used before Abernethy and Warmuth, 200; Koolen and Warmuth, 200, but here we go one step further. We build a canonical worst-case sequence of unit vectors from the following alternating schemes: Definition 3.5 For any {,..., K}, we call a sequence of loss vectors u,..., u K a..kalternating scheme. 8

DROPOUT PERTURBATIONS To simplify the notation in the proofs, we will henceforth assume that the experts are numbered from best to worst, i.e. L... L K. In the canonical worst-case sequence, alternating schemes are repeated as follows: Definition 3.6 Canonical Worst-case Sequence Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, we call the following sequence the canonical worst-case sequence: First repeat the..k-alternating scheme L times; then repeat the 2..K-alternating scheme L 2 L times; then repeat the 3..K-alternating scheme L 3 L 2 times; and so on until finally we repeat the K..K-alternating scheme which consists of just the unit vector u K L K L K times. Note that the canonical worst-case sequence always consists of L K alternating schemes. For example, the canonical worst-case sequence for cumulative losses L = 2, L 2 = 3, L 3 = 6 consists of the following L 3 = 6 alternating schemes: 0 0 0 0 0 0 0 0 0 0,, 0, 0,, 0,, 0, 0, 0, 0. 0 0 } {{ }..3-alternating scheme 0 0 } {{ }..3-alternating scheme 0 } {{ } 2..3-alternating scheme } {{ } 3..3-a.s. } {{ } 3..3-a.s. } {{ } 3..3-a.s. For the BDP algorithm, the following three operations do not increase the regret of a sequence of binary losses see Section A.2 of the appendix for proofs:. If more than one expert gets loss in a single trial, then we split that trial into K trials in which we hand out the losses for the experts in turn Lemma A.. 2. If, in some trial, all experts get loss 0, then we remove that trial Lemma A.2. 3. If, in two consecutive trials t and t +, l t = u and l t+ = u for some experts such that L t, L t,, then we swap the trials, setting l t = u and l t+ = u Lemma A.3. These operations can be used to turn any sequence of binary losses into the canonical worst-case sequence: Lemma 3.7 Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, the regret of the BDP algorithm is non-uniquely maximized by its regret on the canonical worst-case sequence. Proof Start with an arbitrary sequence of binary losses such that L T, = L. By repeated application of operations and 2, we turn this loss sequence into a sequence of unit vectors. We then repeatedly apply the swapping operation 3 to any trials t and t+ such that l t = u and l t+ = u for where we have strict inequality L t, > L t,. We do this until no such consecutive trials remain. This arrives at the canonical worst-case sequence except that the unit vectors within each alternating scheme may be permuted. However, by applying property 3 for the case L t, = L t,, we can sort each alternating scheme into the canonical order. Together with Lemma 3.3, the previous lemma implies that we just need to bound the regret of the BDP algorithm on the canonical worst-case loss sequence. 9

VAN ERVEN KOTŁOWSKI WARMUTH 3.2. Regret on the Canonical Worst-case Sequence By Lemma 3.4 and 3.2, the regret for the BDP algorithm is bounded by R T t= = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.3 On the canonical worst-case sequence, l t, will be zero for all but a single expert, which we can exploit to abbreviate notation: let ˆ t + denote the leader with ties still broen uniformly at random on the perturbed cumulative losses if we would add unit of loss to the perturbed cumulative loss L t,ˆt of the expert selected by the BDP algorithm; and define the event Then 3.3 simplifies to A t = {ˆt = ˆ t + } R T for the expert such that l t, =. PrA t. 3.4 t= We split the L T,K alternating schemes of the canonical worst-case sequence into two regimes, and use different techniques to bound PrA t in each case. Let r L T,K L T, be some nonnegative integer, which we will optimize later:. The first regime consists of the initial L T, + r alternating schemes, where the difference in the cumulative loss between all experts is relatively small at most r. 2. The second regime consists of the remaining L T,K r L T, alternating schemes. Now any expert that is still receiving losses will have a cumulative loss that is at least r larger than the cumulative loss of expert, and r will be chosen large enough to ensure that, with high probability, no such expert will the leader. The First Regime Define the lead pac D t = { : Lt, < min j Lt,j + 2}, which is the random set of experts that might potentially be the leader in trials t or t+. As observed by Devroye et al. 203, there can only be a leader change at time t if D t contains more than one expert. Our goal will be to control the probability that this happens using the following lemma proved in Section A.3., which is an adaptation of Lemma 3 of Devroye et al.: Lemma 3.8 Suppose L t, L > 0 for all. Then Pr D t > 2 ln K α α L + 3. L Unfortunately, we cannot apply Lemma 3.8 immediately: in the first regime there are K trials for each of the first L T, repetitions of the..k-alternating scheme, and applying Lemma 3.8 to all of these trials would already give a bound of order K L T, ln K, which overshoots the right rate by a factor of K. To avoid this factor, we only apply the lemma once per alternating scheme, which is made possible by the following result: 0

DROPOUT PERTURBATIONS Lemma 3.9 Suppose an alternating scheme starts at trial s and ends at trial v. Then v PrA t α v s + PrA v+ α Prˆ v+ ˆ v+ + α Pr D v+ >. t=s Proof The first inequality follows from Lemmas A.4 and A.5 in Section A.3.2 of the appendix, which show that PrA s PrA s+... PrA v α PrA v+. Let = K v s be the first expert in the alternating scheme. At time v + a new alternating scheme will start, so that the situation is entirely symmetrical between all experts that are part of the current alternating scheme: PrA v+ = Prˆ v+ = ˆ + v+ for all {,..., K}. This implies the second inequality: K v s + PrA v+ = Prˆ v+ = ˆ v+ + = = Prˆ v+ {,..., K}, ˆ v+ ˆ v+ + Prˆ v+ ˆ v+ +. Finally, the last inequality follows by definition of the lead pac. Applying either Lemma 3.8 or the trivial bound Pr D t > only to the trials v + that immediately follow an alternating scheme, we obtain the following result for the first regime of the canonical worst-case sequence see Section A.3.3 for the proof: Lemma 3.0 Suppose the first regime ends with trial v. Then v PrA t 2 2L T, ln K + 3 ln + L T, + r +. α α α t= The Second Regime We proceed to bound the probability of the event A t during the second regime of the canonical worst-case sequence, using that PrA t Prˆ t = for the expert such that l t, =. This probability will be easy to control, because during the second regime the difference in cumulative loss between expert and the experts that are still receiving losses, is sufficiently large that the BDP algorithm will prefer expert over these experts with high probability. Lemma 3. Suppose the second regime starts in trial s. Then PrA t K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2. 2L T, + r t=s

VAN ERVEN KOTŁOWSKI WARMUTH Proof Let t be a trial during the L-th alternating scheme, for some L in the second regime, and let t be the expert that gets loss in trial t. Then, by Hoeffding s inequality, PrA t Prˆ t = t Pr Lt,t L 2 α 2 L L T, 2 t, exp L + L T, 2 α 2 L L T, 2 2 α 2 rl L T, = exp exp, 2L T, + L L T, 2L T, + r where the last inequality follows from the fact that a+x is increasing in x for x 0 and a > 0. Let sl and vl denote the first and the last trial in the L-th alternating scheme. Then, summing up over all trials in the second regime, we get L T,K vl L T,K vl 2 α 2 rl L T, PrA t = PrA t exp 2L T, + r t=s L=L T, +r+ t=sl x L=L T, +r+ t=sl L T,K 2 α 2 rl L T, K exp 2L T, + r L=L T, +r+ 2 α 2 r 2 K exp + K exp 2L T, + r x=r+ Here the second term is bounded by: 2 α 2 rx exp 2L T, + r r L T,K L T, = K x=r 2 α 2 rx 2L T, + r dx = 2L T, + r 2 α 2 r exp exp. 2 α 2 rx 2L T, + r 2 α 2 r 2. 2L T, + r Putting things together we obtain 2 α 2 r 2 PrA t K exp + K 2L T, + r 2L t=s T, + r 2 α 2 r exp 2 α 2 r 2 2L T, + r K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2, 2L T, + r which was to be shown. The First and the Second Regime Together Combining the bounds for the first and second regimes from Lemmas 3.0 and 3., and optimizing r, we obtain the following result, which is proved in Section A.3.4 of the appendix: Lemma 3.2 On a canonical worst-case sequence of any length T for which L T, = L, PrA t t= α 2 4 2L α ln3k + 3 ln + L + ln3k α α 2 + 3 α. Plugging this bound into 3.4 completes the proof of Theorem 3.. 2

DROPOUT PERTURBATIONS 4. Constant Regret on IID Losses In this section we show that our BDP algorithm has optimal Oln K regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a fixed gap γ between the expected loss of the best expert and all others. Theorem 4. Let γ 0, ] and δ 0, ] be constants, and let be a fixed expert. Suppose the loss vectors l t are independent random variables such that the expected differences in loss satisfy min E[l t, l t, ] γ for all t. 4. Then, with probability at least δ, the regret of the BDP algorithm is bounded by a constant: R T 8 α 2 γ 2 ln 8K α 2 γ 2 + 3. 4.2 δ As discussed in the proof in Section B., it is possible to improve the dependence on γ at the cost of getting a more complicated expression. Acnowledgments Tim van Erven was supported by NWO Rubicon grant 680-50-2, Wojciech Kotłowsi by the Foundation for Polish Science under the Homing Plus Program, and Manfred K. Warmuth by NSF grant IIS-8028. References Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In Neural Information Processing Systems NIPS, pages 9, 200. Peter Auer, Nicolò Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48 75, 2002. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Nicolò Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networs, 72:604 69, May 996. Nicolò Cesa-Bianchi, Yaov Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 443:427 485, 997. Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research. To appear., 204. Luc Devroye, Gábor Lugosi, and Gergely Neu. Prediction by random-wal perturbation. In Conference on Learning Theory COLT, pages 460 473, 203. 3

VAN ERVEN KOTŁOWSKI WARMUTH Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:9 39, 997. James Hannan. Approximation to Bayes ris in repeated play. Contributions to the Theory of Games, 3:97 39, 957. Elad Hazan, Satyen Kale, and Manfred K. Warmuth. On-line variance minimization in On 2 per trial? In Conference on Learning Theory COLT, pages 34 35, 200. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsy, Ilya Sutsever, and Ruslan R. Salahutdinov. Improving neural networs by preventing co-adaptation of feature detectors. CoRR, abs/207.0580, 202. Nie Jiazhong, Wojciech Kotłowsi, and Manfred K. Warmuth. On-line PCA with optimal regrets. In Algorithmic Learning Theory ALT, pages 98 2, 203. Adam Kalai. A perturbation that maes Follow the Leader equivalent to Randomized Weighted Majority. Private communication, December 2005. Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 73:29 307, 2005. Jyri Kivinen and Manfred K. Warmuth. Additive versus Exponentiated Gradient updates for linear prediction. Information and Computation, 32: 64, 997. Wouter M. Koolen and Manfred K. Warmuth. Hedging structured concepts. In 23rd Annual Conference on Learning Theory - COLT 200, pages 93 04. Omnipress, June 200. Dima Kuzmin and Manfred K. Warmuth. Optimum follow the leader algorithm. In Conference on Learning Theory COLT, pages 684 686, 2005. Open problem. Nic Littlestone and Manfred K. Warmuth. The Weighted Majority algorithm. Information and Computation, 082:22 26, 994. Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 42:07 94, 20. Eiji Taimoto and Manfred K. Warmuth. Path ernels and multiplicative updates. Journal of Machine Learning Research, 4:773 88, 2003. Tim van Erven, Peter D. Grünwald, Wouter Koolen, and Steven de Rooij. Adaptive hedge. In Neural Information Processing Systems NIPS, pages 656 664, 20. Volodya Vov. A game of prediction with expert advice. Journal of Computer and System Sciences, 562:53 73, 998. Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. In Neural Information Processing Systems NIPS, pages 35 359, 203. Sida I. Wang and Christopher D. Manning. Fast dropout training. In International Conference on Machine Learning ICML, pages 8 26, 203. 4

DROPOUT PERTURBATIONS Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. Machine Learning, 87: 32, 20. Additional Material for Follow the Leader with Dropout Perturbations Appendix A. Proof Details for Section 3 A.. Proof of Theorem 3.2 Proof We prove the theorem by explicit construction of the sequence. We tae K = 2 experts. The loss sequence is such that expert never gets any loss hence L = L T, = 0, while expert 2 gets losses ɛ, ɛ 2,..., ɛ T, where ɛ t = t. This means that its cumulative loss L t,2 = t s= s is bounded by 2 t + 2 L t,2 2 t +. We now bound the total loss of the RWP algorithm from below. Consider the situation just before trial t. Then expert 2 has accumulated at most L t,2 2 t loss so far. The expected loss of the algorithm in this iteration is ɛ t times the probability that expert becomes the leader. This probability is at least P ξ t, > ξ t,2 + 2 t, where ξ t, Binomial/2, t t/2. Consequently, U t = ξ t, ξ t,2 + t is distributed as Binomial 2, 2t. Therefore, the expected loss of the algorithm in iteration t is lower bounded by ] E [l t,ˆt ɛ t P U t > 2 t + t = ɛ t P U t t t/2 > 2 2. We first give a rough idea of what happens. Due to Central Limit Theorem, the probability on the right hand side is approximately P Y > 2 2, where Y N0,. Summing over trials we get that the expected cumulative loss of the algorithm is approximately L T,2 P Z > 2 2 = Ω T. To be more precise, we use the Berry-Esséen theorem, which says that for X Binomialn, p, and Y N0,, for any z, P X np > z P Y > z np p fp, n where fp depends on p, but not on n. Using this fact with n = 2t and p = 2 results in U t t P > 2 2 P Y > 2 2 c t/2 t for some constant c. Therefore the expected cumulative loss of the algorithm can be lower bounded by P Y > 2 c 2L T,2 ɛ t t P Y > 2 2L T,2 cln T + = Ω T, which proves the theorem. t= 5

VAN ERVEN KOTŁOWSKI WARMUTH A.2. Operations to Reduce to the Canonical Worst-case Sequence In this section we prove that the three operations on losses from Section 3. can only ever increase the regret of the BDP algorithm. As proved in Lemma 3.7, this implies that the canonical worstcase sequence maximizes the regret among all sequences of binary losses with the same cumulative losses L,..., L K. THE UNIT RULE HOLDS We can assume without loss of generality that in every round exactly one expert gets loss. We call this the unit rule. It follows from two results, which will be proved in this subsection: first, Lemma A. shows that, if multiple experts get loss in the same trial, then the regret can only increase if we split that trial into multiple consecutive trials in which the experts get their losses in turn. Secondly, it is shown by Lemma A.2 that rounds in which all experts get zero loss do not change the regret, and can therefore be ignored in the analysis. Although we only need them for binary losses, both lemmas hold for general losses with values in [0, ]. Lemma A. One Expert Gets Loss Per Trial Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses. Now consider the alternative sequence of losses l,..., l t, l t,..., l K t, l t+,..., l T with { l t, = l t, if =, 0 otherwise, for, =,..., K. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s probability distributions on experts are the same for both sequences of losses, so we only need to compare its expected loss on l t with its expected losses on l t,..., l K t. To this end, we observe that the algorithm s probability of choosing expert given losses l,..., l t is no more than its probability of choosing that expert given losses l,..., l t, l t,..., l t, because during the trials l t,..., l t expert has received no loss whereas experts,..., have respectively received losses l t,,..., l t,, which implies that the algorithm s expected loss can only increase on the alternative sequence of losses. Lemma A.2 Ignore All-zero Loss Vectors Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses such that l t, = 0 for all. Then the regret of the BDP algorithm is the same on the subsequence l,..., l t, l t+,..., l T with trial t removed. Proof Both the best expert and the BDP algorithm have loss 0 on trial t. Trial t also does not influence the actions of the BDP algorithm for any trial t t. Hence the regret does not change if trial t is removed. 6

DROPOUT PERTURBATIONS SWAPPING ARGUMENT We call a loss vector l t a unit loss if there exists a single such that l t, = while l t, = 0 for all. Let ˆ t denote the expert that is randomly selected by the BDP algorithm based on some past losses l,..., l t to predict trial t. Let Binomialp, n denote the distribution of a binomial random variable on n trials with success probability p. Lemma A.3 Swapping Suppose that l,..., l t, l t, l t+, l t+2,..., l T is a sequence of unit losses with l t, = l t+, = for and that L t, L t,. Now consider the alternative sequence of losses l,..., l t, l t+, l t, l t+2,..., l T in which l t and l t+ are swapped. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s weights are the same for both sequences of loss, so we only need to compare its loss on l t, l t+ in the original sequence with its loss on l t+, l t in the alternative sequence. Let Pr and Pr respectively denote probability with respect to the algorithm s randomness on the original and on the alternative losses. Since we assumed unit losses, we need to show that Prˆ t = + Prˆ t+ = Pr ˆ t = + Pr ˆ t+ =. A. As will be made precise below, we can express all these probabilities in terms of binomial random variables L t,j Binomial α, L t,j for j =,..., K that represent the cumulative perturbed losses for the experts after t trials, and an additional Bernoulli variable X Binomial α, that represents whether the next loss is dropped out depending on context, X is either l t, or l t,. It will also be convenient to define the derived values M = min j, Lt,j, C = {j, : Lt,j = M}, which represent the minimum perturbed loss and the number of experts achieving that minimum among all experts except and. We will show that Prˆ t = M = m, C = c, X = x + Prˆ t+ = M = m, C = c, X = x Pr ˆ t = M = m, C = c, X = x Pr ˆ t+ = M = m, C = c, X = x A.2 7

VAN ERVEN KOTŁOWSKI WARMUTH is nonpositive for all m, c and x, from which A. follows. In terms of the random variables we have defined, these conditional probabilities are Prˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Prˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, Pr ˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Pr ˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, where for notational convenience, we sipped the subscript t in L t,j for all j. For x = 0, we see immediately that A.2 is 0, so it remains only to consider the case x =. For that case, A.2 simplifies to Pr L < L, L < m Pr L < L +, L < m + Pr L < L +, L < m Pr L < L, L < m + Pr L = 2 L + < m Pr L = L + < m + Pr L = m < c + L Pr L = m < L + + Pr L = m < c + L + Pr L = m < L + Pr L = m = c + 2 L + Pr L = m = L + = Pr L = L < m + Pr L = L < m + Pr L = 2 L + < m Pr L = L + < m Pr L = m = c + L + Pr L = m = L + Pr L = m = L + Pr L = m = L + = 2 c + 2 m a= Pr L = a = L + Pr L = a = L + + Pr L = m = c + 2 L + Pr L = m = L +. 8

DROPOUT PERTURBATIONS To prove that this is nonpositive, it is sufficient to show that Pr L = a = L + Pr L = a = L + 0 A.3 for all nonnegative integers a. Abbreviate L = L t, and L = L t,. If a = 0 or a > L, then the left-most probability is 0 and A.3 holds. Alternatively, for a {,..., L }, the left-hand side of A.3 is equal to L a L a L a L α 2a α L+L 2a+, a so it is enough to show that L a L a L L a a 0. But this holds, because the left-hand side is equal to L!L! a!l a!a!l a +! = L!L! a!l a +!a!l a +! because L = L t, L t, = L by assumption. L!L! a!l a +!a!l a! = L a + L a + L!L! a!l a +!a!l a +! L L 0, A.3. Bounding Leader Changes on the Canonical Worst-case Sequence A.3.. PROOF OF LEMMA 3.8 Proof Within this proof, abbreviate L = L t,. Let V Binomial α, L and W Binomial α, L t, L, and assume L = V + W. Also define pv = PrV = v. Then Pr D = = = K = L 2 Prmin j K v=0 = L v=2 L j L + 2 = pv Prmin j pv 2 pv K L = v=0 L j v + W + 2 = pv Prmin j K pv Prmin L j v + W j = } {{ } fv L K v=2 = L j v + W + 2 pv 2 Prmin j L j v + W. A.4 9

VAN ERVEN KOTŁOWSKI WARMUTH Let S = { : L min j Lj } be the set of leaders. Then fv = K = Prmin j = Pr : min j L j L, V = v Pr : min j Lj L, V = v Prmin S V = v, L j L, V = v where the first inequality follows by the union bound, and the second because the latter event implies the former. We remar that S is never empty, so that the minimum is always well-defined. We also have L pv 2 v 2 α v 2 α L v+2 α 2 = pv L = v α v α L v α 2 vv L v + 2L v +. Let gv = max{ vv L v+2l v+ α 2 α 2 L v=0, 0}. Then, since g0 = g = 0, A.4 is at least as large as gv Prmin S V = v = for V = min S V. The function gv may be written as α 2 E[gV ] α 2 v gv = h vh 2 v for h v = L v + 2, h 2v = v L v +. For v, both h and h 2 are nonnegative, nondecreasing and convex, which implies that gv also has these properties. Moreover, since gv = 0 for v [0, ] all properties extend to all v 0. Suppose that E[V ] αl B for some B 0. Then Jensen s inequality and monotonicity of g imply that α 2 α 2 E[gV ] α 2 αl α 2 ge[v ] α 2 α 2 g α α B αl α α B + αl + B + 2αL + B + = αl B α B + 2 αl + B + 2 α B + αl + B + 2B + 3 ααl. Putting everything together, we find that α B + 2 αl + B + 2 α B + αl + B + Pr D t > = Pr D t = 2B + 3 ααl, A.5 so that it remains to find a good bound B. Let Y = L V Binomialα, L. Then [ ] [ ] E[V ] + αl = E max αl V E max αl V S [ = E max Y αl ] L ln K 2 =: B, where the last inequality follows by a standard argument for sub-gaussian random variables see, for example, Lemmas A.3 and A. in the textboo by Cesa-Bianchi and Lugosi 2006. Plugging this bound into A.5 leads to the desired result. 20

DROPOUT PERTURBATIONS A.3.2. PROOF DETAILS FOR LEMMA 3.9 Suppose that t is a trial during the first regime in which expert gets a unit of loss. First we consider the case that is not the last expert in a round of the alternating scheme: Lemma A.4 Suppose t is a trial during the first L T, + r repetitions of the alternating scheme in which l t, = for some < K. Then PrA t PrA t+. Proof For j, +, let L t,j Binomial α, L t,j be the perturbed cumulative losses for all experts except experts and +. Also define M = min L t,j, j,+ C = {j, + : Lt,j = M}. By definition of the canonical worst-case sequence, expert gets a unit of loss in trial t, expert + will get a unit of loss in trial t +, and L t, = L t,+. We will construct the perturbed cumulative losses for experts and + from the following variables: V, W Binomial α, L t, and X Binomial α,. To express PrA t, we define the perturbed cumulative losses L t, = V and L t,+ = W, but to express PrA t+ we let L t, = W + X and L t,+ = V. This leads to PrA t M = m, C = c c + = PrV = m, W > m + PrV = m, W = mc c + c + 2 + PrV = m, W > m c + + PrV = m, W = m c + 2 + PrV = W, W < m + PrV = W, W < m 2, PrA t+ M = m, C = c c + = PrV = m, W + X > m + PrV = m, W + X = mc c + c + 2 + PrV = m, W + X > m c + + PrV = m, W + X = m c + 2 + PrV = W + X, W + X < m + PrV = W + X, W + X < m 2 for any m and c. Thus PrA t+ M = m, C = c PrA t M = m, C = c = α PrA t+ M = m, C = c, X = 0 PrA t M = m, C = c, X = 0 + α PrA t+ M = m, C = c, X = PrA t M = m, C = c, X = = α PrA t+ M = m, C = c, X = PrA t M = m, C = c, A.6 2

VAN ERVEN KOTŁOWSKI WARMUTH where PrA t+ M = m, C = c, X = PrA t M = m, C = c c = PrV = m, W + > m PrV = m, W > m c + c + + PrV = m, W + = m PrV = m, W = m c + 2 + PrV = m, W + > m PrV = m, W > m c + + PrV = m, W + = m PrV = m, W = m c + 2 + PrV = W, W + < m + PrV = W +, W + < m PrV = W, W < m PrV = W, W < m 2. Using that V and W have the same distribution, so that we may switch their roles, this simplifies to PrA t+ M = m, C = c, X = PrA t M = m, C = c c = PrV = m, W = m c + c + c + 2 + c + 2 c + + PrV = W = m c + 2 2 + PrV = W = m c + c + 2 0. Substituting bac into A.6, we see that PrA t+ M = m, C = c PrA t M = m, C = c 0 for all m and c. Hence PrA t+ PrA t 0 also holds unconditionally, from which the lemma follows. Secondly, we consider the case that is the last expert in a round of the alternating scheme: Lemma A.5 Suppose t is a trial during the first L T, + r repetitions of the alternating scheme in which l t,k =. Then PrA t α PrA t+. To mae PrA t+ well-defined in case t = T, we adopt the convention that expert K gets a unit of loss in trial T +. Proof Let be the expert that gets a unit of loss in trial t + so that l t+, = and l t+, = 0 for. Trial t + is at the beginning of a round of the alternating scheme, so by symmetry between the experts that are part of the alternating scheme, PrA t+ would remain the same if we changed l t+ so that l t+,k = and l t+, = 0 for all K. But then we would have PrA t+ l t,k = 0 = PrA t, 22

DROPOUT PERTURBATIONS where l t,k is the perturbed loss for expert K in round t, and consequently PrA t+ = α PrA t+ l t,k = 0 + α PrA t+ l t,k = α PrA t+ l t,k = 0 = α PrA t, from which the lemma follows. A.3.3. PROOF OF LEMMA 3.0 Proof Let vl denote the last trial in the L-th alternating scheme. Then, by Lemmas 3.9, 3.8 and the trivial bound Pr D vl+ >, v PrA t L T, +r Pr D α vl+ > L T, 2 ln K α α α L + 3 + r + L t= L= L=2 2 2L T, ln K + 3 ln + L T, + r +, α α α where the last step follows from the fact that L T, L=2 fl L T, fldl for any nonincreasing function f. A.3.4. PROOF OF LEMMA 3.2 Here we show how to optimize r to obtain Lemma 3.2: Proof Abbreviate fr = 2 α2 r 2 2L +r. Then the bound for the second regime from Lemma 3. can be written as PrA t Kr fr 2L + 3r 2L + r e fr 3Kr fr e fr. t=s We will choose r to be the smallest nonnegative integer such that fr ln3k, so that PrA t r. t=s A.7 It is no problem if this maes r exceed its maximal value L T,K L T,, because in that case the second regime is empty, so A.7 still holds, and since the bound for the first regime from Lemma 3.0 is increasing in r it also still holds. To find r, we need to tae the largest solution to 2 α 2 r 2 2L + r = ln3k, and round it up to the nearest integer. Abbreviating a = 2 α 2 and b = ln3k, this gives b + b r = 2 + 8aL b b 2L 2a a + b + = ln3k L a 2 α 2 + ln3k +, α 23

VAN ERVEN KOTŁOWSKI WARMUTH where the inequality follows from x + y x + y for nonnegative x, y and x x +. Combining this bound on r with A.7 and the bound for the first regime from Lemma 3.0, we find that PrA t α α α α α α α α α α α α t= 2 2L ln K + 3 ln + L + r + + r 2 2L ln K + 3 ln + L + 2r + 2 2L ln K + 3 ln + L + ln3k 4 2L ln3k + 3 ln + L which is equivalent to the statement of the lemma. α 2 + 2 L ln3k α + ln3k α 2 + 3, + 3 Appendix B. Proofs for Section 4 B.. Proof of Theorem 4. Proof [Theorem 4.] We will show that the conditions of Lemma B. below, with c = γ/2, are satisfied with probability at least δ if τ is chosen as 8 τ = α 2 γ 2 ln 8K α 2 γ 2. δ Because the second term in the bound from Lemma B. is bounded by this shows that 4K α 2 γ 2 exp α2 γ 2 τ, 4 R T R τ+ + with probability at least δ. We now get the inequality in 4.2 from the trivial bound R τ+ τ + and x x +. Alternatively, one might also get a better dependence on γ by applying Theorem 3. to R τ+ and using that L τ +, which leads to ln K R T = O + ln γ γ γ 2 δ ln K. To verify that the conditions of Lemma B. are satisfied with sufficient probability, define the following events for t τ + : A t, : L t, E[L t, ] γ 4 t for, B t : L t, E[L t, ] + γ 4 t, 24

DROPOUT PERTURBATIONS and let D t = B t A t, be the event that they all hold simultaneously. By the assumption in 4. we have that L t, L t, E[L t, ] E[L t, ] γ 2 t γ 2 t on D t, so that the conditions of Lemma B. are satisfied with c = γ/2 if D t holds for all t τ +. By Hoeffding s inequality, the probabilities of the complementary events Āt, and B t are all bounded by exp γ 2 t/8 and hence by the union bound the probability of D t is bounded by K exp γ 2 t/8. Combining this with another application of the union bound, we find that the probability that D t fails to hold for any t τ + is bounded by Pr t=τ+,...,t D t t=τ+ K t=τ Pr Dt K t=τ+ exp γ2 8 t exp γ2 8K tdt = 8 γ 2 exp γ2 8 τ. The reader may verify that, for our choice of τ, this probability is bounded by δ, which completes the proof. Lemma B. Suppose that, for some, L t, L t, ct for all t τ + and, where c > 0 is a constant, and τ is a nonnegative integer. Then the regret for the BDP algorithm is bounded by a constant: R T R τ+ + K α 2 c 2 exp α 2 c 2 τ. Proof From the assumption of the lemma we now that expert will be the best expert at least for all t τ +. Consequently the regret is bounded by R T R τ+ + t=τ+2 Prˆ t. For any t τ + 2 and any, Hoeffding s inequality implies that Prˆ t = Pr L t, L t, = Pr Lt, L t, E[ L t, L t, ] αl t, L t, 2 α 2 L t, L t, 2 exp exp α 2 c 2 t. 2t 25

VAN ERVEN KOTŁOWSKI WARMUTH Consequently, by the union bound, t=τ+2 Prˆ t = K t=τ+ Prˆ t = K exp α 2 c 2 t K t=τ+2 K α 2 c 2 exp α 2 c 2 τ, from which the lemma follows. t=τ+2 τ exp α 2 c 2 t exp α 2 c 2 t dt 26