Follow the Leader with Dropout Perturbations


 Kevin Hudson
 2 years ago
 Views:
Transcription
1 JMLR: Worshop and Conference Proceedings vol 35: 26, 204 Follow the Leader with Dropout Perturbations Tim van Erven Département de Mathématiques, Université ParisSud, France Wojciech Kotłowsi Institute of Computing Science, Poznań University of Technology, Poland Manfred K. Warmuth Department of Computer Science, University of California, Santa Cruz Abstract We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss i.e. regret compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader FPL. The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss the leader where ties are broen uniformly at random. To achieve the optimal worstcase regret as a function of the loss L of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or before selecting the leader. We show that our perturbations are an instance of dropout because experts may be interpreted as features although for nonbinary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuningfree version of the FPL algorithm achieves two feats: optimal worstcase O L ln K + ln K regret as a function of L, and optimal Oln K regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family AdaHedge and FlipFlop also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert because they depend on the expert s total loss and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants. Keywords: Online learning, regret bounds, expert setting, Follow the Leader, Follow the Perturbed Leader, Dropout. Introduction We address the following online sequential prediction problem, nown as prediction with expert advice see e.g. Littlestone and Warmuth, 994; Vov, 998: in each trial t =, 2,... the algorithm randomly chooses one of K experts as its predictor before being told the loss vector l t [0, ] K of all experts. Let L T, be the cumulative loss of the th expert after T rounds, and let L = min L T, be the cumulative loss of the best expert. Then the goal for the algorithm is to minimize the difference between its expected cumulative loss and L, which is nown as the regret. In this paper, we consider not just the standard worstcase setting, in which losses are generated by an 204 T. van Erven, W. Kotłowsi & M.K. Warmuth.
2 VAN ERVEN KOTŁOWSKI WARMUTH adversary that tries to maximize the regret, but also a setting with easier data, in which loss vectors are sampled from a fixed probability distribution. The simplest algorithm is to Follow the Leader FTL, i.e. to choose an expert of minimal cumulative loss, with ties broen uniformly at random. This algorithm wors well for probabilistic data, but unfortunately its regret will grow linearly with L and T on worstcase data, which means that it is not able to learn. This problem can be avoided by the Weighted Majority algorithm or its stratified version, the Hedge algorithm Littlestone and Warmuth, 994; Freund and Schapire, 997, which chooses expert with probability proportional to the exponential weight exp ηl t, in trial t, for some positive learning rate η. For the appropriate tuning of η, Hedge is guaranteed to achieve sublinear regret of order O L ln K + ln K, which is worstcase optimal CesaBianchi et al., 997; Jiazhong et al., 203. In addition to using exponential weights, there is a second method that can achieve sublinear worstcase regret: perturb the cumulative losses of the experts by independent additive noise from a fixed distribution and then apply FTL to the perturbed losses. For exponentially distributed noise with its parameter appropriately chosen as a function of L, the resulting Follow the Perturbed Leader FPL algorithm can also guarantee regret O L ln K + ln K Kalai and Vempala, FPL can also exactly simulate Hedge with any learning rate η by using logexponential random variables for the noise that are scaled as a function of η Kuzmin and Warmuth, 2005; Kalai, Instead of using additive noise, we perturb the loss vector of the current trial: for binary losses l t {0, } K, we set each component l t, to zero with fixed probability α. As we explain in Section 2, this is equivalent to dropout Hinton et al., 202, because l t, may be interpreted as the th feature value at trial t. Surprisingly, this simple method achieves the same optimal regret bound for any value of α 0,, without tuning. We can also handle the general case when the loss vector l t taes values in the continuous range [0, ] K. In this case, we show that, on worstcase data, the vanilla dropout procedure that simply sets coordinates of the loss vector to zero, gets a suboptimal ΩK dependence on the number of experts K. However, by binarizing the dropout loss, by which we mean setting the loss l t, of the th expert to with probability αl t, and to 0 otherwise, FPL again achieves the optimal regret of O L ln K + ln K. Most Hedge or FPL algorithms depend on a learning rate or noise parameter η > 0, which needs to be tuned in terms of L to get the worstcase bound O L ln K + ln K. A simple way is to maintain an estimate of L and tune η based on the current estimate. As soon as the loss of the best expert exceeds the current estimate, the estimate is doubled and the algorithm retuned see e.g. CesaBianchi et al., 996, 997. This method achieves the optimal regret for Hedge and the related previous versions of FPL, albeit with a large constant factor. Much fancier tuning methods were introduced by Auer et al Amongst others, their techniques have recently led to the AdaHedge and FlipFlop algorithms van Erven et al., 20; de Rooij et al., 204, which wor well in the case when the loss vector is drawn i.i.d. from a fixed distribution. In this case they achieve constant regret Oln K when there is a gap between the loss of the best expert and the expected loss of all others. Surprisingly, we can show that FPL based on the binarized dropout perturbations also achieves regret Oln K in the i.i.d. case, without the need of any tuning. On the other side of the spectrum, there exists another simple perturbation based on Random Wal Perturbation RWP Devroye et al., 203: simply add an independent Bernoulli coin flip to each loss component in each trial without any tuning. We highly benefited from the techniques 2
3 DROPOUT PERTURBATIONS used in the analysis of this version of FPL. However, we can show that RWP already has Ω T regret in the noisefree case L = 0, where T is the number of trials. In contrast, FPL with binarized dropout perturbation achieves constant regret of order Oln K. It seems that since the perturbation of the loss of expert depends on the total current loss of expert, this maes FPL more adaptive and no additional tuning is required. Our research opens up a large number of questions regarding tuningfree methods for learning with the linear loss.. Does FPL with dropout perturbations lead to efficient algorithms with close to optimal regret for shifting experts or for various combinatorial expert settings such as the online shortest path problem Taimoto and Warmuth, 2003? See Devroye et al., 203 for a similar discussion. 2. Are there other versions of dropout such as dropping the entire loss vector with probability α that achieve the same feats as the dropout perturbations used in this paper? 3. There is a natural matrix generalization of the expert setting Warmuth and Kuzmin, 20. Is there a version of dropout perturbation that avoids the expensive matrix decompositions used by all algorithms with good regret bounds for this generalization Hazan et al., 200 and replaces the matrix decompositions by maximum eigenvector calculations? 4. Wager et al. 203 consider a Follow the Regularized Leader type algorithm for the batch setting. They use a regularizer that is a variation of the squared Euclidean distance, which they motivate as FTL on an approximation to the expected dropout loss. Although this use of dropout is entirely different from ours, it would be interesting to now whether it leads to novel regret or generalization bounds. Outline The paper is organized as follows. In Section 2 we formally define dropout perturbation, and explain how dropping loss components l t, is the same as dropping features. Section 3 contains the core of our paper: We prove that the regret of FPL with binarized dropout perturbations is bounded by O L ln K + ln K. Even though our algorithm is simple to state, this proof is quite involved at this point. We construct a worstcase sequence of losses using a sequence of reductions. This sequence consists of two regimes and we prove bounds for each regime separately, which we then combine. We also clarify the difference to the RWP algorithm, which provably does not achieve this bound. Then, in Section 4, we show that dropout perturbation automatically adapts to i.i.d. loss vectors, for which it gets constant regret of order Oln K. Some of the proof details are postponed to the appendix. 2. Dropout Perturbation Setting Online prediction with expert advice CesaBianchi and Lugosi, 2006 is formalized as a repeated game between the algorithm and an adversary. At each iteration t =,..., T, the algorithm randomly pics an expert ˆ t {,..., K} according to a probability distribution w t = w t,,..., w t,k of its choice. Then, the loss vector l t [0, ] K is revealed by the adversary, and the algorithm suffers loss l t,ˆt. Let L T, = T t= l t, be the cumulative loss of expert. Then the goal of the algorithm is to minimize its regret, which is the difference between its expected 3
4 VAN ERVEN KOTŁOWSKI WARMUTH Algorithm : Follow the Leader with Binarized Dropout Perturbation BDP Parameter : Dropout probability α 0, Initialization: L 0, = 0 for all =,..., K. for t = to T do Pic expert ˆ t = argmin Lt, with ties broen uniformly at random. Observe loss vector l t and suffer loss l t,ˆt. Draw l t, according to 2., independently for all. Update L t, = L t, + l t, for all. end cumulative loss and the cumulative loss of the best expert chosen in hindsight: [ T ] R T = E l t,ˆt L, where L = min L T,. t= [ T The expectation is taen with respect to the random choices of the algorithm, i.e. E t= l t,ˆt] = T t= w t l t. For the process that generates the losses, we will consider two different models: In the worstcase setting, studied in Section 3, we need to guarantee small regret for every possible sequence of losses l,..., l T [0, ] K, including the sequence that is the most difficult for the algorithm; in the independent, identically distributed i.i.d. setting, considered in Section 4, we assume the loss vectors l t [0, ] K all come from a fixed, but unnown probability distribution, which maes it possible to give stronger guarantees on the regret. Follow the Perturbed Leader The general class of Follow the Perturbed Leader FPL algorithms Kalai and Vempala, 2005; Hannan, 957 is defined by the choice { } ˆ t = argmin Lt, + ξ t,, where ξ t, is an additive random perturbation of the cumulative losses, which is chosen independently for every expert. Kalai and Vempala 2005 consider exponential and uniform distributions for ξ t, that depend on a parameter η > 0, and in the RWP algorithm of Devroye et al., 203 ξ t, Binomial/2, t t/2 is a binomial variable on t outcomes with success probability /2 that is centered around its mean. Binarized Dropout Perturbations We introduce an instance of FPL based on binarized dropout perturbation BDP. For any dropout probability α 0,, the binarized dropout perturbation of loss l t, is defined as: { with probability αl t,, l t, = 2. 0 otherwise. Let L T, Algorithm chooses = T t= l t, be the cumulative BDP loss for expert. Then the BDP algorithm see ˆ t = argmin 4 L t,,
5 DROPOUT PERTURBATIONS with ties broen uniformly at random. BDP is conceptually very simple. Computationally, it might be of interest that it does sparse updates, which only operate on nonzero features: if l t, = 0, then l t, = 0 as well, so, if one uses the same variable to store L t, and L t,, then no update to the internal state of the algorithm is required. There is a parameter α, but this parameter only affects the constants in the theorems, so simply setting it to α = /2 without any tuning already gives good performance. BDP may be viewed as an instance of FPL with the additive datadependent perturbations ξ t, = L t, L t,. If the losses are binary, α = /2 and the cumulative loss grows approximately as L t, t for all experts, then ξ t, is approximately distributed as Binomial/2, t t. Consequently, in this special case BDP is very similar to RWP because shifts by a constant do not change the minimizer ˆ t. Standard Dropout Perturbations Dropout is normally defined as dropping hidden units or features in a neural networ while training the parameters of the networ on batch data using Gradient Descent Hinton et al., 202; Wang and Manning, 203; Wager et al., 203. In each iteration, hidden units or features are dropped with a fixed probability. We are in the single neuron case, in which each expert may be identified with a feature and the standard dropout perturbations without binarization become : { l s t, = l t, with probability α, 0 otherwise. For binary losses, standard dropout perturbation is the same as BDP, but for general losses in the interval [0, ] they are different. We initially tried to prove our results for standard dropout perturbation, but it turns out that this approach cannot achieve the right dependence on the number of experts: Theorem 2. Consider the FPL algorithm based on standard dropout perturbation with parameter α 0,. Then, for any B > 0 and any K 2, there exists a loss sequence l,..., l T with l t [0, ] K for which L B, while the algorithm incurs regret at least 2 K. Proof Tae ɛ and δ to be very small positive numbers specified later. The sequence is constructed as follows. First, all experts except expert incur losses ɛ for a number of iterations that is sufficiently large that, with probability w.p. at least δ, the algorithm will choose expert as a leader. Note that the required number of iterations depends on δ, K and α, but it does not depend on ɛ. Then, expert incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. Next, all experts except expert 2 incur losses ɛ for sufficiently many iterations that, w.p. at least δ, the algorithm will choose expert 2 as a leader. Then, expert 2 incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. This process is then repeated for = 3, 4,..., K. At the end of the game, the algorithm incurred loss at least K w.p.. The loss w l t in round t that we consider here is linear in the weights w. For a general convex loss of the form f tw x t, where x t is the feature vector at round t, learning the weights w is often achieved by locally linearizing the loss with a firstorder Taylor expansion: f tw t x t f tw x t w t l t w l t for the surrogate loss l t, = f tw t x tx t,. See the discussion in Section 4. of Kivinen and Warmuth 997 and Section 2.4 of Shalev Shwartz 20. Thus, experts correspond to features, and setting the th feature x t, to 0 is equivalent to setting l t, to 0. 5
6 VAN ERVEN KOTŁOWSKI WARMUTH at least K δ, while expert K incurred no more loss than T ɛ. Taing δ small enough that δ < 2K, the expected loss of the algorithm is at least 2 K. Finally, we tae ɛ small enough to satisfy 0 < T ɛ B, which gives L B. For the case that L B ln K, a better bound of order O L ln K + ln K = Oln K is possible. So we see that standard dropout leads to a suboptimal algorithm, and therefore we use binarized dropout for the rest of the paper. 3. Regret Bounded in Terms of the Cumulative Loss of the Best Expert We will prove that the regret for the BDP algorithm is bounded by O L ln K + ln K: Theorem 3. For any sequence of losses taing values in [0, ] with min L T, = L, the regret of the BDP algorithm is bounded by R T α 2 4 2L α ln3k + 3 ln + L + ln3k α α α. Since RWP is similar to BDP and is nown to have regret bounded by O T ln K, one might try to show that the regret for RWP satisfies the stronger bound O L ln K + ln K as well, but this turns out to be impossible: Theorem 3.2 For every T > 0, there exists a loss sequence l,..., l T for which L = 0, while RWP suffers Ω T regret. This theorem, proved in Section A. of the appendix, shows that there is a fundamental difference between RWP and BDP, and that the data dependent perturbations used in BDP are crucial to obtaining a bound in terms of L. We now turn to proving Theorem 3., starting with the observation that it is in fact sufficient to prove the theorem only for binary losses: Lemma 3.3 Let a 0 and b be any constants. If the regret for the BDP algorithm is bounded by R T a L + b 3. for all sequences of binary losses, then it also satisfies 3. for any sequence of losses with values in the whole interval [0, ]. Proof Let l,..., l T be an arbitrary sequence of loss vectors with components l t, taing values in the whole interval [0, ]. We need to show that the regret for this sequence satisfies 3.. To this end, let l,..., l T be an alternative sequence of losses that is generated randomly from l,..., l T by letting { l t, = 0 with probability l t,, with probability l t,, independently for all t and. Accordingly, let a prime on any quantity denote that it is evaluated on the alternative losses. For example, L T, = T t= l t, is the cumulative alternative loss for expert. Let w t be the probability distribution on experts induced by the internal randomization of the 6
7 DROPOUT PERTURBATIONS BDP algorithm, i.e. w t, = P ˆ t =, and let w t be its counterpart on the alternative losses. Then, because the BDP algorithm internally generates a binary sequence of losses with the same probabilities as l t,, we have E l,...,l [w t t] = w t, and, independently, E l t [l t] = l t. Consequently, E[w t l t] = w t l t and E[ L T ] = L T, where L T and L T are the expected with respect to the internal randomization cumulative losses of the algorithm on the original and the alternative losses, respectively. Applying 3. for the alternative losses and taing expectations, it now follows by Jensen s inequality that [ ] E[ L T ] E[min L T, ] a E min L T, + b a E[min L T, ] + b E[ L T ] min E[L T, ] a min E[L T, ] + b L T min L T, a min L T, + b. Thus, from now on, we assume the losses are binary, i.e. l t, {0, }. Following the standard approach for FPL algorithms of Kalai and Vempala 2005, we start by bounding the regret by applying the socalled betheleader lemma to the perturbed losses. Lie in the analysis of Devroye et al. 203, the result simplifies, because we can relate the expectation of the perturbed losses bac to the original losses: Lemma 3.4 betheleader algorithm is bounded by For any sequence of loss vectors l,..., l T, the regret for the BDP R T α t= E [ lt,ˆt t,ˆt+] l, where l,..., l T are the dropout perturbed losses and ˆ t is the random expert chosen by the BDP algorithm based on the perturbed losses before trial t. Proof The expected regret is given by R T = t= ] E [l t,ˆt min L T, = α E t= l t,ˆt min ] α [ lt,ˆt E ] min [ Lt, α E α E lt,ˆt l t,ˆt+, t= LT, where the second equality holds because ˆ t only depends on l,..., l t, which are independent of l t, and E [ lt, ] = αl t, ; and the last inequality is from the betheleader lemma, applied to the perturbed losses, which shows that T t= l t,ˆt+ min LT, Kalai and Vempala, 2005, Equation 4, CesaBianchi and Lugosi, 2006, Lemma 3.. t= 7
8 VAN ERVEN KOTŁOWSKI WARMUTH Every single term in the bound from Lemma 3.4 may be bounded further as follows: E [ lt,ˆt t,ˆt+] l = α α K Prˆ t = ˆ t+ l t, = l t, E [ lt,ˆt l t,ˆt+ ˆ t = ˆ t+, l ] t, = l t, = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.2 = In the case considered by Kalai and Vempala 2005, this expression can be easily controlled, because, for their specific choice of perturbations, the conditional probability P m, = Prˆ t+ ˆ t =, l t, = l t,, min j L t, = m is small, uniformly for all m see the second display on p. 298 of their paper, which is easy to show, because P m, depends only on the cumulative loss and the perturbations for a single expert. Thus their choice of perturbations is what maes their analysis simple. Unfortunately, in our case and also for the RWP algorithm of Devroye et al. 203, this simple approach breas down, because, for the perturbations we consider, the probability P m, may be large for some m although such m will have small probability. We therefore cannot use a uniform bound on P m,, and as a consequence our proof is more complicated. We also cannot follow the line of reasoning of Devroye et al., because it relies on the fact that, at any time t, the perturbation for every expert has a standard deviation of order t. For our algorithm this is only true if the cumulative losses for the experts grow at least at a linear rate ct for some absolute constant c > 0, which need not be the case in general. Put differently, if the expert losses grow sublinearly, then our perturbations are too small to use the approach of Devroye et al.. Thus, we cannot use any of the existing approaches to control 3.2 for arbitrary losses, and, instead, we proceed as follows:. Given any integers L,..., L K, we find a canonical worstcase sequence of losses, which maximizes the regret of the BDP algorithm among all binary loss sequences of arbitrary length T such that L T, = L for all experts. 2. We bound 3.2 on this sequence. 3.. The Canonical Worstcase Sequence We will show that the worstcase regret is achieved for a sequence of unit vectors: let u {0, } K denote the unit vector 0,..., 0,, 0,..., 0 where the is in the th position. This restriction to unit vectors has been used before Abernethy and Warmuth, 200; Koolen and Warmuth, 200, but here we go one step further. We build a canonical worstcase sequence of unit vectors from the following alternating schemes: Definition 3.5 For any {,..., K}, we call a sequence of loss vectors u,..., u K a..kalternating scheme. 8
9 DROPOUT PERTURBATIONS To simplify the notation in the proofs, we will henceforth assume that the experts are numbered from best to worst, i.e. L... L K. In the canonical worstcase sequence, alternating schemes are repeated as follows: Definition 3.6 Canonical Worstcase Sequence Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, we call the following sequence the canonical worstcase sequence: First repeat the..kalternating scheme L times; then repeat the 2..Kalternating scheme L 2 L times; then repeat the 3..Kalternating scheme L 3 L 2 times; and so on until finally we repeat the K..Kalternating scheme which consists of just the unit vector u K L K L K times. Note that the canonical worstcase sequence always consists of L K alternating schemes. For example, the canonical worstcase sequence for cumulative losses L = 2, L 2 = 3, L 3 = 6 consists of the following L 3 = 6 alternating schemes: ,, 0, 0,, 0,, 0, 0, 0, } {{ }..3alternating scheme 0 0 } {{ }..3alternating scheme 0 } {{ } 2..3alternating scheme } {{ } 3..3a.s. } {{ } 3..3a.s. } {{ } 3..3a.s. For the BDP algorithm, the following three operations do not increase the regret of a sequence of binary losses see Section A.2 of the appendix for proofs:. If more than one expert gets loss in a single trial, then we split that trial into K trials in which we hand out the losses for the experts in turn Lemma A.. 2. If, in some trial, all experts get loss 0, then we remove that trial Lemma A If, in two consecutive trials t and t +, l t = u and l t+ = u for some experts such that L t, L t,, then we swap the trials, setting l t = u and l t+ = u Lemma A.3. These operations can be used to turn any sequence of binary losses into the canonical worstcase sequence: Lemma 3.7 Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, the regret of the BDP algorithm is nonuniquely maximized by its regret on the canonical worstcase sequence. Proof Start with an arbitrary sequence of binary losses such that L T, = L. By repeated application of operations and 2, we turn this loss sequence into a sequence of unit vectors. We then repeatedly apply the swapping operation 3 to any trials t and t+ such that l t = u and l t+ = u for where we have strict inequality L t, > L t,. We do this until no such consecutive trials remain. This arrives at the canonical worstcase sequence except that the unit vectors within each alternating scheme may be permuted. However, by applying property 3 for the case L t, = L t,, we can sort each alternating scheme into the canonical order. Together with Lemma 3.3, the previous lemma implies that we just need to bound the regret of the BDP algorithm on the canonical worstcase loss sequence. 9
10 VAN ERVEN KOTŁOWSKI WARMUTH 3.2. Regret on the Canonical Worstcase Sequence By Lemma 3.4 and 3.2, the regret for the BDP algorithm is bounded by R T t= = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.3 On the canonical worstcase sequence, l t, will be zero for all but a single expert, which we can exploit to abbreviate notation: let ˆ t + denote the leader with ties still broen uniformly at random on the perturbed cumulative losses if we would add unit of loss to the perturbed cumulative loss L t,ˆt of the expert selected by the BDP algorithm; and define the event Then 3.3 simplifies to A t = {ˆt = ˆ t + } R T for the expert such that l t, =. PrA t. 3.4 t= We split the L T,K alternating schemes of the canonical worstcase sequence into two regimes, and use different techniques to bound PrA t in each case. Let r L T,K L T, be some nonnegative integer, which we will optimize later:. The first regime consists of the initial L T, + r alternating schemes, where the difference in the cumulative loss between all experts is relatively small at most r. 2. The second regime consists of the remaining L T,K r L T, alternating schemes. Now any expert that is still receiving losses will have a cumulative loss that is at least r larger than the cumulative loss of expert, and r will be chosen large enough to ensure that, with high probability, no such expert will the leader. The First Regime Define the lead pac D t = { : Lt, < min j Lt,j + 2}, which is the random set of experts that might potentially be the leader in trials t or t+. As observed by Devroye et al. 203, there can only be a leader change at time t if D t contains more than one expert. Our goal will be to control the probability that this happens using the following lemma proved in Section A.3., which is an adaptation of Lemma 3 of Devroye et al.: Lemma 3.8 Suppose L t, L > 0 for all. Then Pr D t > 2 ln K α α L + 3. L Unfortunately, we cannot apply Lemma 3.8 immediately: in the first regime there are K trials for each of the first L T, repetitions of the..kalternating scheme, and applying Lemma 3.8 to all of these trials would already give a bound of order K L T, ln K, which overshoots the right rate by a factor of K. To avoid this factor, we only apply the lemma once per alternating scheme, which is made possible by the following result: 0
11 DROPOUT PERTURBATIONS Lemma 3.9 Suppose an alternating scheme starts at trial s and ends at trial v. Then v PrA t α v s + PrA v+ α Prˆ v+ ˆ v+ + α Pr D v+ >. t=s Proof The first inequality follows from Lemmas A.4 and A.5 in Section A.3.2 of the appendix, which show that PrA s PrA s+... PrA v α PrA v+. Let = K v s be the first expert in the alternating scheme. At time v + a new alternating scheme will start, so that the situation is entirely symmetrical between all experts that are part of the current alternating scheme: PrA v+ = Prˆ v+ = ˆ + v+ for all {,..., K}. This implies the second inequality: K v s + PrA v+ = Prˆ v+ = ˆ v+ + = = Prˆ v+ {,..., K}, ˆ v+ ˆ v+ + Prˆ v+ ˆ v+ +. Finally, the last inequality follows by definition of the lead pac. Applying either Lemma 3.8 or the trivial bound Pr D t > only to the trials v + that immediately follow an alternating scheme, we obtain the following result for the first regime of the canonical worstcase sequence see Section A.3.3 for the proof: Lemma 3.0 Suppose the first regime ends with trial v. Then v PrA t 2 2L T, ln K + 3 ln + L T, + r +. α α α t= The Second Regime We proceed to bound the probability of the event A t during the second regime of the canonical worstcase sequence, using that PrA t Prˆ t = for the expert such that l t, =. This probability will be easy to control, because during the second regime the difference in cumulative loss between expert and the experts that are still receiving losses, is sufficiently large that the BDP algorithm will prefer expert over these experts with high probability. Lemma 3. Suppose the second regime starts in trial s. Then PrA t K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2. 2L T, + r t=s
12 VAN ERVEN KOTŁOWSKI WARMUTH Proof Let t be a trial during the Lth alternating scheme, for some L in the second regime, and let t be the expert that gets loss in trial t. Then, by Hoeffding s inequality, PrA t Prˆ t = t Pr Lt,t L 2 α 2 L L T, 2 t, exp L + L T, 2 α 2 L L T, 2 2 α 2 rl L T, = exp exp, 2L T, + L L T, 2L T, + r where the last inequality follows from the fact that a+x is increasing in x for x 0 and a > 0. Let sl and vl denote the first and the last trial in the Lth alternating scheme. Then, summing up over all trials in the second regime, we get L T,K vl L T,K vl 2 α 2 rl L T, PrA t = PrA t exp 2L T, + r t=s L=L T, +r+ t=sl x L=L T, +r+ t=sl L T,K 2 α 2 rl L T, K exp 2L T, + r L=L T, +r+ 2 α 2 r 2 K exp + K exp 2L T, + r x=r+ Here the second term is bounded by: 2 α 2 rx exp 2L T, + r r L T,K L T, = K x=r 2 α 2 rx 2L T, + r dx = 2L T, + r 2 α 2 r exp exp. 2 α 2 rx 2L T, + r 2 α 2 r 2. 2L T, + r Putting things together we obtain 2 α 2 r 2 PrA t K exp + K 2L T, + r 2L t=s T, + r 2 α 2 r exp 2 α 2 r 2 2L T, + r K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2, 2L T, + r which was to be shown. The First and the Second Regime Together Combining the bounds for the first and second regimes from Lemmas 3.0 and 3., and optimizing r, we obtain the following result, which is proved in Section A.3.4 of the appendix: Lemma 3.2 On a canonical worstcase sequence of any length T for which L T, = L, PrA t t= α 2 4 2L α ln3k + 3 ln + L + ln3k α α α. Plugging this bound into 3.4 completes the proof of Theorem 3.. 2
13 DROPOUT PERTURBATIONS 4. Constant Regret on IID Losses In this section we show that our BDP algorithm has optimal Oln K regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a fixed gap γ between the expected loss of the best expert and all others. Theorem 4. Let γ 0, ] and δ 0, ] be constants, and let be a fixed expert. Suppose the loss vectors l t are independent random variables such that the expected differences in loss satisfy min E[l t, l t, ] γ for all t. 4. Then, with probability at least δ, the regret of the BDP algorithm is bounded by a constant: R T 8 α 2 γ 2 ln 8K α 2 γ δ As discussed in the proof in Section B., it is possible to improve the dependence on γ at the cost of getting a more complicated expression. Acnowledgments Tim van Erven was supported by NWO Rubicon grant , Wojciech Kotłowsi by the Foundation for Polish Science under the Homing Plus Program, and Manfred K. Warmuth by NSF grant IIS References Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In Neural Information Processing Systems NIPS, pages 9, 200. Peter Auer, Nicolò CesaBianchi, and Claudio Gentile. Adaptive and selfconfident online learning algorithms. Journal of Computer and System Sciences, 64:48 75, Nicolò CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Nicolò CesaBianchi, Philip M. Long, and Manfred K. Warmuth. Worstcase quadratic loss bounds for online prediction of linear functions by gradient descent. IEEE Transactions on Neural Networs, 72:604 69, May 996. Nicolò CesaBianchi, Yaov Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 443: , 997. Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research. To appear., 204. Luc Devroye, Gábor Lugosi, and Gergely Neu. Prediction by randomwal perturbation. In Conference on Learning Theory COLT, pages ,
14 VAN ERVEN KOTŁOWSKI WARMUTH Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55:9 39, 997. James Hannan. Approximation to Bayes ris in repeated play. Contributions to the Theory of Games, 3:97 39, 957. Elad Hazan, Satyen Kale, and Manfred K. Warmuth. Online variance minimization in On 2 per trial? In Conference on Learning Theory COLT, pages 34 35, 200. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsy, Ilya Sutsever, and Ruslan R. Salahutdinov. Improving neural networs by preventing coadaptation of feature detectors. CoRR, abs/ , 202. Nie Jiazhong, Wojciech Kotłowsi, and Manfred K. Warmuth. Online PCA with optimal regrets. In Algorithmic Learning Theory ALT, pages 98 2, 203. Adam Kalai. A perturbation that maes Follow the Leader equivalent to Randomized Weighted Majority. Private communication, December Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 73:29 307, Jyri Kivinen and Manfred K. Warmuth. Additive versus Exponentiated Gradient updates for linear prediction. Information and Computation, 32: 64, 997. Wouter M. Koolen and Manfred K. Warmuth. Hedging structured concepts. In 23rd Annual Conference on Learning Theory  COLT 200, pages Omnipress, June 200. Dima Kuzmin and Manfred K. Warmuth. Optimum follow the leader algorithm. In Conference on Learning Theory COLT, pages , Open problem. Nic Littlestone and Manfred K. Warmuth. The Weighted Majority algorithm. Information and Computation, 082:22 26, 994. Shai ShalevShwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 42:07 94, 20. Eiji Taimoto and Manfred K. Warmuth. Path ernels and multiplicative updates. Journal of Machine Learning Research, 4:773 88, Tim van Erven, Peter D. Grünwald, Wouter Koolen, and Steven de Rooij. Adaptive hedge. In Neural Information Processing Systems NIPS, pages , 20. Volodya Vov. A game of prediction with expert advice. Journal of Computer and System Sciences, 562:53 73, 998. Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. In Neural Information Processing Systems NIPS, pages , 203. Sida I. Wang and Christopher D. Manning. Fast dropout training. In International Conference on Machine Learning ICML, pages 8 26,
15 DROPOUT PERTURBATIONS Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. Machine Learning, 87: 32, 20. Additional Material for Follow the Leader with Dropout Perturbations Appendix A. Proof Details for Section 3 A.. Proof of Theorem 3.2 Proof We prove the theorem by explicit construction of the sequence. We tae K = 2 experts. The loss sequence is such that expert never gets any loss hence L = L T, = 0, while expert 2 gets losses ɛ, ɛ 2,..., ɛ T, where ɛ t = t. This means that its cumulative loss L t,2 = t s= s is bounded by 2 t + 2 L t,2 2 t +. We now bound the total loss of the RWP algorithm from below. Consider the situation just before trial t. Then expert 2 has accumulated at most L t,2 2 t loss so far. The expected loss of the algorithm in this iteration is ɛ t times the probability that expert becomes the leader. This probability is at least P ξ t, > ξ t,2 + 2 t, where ξ t, Binomial/2, t t/2. Consequently, U t = ξ t, ξ t,2 + t is distributed as Binomial 2, 2t. Therefore, the expected loss of the algorithm in iteration t is lower bounded by ] E [l t,ˆt ɛ t P U t > 2 t + t = ɛ t P U t t t/2 > 2 2. We first give a rough idea of what happens. Due to Central Limit Theorem, the probability on the right hand side is approximately P Y > 2 2, where Y N0,. Summing over trials we get that the expected cumulative loss of the algorithm is approximately L T,2 P Z > 2 2 = Ω T. To be more precise, we use the BerryEsséen theorem, which says that for X Binomialn, p, and Y N0,, for any z, P X np > z P Y > z np p fp, n where fp depends on p, but not on n. Using this fact with n = 2t and p = 2 results in U t t P > 2 2 P Y > 2 2 c t/2 t for some constant c. Therefore the expected cumulative loss of the algorithm can be lower bounded by P Y > 2 c 2L T,2 ɛ t t P Y > 2 2L T,2 cln T + = Ω T, which proves the theorem. t= 5
16 VAN ERVEN KOTŁOWSKI WARMUTH A.2. Operations to Reduce to the Canonical Worstcase Sequence In this section we prove that the three operations on losses from Section 3. can only ever increase the regret of the BDP algorithm. As proved in Lemma 3.7, this implies that the canonical worstcase sequence maximizes the regret among all sequences of binary losses with the same cumulative losses L,..., L K. THE UNIT RULE HOLDS We can assume without loss of generality that in every round exactly one expert gets loss. We call this the unit rule. It follows from two results, which will be proved in this subsection: first, Lemma A. shows that, if multiple experts get loss in the same trial, then the regret can only increase if we split that trial into multiple consecutive trials in which the experts get their losses in turn. Secondly, it is shown by Lemma A.2 that rounds in which all experts get zero loss do not change the regret, and can therefore be ignored in the analysis. Although we only need them for binary losses, both lemmas hold for general losses with values in [0, ]. Lemma A. One Expert Gets Loss Per Trial Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses. Now consider the alternative sequence of losses l,..., l t, l t,..., l K t, l t+,..., l T with { l t, = l t, if =, 0 otherwise, for, =,..., K. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s probability distributions on experts are the same for both sequences of losses, so we only need to compare its expected loss on l t with its expected losses on l t,..., l K t. To this end, we observe that the algorithm s probability of choosing expert given losses l,..., l t is no more than its probability of choosing that expert given losses l,..., l t, l t,..., l t, because during the trials l t,..., l t expert has received no loss whereas experts,..., have respectively received losses l t,,..., l t,, which implies that the algorithm s expected loss can only increase on the alternative sequence of losses. Lemma A.2 Ignore Allzero Loss Vectors Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses such that l t, = 0 for all. Then the regret of the BDP algorithm is the same on the subsequence l,..., l t, l t+,..., l T with trial t removed. Proof Both the best expert and the BDP algorithm have loss 0 on trial t. Trial t also does not influence the actions of the BDP algorithm for any trial t t. Hence the regret does not change if trial t is removed. 6
17 DROPOUT PERTURBATIONS SWAPPING ARGUMENT We call a loss vector l t a unit loss if there exists a single such that l t, = while l t, = 0 for all. Let ˆ t denote the expert that is randomly selected by the BDP algorithm based on some past losses l,..., l t to predict trial t. Let Binomialp, n denote the distribution of a binomial random variable on n trials with success probability p. Lemma A.3 Swapping Suppose that l,..., l t, l t, l t+, l t+2,..., l T is a sequence of unit losses with l t, = l t+, = for and that L t, L t,. Now consider the alternative sequence of losses l,..., l t, l t+, l t, l t+2,..., l T in which l t and l t+ are swapped. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s weights are the same for both sequences of loss, so we only need to compare its loss on l t, l t+ in the original sequence with its loss on l t+, l t in the alternative sequence. Let Pr and Pr respectively denote probability with respect to the algorithm s randomness on the original and on the alternative losses. Since we assumed unit losses, we need to show that Prˆ t = + Prˆ t+ = Pr ˆ t = + Pr ˆ t+ =. A. As will be made precise below, we can express all these probabilities in terms of binomial random variables L t,j Binomial α, L t,j for j =,..., K that represent the cumulative perturbed losses for the experts after t trials, and an additional Bernoulli variable X Binomial α, that represents whether the next loss is dropped out depending on context, X is either l t, or l t,. It will also be convenient to define the derived values M = min j, Lt,j, C = {j, : Lt,j = M}, which represent the minimum perturbed loss and the number of experts achieving that minimum among all experts except and. We will show that Prˆ t = M = m, C = c, X = x + Prˆ t+ = M = m, C = c, X = x Pr ˆ t = M = m, C = c, X = x Pr ˆ t+ = M = m, C = c, X = x A.2 7
18 VAN ERVEN KOTŁOWSKI WARMUTH is nonpositive for all m, c and x, from which A. follows. In terms of the random variables we have defined, these conditional probabilities are Prˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Prˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, Pr ˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Pr ˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, where for notational convenience, we sipped the subscript t in L t,j for all j. For x = 0, we see immediately that A.2 is 0, so it remains only to consider the case x =. For that case, A.2 simplifies to Pr L < L, L < m Pr L < L +, L < m + Pr L < L +, L < m Pr L < L, L < m + Pr L = 2 L + < m Pr L = L + < m + Pr L = m < c + L Pr L = m < L + + Pr L = m < c + L + Pr L = m < L + Pr L = m = c + 2 L + Pr L = m = L + = Pr L = L < m + Pr L = L < m + Pr L = 2 L + < m Pr L = L + < m Pr L = m = c + L + Pr L = m = L + Pr L = m = L + Pr L = m = L + = 2 c + 2 m a= Pr L = a = L + Pr L = a = L + + Pr L = m = c + 2 L + Pr L = m = L +. 8
19 DROPOUT PERTURBATIONS To prove that this is nonpositive, it is sufficient to show that Pr L = a = L + Pr L = a = L + 0 A.3 for all nonnegative integers a. Abbreviate L = L t, and L = L t,. If a = 0 or a > L, then the leftmost probability is 0 and A.3 holds. Alternatively, for a {,..., L }, the lefthand side of A.3 is equal to L a L a L a L α 2a α L+L 2a+, a so it is enough to show that L a L a L L a a 0. But this holds, because the lefthand side is equal to L!L! a!l a!a!l a +! = L!L! a!l a +!a!l a +! because L = L t, L t, = L by assumption. L!L! a!l a +!a!l a! = L a + L a + L!L! a!l a +!a!l a +! L L 0, A.3. Bounding Leader Changes on the Canonical Worstcase Sequence A.3.. PROOF OF LEMMA 3.8 Proof Within this proof, abbreviate L = L t,. Let V Binomial α, L and W Binomial α, L t, L, and assume L = V + W. Also define pv = PrV = v. Then Pr D = = = K = L 2 Prmin j K v=0 = L v=2 L j L + 2 = pv Prmin j pv 2 pv K L = v=0 L j v + W + 2 = pv Prmin j K pv Prmin L j v + W j = } {{ } fv L K v=2 = L j v + W + 2 pv 2 Prmin j L j v + W. A.4 9
20 VAN ERVEN KOTŁOWSKI WARMUTH Let S = { : L min j Lj } be the set of leaders. Then fv = K = Prmin j = Pr : min j L j L, V = v Pr : min j Lj L, V = v Prmin S V = v, L j L, V = v where the first inequality follows by the union bound, and the second because the latter event implies the former. We remar that S is never empty, so that the minimum is always welldefined. We also have L pv 2 v 2 α v 2 α L v+2 α 2 = pv L = v α v α L v α 2 vv L v + 2L v +. Let gv = max{ vv L v+2l v+ α 2 α 2 L v=0, 0}. Then, since g0 = g = 0, A.4 is at least as large as gv Prmin S V = v = for V = min S V. The function gv may be written as α 2 E[gV ] α 2 v gv = h vh 2 v for h v = L v + 2, h 2v = v L v +. For v, both h and h 2 are nonnegative, nondecreasing and convex, which implies that gv also has these properties. Moreover, since gv = 0 for v [0, ] all properties extend to all v 0. Suppose that E[V ] αl B for some B 0. Then Jensen s inequality and monotonicity of g imply that α 2 α 2 E[gV ] α 2 αl α 2 ge[v ] α 2 α 2 g α α B αl α α B + αl + B + 2αL + B + = αl B α B + 2 αl + B + 2 α B + αl + B + 2B + 3 ααl. Putting everything together, we find that α B + 2 αl + B + 2 α B + αl + B + Pr D t > = Pr D t = 2B + 3 ααl, A.5 so that it remains to find a good bound B. Let Y = L V Binomialα, L. Then [ ] [ ] E[V ] + αl = E max αl V E max αl V S [ = E max Y αl ] L ln K 2 =: B, where the last inequality follows by a standard argument for subgaussian random variables see, for example, Lemmas A.3 and A. in the textboo by CesaBianchi and Lugosi Plugging this bound into A.5 leads to the desired result. 20
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More information17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
More informationAn Introduction to Gametheoretic Online Learning Tim van Erven
General Mathematics Colloquium Leiden, December 4, 2014 An Introduction to Gametheoretic Online Learning Tim van Erven Statistics Bayesian Statistics Frequentist Statistics Statistics Bayesian Statistics
More informationA DriftingGames Analysis for Online Learning and Applications to Boosting
A DriftingGames Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationFollowing the Perturbed Leader for Online Structured Learning
Alon Cohen ALONCOH@CSWEB.HAIFA.AC.IL University of Haifa, University of Haifa, Dept. of Computer Science, 31905 Haifa, Israel Tamir Hazan TAMIR@CS.HAIFA.AC.IL University of Haifa, University of Haifa,
More informationOnline Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
More informationOnline Learning with Switching Costs and Other Adaptive Adversaries
Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò CesaBianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann
More informationFoundations of Machine Learning OnLine Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning OnLine Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationAchieving All with No Parameters: AdaNormalHedge
JMLR: Workshop and Conference Proceedings vol 40:1 19, 2015 Achieving All with No Parameters: AdaNormalHedge Haipeng Luo Department of Computer Science, Princeton University Robert E. Schapire Microsoft
More informationThe pnorm generalization of the LMS algorithm for adaptive filtering
The pnorm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology
More informationOnline Algorithms: Learning & Optimization with No Regret.
Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the
More informationLINEAR PROGRAMMING WITH ONLINE LEARNING
LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA EMAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA
More informationMultiArmed Bandit Problems
Machine Learning Coms4771 MultiArmed Bandit Problems Lecture 20 Multiarmed Bandit Problems The Setting: K arms (or actions) Each time t, each arm i pays off a bounded realvalued reward x i (t), say
More informationMonotone multiarmed bandit allocations
JMLR: Workshop and Conference Proceedings 19 (2011) 829 833 24th Annual Conference on Learning Theory Monotone multiarmed bandit allocations Aleksandrs Slivkins Microsoft Research Silicon Valley, Mountain
More information1 Portfolio Selection
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online PassiveAggressive Algorithms Koby Crammer Ofer Dekel Shai ShalevShwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationNumerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 2226 Jan 2007 1 Introduction In this course we will be looking
More informationOptimal Strategies and Minimax Lower Bounds for Online Convex Games
Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / 2184189 CesaBianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationNotes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
More informationThe Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationarxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
More informationStrongly Adaptive Online Learning
Amit Daniely Alon Gonen Shai ShalevShwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance
More informationOnline Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes
Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai ShalevShwartz Scribe: Shai ShalevShwartz In this lecture we describe a different model of learning which is called online
More informationTracking a Small Set of Experts by Mixing Past Posteriors
Journal of Machine Learning Research 3 (22) 363396 Submitted /; Published /2 Tracing a Small Set of Experts by Mixing Past Posteriors Olivier Bousquet Centre de Mathématiques Appliquées Ecole Polytechnique
More informationLOOKING FOR A GOOD TIME TO BET
LOOKING FOR A GOOD TIME TO BET LAURENT SERLET Abstract. Suppose that the cards of a well shuffled deck of cards are turned up one after another. At any timebut once only you may bet that the next card
More informationLecture 13: Martingales
Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of
More informationCourse Notes for Math 162: Mathematical Statistics Approximation Methods in Statistics
Course Notes for Math 16: Mathematical Statistics Approximation Methods in Statistics Adam Merberg and Steven J. Miller August 18, 6 Abstract We introduce some of the approximation methods commonly used
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationNonparametric adaptive age replacement with a onecycle criterion
Nonparametric adaptive age replacement with a onecycle criterion P. CoolenSchrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK email: Pauline.Schrijner@durham.ac.uk
More information4.4 The recursiontree method
4.4 The recursiontree method Let us see how a recursion tree would provide a good guess for the recurrence = 3 4 ) Start by nding an upper bound Floors and ceilings usually do not matter when solving
More informationEfficient Algorithms for Online Game Playing and Universal Portfolio Management
Electronic Colloquium on Computational Complexity, Report No. 33 (2006) Efficient Algorithms for Online Game Playing and Universal Portfolio Management Amit Agarwal Elad Hazan Abstract A natural algorithmic
More informationA Potentialbased Framework for Online Multiclass Learning with Partial Feedback
A Potentialbased Framework for Online Multiclass Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science
More informationThe Online Set Cover Problem
The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationApplied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationProbability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur
Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special DistributionsVI Today, I am going to introduce
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationBX in ( u, v) basis in two ways. On the one hand, AN = u+
1. Let f(x) = 1 x +1. Find f (6) () (the value of the sixth derivative of the function f(x) at zero). Answer: 7. We expand the given function into a Taylor series at the point x = : f(x) = 1 x + x 4 x
More informationThe Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
More informationIntroduction to Flocking {Stochastic Matrices}
Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur  Yvette May 21, 2012 CRAIG REYNOLDS  1987 BOIDS The Lion King CRAIG REYNOLDS
More informationUCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTIARMED BANDIT PROBLEM
UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTIARMED BANDIT PROBLEM PETER AUER AND RONALD ORTNER ABSTRACT In the stochastic multiarmed bandit problem we consider a modification of the
More informationRobbing the bandit: Less regret in online geometric optimization against an adaptive adversary.
Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. Varsha Dani Thomas P. Hayes November 12, 2005 Abstract We consider online bandit geometric optimization,
More informationSome Notes on Taylor Polynomials and Taylor Series
Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited
More informationOnline Convex Programming and Generalized Infinitesimal Gradient Ascent
Online Convex Programming and Generalized Infinitesimal Gradient Ascent Martin Zinkevich February 003 CMUCS03110 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1513 Abstract Convex
More information1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let
Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as
More informationMind the Duality Gap: Logarithmic regret algorithms for online optimization
Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago sham@ttic.org Shai ShalevShartz Toyota Technological Institute at
More informationSome Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.
Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,
More informationChapter 4 Lecture Notes
Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a realvalued function defined on the sample space of some experiment. For instance,
More informationPrime Numbers. Chapter Primes and Composites
Chapter 2 Prime Numbers The term factoring or factorization refers to the process of expressing an integer as the product of two or more integers in a nontrivial way, e.g., 42 = 6 7. Prime numbers are
More informationGambling Systems and MultiplicationInvariant Measures
Gambling Systems and MultiplicationInvariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously
More information2 The Euclidean algorithm
2 The Euclidean algorithm Do you understand the number 5? 6? 7? At some point our level of comfort with individual numbers goes down as the numbers get large For some it may be at 43, for others, 4 In
More informationLecture Notes on Online Learning DRAFT
Lecture Notes on Online Learning DRAFT Alexander Rakhlin These lecture notes contain material presented in the Statistical Learning Theory course at UC Berkeley, Spring 08. Various parts of these notes
More informationOn the Union of Arithmetic Progressions
On the Union of Arithmetic Progressions Shoni Gilboa Rom Pinchasi August, 04 Abstract We show that for any integer n and real ɛ > 0, the union of n arithmetic progressions with pairwise distinct differences,
More informationMathematical Induction
Chapter 2 Mathematical Induction 2.1 First Examples Suppose we want to find a simple formula for the sum of the first n odd numbers: 1 + 3 + 5 +... + (2n 1) = n (2k 1). How might we proceed? The most natural
More information24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NPcomplete. Then one can conclude according to the present state of science that no
More informationLog Wealth In each period, algorithm A performs 85% as well as algorithm B
1 Online Investing Portfolio Selection Consider n stocks Our distribution of wealth is some vector b e.g. (1/3, 1/3, 1/3) At end of one period, we get a vector of price relatives x e.g. (0.98, 1.02, 1.00)
More informationThe Relation between Two Present Value Formulae
James Ciecka, Gary Skoog, and Gerald Martin. 009. The Relation between Two Present Value Formulae. Journal of Legal Economics 15(): pp. 6174. The Relation between Two Present Value Formulae James E. Ciecka,
More informationMaster s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
More information7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
More informationmax cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x
Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study
More information0.1 Phase Estimation Technique
Phase Estimation In this lecture we will describe Kitaev s phase estimation algorithm, and use it to obtain an alternate derivation of a quantum factoring algorithm We will also use this technique to design
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture  17 ShannonFanoElias Coding and Introduction to Arithmetic Coding
More informationLoad balancing of temporary tasks in the l p norm
Load balancing of temporary tasks in the l p norm Yossi Azar a,1, Amir Epstein a,2, Leah Epstein b,3 a School of Computer Science, Tel Aviv University, Tel Aviv, Israel. b School of Computer Science, The
More informationRandom variables, probability distributions, binomial random variable
Week 4 lecture notes. WEEK 4 page 1 Random variables, probability distributions, binomial random variable Eample 1 : Consider the eperiment of flipping a fair coin three times. The number of tails that
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationEfficient Minimax Strategies for Square Loss Games
Efficient Minimax Strategies for Square Loss Games Wouter M. Koolen Queensland University of Technology and UC Berkeley wouter.koolen@qut.edu.au Alan Malek University of California, Berkeley malek@eecs.berkeley.edu
More informationCOS 511: Foundations of Machine Learning. Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 2006
COS 511: Foundations of Machine Learning Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 006 In the previous lecture, we introduced a new learning model, the Online Learning Model. After seeing a concrete
More informationStochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
More informationLecture 3: Linear Programming Relaxations and Rounding
Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can
More informationBinomial distribution From Wikipedia, the free encyclopedia See also: Negative binomial distribution
Binomial distribution From Wikipedia, the free encyclopedia See also: Negative binomial distribution In probability theory and statistics, the binomial distribution is the discrete probability distribution
More informationSinglePeriod Balancing of Pay PerClick and PayPerView Online Display Advertisements
SinglePeriod Balancing of Pay PerClick and PayPerView Online Display Advertisements Changhyun Kwon Department of Industrial and Systems Engineering University at Buffalo, the State University of New
More informationThe Advantage Testing Foundation Solutions
The Advantage Testing Foundation 013 Problem 1 The figure below shows two equilateral triangles each with area 1. The intersection of the two triangles is a regular hexagon. What is the area of the union
More informationAnalysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NPhard problem. What should I do? A. Theory says you're unlikely to find a polytime algorithm. Must sacrifice one
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
More information2.3 Scheduling jobs on identical parallel machines
2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed
More informationUsing Generalized Forecasts for Online Currency Conversion
Using Generalized Forecasts for Online Currency Conversion Kazuo Iwama and Kouki Yonezawa School of Informatics Kyoto University Kyoto 6068501, Japan {iwama,yonezawa}@kuis.kyotou.ac.jp Abstract. ElYaniv
More informationNote on some explicit formulae for twin prime counting function
Notes on Number Theory and Discrete Mathematics Vol. 9, 03, No., 43 48 Note on some explicit formulae for twin prime counting function Mladen VassilevMissana 5 V. Hugo Str., 4 Sofia, Bulgaria email:
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationImportant Probability Distributions OPRE 6301
Important Probability Distributions OPRE 6301 Important Distributions... Certain probability distributions occur with such regularity in reallife applications that they have been given their own names.
More informationLARGE CLASSES OF EXPERTS
LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 Email: szepesva@ualberta.ca UofA, October 31, 2006 OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLESHARE
More informationElementary Number Theory We begin with a bit of elementary number theory, which is concerned
CONSTRUCTION OF THE FINITE FIELDS Z p S. R. DOTY Elementary Number Theory We begin with a bit of elementary number theory, which is concerned solely with questions about the set of integers Z = {0, ±1,
More informationThe van Hoeij Algorithm for Factoring Polynomials
The van Hoeij Algorithm for Factoring Polynomials Jürgen Klüners Abstract In this survey we report about a new algorithm for factoring polynomials due to Mark van Hoeij. The main idea is that the combinatorial
More informationECEN 5682 Theory and Practice of Error Control Codes
ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.
More information1 Limiting distribution for a Markov chain
Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easytocheck
More informationBasic building blocks for a tripledouble intermediate format
Basic building blocks for a tripledouble intermediate format corrected version) Christoph Quirin Lauter February 26, 2009 Abstract The implementation of correctly rounded elementary functions needs high
More informationIntroduction to Online Optimization
Princeton University  Department of Operations Research and Financial Engineering Introduction to Online Optimization Sébastien Bubeck December 14, 2011 1 Contents Chapter 1. Introduction 5 1.1. Statistical
More informationConstrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing
Communications for Statistical Applications and Methods 2013, Vol 20, No 4, 321 327 DOI: http://dxdoiorg/105351/csam2013204321 Constrained Bayes and Empirical Bayes Estimator Applications in Insurance
More informationFEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL
FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL STATIsTICs 4 IV. RANDOm VECTORs 1. JOINTLY DIsTRIBUTED RANDOm VARIABLEs If are two rom variables defined on the same sample space we define the joint
More informationU.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationTangent and normal lines to conics
4.B. Tangent and normal lines to conics Apollonius work on conics includes a study of tangent and normal lines to these curves. The purpose of this document is to relate his approaches to the modern viewpoints
More informationChapter 6. Number Theory. 6.1 The Division Algorithm
Chapter 6 Number Theory The material in this chapter offers a small glimpse of why a lot of facts that you ve probably nown and used for a long time are true. It also offers some exposure to generalization,
More information