Save this PDF as:

Size: px
Start display at page:

## Transcription

4 VAN ERVEN KOTŁOWSKI WARMUTH Algorithm : Follow the Leader with Binarized Dropout Perturbation BDP Parameter : Dropout probability α 0, Initialization: L 0, = 0 for all =,..., K. for t = to T do Pic expert ˆ t = argmin Lt, with ties broen uniformly at random. Observe loss vector l t and suffer loss l t,ˆt. Draw l t, according to 2., independently for all. Update L t, = L t, + l t, for all. end cumulative loss and the cumulative loss of the best expert chosen in hindsight: [ T ] R T = E l t,ˆt L, where L = min L T,. t= [ T The expectation is taen with respect to the random choices of the algorithm, i.e. E t= l t,ˆt] = T t= w t l t. For the process that generates the losses, we will consider two different models: In the worst-case setting, studied in Section 3, we need to guarantee small regret for every possible sequence of losses l,..., l T [0, ] K, including the sequence that is the most difficult for the algorithm; in the independent, identically distributed i.i.d. setting, considered in Section 4, we assume the loss vectors l t [0, ] K all come from a fixed, but unnown probability distribution, which maes it possible to give stronger guarantees on the regret. Follow the Perturbed Leader The general class of Follow the Perturbed Leader FPL algorithms Kalai and Vempala, 2005; Hannan, 957 is defined by the choice { } ˆ t = argmin Lt, + ξ t,, where ξ t, is an additive random perturbation of the cumulative losses, which is chosen independently for every expert. Kalai and Vempala 2005 consider exponential and uniform distributions for ξ t, that depend on a parameter η > 0, and in the RWP algorithm of Devroye et al., 203 ξ t, Binomial/2, t t/2 is a binomial variable on t outcomes with success probability /2 that is centered around its mean. Binarized Dropout Perturbations We introduce an instance of FPL based on binarized dropout perturbation BDP. For any dropout probability α 0,, the binarized dropout perturbation of loss l t, is defined as: { with probability αl t,, l t, = 2. 0 otherwise. Let L T, Algorithm chooses = T t= l t, be the cumulative BDP loss for expert. Then the BDP algorithm see ˆ t = argmin 4 L t,,

5 DROPOUT PERTURBATIONS with ties broen uniformly at random. BDP is conceptually very simple. Computationally, it might be of interest that it does sparse updates, which only operate on non-zero features: if l t, = 0, then l t, = 0 as well, so, if one uses the same variable to store L t, and L t,, then no update to the internal state of the algorithm is required. There is a parameter α, but this parameter only affects the constants in the theorems, so simply setting it to α = /2 without any tuning already gives good performance. BDP may be viewed as an instance of FPL with the additive data-dependent perturbations ξ t, = L t, L t,. If the losses are binary, α = /2 and the cumulative loss grows approximately as L t, t for all experts, then ξ t, is approximately distributed as Binomial/2, t t. Consequently, in this special case BDP is very similar to RWP because shifts by a constant do not change the minimizer ˆ t. Standard Dropout Perturbations Dropout is normally defined as dropping hidden units or features in a neural networ while training the parameters of the networ on batch data using Gradient Descent Hinton et al., 202; Wang and Manning, 203; Wager et al., 203. In each iteration, hidden units or features are dropped with a fixed probability. We are in the single neuron case, in which each expert may be identified with a feature and the standard dropout perturbations without binarization become : { l s t, = l t, with probability α, 0 otherwise. For binary losses, standard dropout perturbation is the same as BDP, but for general losses in the interval [0, ] they are different. We initially tried to prove our results for standard dropout perturbation, but it turns out that this approach cannot achieve the right dependence on the number of experts: Theorem 2. Consider the FPL algorithm based on standard dropout perturbation with parameter α 0,. Then, for any B > 0 and any K 2, there exists a loss sequence l,..., l T with l t [0, ] K for which L B, while the algorithm incurs regret at least 2 K. Proof Tae ɛ and δ to be very small positive numbers specified later. The sequence is constructed as follows. First, all experts except expert incur losses ɛ for a number of iterations that is sufficiently large that, with probability w.p. at least δ, the algorithm will choose expert as a leader. Note that the required number of iterations depends on δ, K and α, but it does not depend on ɛ. Then, expert incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. Next, all experts except expert 2 incur losses ɛ for sufficiently many iterations that, w.p. at least δ, the algorithm will choose expert 2 as a leader. Then, expert 2 incurs unit loss and so does the algorithm w.p. δ, while all other experts incur no loss. This process is then repeated for = 3, 4,..., K. At the end of the game, the algorithm incurred loss at least K w.p.. The loss w l t in round t that we consider here is linear in the weights w. For a general convex loss of the form f tw x t, where x t is the feature vector at round t, learning the weights w is often achieved by locally linearizing the loss with a first-order Taylor expansion: f tw t x t f tw x t w t l t w l t for the surrogate loss l t, = f tw t x tx t,. See the discussion in Section 4. of Kivinen and Warmuth 997 and Section 2.4 of Shalev- Shwartz 20. Thus, experts correspond to features, and setting the -th feature x t, to 0 is equivalent to setting l t, to 0. 5

6 VAN ERVEN KOTŁOWSKI WARMUTH at least K δ, while expert K incurred no more loss than T ɛ. Taing δ small enough that δ < 2K, the expected loss of the algorithm is at least 2 K. Finally, we tae ɛ small enough to satisfy 0 < T ɛ B, which gives L B. For the case that L B ln K, a better bound of order O L ln K + ln K = Oln K is possible. So we see that standard dropout leads to a suboptimal algorithm, and therefore we use binarized dropout for the rest of the paper. 3. Regret Bounded in Terms of the Cumulative Loss of the Best Expert We will prove that the regret for the BDP algorithm is bounded by O L ln K + ln K: Theorem 3. For any sequence of losses taing values in [0, ] with min L T, = L, the regret of the BDP algorithm is bounded by R T α 2 4 2L α ln3k + 3 ln + L + ln3k α α α. Since RWP is similar to BDP and is nown to have regret bounded by O T ln K, one might try to show that the regret for RWP satisfies the stronger bound O L ln K + ln K as well, but this turns out to be impossible: Theorem 3.2 For every T > 0, there exists a loss sequence l,..., l T for which L = 0, while RWP suffers Ω T regret. This theorem, proved in Section A. of the appendix, shows that there is a fundamental difference between RWP and BDP, and that the data dependent perturbations used in BDP are crucial to obtaining a bound in terms of L. We now turn to proving Theorem 3., starting with the observation that it is in fact sufficient to prove the theorem only for binary losses: Lemma 3.3 Let a 0 and b be any constants. If the regret for the BDP algorithm is bounded by R T a L + b 3. for all sequences of binary losses, then it also satisfies 3. for any sequence of losses with values in the whole interval [0, ]. Proof Let l,..., l T be an arbitrary sequence of loss vectors with components l t, taing values in the whole interval [0, ]. We need to show that the regret for this sequence satisfies 3.. To this end, let l,..., l T be an alternative sequence of losses that is generated randomly from l,..., l T by letting { l t, = 0 with probability l t,, with probability l t,, independently for all t and. Accordingly, let a prime on any quantity denote that it is evaluated on the alternative losses. For example, L T, = T t= l t, is the cumulative alternative loss for expert. Let w t be the probability distribution on experts induced by the internal randomization of the 6

7 DROPOUT PERTURBATIONS BDP algorithm, i.e. w t, = P ˆ t =, and let w t be its counterpart on the alternative losses. Then, because the BDP algorithm internally generates a binary sequence of losses with the same probabilities as l t,, we have E l,...,l [w t t] = w t, and, independently, E l t [l t] = l t. Consequently, E[w t l t] = w t l t and E[ L T ] = L T, where L T and L T are the expected with respect to the internal randomization cumulative losses of the algorithm on the original and the alternative losses, respectively. Applying 3. for the alternative losses and taing expectations, it now follows by Jensen s inequality that [ ] E[ L T ] E[min L T, ] a E min L T, + b a E[min L T, ] + b E[ L T ] min E[L T, ] a min E[L T, ] + b L T min L T, a min L T, + b. Thus, from now on, we assume the losses are binary, i.e. l t, {0, }. Following the standard approach for FPL algorithms of Kalai and Vempala 2005, we start by bounding the regret by applying the so-called be-the-leader lemma to the perturbed losses. Lie in the analysis of Devroye et al. 203, the result simplifies, because we can relate the expectation of the perturbed losses bac to the original losses: Lemma 3.4 be-the-leader algorithm is bounded by For any sequence of loss vectors l,..., l T, the regret for the BDP R T α t= E [ lt,ˆt t,ˆt+] l, where l,..., l T are the dropout perturbed losses and ˆ t is the random expert chosen by the BDP algorithm based on the perturbed losses before trial t. Proof The expected regret is given by R T = t= ] E [l t,ˆt min L T, = α E t= l t,ˆt min ] α [ lt,ˆt E ] min [ Lt, α E α E lt,ˆt l t,ˆt+, t= LT, where the second equality holds because ˆ t only depends on l,..., l t, which are independent of l t, and E [ lt, ] = αl t, ; and the last inequality is from the be-the-leader lemma, applied to the perturbed losses, which shows that T t= l t,ˆt+ min LT, Kalai and Vempala, 2005, Equation 4, Cesa-Bianchi and Lugosi, 2006, Lemma 3.. t= 7

8 VAN ERVEN KOTŁOWSKI WARMUTH Every single term in the bound from Lemma 3.4 may be bounded further as follows: E [ lt,ˆt t,ˆt+] l = α α K Prˆ t = ˆ t+ l t, = l t, E [ lt,ˆt l t,ˆt+ ˆ t = ˆ t+, l ] t, = l t, = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.2 = In the case considered by Kalai and Vempala 2005, this expression can be easily controlled, because, for their specific choice of perturbations, the conditional probability P m, = Prˆ t+ ˆ t =, l t, = l t,, min j L t, = m is small, uniformly for all m see the second display on p. 298 of their paper, which is easy to show, because P m, depends only on the cumulative loss and the perturbations for a single expert. Thus their choice of perturbations is what maes their analysis simple. Unfortunately, in our case and also for the RWP algorithm of Devroye et al. 203, this simple approach breas down, because, for the perturbations we consider, the probability P m, may be large for some m although such m will have small probability. We therefore cannot use a uniform bound on P m,, and as a consequence our proof is more complicated. We also cannot follow the line of reasoning of Devroye et al., because it relies on the fact that, at any time t, the perturbation for every expert has a standard deviation of order t. For our algorithm this is only true if the cumulative losses for the experts grow at least at a linear rate ct for some absolute constant c > 0, which need not be the case in general. Put differently, if the expert losses grow sublinearly, then our perturbations are too small to use the approach of Devroye et al.. Thus, we cannot use any of the existing approaches to control 3.2 for arbitrary losses, and, instead, we proceed as follows:. Given any integers L,..., L K, we find a canonical worst-case sequence of losses, which maximizes the regret of the BDP algorithm among all binary loss sequences of arbitrary length T such that L T, = L for all experts. 2. We bound 3.2 on this sequence. 3.. The Canonical Worst-case Sequence We will show that the worst-case regret is achieved for a sequence of unit vectors: let u {0, } K denote the unit vector 0,..., 0,, 0,..., 0 where the is in the -th position. This restriction to unit vectors has been used before Abernethy and Warmuth, 200; Koolen and Warmuth, 200, but here we go one step further. We build a canonical worst-case sequence of unit vectors from the following alternating schemes: Definition 3.5 For any {,..., K}, we call a sequence of loss vectors u,..., u K a..kalternating scheme. 8

9 DROPOUT PERTURBATIONS To simplify the notation in the proofs, we will henceforth assume that the experts are numbered from best to worst, i.e. L... L K. In the canonical worst-case sequence, alternating schemes are repeated as follows: Definition 3.6 Canonical Worst-case Sequence Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, we call the following sequence the canonical worst-case sequence: First repeat the..k-alternating scheme L times; then repeat the 2..K-alternating scheme L 2 L times; then repeat the 3..K-alternating scheme L 3 L 2 times; and so on until finally we repeat the K..K-alternating scheme which consists of just the unit vector u K L K L K times. Note that the canonical worst-case sequence always consists of L K alternating schemes. For example, the canonical worst-case sequence for cumulative losses L = 2, L 2 = 3, L 3 = 6 consists of the following L 3 = 6 alternating schemes: ,, 0, 0,, 0,, 0, 0, 0, } {{ }..3-alternating scheme 0 0 } {{ }..3-alternating scheme 0 } {{ } 2..3-alternating scheme } {{ } 3..3-a.s. } {{ } 3..3-a.s. } {{ } 3..3-a.s. For the BDP algorithm, the following three operations do not increase the regret of a sequence of binary losses see Section A.2 of the appendix for proofs:. If more than one expert gets loss in a single trial, then we split that trial into K trials in which we hand out the losses for the experts in turn Lemma A.. 2. If, in some trial, all experts get loss 0, then we remove that trial Lemma A If, in two consecutive trials t and t +, l t = u and l t+ = u for some experts such that L t, L t,, then we swap the trials, setting l t = u and l t+ = u Lemma A.3. These operations can be used to turn any sequence of binary losses into the canonical worst-case sequence: Lemma 3.7 Let L... L K be nonnegative integers. Among all sequences of binary losses of arbitrary length T such that L T, = L for all, the regret of the BDP algorithm is non-uniquely maximized by its regret on the canonical worst-case sequence. Proof Start with an arbitrary sequence of binary losses such that L T, = L. By repeated application of operations and 2, we turn this loss sequence into a sequence of unit vectors. We then repeatedly apply the swapping operation 3 to any trials t and t+ such that l t = u and l t+ = u for where we have strict inequality L t, > L t,. We do this until no such consecutive trials remain. This arrives at the canonical worst-case sequence except that the unit vectors within each alternating scheme may be permuted. However, by applying property 3 for the case L t, = L t,, we can sort each alternating scheme into the canonical order. Together with Lemma 3.3, the previous lemma implies that we just need to bound the regret of the BDP algorithm on the canonical worst-case loss sequence. 9

10 VAN ERVEN KOTŁOWSKI WARMUTH 3.2. Regret on the Canonical Worst-case Sequence By Lemma 3.4 and 3.2, the regret for the BDP algorithm is bounded by R T t= = K Prˆ t = ˆ t+ l t, = l t, l t,. 3.3 On the canonical worst-case sequence, l t, will be zero for all but a single expert, which we can exploit to abbreviate notation: let ˆ t + denote the leader with ties still broen uniformly at random on the perturbed cumulative losses if we would add unit of loss to the perturbed cumulative loss L t,ˆt of the expert selected by the BDP algorithm; and define the event Then 3.3 simplifies to A t = {ˆt = ˆ t + } R T for the expert such that l t, =. PrA t. 3.4 t= We split the L T,K alternating schemes of the canonical worst-case sequence into two regimes, and use different techniques to bound PrA t in each case. Let r L T,K L T, be some nonnegative integer, which we will optimize later:. The first regime consists of the initial L T, + r alternating schemes, where the difference in the cumulative loss between all experts is relatively small at most r. 2. The second regime consists of the remaining L T,K r L T, alternating schemes. Now any expert that is still receiving losses will have a cumulative loss that is at least r larger than the cumulative loss of expert, and r will be chosen large enough to ensure that, with high probability, no such expert will the leader. The First Regime Define the lead pac D t = { : Lt, < min j Lt,j + 2}, which is the random set of experts that might potentially be the leader in trials t or t+. As observed by Devroye et al. 203, there can only be a leader change at time t if D t contains more than one expert. Our goal will be to control the probability that this happens using the following lemma proved in Section A.3., which is an adaptation of Lemma 3 of Devroye et al.: Lemma 3.8 Suppose L t, L > 0 for all. Then Pr D t > 2 ln K α α L + 3. L Unfortunately, we cannot apply Lemma 3.8 immediately: in the first regime there are K trials for each of the first L T, repetitions of the..k-alternating scheme, and applying Lemma 3.8 to all of these trials would already give a bound of order K L T, ln K, which overshoots the right rate by a factor of K. To avoid this factor, we only apply the lemma once per alternating scheme, which is made possible by the following result: 0

11 DROPOUT PERTURBATIONS Lemma 3.9 Suppose an alternating scheme starts at trial s and ends at trial v. Then v PrA t α v s + PrA v+ α Prˆ v+ ˆ v+ + α Pr D v+ >. t=s Proof The first inequality follows from Lemmas A.4 and A.5 in Section A.3.2 of the appendix, which show that PrA s PrA s+... PrA v α PrA v+. Let = K v s be the first expert in the alternating scheme. At time v + a new alternating scheme will start, so that the situation is entirely symmetrical between all experts that are part of the current alternating scheme: PrA v+ = Prˆ v+ = ˆ + v+ for all {,..., K}. This implies the second inequality: K v s + PrA v+ = Prˆ v+ = ˆ v+ + = = Prˆ v+ {,..., K}, ˆ v+ ˆ v+ + Prˆ v+ ˆ v+ +. Finally, the last inequality follows by definition of the lead pac. Applying either Lemma 3.8 or the trivial bound Pr D t > only to the trials v + that immediately follow an alternating scheme, we obtain the following result for the first regime of the canonical worst-case sequence see Section A.3.3 for the proof: Lemma 3.0 Suppose the first regime ends with trial v. Then v PrA t 2 2L T, ln K + 3 ln + L T, + r +. α α α t= The Second Regime We proceed to bound the probability of the event A t during the second regime of the canonical worst-case sequence, using that PrA t Prˆ t = for the expert such that l t, =. This probability will be easy to control, because during the second regime the difference in cumulative loss between expert and the experts that are still receiving losses, is sufficiently large that the BDP algorithm will prefer expert over these experts with high probability. Lemma 3. Suppose the second regime starts in trial s. Then PrA t K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2. 2L T, + r t=s

12 VAN ERVEN KOTŁOWSKI WARMUTH Proof Let t be a trial during the L-th alternating scheme, for some L in the second regime, and let t be the expert that gets loss in trial t. Then, by Hoeffding s inequality, PrA t Prˆ t = t Pr Lt,t L 2 α 2 L L T, 2 t, exp L + L T, 2 α 2 L L T, 2 2 α 2 rl L T, = exp exp, 2L T, + L L T, 2L T, + r where the last inequality follows from the fact that a+x is increasing in x for x 0 and a > 0. Let sl and vl denote the first and the last trial in the L-th alternating scheme. Then, summing up over all trials in the second regime, we get L T,K vl L T,K vl 2 α 2 rl L T, PrA t = PrA t exp 2L T, + r t=s L=L T, +r+ t=sl x L=L T, +r+ t=sl L T,K 2 α 2 rl L T, K exp 2L T, + r L=L T, +r+ 2 α 2 r 2 K exp + K exp 2L T, + r x=r+ Here the second term is bounded by: 2 α 2 rx exp 2L T, + r r L T,K L T, = K x=r 2 α 2 rx 2L T, + r dx = 2L T, + r 2 α 2 r exp exp. 2 α 2 rx 2L T, + r 2 α 2 r 2. 2L T, + r Putting things together we obtain 2 α 2 r 2 PrA t K exp + K 2L T, + r 2L t=s T, + r 2 α 2 r exp 2 α 2 r 2 2L T, + r K 2L T, + 3r 2 α 2 r exp 2 α 2 r 2, 2L T, + r which was to be shown. The First and the Second Regime Together Combining the bounds for the first and second regimes from Lemmas 3.0 and 3., and optimizing r, we obtain the following result, which is proved in Section A.3.4 of the appendix: Lemma 3.2 On a canonical worst-case sequence of any length T for which L T, = L, PrA t t= α 2 4 2L α ln3k + 3 ln + L + ln3k α α α. Plugging this bound into 3.4 completes the proof of Theorem 3.. 2

13 DROPOUT PERTURBATIONS 4. Constant Regret on IID Losses In this section we show that our BDP algorithm has optimal Oln K regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a fixed gap γ between the expected loss of the best expert and all others. Theorem 4. Let γ 0, ] and δ 0, ] be constants, and let be a fixed expert. Suppose the loss vectors l t are independent random variables such that the expected differences in loss satisfy min E[l t, l t, ] γ for all t. 4. Then, with probability at least δ, the regret of the BDP algorithm is bounded by a constant: R T 8 α 2 γ 2 ln 8K α 2 γ δ As discussed in the proof in Section B., it is possible to improve the dependence on γ at the cost of getting a more complicated expression. Acnowledgments Tim van Erven was supported by NWO Rubicon grant , Wojciech Kotłowsi by the Foundation for Polish Science under the Homing Plus Program, and Manfred K. Warmuth by NSF grant IIS References Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In Neural Information Processing Systems NIPS, pages 9, 200. Peter Auer, Nicolò Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48 75, Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Nicolò Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networs, 72:604 69, May 996. Nicolò Cesa-Bianchi, Yaov Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 443: , 997. Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research. To appear., 204. Luc Devroye, Gábor Lugosi, and Gergely Neu. Prediction by random-wal perturbation. In Conference on Learning Theory COLT, pages ,

15 DROPOUT PERTURBATIONS Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. Machine Learning, 87: 32, 20. Additional Material for Follow the Leader with Dropout Perturbations Appendix A. Proof Details for Section 3 A.. Proof of Theorem 3.2 Proof We prove the theorem by explicit construction of the sequence. We tae K = 2 experts. The loss sequence is such that expert never gets any loss hence L = L T, = 0, while expert 2 gets losses ɛ, ɛ 2,..., ɛ T, where ɛ t = t. This means that its cumulative loss L t,2 = t s= s is bounded by 2 t + 2 L t,2 2 t +. We now bound the total loss of the RWP algorithm from below. Consider the situation just before trial t. Then expert 2 has accumulated at most L t,2 2 t loss so far. The expected loss of the algorithm in this iteration is ɛ t times the probability that expert becomes the leader. This probability is at least P ξ t, > ξ t,2 + 2 t, where ξ t, Binomial/2, t t/2. Consequently, U t = ξ t, ξ t,2 + t is distributed as Binomial 2, 2t. Therefore, the expected loss of the algorithm in iteration t is lower bounded by ] E [l t,ˆt ɛ t P U t > 2 t + t = ɛ t P U t t t/2 > 2 2. We first give a rough idea of what happens. Due to Central Limit Theorem, the probability on the right hand side is approximately P Y > 2 2, where Y N0,. Summing over trials we get that the expected cumulative loss of the algorithm is approximately L T,2 P Z > 2 2 = Ω T. To be more precise, we use the Berry-Esséen theorem, which says that for X Binomialn, p, and Y N0,, for any z, P X np > z P Y > z np p fp, n where fp depends on p, but not on n. Using this fact with n = 2t and p = 2 results in U t t P > 2 2 P Y > 2 2 c t/2 t for some constant c. Therefore the expected cumulative loss of the algorithm can be lower bounded by P Y > 2 c 2L T,2 ɛ t t P Y > 2 2L T,2 cln T + = Ω T, which proves the theorem. t= 5

16 VAN ERVEN KOTŁOWSKI WARMUTH A.2. Operations to Reduce to the Canonical Worst-case Sequence In this section we prove that the three operations on losses from Section 3. can only ever increase the regret of the BDP algorithm. As proved in Lemma 3.7, this implies that the canonical worstcase sequence maximizes the regret among all sequences of binary losses with the same cumulative losses L,..., L K. THE UNIT RULE HOLDS We can assume without loss of generality that in every round exactly one expert gets loss. We call this the unit rule. It follows from two results, which will be proved in this subsection: first, Lemma A. shows that, if multiple experts get loss in the same trial, then the regret can only increase if we split that trial into multiple consecutive trials in which the experts get their losses in turn. Secondly, it is shown by Lemma A.2 that rounds in which all experts get zero loss do not change the regret, and can therefore be ignored in the analysis. Although we only need them for binary losses, both lemmas hold for general losses with values in [0, ]. Lemma A. One Expert Gets Loss Per Trial Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses. Now consider the alternative sequence of losses l,..., l t, l t,..., l K t, l t+,..., l T with { l t, = l t, if =, 0 otherwise, for, =,..., K. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s probability distributions on experts are the same for both sequences of losses, so we only need to compare its expected loss on l t with its expected losses on l t,..., l K t. To this end, we observe that the algorithm s probability of choosing expert given losses l,..., l t is no more than its probability of choosing that expert given losses l,..., l t, l t,..., l t, because during the trials l t,..., l t expert has received no loss whereas experts,..., have respectively received losses l t,,..., l t,, which implies that the algorithm s expected loss can only increase on the alternative sequence of losses. Lemma A.2 Ignore All-zero Loss Vectors Suppose l,..., l t, l t, l t+,..., l T is a sequence of losses such that l t, = 0 for all. Then the regret of the BDP algorithm is the same on the subsequence l,..., l t, l t+,..., l T with trial t removed. Proof Both the best expert and the BDP algorithm have loss 0 on trial t. Trial t also does not influence the actions of the BDP algorithm for any trial t t. Hence the regret does not change if trial t is removed. 6

17 DROPOUT PERTURBATIONS SWAPPING ARGUMENT We call a loss vector l t a unit loss if there exists a single such that l t, = while l t, = 0 for all. Let ˆ t denote the expert that is randomly selected by the BDP algorithm based on some past losses l,..., l t to predict trial t. Let Binomialp, n denote the distribution of a binomial random variable on n trials with success probability p. Lemma A.3 Swapping Suppose that l,..., l t, l t, l t+, l t+2,..., l T is a sequence of unit losses with l t, = l t+, = for and that L t, L t,. Now consider the alternative sequence of losses l,..., l t, l t+, l t, l t+2,..., l T in which l t and l t+ are swapped. Then the regret of the BDP algorithm on the original losses R T never exceeds its regret on the alternative losses R T : R T R T. Proof The cumulative loss of the best expert L is the same on both sequences of losses, so we only need to consider the cumulative loss of the BDP algorithm. On trials,..., t and t +,..., T the algorithm s weights are the same for both sequences of loss, so we only need to compare its loss on l t, l t+ in the original sequence with its loss on l t+, l t in the alternative sequence. Let Pr and Pr respectively denote probability with respect to the algorithm s randomness on the original and on the alternative losses. Since we assumed unit losses, we need to show that Prˆ t = + Prˆ t+ = Pr ˆ t = + Pr ˆ t+ =. A. As will be made precise below, we can express all these probabilities in terms of binomial random variables L t,j Binomial α, L t,j for j =,..., K that represent the cumulative perturbed losses for the experts after t trials, and an additional Bernoulli variable X Binomial α, that represents whether the next loss is dropped out depending on context, X is either l t, or l t,. It will also be convenient to define the derived values M = min j, Lt,j, C = {j, : Lt,j = M}, which represent the minimum perturbed loss and the number of experts achieving that minimum among all experts except and. We will show that Prˆ t = M = m, C = c, X = x + Prˆ t+ = M = m, C = c, X = x Pr ˆ t = M = m, C = c, X = x Pr ˆ t+ = M = m, C = c, X = x A.2 7

18 VAN ERVEN KOTŁOWSKI WARMUTH is nonpositive for all m, c and x, from which A. follows. In terms of the random variables we have defined, these conditional probabilities are Prˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Prˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, Pr ˆ t = M = m, C = c, X = x = Pr L < L, L < m + Pr L = L < m 2 + Pr L = m < L c + + Pr L = m = L c + 2, Pr ˆ t+ = M = m, C = c, X = x = Pr L < L + x, L < m + Pr L = L + x < m 2 + Pr L = m < L + x c + + Pr L = m = L + x c + 2, where for notational convenience, we sipped the subscript t in L t,j for all j. For x = 0, we see immediately that A.2 is 0, so it remains only to consider the case x =. For that case, A.2 simplifies to Pr L < L, L < m Pr L < L +, L < m + Pr L < L +, L < m Pr L < L, L < m + Pr L = 2 L + < m Pr L = L + < m + Pr L = m < c + L Pr L = m < L + + Pr L = m < c + L + Pr L = m < L + Pr L = m = c + 2 L + Pr L = m = L + = Pr L = L < m + Pr L = L < m + Pr L = 2 L + < m Pr L = L + < m Pr L = m = c + L + Pr L = m = L + Pr L = m = L + Pr L = m = L + = 2 c + 2 m a= Pr L = a = L + Pr L = a = L + + Pr L = m = c + 2 L + Pr L = m = L +. 8

19 DROPOUT PERTURBATIONS To prove that this is nonpositive, it is sufficient to show that Pr L = a = L + Pr L = a = L + 0 A.3 for all nonnegative integers a. Abbreviate L = L t, and L = L t,. If a = 0 or a > L, then the left-most probability is 0 and A.3 holds. Alternatively, for a {,..., L }, the left-hand side of A.3 is equal to L a L a L a L α 2a α L+L 2a+, a so it is enough to show that L a L a L L a a 0. But this holds, because the left-hand side is equal to L!L! a!l a!a!l a +! = L!L! a!l a +!a!l a +! because L = L t, L t, = L by assumption. L!L! a!l a +!a!l a! = L a + L a + L!L! a!l a +!a!l a +! L L 0, A.3. Bounding Leader Changes on the Canonical Worst-case Sequence A.3.. PROOF OF LEMMA 3.8 Proof Within this proof, abbreviate L = L t,. Let V Binomial α, L and W Binomial α, L t, L, and assume L = V + W. Also define pv = PrV = v. Then Pr D = = = K = L 2 Prmin j K v=0 = L v=2 L j L + 2 = pv Prmin j pv 2 pv K L = v=0 L j v + W + 2 = pv Prmin j K pv Prmin L j v + W j = } {{ } fv L K v=2 = L j v + W + 2 pv 2 Prmin j L j v + W. A.4 9

20 VAN ERVEN KOTŁOWSKI WARMUTH Let S = { : L min j Lj } be the set of leaders. Then fv = K = Prmin j = Pr : min j L j L, V = v Pr : min j Lj L, V = v Prmin S V = v, L j L, V = v where the first inequality follows by the union bound, and the second because the latter event implies the former. We remar that S is never empty, so that the minimum is always well-defined. We also have L pv 2 v 2 α v 2 α L v+2 α 2 = pv L = v α v α L v α 2 vv L v + 2L v +. Let gv = max{ vv L v+2l v+ α 2 α 2 L v=0, 0}. Then, since g0 = g = 0, A.4 is at least as large as gv Prmin S V = v = for V = min S V. The function gv may be written as α 2 E[gV ] α 2 v gv = h vh 2 v for h v = L v + 2, h 2v = v L v +. For v, both h and h 2 are nonnegative, nondecreasing and convex, which implies that gv also has these properties. Moreover, since gv = 0 for v [0, ] all properties extend to all v 0. Suppose that E[V ] αl B for some B 0. Then Jensen s inequality and monotonicity of g imply that α 2 α 2 E[gV ] α 2 αl α 2 ge[v ] α 2 α 2 g α α B αl α α B + αl + B + 2αL + B + = αl B α B + 2 αl + B + 2 α B + αl + B + 2B + 3 ααl. Putting everything together, we find that α B + 2 αl + B + 2 α B + αl + B + Pr D t > = Pr D t = 2B + 3 ααl, A.5 so that it remains to find a good bound B. Let Y = L V Binomialα, L. Then [ ] [ ] E[V ] + αl = E max αl V E max αl V S [ = E max Y αl ] L ln K 2 =: B, where the last inequality follows by a standard argument for sub-gaussian random variables see, for example, Lemmas A.3 and A. in the textboo by Cesa-Bianchi and Lugosi Plugging this bound into A.5 leads to the desired result. 20

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### 1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

### An Introduction to Game-theoretic Online Learning Tim van Erven

General Mathematics Colloquium Leiden, December 4, 2014 An Introduction to Game-theoretic Online Learning Tim van Erven Statistics Bayesian Statistics Frequentist Statistics Statistics Bayesian Statistics

### A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

### Following the Perturbed Leader for Online Structured Learning

Alon Cohen ALONCOH@CSWEB.HAIFA.AC.IL University of Haifa, University of Haifa, Dept. of Computer Science, 31905 Haifa, Israel Tamir Hazan TAMIR@CS.HAIFA.AC.IL University of Haifa, University of Haifa,

### Online Convex Optimization

E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### Achieving All with No Parameters: AdaNormalHedge

JMLR: Workshop and Conference Proceedings vol 40:1 19, 2015 Achieving All with No Parameters: AdaNormalHedge Haipeng Luo Department of Computer Science, Princeton University Robert E. Schapire Microsoft

### The p-norm generalization of the LMS algorithm for adaptive filtering

The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

### Online Algorithms: Learning & Optimization with No Regret.

Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the

### LINEAR PROGRAMMING WITH ONLINE LEARNING

LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA

### Multi-Armed Bandit Problems

Machine Learning Coms-4771 Multi-Armed Bandit Problems Lecture 20 Multi-armed Bandit Problems The Setting: K arms (or actions) Each time t, each arm i pays off a bounded real-valued reward x i (t), say

### Monotone multi-armed bandit allocations

JMLR: Workshop and Conference Proceedings 19 (2011) 829 833 24th Annual Conference on Learning Theory Monotone multi-armed bandit allocations Aleksandrs Slivkins Microsoft Research Silicon Valley, Mountain

### 1 Portfolio Selection

COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with

### Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Optimal Strategies and Minimax Lower Bounds for Online Convex Games

Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems

### Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### The Multiplicative Weights Update method

Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game

### How to Use Expert Advice

NICOLÒ CESA-BIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance

### Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

### Tracking a Small Set of Experts by Mixing Past Posteriors

Journal of Machine Learning Research 3 (22) 363-396 Submitted /; Published /2 Tracing a Small Set of Experts by Mixing Past Posteriors Olivier Bousquet Centre de Mathématiques Appliquées Ecole Polytechnique

### LOOKING FOR A GOOD TIME TO BET

LOOKING FOR A GOOD TIME TO BET LAURENT SERLET Abstract. Suppose that the cards of a well shuffled deck of cards are turned up one after another. At any time-but once only- you may bet that the next card

### Lecture 13: Martingales

Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of

### Course Notes for Math 162: Mathematical Statistics Approximation Methods in Statistics

Course Notes for Math 16: Mathematical Statistics Approximation Methods in Statistics Adam Merberg and Steven J. Miller August 18, 6 Abstract We introduce some of the approximation methods commonly used

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### 4.4 The recursion-tree method

4.4 The recursion-tree method Let us see how a recursion tree would provide a good guess for the recurrence = 3 4 ) Start by nding an upper bound Floors and ceilings usually do not matter when solving

### Efficient Algorithms for Online Game Playing and Universal Portfolio Management

Electronic Colloquium on Computational Complexity, Report No. 33 (2006) Efficient Algorithms for Online Game Playing and Universal Portfolio Management Amit Agarwal Elad Hazan Abstract A natural algorithmic

### A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

### The Online Set Cover Problem

The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### BX in ( u, v) basis in two ways. On the one hand, AN = u+

1. Let f(x) = 1 x +1. Find f (6) () (the value of the sixth derivative of the function f(x) at zero). Answer: 7. We expand the given function into a Taylor series at the point x = : f(x) = 1 x + x 4 x

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM

UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM PETER AUER AND RONALD ORTNER ABSTRACT In the stochastic multi-armed bandit problem we consider a modification of the

### Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary.

Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. Varsha Dani Thomas P. Hayes November 12, 2005 Abstract We consider online bandit geometric optimization,

### Some Notes on Taylor Polynomials and Taylor Series

Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited

### Online Convex Programming and Generalized Infinitesimal Gradient Ascent

Online Convex Programming and Generalized Infinitesimal Gradient Ascent Martin Zinkevich February 003 CMU-CS-03-110 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1513 Abstract Convex

### 1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

### Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago sham@tti-c.org Shai Shalev-Shartz Toyota Technological Institute at

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

### Prime Numbers. Chapter Primes and Composites

Chapter 2 Prime Numbers The term factoring or factorization refers to the process of expressing an integer as the product of two or more integers in a nontrivial way, e.g., 42 = 6 7. Prime numbers are

### Gambling Systems and Multiplication-Invariant Measures

Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously

### 2 The Euclidean algorithm

2 The Euclidean algorithm Do you understand the number 5? 6? 7? At some point our level of comfort with individual numbers goes down as the numbers get large For some it may be at 43, for others, 4 In

### Lecture Notes on Online Learning DRAFT

Lecture Notes on Online Learning DRAFT Alexander Rakhlin These lecture notes contain material presented in the Statistical Learning Theory course at UC Berkeley, Spring 08. Various parts of these notes

### On the Union of Arithmetic Progressions

On the Union of Arithmetic Progressions Shoni Gilboa Rom Pinchasi August, 04 Abstract We show that for any integer n and real ɛ > 0, the union of n arithmetic progressions with pairwise distinct differences,

### Mathematical Induction

Chapter 2 Mathematical Induction 2.1 First Examples Suppose we want to find a simple formula for the sum of the first n odd numbers: 1 + 3 + 5 +... + (2n 1) = n (2k 1). How might we proceed? The most natural

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Log Wealth In each period, algorithm A performs 85% as well as algorithm B

1 Online Investing Portfolio Selection Consider n stocks Our distribution of wealth is some vector b e.g. (1/3, 1/3, 1/3) At end of one period, we get a vector of price relatives x e.g. (0.98, 1.02, 1.00)

### The Relation between Two Present Value Formulae

James Ciecka, Gary Skoog, and Gerald Martin. 009. The Relation between Two Present Value Formulae. Journal of Legal Economics 15(): pp. 61-74. The Relation between Two Present Value Formulae James E. Ciecka,

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### 0.1 Phase Estimation Technique

Phase Estimation In this lecture we will describe Kitaev s phase estimation algorithm, and use it to obtain an alternate derivation of a quantum factoring algorithm We will also use this technique to design

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

Load balancing of temporary tasks in the l p norm Yossi Azar a,1, Amir Epstein a,2, Leah Epstein b,3 a School of Computer Science, Tel Aviv University, Tel Aviv, Israel. b School of Computer Science, The

### Random variables, probability distributions, binomial random variable

Week 4 lecture notes. WEEK 4 page 1 Random variables, probability distributions, binomial random variable Eample 1 : Consider the eperiment of flipping a fair coin three times. The number of tails that

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Efficient Minimax Strategies for Square Loss Games

Efficient Minimax Strategies for Square Loss Games Wouter M. Koolen Queensland University of Technology and UC Berkeley wouter.koolen@qut.edu.au Alan Malek University of California, Berkeley malek@eecs.berkeley.edu

### COS 511: Foundations of Machine Learning. Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 2006

COS 511: Foundations of Machine Learning Rob Schapire Lecture #14 Scribe: Qian Xi March 30, 006 In the previous lecture, we introduced a new learning model, the Online Learning Model. After seeing a concrete

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

Single-Period Balancing of Pay Per-Click and Pay-Per-View Online Display Advertisements Changhyun Kwon Department of Industrial and Systems Engineering University at Buffalo, the State University of New

### The Advantage Testing Foundation Solutions

The Advantage Testing Foundation 013 Problem 1 The figure below shows two equilateral triangles each with area 1. The intersection of the two triangles is a regular hexagon. What is the area of the union

### Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Using Generalized Forecasts for Online Currency Conversion

Using Generalized Forecasts for Online Currency Conversion Kazuo Iwama and Kouki Yonezawa School of Informatics Kyoto University Kyoto 606-8501, Japan {iwama,yonezawa}@kuis.kyoto-u.ac.jp Abstract. El-Yaniv

### Note on some explicit formulae for twin prime counting function

Notes on Number Theory and Discrete Mathematics Vol. 9, 03, No., 43 48 Note on some explicit formulae for twin prime counting function Mladen Vassilev-Missana 5 V. Hugo Str., 4 Sofia, Bulgaria e-mail:

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Important Probability Distributions OPRE 6301

Important Probability Distributions OPRE 6301 Important Distributions... Certain probability distributions occur with such regularity in real-life applications that they have been given their own names.

### LARGE CLASSES OF EXPERTS

LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 31, 2006 OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE

### Elementary Number Theory We begin with a bit of elementary number theory, which is concerned

CONSTRUCTION OF THE FINITE FIELDS Z p S. R. DOTY Elementary Number Theory We begin with a bit of elementary number theory, which is concerned solely with questions about the set of integers Z = {0, ±1,

### The van Hoeij Algorithm for Factoring Polynomials

The van Hoeij Algorithm for Factoring Polynomials Jürgen Klüners Abstract In this survey we report about a new algorithm for factoring polynomials due to Mark van Hoeij. The main idea is that the combinatorial

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Basic building blocks for a triple-double intermediate format

Basic building blocks for a triple-double intermediate format corrected version) Christoph Quirin Lauter February 26, 2009 Abstract The implementation of correctly rounded elementary functions needs high

### Introduction to Online Optimization

Princeton University - Department of Operations Research and Financial Engineering Introduction to Online Optimization Sébastien Bubeck December 14, 2011 1 Contents Chapter 1. Introduction 5 1.1. Statistical

### Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing

Communications for Statistical Applications and Methods 2013, Vol 20, No 4, 321 327 DOI: http://dxdoiorg/105351/csam2013204321 Constrained Bayes and Empirical Bayes Estimator Applications in Insurance

### FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL STATIsTICs 4 IV. RANDOm VECTORs 1. JOINTLY DIsTRIBUTED RANDOm VARIABLEs If are two rom variables defined on the same sample space we define the joint

### U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line