arxiv: v2 [cs.lg] 27 Jun 2008
|
|
|
- Oswald Newman
- 9 years ago
- Views:
Transcription
1 Prediction with expert advice for the Brier game arxiv: v [cslg] 7 Jun 008 Vladimir Vovk and Fedor Zhdanov Computer Learning Research Centre Department of Computer Science Royal Holloway, University of London Egham, Surrey TW0 0EX, England February 18, 013 Abstract We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it The resulting prediction algorithm is applied to predict results of football and tennis matches The theoretical performance guarantee turns out to be rather tight on these data sets, especially in the case of the more extensive tennis data 1 Introduction The paradigm of prediction with expert advice was introduced in the late 1980s (see, eg, [5], [11], []) and has been applied to various loss functions; see [3] for a recent book-length review An especially important class of loss functions is that of mixable ones, for which the learner s loss can be made as small as the best expert s loss plus a constant (depending on the number of experts) It is known [8, 14] that the optimal additive constant is attained by the strong aggregating algorithm proposed in [13] (we use the adjective strong to distinguish it from the weak aggregating algorithm of [9]) There are several important loss functions that have been shown to be mixable and for which the optimal additive constant has been found The prime examples in the case of binary observations are the log loss function and the square loss function The log loss function, whose mixability is obvious, has been explored extensively, along with its important generalizations, the Kullback Leibler divergence and Cover s loss function In this paper we concentrate on the square loss function In the binary case, its mixability was demonstrated in [13] There are two natural directions in which this result could be generalized: Regression: observations are real numbers (square-loss regression is a standard problem in statistics) 1
2 Classification: observations take values in a finite set (this leads to the Brier game, to be defined below, a standard way of measuring the quality of predictions in meteorology and other applied fields: see, eg, [4]) The mixability of the square loss function in the case of observations belonging to a bounded interval of real numbers was demonstrated in [8]; Haussler et al s algorithm was simplified in [16] Surprisingly, the case of square-loss non-binary classification has never been analysed in the framework of prediction with expert advice The purpose of this paper is to fill this gap Its short conference version [17] appeared in the ICML 008 proceedings Prediction algorithm and loss bound A game of prediction consists of three components: the observation space Ω, the decision space Γ, and the loss function λ : Ω Γ R In this paper we are interested in the following Brier game [1]: Ω is a finite and non-empty set, Γ := P(Ω) is the set of all probability measures on Ω, and λ(ω, γ) = o Ω(γ{o} δ ω {o}), where δ ω P(Ω) is the probability measure concentrated at ω: δ ω {ω} = 1 and δ ω {o} = 0 for o ω (For example, if Ω = {1,, 3}, ω = 1, γ{1} = 1/, γ{} = 1/4, and γ{3} = 1/4, λ(ω, γ) = (1/ 1) +(1/4 0) +(1/4 0) = 3/8) The game of prediction is being played repeatedly by a learner having access to decisions made by a pool of experts, which leads to the following prediction protocol: Protocol 1 Prediction with expert advice L 0 := 0 L k 0 := 0, k = 1,,K for N = 1,, do Expert k announces γn k Γ, k = 1,,K Learner announces γ N Γ Reality announces ω N Ω L N := L N 1 + λ(ω N, γ N ) L k N := Lk N 1 + λ(ω N, γn k ), k = 1,,K end for At each step of Protocol 1 Learner is given K experts advice and is required to come up with his own decision; L N is his cumulative loss over the first N steps, and L k N is the kth expert s cumulative loss over the first N steps In the case of the Brier game, the decisions are probability forecasts for the next observation An optimal (in the sense of Theorem 1 below) strategy for Learner in prediction with expert advice for the Brier game is given by the strong aggregating
3 algorithm For each expert k, the algorithm maintains its weight w k, constantly slashing the weights of less successful experts Its description uses the notation t + := max(t, 0) Algorithm 1 Strong aggregating algorithm for the Brier game w0 k := 1, k = 1,,K for N = 1,, do Read the Experts predictions γn k, k = 1,,K Set G N (ω) := ln K k=1 wk N 1 e λ(ω,γk N ), ω Ω Solve ω Ω (s G N(ω)) + = in s R Set γ N {ω} := (s G N (ω)) + /, ω Ω Output prediction γ N P(Ω) Read observation ω N wn k := wk N 1 e λ(ωn,γk N ) end for The algorithm will be derived in Section 5 The following result (to be proved in Section 4) gives a performance guarantee for it that cannot be improved by any other prediction algorithm Theorem 1 Using Algorithm 1 as Learner s strategy in Protocol 1 for the Brier game guarantees that L N min k=1,,k Lk N + lnk (1) for all N = 1,, If A < lnk, Learner does not have a strategy guaranteeing for all N = 1,, L N min k=1,,k Lk N + A () The second part of this theorem follows from its special case with Ω = (the binary case) However, we are not aware of a proof of this result in the binary case, and we will not use this reduction 3 Experimental results In our first empirical study of Algorithm 1 we use historical data about 6473 matches in various English football league competitions, namely: the Premier League (the pinnacle of the English football system), the Football League Championship, Football League One, Football League Two, the Football Conference Our data, provided by Football-Data, cover three seasons, 005/006, 006/007, and 007/008 (The 007/008 season ended in May shortly after the ICML 008 submission deadline, and so the data set used in the conference version [17] of this paper covered only part of that season, with 6416 matches in total) The matches are sorted first by date, then by league, and then by the name of the home team In the terminology of our prediction protocol, the 3
4 outcome of each match is the observation, taking one of three possible values, home win, draw, or away win ; we will encode the possible values as 1,, and 3 For each match we have forecasts made by a range of bookmakers We chose eight bookmakers for which we have enough data over a long period of time, namely Bet365, Bet&Win, Gamebookers, Interwetten, Ladbrokes, Sportingbet, Stan James, and VC Bet (And the seasons mentioned above were chosen because the forecasts of these bookmakers are available for them) A probability forecast for the next observation is essentially a vector (p 1, p, p 3 ) consisting of positive numbers summing to 1 The bookmakers do not announce these numbers directly; instead, they quote three betting odds, a 1, a, and a 3 Each number a i is the amount which the bookmaker undertakes to pay out to a client betting on outcome i per unit stake in the event that i happens (the stake itself is never returned to the bettor, which makes all betting odds greater than 1; ie, the odds are announced according to the continental rather than traditional system) The inverse value 1/a i, i {1,, 3}, can be interpreted as the bookmaker s quoted probability for the observation i The bookmaker s quoted probabilities are usually slightly (because of the competition with other bookmakers) in his favour: the sum 1/a 1 + 1/a + 1/a 3 exceeds 1 by the amount called the overround (at most 015 in the vast majority of cases) We used p i := 1/a i 1/a 1 + 1/a + 1/a 3, i = 1,, 3, (3) as the bookmaker s forecasts; it is clear that p 1 + p + p 3 = 1 The results of applying Algorithm 1 to the football data, with 8 experts and 3 possible observations, are shown in Figure 1 Let L k N be the cumulative loss of Expert k, k = 1,,8, over the first N matches and L N be the corresponding number for Algorithm 1 (ie, we essentially continue to use the notation of Theorem 1) The dashed line corresponding to Expert k shows the excess loss N L k N L N of Expert k over Algorithm 1 The excess loss can be negative, but from Theorem 1 we know that it cannot be less than ln 8; this lower bound is also shown in Figure 1 Finally, the thick line (the positive part of the x axis) is drawn for comparison: this is the excess loss of Algorithm 1 over itself We can see that at each moment in time the algorithm s cumulative loss is fairly close to the cumulative loss of the best expert (at that time; the best expert keeps changing over time) Figure shows the distribution of the bookmakers overrounds We can see that in most cases overrounds are between 005 and 015, but there are also occasional extreme values, near zero or in excess of 03 In Figure 1 one bookmaker clearly performs worse than the others His poor performance may be explained by his mean overround being about 013, near the top end of the distribution in Figure (On one hand, a high overround diminishes the need for accurate probability forecasts, and on the other, our estimates (3) of the probabilities implicit in the announced odds also become less precise) 4
5 Algorithm 1 experts Figure 1: The difference between the cumulative loss of each of the 8 bookmakers (experts) and of Algorithm 1 on the football data The theoretical lower bound ln8 from Theorem 1 is also shown Figure 3 shows the results of another empirical study, involving data about a large number of tennis tournaments in 004, 005, 006, and 007, with the total number of matches 10,087 The tournaments include, eg, Australian Open, French Open, US Open, and Wimbledon; the data is provided by Tennis- Data The matches are sorted by date, then by tournament, and then by the winner s name The data contain information about the winner of each match and the betting odds of 4 bookmakers for his/her win and for the opponent s win Therefore, now there are two possible observations (player 1 s win and player s win) There are four bookmakers: Bet365, Centrebet, Expekt, and Pinnacle Sports The results in Figure 3 are presented in the same way as in Figure 1 Typical values of the overround are below 01, as shown in Figure 4 (analogous to Figure ) In both Figure 1 and Figure 3 the cumulative loss of Algorithm 1 is close to the cumulative loss of the best expert, despite the fact that some of the experts perform poorly The theoretical bound is not hopelessly loose for the football data and is rather tight for the tennis data The pictures look exactly the same when Algorithm 1 is applied in the more realistic manner where the experts weights w k are not updated over the matches that are played simultaneously Our second empirical study (Figure 3) is about binary prediction, and so the algorithm of [13] could have also been used (and would have given similar 5
6 Figure : The overround distribution histogram for the football data, with 00 bins of equal size between the minimum and maximum values of the overround results) We included it since we are not aware of any empirical studies even for the binary case For comparison with several other popular prediction algorithms, see Appendix B The data used for producing all the figures and tables in this section and in Appendix B can be downloaded from 4 Proof of Theorem 1 This proof will use some basic notions of elementary differential geometry, especially those connected with the Gauss Kronecker curvature of surfaces (The use of curvature in this kind of results is standard: see, eg, [13] and [8]) All definitions that we will need can be found in, eg, [1] A vector f R Ω (understood to be a function f : Ω R) is a superprediction if there is γ Γ such that, for all ω Ω, λ(ω, γ) f(ω); the set Σ of all superpredictions is the superprediction set For each learning rate η > 0, let Φ η : R Ω (0, ) Ω be the homeomorphism defined by Φ η (f) : ω Ω e ηf(ω), f R Ω (4) The image Φ η (Σ) of the superprediction set will be called the η-exponential superprediction set It is known that L N min k=1,,k Lk N + lnk, N = 1,,, η 6
7 Algorithm 1 experts Figure 3: The difference between the cumulative loss of each of the 4 bookmakers and of Algorithm 1 on the tennis data Now the theoretical bound is ln4 can be guaranteed if and only if the η-exponential superprediction set is convex (part if for all K and part only if for K are proved in [14]; part only if for all K is proved by Chris Watkins, and the details can be found in Appendix A) Comparing this with (1) and () we can see that we are required to prove that Φ η (Σ) is convex when η 1; Φ η (Σ) is not convex when η > 1 Define the η-exponential superprediction surface to be the part of the boundary of the η-exponential superprediction set Φ η (Σ) lying inside (0, ) Ω The idea of the proof is to check that, for all η < 1, the Gauss Kronecker curvature of this surface is nowhere vanishing Even when this is done, however, there is still uncertainty as to in which direction the surface is bulging (towards the origin or away from it) The standard argument (as in [1], Chapter 1, Theorem 6) based on the continuity of the smallest principal curvature shows that the η-exponential superprediction set is bulging away from the origin for small enough η: indeed, since it is true at some point, it is true everywhere on the surface By the continuity in η this is also true for all η < 1 Now, since the η-exponential superprediction set is convex for all η < 1, it is also convex for η = 1 Let us now check that the Gauss Kronecker curvature of the η-exponential superprediction surface is always positive when η < 1 and is sometimes negative 7
8 Figure 4: The overround distribution histogram for the tennis data when η > 1 (the rest of the proof, an elaboration of the above argument, will be easy) Set n := Ω ; without loss of generality we assume Ω = {1,, n} A convenient parametric representation of the η-exponential superprediction surface is x 1 x x n 1 x n = e η((u1 1) +(u ) + +(u n ) ) e η((u1 ) +(u 1) + +(u n ) ) e η((u1 ) + +(u n 1 1) +(u n ) ) e η((u1 ) + +(u n 1 ) +(u n 1) ), (5) where u 1,, u n 1 are the coordinates on the surface, u 1,,u n 1 (0, 1) subject to u 1 + u n 1 < 1, and u n is a shorthand for 1 u 1 u n 1 The derivative of (5) in u 1 is u 1 x 1 x x n 1 x n (u n u 1 + 1)e η((u1 1) +(u ) + +(u n 1 ) +(u n ) ) (u n u 1 )e η((u1 ) +(u 1) + +(u n 1 ) +(u n ) ) = η (u n u 1 )e η((u1 ) +(u ) + +(u n 1 1) +(u n ) ) (u n u 1 1)e η((u1 ) +(u ) + +(u n 1 ) +(u n 1) ) 8
9 the derivative in u is and so on, up to u u n 1 x 1 x x n 1 x n x 1 x x n 1 x n (u n u )e ηu1 (u n u + 1)e ηu, (u n u )e ηun 1 (u n u 1)e ηun (u n u 1 + 1)e ηu1 (u n u 1 )e ηu, (u n u 1 )e ηun 1 (u n u 1 1)e ηun (u n u n 1 )e ηu1 (u n u n 1 )e ηu, (u n u n 1 + 1)e ηun 1 (u n u n 1 1)e ηun all coefficients of proportionality being equal and positive A normal vector to the surface can be found as Z := e 1 e n 1 e n (u n u 1 + 1)e ηu1 (u n u 1 )e ηun 1 (u n u 1 1)e ηun, (u n u n 1 )e ηu1 (u n u n 1 + 1)e ηun 1 (u n u n 1 1)e ηun where e i is the ith vector in the standard basis of R n The coefficient in front of e 1 is the (n 1) (n 1) determinant (u n u 1 )e ηu (u n u 1 )e ηun 1 (u n u 1 1)e ηun (u n u + 1)e ηu (u n u )e ηun 1 (u n u 1)e ηun (u n u n 1 )e ηu (u n u n 1 + 1)e ηun 1 (u n u n 1 1)e ηun u n u 1 u n u 1 u n u 1 1 u n u + 1 u n u u n u 1 e ηu1 u n u n 1 u n u n u n u n 1 1 9
10 1 1 1 u n u u n u 1 = e ηu1 1 1 u n u u n u n u n u u 1 u = e ηu u 1 u u 1 u n 1 = e ηu1( ( 1) n (u n u 1 1) + ( 1) n+1 (u 1 u ) + ( 1) n+1 (u 1 u 3 ) + + ( 1) n+1 (u 1 u n 1 ) ) = e ηu1 ( 1) n ( (u + u u n ) (n 1)u 1 1 ) = e ηu1 ( 1) n nu 1 u 1 e ηu1 (6) (with a positive coefficient of proportionality, e η, in the first ; the third equality follows from the expansion of the determinant along the last column and then along the first row) Similarly, the coefficient in front of e i is proportional (with the same coefficient of proportionality) to u i e ηui for i =,, n 1; indeed, the (n 1) (n 1) determinant representing the coefficient in front of e i can be reduced to the form analogous to (6) by moving the ith row to the top The coefficient in front of e n is proportional to e ηun u n u u n u 1 u n u 1 u n u 1 u n u u n u + 1 u n u u n u u n u n u n u n u n u n + 1 u n u n u n u n 1 u n u n 1 u n u n 1 u n u n u n u u n u = e ηun u n u n u n u n u n u u n u = e ηun = nu n e ηun u n u n nu n (with the coefficient of proportionality e η ( 1) n 1 ) 10
11 The Gauss Kronecker curvature at the point with coordinates (u 1,,u n 1 ) is proportional (with a positive coefficient of proportionality, possibly depending on the point) to Z T u 1 Z T u n 1 Z T ([1], Chapter 1, Theorem 5, with T standing for transposition) A straightforward calculation allows us to rewrite determinant (7) (ignoring the positive coefficient (( 1) n 1 ne η ) n ) as (1 ηu 1 )e ηu1 0 0 (ηu n 1)e ηun 0 (1 ηu )e ηu 0 (ηu n 1)e ηun 0 0 (1 ηu n 1 )e ηun 1 (ηu n 1)e ηun u 1 e ηu1 u e ηu u n 1 e ηun 1 u n e ηun 1 ηu ηu n ηu 0 ηu n ηu n 1 ηu n 1 u 1 u u n 1 u n = u 1 (1 ηu )(1 ηu 3 ) (1 ηu n ) + u (1 ηu 1 )(1 ηu 3 ) (1 ηu n ) + (7) + u n (1 ηu 1 )(1 ηu ) (1 ηu n 1 ) (8) (with a positive coefficient of proportionality; to avoid calculation of the parities of various permutations, the reader might prefer to prove the last equality by induction in n, expanding the last determinant along the first column) Our next goal is to show that the last expression in (8) is positive when η < 1 but can be negative when η > 1 If η > 1, set u 1 = u := 1/ and u 3 = = u n := 0 The last expression in (8) becomes negative It will remain negative if u 1 and u are sufficiently close to 1/ and u 3,,u n are sufficiently close to 0 It remains to consider the case η < 1 Set t i := 1 ηu i, i = 1,,n; the constraints on the t i are 1 < 1 η < t i < 1, i = 1,,n, t t n = n η > n (9) Our goal is to prove (1 t 1 )t t 3 t n + + (1 t n )t 1 t t n 1 > 0, 11
12 ie, This reduces to if t 1 t n > 0, and to t t 3 t n + + t 1 t t n 1 > nt 1 t n (10) 1 t t n > n (11) < n (1) t 1 t n if t 1 t n < 0 The remaining case is where some of the t i are zero; for concreteness, let t n = 0 By (9) we have t t n 1 > n, and so all of t 1,, t n 1 are positive; this shows that (10) is indeed true Let us prove (11) Since t 1 t n > 0, all of t 1,, t n are positive (if two of them were negative, the sum t t n would be less than n ; cf (9)) Therefore, > = n t 1 t n } {{ } n times To establish (10) it remains to prove (1) Suppose, without loss of generality, that t 1 > 0, t > 0,, t n 1 > 0, and t n < 0 We will prove a slightly stronger statement allowing t 1,,t n to take value 1 and removing the lower bound on t n Since the function t (0, 1] 1/t is convex, we can also assume, without loss of generality, t 1 = = t n = 1 Then t n 1 + t n > 0, and so therefore, 1 t n t n < 0; < n < n t 1 t n t n 1 t n Finally, let us check that the positivity of the Gauss Kronecker curvature implies the convexity of the η-exponential superprediction set in the case η 1, and the lack of positivity of the Gauss Kronecker curvature implies the lack of convexity of the η-exponential superprediction set in the case η > 1 The η-exponential superprediction surface will be oriented by choosing the normal vector field directed towards the origin This can be done since x 1 x n e ηu1 e ηun, Z ( 1) n 1 u 1 e ηu1 u n e ηun, (13) with both coefficients of proportionality positive (cf (5) and the bottom row of the first determinant in (8)), and the sign of the scalar product of the two vectors on the right-hand sides in (13) does not depend on the point (u 1,, u n 1 ) Namely, we take ( 1) n Z as the normal vector field directed towards the origin The Gauss Kronecker curvature will not change sign after the re-orientation: 1
13 if n is even, the new orientation coincides with the old, and for odd n the Gauss Kronecker curvature does not depend on the orientation In the case η > 1, the Gauss Kronecker curvature is negative at some point, and so the η-exponential superprediction set is not convex ([1], Chapter 13, Theorem 1 and its proof) It remains to consider the case η 1 Because of the continuity of the η- exponential superprediction surface in η we can and will assume, without loss of generality, that η < 1 Let us first check that the smallest principal curvature k 1 = k 1 (u 1,, u n 1, η) of the η-exponential superprediction surface is always positive (among the arguments of k 1 we list not only the coordinates u 1,,u n 1 of a point on the surface (5) but also the learning rate η (0, 1)) At least at some (u 1,,u n 1, η) the value of k 1 (u 1,,u n 1, η) is positive: take a sufficiently small η and the point on the surface (5) at which the maximum of x x n is attained (the point of the η-exponential superprediction set at which the maximum is attained will lie on the surface since the maximum is attained at (x 1,,x n ) = (1,,1) when η = 0) Therefore, for all (u 1,,u n 1, η) the value of k 1 (u 1,, u n 1, η) is positive: if k 1 had different signs at two points in the set { (u 1,, u n 1, η) u 1 (0, 1),,u n 1 (0, 1), u u n 1 < 1, η (0, 1) }, (14) we could connect these points by a continuous curve lying completely inside (14); at some point on the curve, k 1 would be zero, in contradiction to the positivity of the Gauss Kronecker curvature k 1 k n 1 Now it is easy to show that the η-exponential superprediction set is convex Suppose there are two points A and B on the η-exponential superprediction surface such that the interval [A, B] contains points outside the η-exponential superprediction set The intersection of the plane OAB, where O is the origin, with the η-exponential superprediction surface is a planar curve; the curvature of this curve at some point between A and B will be negative (remember that the curve is oriented by directing the normal vector field towards the origin), contradicting the positivity of k 1 at that point 5 Derivation of the prediction algorithm To achieve the loss bound (1) in Theorem 1 Learner can use, as discussed earlier, the strong aggregating algorithm (see, eg, [16], Section 1, (15)) with η = 1 In this section we will find a substitution function for the strong aggregating algorithm for the Brier game with η 1, which is the only component of the algorithm not described explicitly in [16] Our substitution function will not require that its input, the generalized prediction, should be computed from the normalized distribution (w k ) K k=1 on the experts; this is a valuable feature for 13
14 generalizations to an infinite number of experts (as demonstrated in, eg, [16], Appendix A1) Suppose that we are given a generalized prediction (l 1,,l n ) T computed by the aggregating pseudo-algorithm from a normalized distribution on the experts Since (l 1,, l n ) T is a superprediction (remember that we are assuming η 1), we are only required to find a permitted prediction (u 1 1) + (u ) + + (u n ) (cf (5)) satisfying λ 1 λ λ n = (u 1 ) + (u 1) + + (u n ) (u 1 ) + (u ) + + (u n 1) (15) λ 1 l 1,, λ n l n (16) Now suppose we are given a generalized prediction (L 1,, L n ) T computed by the aggregating pseudo-algorithm from an unnormalized distribution on the experts; in other words, we are given L 1 L n = l 1 + c l n + c for some c R To find (15) satisfying (16) we can first find the largest t R such that (L 1 t,, L n t) T is still a superprediction and then find (15) satisfying λ 1 L 1 t,, λ n L n t (17) Since t c, it is clear that (λ 1,,λ n ) T will also satisfy the required (16) Proposition 1 Define s R by the requirement n (s L i ) + = (18) i=1 The unique solution to the optimization problem t max under the constraints (17) with λ 1,, λ n as in (15) will be u i = (s L i) +, i = 1,,n, (19) t = s 1 (u 1 ) (u n ) (0) There exists a unique s satisfying (18) since the left-hand side of (18) is a continuous, increasing (strictly increasing when positive) and unbounded above function of s The substitution function is given by (19) 14
15 Proof of Proposition 1 Let us denote the u i and t defined by (19) and (0) as u i and t, respectively To see that they satisfy the constraints (17), notice that the ith constraint can be spelt out as (u 1 ) + + (u n ) u i + 1 L i t, which immediately follows from (19) and (0) As a by-product, we can see that the inequality becomes an equality, ie, t = L i 1 + u i (u 1 ) (u n ), (1) for all i with u i > 0 We can rewrite (17) as t L u 1 (u 1 ) (u n ), t L n 1 + u n (u 1 ) (u n ), () and our goal is to prove that these inequalities imply t < t (unless u 1 = u 1,,u n = u n ) Choose u i (necessarily u i > 0 unless u 1 = u 1,, u n = u n ; in the latter case, however, we can, and will, also choose u i > 0) for which ǫ i := u i u i is maximal Then every value of t satisfying () will also satisfy t L i 1 + u i n (u j ) j=1 = L i 1 + u i ǫ i L i 1 + u i n n (u j ) + ǫ j u j j=1 n (u j ) j=1 j=1 n ǫ j t, with the last following from (1) and becoming < when not all u j coincide with u j The detailed description of the resulting prediction algorithm was given as Algorithm 1 in Section As discussed, that algorithm uses the generalized prediction G N (ω) computed from unnormalized weights 6 Conclusion In this paper we only considered the simplest prediction problem for the Brier game: competing with a finite pool of experts In the case of square-loss regression, it is possible to find efficient closed-form prediction algorithms competitive with linear functions (see, eg, [3], Chapter 11) Such algorithms can often be kernelized to obtain prediction algorithms competitive with reproducing kernel Hilbert spaces of prediction rules This would be an appealing research programme in the case of the Brier game as well j=1 n j=1 ǫ j 15
16 Acknowledgments We are grateful to Football-Data and Tennis-Data for providing access to the data used in this paper This work was partly supported by EPSRC (grant EP/F00998/1) Comments by Alexey Chernov, Yuri Kalnishkan, Alex Gammerman, Bob Vickers, and the anonymous referees for the conference version have helped us improve the presentation The latter also suggested comparing our results to the Weighted Average Algorithm and the Hedge algorithm References [1] Glenn W Brier Verification of forecasts expressed in terms of probability Monthly Weather Review, 78:1 3, 1950 [] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth How to use expert advice Journal of the Association for Computing Machinery, 44:47 485, 1997 [3] Nicolò Cesa-Bianchi and Gábor Lugosi Prediction, Learning, and Games Cambridge University Press, Cambridge, England, 006 [4] A Philip Dawid Probability forecasting In Samuel Kotz, Norman L Johnson, and Campbell B Read, editors, Encyclopedia of Statistical Sciences, volume 7, pages Wiley, New York, 1986 [5] Alfredo DeSantis, George Markowsky, and Mark N Wegman Learning probabilistic prediction functions In Proceedings of the Twenty Ninth Annual IEEE Symposium on Foundations of Computer Science, pages , Los Alamitos, CA, 1988 IEEE Computer Society [6] Yoav Freund and Robert E Schapire A decision-theoretic generalization of on-line learning and an application to boosting Journal of Computer and System Sciences, 55: , 1997 [7] G H Hardy, John E Littlewood, and George Pólya Inequalities Cambridge University Press, Cambridge, England, second edition, 195 [8] David Haussler, Jyrki Kivinen, and Manfred K Warmuth Sequential prediction of individual sequences under general loss functions IEEE Transactions on Information Theory, 44: , 1998 [9] Yuri Kalnishkan and Michael V Vyugin The Weak Aggregating Algorithm and weak mixability In Peter Auer and Ron Meir, editors, Proceedings of the Eighteenth Annual Conference on Learning Theory, volume 3559 of Lecture Notes in Computer Science, pages , Berlin, 005 Springer [10] Jyrki Kivinen and Manfred K Warmuth Averaging expert predictions In Paul Fischer and Hans U Simon, editors, Proceedings of the Fourth European Conference on Computational Learning Theory, volume 157 of Lecture Notes in Artificial Intelligence, pages , Berlin, 1999 Springer 16
17 [11] Nick Littlestone and Manfred K Warmuth The Weighted Majority Algorithm Information and Computation, 108:1 61, 1994 [1] John A Thorpe Elementary Topics in Differential Geometry Springer, New York, 1979 [13] Vladimir Vovk Aggregating strategies In Mark Fulk and John Case, editors, Proceedings of the Third Annual Workshop on Computational Learning Theory, pages , San Mateo, CA, 1990 Morgan Kaufmann [14] Vladimir Vovk A game of prediction with expert advice Journal of Computer and System Sciences, 56: , 1998 [15] Vladimir Vovk Derandomizing stochastic prediction strategies Machine Learning, 35:47 8, 1999 [16] Vladimir Vovk Competitive on-line statistics International Statistical Review, 69:13 48, 001 [17] Vladimir Vovk and Fedor Zhdanov Prediction with expert advice for the Brier game In Andrew McCallum and Sam Roweis, editors, Proceedings of the Twenty Fifth International Conference on Machine Learning, 008 A Watkins s theorem Watkins s theorem is stated in [15] (Theorem 8) not in sufficient generality: it presupposes that the loss function is perfectly mixable The proof, however, shows that this assumption is irrelevant (it can be made part of the conclusion), and the goal of this appendix is to give a self-contained statement of a suitable version of the theorem In this appendix we will use a slightly more general notion of a game of prediction (Ω, Γ, λ): namely, the loss function λ : Ω Γ R is now allowed to take values in the extended real line R := R {, } (although the value will be later disallowed) Partly following [14], for each K = 1,, and each a > 0 we consider the following perfect-information game G K (a) (the global game ) between two players, Learner and Environment Environment is a team of K + 1 players called Expert 1 to Expert K and Reality, who play with Learner according to Protocol 1 Learner wins if, for all N = 1,, and all k {1,,K}, L N L k N + a; (3) otherwise, Environment wins It is possible that L N = or L k N = in (3); the interpretation of inequalities involving infinities is natural For each K we will be interested in the set of those a > 0 for which Learner has a winning strategy in the game G K (a) (we will denote this by L G K (a)) It is obvious that L G K (a) & a > a = L G K (a ); 17
18 therefore, for each K there exists a unique borderline value a K such that L G K (a) holds when a > a K and fails when a < a K It is possible that a K = (but remember that we are only interested in finite values of a) These are our assumptions about the game of prediction (similar to those in [14]): Γ is a compact topological space; for each ω Ω, the function γ Γ λ(ω, γ) is continuous (R is equipped with the standard topology); there exists γ Γ such that, for all ω Ω, λ(ω, γ) < ; the function λ is bounded below We say that the game of prediction (Ω, Γ, λ) is η-mixable, where η > 0, if γ 1 Γ, γ Γ, α [0, 1] δ Γ ω Ω: e ηλ(ω,δ) αe ηλ(ω,γ1) + (1 α)e ηλ(ω,γ) (4) In the case of finite Ω, this condition says that the image of the superprediction set under the mapping Φ η (see (4)) is convex The game of prediction is perfectly mixable if it is η-mixable for some η > 0 It follows from [7] (Theorem 9, applied to the means M φ with φ(x) = e ηx ) that if the prediction game is η-mixable it will remain η -mixable for any positive η < η (For another proof, see the end of the proof of Lemma 9 in [14]) Let η be the supremum of the η for which the prediction game is η-mixable (with η := 0 when the game is not perfectly mixable) The compactness of Γ implies that the prediction game is η -mixable Theorem (Chris Watkins) For any K {1,, }, a K = lnk η In particular, a K < if and only if the game is perfectly mixable The theorem does not say explicitly, but it is easy to check, that L G K (a K ): this follows both from general considerations (cf Lemma 3 in [14]) and from the fact that the SAA wins G K (a K ) = G K (lnk/η ) Proof of Theorem The proof will use some notions and notation used in the statement and proof of Theorem 1 of [14] Without loss of generality we can, and will, assume that the loss function satisfies λ > 1 (add a suitable constant to λ if needed) Therefore, Assumption 4 of [14] (the only assumption in [14] not directly made in this paper) is satisfied In view of the fact that L G K (lnk/η ), we only need to show that L G K (a) does not hold for a < lnk/η Fix a < lnk/η 18
19 The separation curve, as defined in [14], consists of the points (c(β), c(β)/η) [0, ), where β := e η and η ranges over [0, ] (see [14], Theorem 1) Since the two-fold convex mixture in (4) can be replaced by any finite convex mixture (apply two-fold mixtures repeatedly), setting η := η shows that the point (1, 1/η ) is Northeast of (actually belongs to) the separation curve On the other hand, the point (1, a/ lnk) is Southwest and outside of the separation curve (use Lemmas 8 1 of [14]) Therefore, E (=Environment) has a winning strategy in the game G(1, a/ lnk), as defined in [14] It is easy to see from the proof of Theorem 1 in [14] that the definition of the game G in [14] can be modified, without changing the conclusion about G(1, a/ ln K), by replacing the line E chooses n 1 {size of the pool} in the protocol on p 153 of [14] by E chooses n 1 {lower bound on the size of the pool} L chooses n n {size of the pool} (indeed, the proof in Section 6 of [14] only requires that there should be sufficiently many experts) Let n be the first move by Environment according to her winning strategy Now suppose L G K (a) From the fact that there exists Learner s strategy L 1 winning G K (a) we can deduce: there exists Learner s strategy L winning G K (a) (we can split the K experts into K groups of K, merge the experts decisions in each group with L 1, and finally merge the groups decisions with L 1 ); there exists Learner s strategy L 3 winning G K 3(3a) (we can split the K 3 experts into K groups of K, merge the experts decisions in each group with L, and finally merge the groups decisions with L 1 ); and so on When the number K m of experts exceeds n, we obtain a contradiction: Learner can guarantee L N L k N + ma for all N and all K m experts k, and Environment can guarantee that for some N and k L N > L k N + a lnk ln(km ) = L k N + ma B Comparison with other prediction algorithms Other popular algorithms for prediction with expert advice that could be used instead of Algorithm 1 in our empirical studies reported in Section 3 are, among others, Kivinen and Warmuth s [10] Weighted Average Algorithm (WdAA), Kalnishkan and Vyugin s [9] Weak Aggregating Algorithm (WkAA), and Freund and Schapire s [6] Hedge algorithm (HA) In this appendix we consider these three algorithms and three more naive algorithms (which, nevertheless, perform surprisingly well) 19
20 Weighted Average Algorithm experts Figure 5: The difference between the cumulative loss of each of the 8 bookmakers and of the Weighted Average Algorithm (WdAA) on the football data The chosen value of the parameter c = 1/η for the WdAA, c := 16/3, minimizes its theoretical loss bound The theoretical lower bound ln for Algorithm 1 is also shown (the theoretical lower bound for the Weighted Average Algorithm, , can be extracted from Table 1 below) The Weighted Average Algorithm is very similar to the Strong Aggregating Algorithm (SAA) used in this paper: the WdAA maintains the same weights for the experts as the SAA, and the only difference is that the WdAA merges the experts predictions by averaging them according to their weights, whereas the SAA uses a more complicated minimax optimal merging scheme (given by (19) for the Brier game) The performance guarantee for the WdAA applied to the Brier game is weaker than the optimal (1), but of course this does not mean that its empirical performance is necessarily worse than that of the SAA (ie, Algorithm 1) Figures 5 and 6 show the performance of this algorithm, in the same format as before (see Figures 1 and 3) We can see that for the football data the maximal difference between the cumulative loss of the WdAA and the cumulative loss of the best expert is larger that for Algorithm 1 but still well within the optimal bound lnk given by (1) For the tennis data the maximal difference is about twice as large as for Algorithm 1, violating the optimal bound lnk In its most basic form ([10], the beginning of Section 6), the WdAA works in the following protocol At each step each expert, Learner, and Reality choose an element of the unit ball in R n, and the loss function is the squared dis- 0
21 0 15 Weighted Average Algorithm experts Figure 6: The difference between the cumulative loss of each of the 4 bookmakers and of the WdAA for c := 4 on the tennis data tance between the decision (Learner s or an expert s move) and the observation (Reality s move) This covers the Brier game with Ω = {1,,n}, each observation ω Ω represented as the vector (δ ω {1},, δ ω {n}), and each decision γ P(Ω) represented as the vector (γ{1},, γ{n}) However, in the Brier game the decision makers moves are known to belong to the simplex {(u 1,, u n ) [0, ) n n i=1 ui = 1}, and Reality s move is known to be one of the vertices of this simplex Therefore, we can optimize the ball radius by considering the smallest ball containing the simplex rather than the unit ball This is what we did for the results reported here (although the results reported in the conference version of this paper [17] are for the WdAA applied to the unit cube in R n ) The radius of the smallest ball is R := if n = 3 n if n = 1 if n is large As described in [10], the WdAA is parameterized by c := 1/η instead of η, and the optimal value of c is c = 8R, leading to the guaranteed loss bound L N min k=1,,k Lk N + 8R lnk for all N = 1,, (see [10], Section 6) This is significantly looser than the bound (1) for Algorithm 1 1
22 5 maximal difference optimal parameter parameter c Figure 7: The maximal difference (5) for the WdAA as function of the parameter c on the football data The theoretical guarantee ln 8 for the maximal difference for Algorithm 1 is also shown (the theoretical guarantee for the WdAA, , is given in Table 1) The values c = 16/3 and c = 4 used in Figures 5 and 6, respectively, are obtained by minimizing the WdAA s performance guarantee, but minimizing a loose bound might not be such a good idea Figure 7 shows the maximal difference max N=1,,6473 ( L N (c) min k=1,,8 Lk N ), (5) where L N (c) is the loss of the WdAA with parameter c on the football data over the first N steps and L k N is the analogous loss of the kth expert, as a function of c Similarly, Figure 8 shows the maximal difference ( ) L N (c) min (6) max N=1,,10087 k=1,,4 Lk N for the tennis data And indeed, in both cases the value of c minimizing the empirical loss is far from the value minimizing the bound; as could be expected, the empirical optimal value for the WdAA is not so different from the optimal value for Algorithm 1 The following two figures, 9 and 10, demonstrate that there is no such anomaly for Algorithm 1 Figures 11 and 1 show the behaviour of the WdAA for the value of parameter c = 1, ie, η = 1, that is optimal for Algorithm 1 They look remarkably
23 35 3 maximal difference optimal parameter parameter c Figure 8: The maximal difference (6) for the WdAA as function of the parameter c on the tennis data The theoretical bound for the WdAA is 5545 (see Table 1) 3
24 6 4 maximal difference optimal parameter parameter η Figure 9: The maximal difference ((5) with η in place of c) for Algorithm 1 as function of the parameter η on the football data 4
25 4 35 maximal difference optimal parameter parameter η Figure 10: The maximal difference ((6) with η in place of c) for Algorithm 1 as function of the parameter η on the tennis data 5
26 Weighted Average Algorithm experts Figure 11: The difference between the cumulative loss of each of the 8 bookmakers and of the WdAA on the football data for c = 1 (the value of parameter minimizing the theoretical performance guarantee for Algorithm 1) similar to Figures 1 and 3, respectively The following two algorithms, the Weak Aggregating Algorithm (WkAA) and the Hedge algorithm (HA), make increasingly weaker assumptions about the prediction game being played Algorithm 1 computes the experts weights taking full account of the degree of convexity of the loss function and uses a minimax optimal substitution function Not surprisingly, it leads to the optimal loss bound of the form () The WdAA computes the experts weights in the same way, but uses a suboptimal substitution function; this naturally leads to a suboptimal loss bound The WkAA does not know that the loss function is strictly convex; it computes the experts weights in a way that leads to decent results for all convex functions The WkAA uses the same substitution function as the WdAA, but this appears less important than the way it computes the weights The HA knows even less: it does not even know that its and the experts performance is measured using a loss function At each step the HA decides which expert it is going to follow, and at the end of the step it is only told the losses suffered by all experts Therefore, it is not surprising that the WkAA does not perform as well as Algorithm 1 and the WdAA with c = 1; the performance of the HA is even weaker: see Figures The HA is a randomized algorithm, so we show the expected performance Figures show the performance of the WdAA and the HA for all possible values of their parameters (c and β, respectively) We do not show the optimal 6
27 Weighted Average Algorithm experts Figure 1: The difference between the cumulative loss of each of the 4 bookmakers and of the WdAA for c = 1 on the tennis data 6 5 maximal difference Figure 13: The maximal difference for the Weak Aggregating Algorithm (WkAA) as function of c on the football data 7
28 4 35 maximal difference Figure 14: The maximal difference for the WkAA as function of c on the tennis data 5 45 maximal difference for HA maximal difference for SAA HA Figure 15: The expected maximal difference for the Hedge algorithm (HA) and for the SAA Hedge algorithm (SAA-HA) as a function of β on the football data 8
29 values of parameters since neither algorithm satisfies a loss bound of the form () (typical loss bounds for these algorithms allow A to depend on N, and the optimal value would also depend on N) In the case of the HA, the loss bound given in the original paper [6] was replaced, in the same framework, by a stronger bound in [14] (Example 7) The stronger bound is achieved by the SAA applied to the HA framework described above (with no loss function); this algorithm is referred to as SAA-HA in the captions The description of the SAA-HA given in [14] admits some freedom in the choice of Learner s decision; our implementation replaces the HA s weights p k, k = 1,,K, with ln ( 1 + (β 1)p k) K k=1 ln (1 + (β, k = 1,,K 1)pk ) The losses suffered by the HA and the SAA-HA are very close An interesting observation is that, for both football and tennis data, the loss of the HA is almost minimized by setting its parameter β to 0 (the qualification almost is necessary in the case of the tennis data as well: the lines of maximal difference in Figure 16 are not monotonic for β extremely close to 0) The HA with β = 0 coincides with the Follow the Leader Algorithm (FLA), which chooses the same decision as the best (with the smallest loss up to now) expert; if there are several best experts (which almost never happens after the first step), their predictions are averaged with equal weights Standard examples (see, eg, [3], Section 43) show that this algorithm (unlike its version Follow the Perturbed Leader) can fail badly on some data sequences However, its empirical performance (Figures 17 and 18) on our data sets is not so bad: it violates the loss bounds for Algorithm 1 only slightly The decent performance of the Follow the Leader Algorithm suggests checking the empirical performance of other similarly naive algorithms The Simple Average Algorithm s decision is defined as the arithmetic mean of the experts decisions (with equal weights) Figures 19 and 0 show the performance of this algorithm It does violate the theoretical loss bound for Algorithm 1, but not significantly (especially in the case of football data) The last naive algorithm that we consider is in fact optimal, but for a different loss function The Bayes Mixture Algorithm (BMA) is the Strong Aggregating Algorithm applied to the log loss function This algorithm has a very simple description [13], and was studied from the point of view of prediction with expert advice already in [5] Figures 1 and show the performance of the BMA measured by the Brier loss function, as usual The performance is excellent for the football data but much weaker for tennis Despite the decent performance of the three naive algorithms on our two data sets, there is always a danger of catastrophic performance on some data set: there are no performance guarantees for these algorithms whatsoever It is an important advantage of more sophisticated algorithms that they establish some upper bound on the algorithm s regret Precise numbers associated with the figures referred to above are given in 9
30 9 8 maximal difference for HA maximal difference for SAA HA Figure 16: The expected maximal difference for the HA and for the SAA-HA as a function of β on the tennis data 1 10 Follow the Leader Algorithm experts Figure 17: The difference between the cumulative loss of each of the 8 bookmakers and of the Follow the Leader Algorithm on the football data 30
31 Follow the Leader Algorithm experts Figure 18: The difference between the cumulative loss of each of the 4 bookmakers and of the Follow the Leader Algorithm on the tennis data 1 10 Simple Average Algorithm experts Figure 19: The difference between the cumulative loss of each of the 8 bookmakers and of the Simple Average Algorithm on the football data 31
32 Simple Average Algorithm experts Figure 0: The difference between the cumulative loss of each of the 4 bookmakers and of the Simple Average Algorithm on the tennis data Bayes Mixture Algorithm experts Figure 1: The difference between the cumulative loss of each of the 8 bookmakers and of the Bayes Mixture Algorithm on the football data 3
33 15 Bayes Mixture Algorithm experts Figure : The difference between the cumulative loss of each of the 4 bookmakers and of the Bayes Mixture Algorithm on the tennis data Tables 1 and : the second column gives the maximal differences (5) and (6), respectively The numbers preceded by are the maximal differences corresponding to the best value of parameter chosen in hindsight, after seeing the data set Therefore, the corresponding numbers involve data snooping and cannot serve as a fair measure of performance The third column gives the theoretical performance guarantees (if available) 33
34 Algorithm Maximal difference Theoretical bound Algorithm WdAA (c = 16/3) WdAA (c = 1) 1181 none of the form () WkAA none of the form () HA (expected) 3694 none of the form () SAA-HA (expected) 388 none of the form () Follow the Leader Algorithm 7983 none Simple Average Algorithm 54 none Bayes Mixture Algorithm 1060 none Table 1: The maximal difference between the loss of each algorithm and the loss of the best expert for the football data (second column); the theoretical upper bound on this difference (third column) Algorithm Maximal difference Theoretical bound Algorithm WdAA (c = 4) WdAA (c = 1) none of the form () WkAA none of the form () HA (expected) none of the form () SAA-HA (expected) none of the form () Follow the Leader Algorithm none Simple Average Algorithm 3798 none Bayes Mixture Algorithm none Table : The maximal difference between the loss of each algorithm and the loss of the best expert for the tennis data (second column); the theoretical upper bound on this difference (third column) 34
Prediction With Expert Advice For The Brier Game
Journal of Machine Learning Research 0 (2009) 2445-247 Submitted 6/09; Published /09 Prediction With Expert Advice For The Brier Game Vladimir Vovk Fedor Zhdanov Computer Learning Research Centre Department
1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
The Advantages and Disadvantages of Online Linear Optimization
LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA
Week 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
arxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Universal Algorithms for Probability Forecasting
Universal Algorithms for Probability Forecasting Fedor Zhdanov Computer Learning Research Centre, Department of Computer Science, Royal Holloway University of London, Egham, Surrey, TW0 0EX, UK. [email protected]
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Notes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
LOOKING FOR A GOOD TIME TO BET
LOOKING FOR A GOOD TIME TO BET LAURENT SERLET Abstract. Suppose that the cards of a well shuffled deck of cards are turned up one after another. At any time-but once only- you may bet that the next card
Gambling Systems and Multiplication-Invariant Measures
Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.
Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
About the inverse football pool problem for 9 games 1
Seventh International Workshop on Optimal Codes and Related Topics September 6-1, 013, Albena, Bulgaria pp. 15-133 About the inverse football pool problem for 9 games 1 Emil Kolev Tsonka Baicheva Institute
The Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
ALMOST COMMON PRIORS 1. INTRODUCTION
ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
Predicting sports events from past results
Predicting sports events from past results Towards effective betting on football Douwe Buursma University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT
EFFICIENCY IN BETTING MARKETS: EVIDENCE FROM ENGLISH FOOTBALL
The Journal of Prediction Markets (2007) 1, 61 73 EFFICIENCY IN BETTING MARKETS: EVIDENCE FROM ENGLISH FOOTBALL Bruno Deschamps and Olivier Gergaud University of Bath University of Reims We analyze the
No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics
No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
PYTHAGOREAN TRIPLES KEITH CONRAD
PYTHAGOREAN TRIPLES KEITH CONRAD 1. Introduction A Pythagorean triple is a triple of positive integers (a, b, c) where a + b = c. Examples include (3, 4, 5), (5, 1, 13), and (8, 15, 17). Below is an ancient
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
Markov random fields and Gibbs measures
Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand
Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such
Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012
Bayesian logistic betting strategy against probability forecasting Akimichi Takemura, Univ. Tokyo (joint with Masayuki Kumon, Jing Li and Kei Takeuchi) November 12, 2012 arxiv:1204.3496. To appear in Stochastic
Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research [email protected]
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research [email protected] Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
OPTIMAL CONTROL OF A COMMERCIAL LOAN REPAYMENT PLAN. E.V. Grigorieva. E.N. Khailov
DISCRETE AND CONTINUOUS Website: http://aimsciences.org DYNAMICAL SYSTEMS Supplement Volume 2005 pp. 345 354 OPTIMAL CONTROL OF A COMMERCIAL LOAN REPAYMENT PLAN E.V. Grigorieva Department of Mathematics
FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z
FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Surface bundles over S 1, the Thurston norm, and the Whitehead link
Surface bundles over S 1, the Thurston norm, and the Whitehead link Michael Landry August 16, 2014 The Thurston norm is a powerful tool for studying the ways a 3-manifold can fiber over the circle. In
Applied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
Lecture Notes on Elasticity of Substitution
Lecture Notes on Elasticity of Substitution Ted Bergstrom, UCSB Economics 210A March 3, 2011 Today s featured guest is the elasticity of substitution. Elasticity of a function of a single variable Before
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
Metric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
The Kelly criterion for spread bets
IMA Journal of Applied Mathematics 2007 72,43 51 doi:10.1093/imamat/hxl027 Advance Access publication on December 5, 2006 The Kelly criterion for spread bets S. J. CHAPMAN Oxford Centre for Industrial
Sharing Online Advertising Revenue with Consumers
Sharing Online Advertising Revenue with Consumers Yiling Chen 2,, Arpita Ghosh 1, Preston McAfee 1, and David Pennock 1 1 Yahoo! Research. Email: arpita, mcafee, [email protected] 2 Harvard University.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Monotone multi-armed bandit allocations
JMLR: Workshop and Conference Proceedings 19 (2011) 829 833 24th Annual Conference on Learning Theory Monotone multi-armed bandit allocations Aleksandrs Slivkins Microsoft Research Silicon Valley, Mountain
A New Interpretation of Information Rate
A New Interpretation of Information Rate reproduced with permission of AT&T By J. L. Kelly, jr. (Manuscript received March 2, 956) If the input symbols to a communication channel represent the outcomes
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Separation Properties for Locally Convex Cones
Journal of Convex Analysis Volume 9 (2002), No. 1, 301 307 Separation Properties for Locally Convex Cones Walter Roth Department of Mathematics, Universiti Brunei Darussalam, Gadong BE1410, Brunei Darussalam
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are
1 Portfolio Selection
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with
On Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: [email protected] Fabrizio Smeraldi School of
Chapter 21: The Discounted Utility Model
Chapter 21: The Discounted Utility Model 21.1: Introduction This is an important chapter in that it introduces, and explores the implications of, an empirically relevant utility function representing intertemporal
DEVELOPING A MODEL THAT REFLECTS OUTCOMES OF TENNIS MATCHES
DEVELOPING A MODEL THAT REFLECTS OUTCOMES OF TENNIS MATCHES Barnett T., Brown A., and Clarke S. Faculty of Life and Social Sciences, Swinburne University, Melbourne, VIC, Australia ABSTRACT Many tennis
Online Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
What is Linear Programming?
Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to
Alok Gupta. Dmitry Zhdanov
RESEARCH ARTICLE GROWTH AND SUSTAINABILITY OF MANAGED SECURITY SERVICES NETWORKS: AN ECONOMIC PERSPECTIVE Alok Gupta Department of Information and Decision Sciences, Carlson School of Management, University
Linear Risk Management and Optimal Selection of a Limited Number
How to build a probability-free casino Adam Chalcraft CCR La Jolla [email protected] Chris Freiling Cal State San Bernardino [email protected] Randall Dougherty CCR La Jolla [email protected] Jason
9.2 Summation Notation
9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a
Mathematical Induction
Mathematical Induction (Handout March 8, 01) The Principle of Mathematical Induction provides a means to prove infinitely many statements all at once The principle is logical rather than strictly mathematical,
Sharing Online Advertising Revenue with Consumers
Sharing Online Advertising Revenue with Consumers Yiling Chen 2,, Arpita Ghosh 1, Preston McAfee 1, and David Pennock 1 1 Yahoo! Research. Email: arpita, mcafee, [email protected] 2 Harvard University.
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Using Generalized Forecasts for Online Currency Conversion
Using Generalized Forecasts for Online Currency Conversion Kazuo Iwama and Kouki Yonezawa School of Informatics Kyoto University Kyoto 606-8501, Japan {iwama,yonezawa}@kuis.kyoto-u.ac.jp Abstract. El-Yaniv
Research Article Stability Analysis for Higher-Order Adjacent Derivative in Parametrized Vector Optimization
Hindawi Publishing Corporation Journal of Inequalities and Applications Volume 2010, Article ID 510838, 15 pages doi:10.1155/2010/510838 Research Article Stability Analysis for Higher-Order Adjacent Derivative
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Chapter 7. Sealed-bid Auctions
Chapter 7 Sealed-bid Auctions An auction is a procedure used for selling and buying items by offering them up for bid. Auctions are often used to sell objects that have a variable price (for example oil)
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Practical Guide to the Simplex Method of Linear Programming
Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear
The p-norm generalization of the LMS algorithm for adaptive filtering
The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Numerical Algorithms for Predicting Sports Results
Numerical Algorithms for Predicting Sports Results by Jack David Blundell, 1 School of Computing, Faculty of Engineering ABSTRACT Numerical models can help predict the outcome of sporting events. The features
Notes on Determinant
ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without
Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points
Introduction to Algebraic Geometry Bézout s Theorem and Inflection Points 1. The resultant. Let K be a field. Then the polynomial ring K[x] is a unique factorisation domain (UFD). Another example of a
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
TAKE-AWAY GAMES. ALLEN J. SCHWENK California Institute of Technology, Pasadena, California INTRODUCTION
TAKE-AWAY GAMES ALLEN J. SCHWENK California Institute of Technology, Pasadena, California L INTRODUCTION Several games of Tf take-away?f have become popular. The purpose of this paper is to determine the
MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao [email protected]
Integer Polynomials June 9, 007 Yufei Zhao [email protected] We will use Z[x] to denote the ring of polynomials with integer coefficients. We begin by summarizing some of the common approaches used in dealing
TOPIC 4: DERIVATIVES
TOPIC 4: DERIVATIVES 1. The derivative of a function. Differentiation rules 1.1. The slope of a curve. The slope of a curve at a point P is a measure of the steepness of the curve. If Q is a point on the
Two-Stage Stochastic Linear Programs
Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic
The number of generalized balanced lines
The number of generalized balanced lines David Orden Pedro Ramos Gelasio Salazar Abstract Let S be a set of r red points and b = r + 2δ blue points in general position in the plane, with δ 0. A line l
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
LARGE CLASSES OF EXPERTS
LARGE CLASSES OF EXPERTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: [email protected] UofA, October 31, 2006 OUTLINE 1 TRACKING THE BEST EXPERT 2 FIXED SHARE FORECASTER 3 VARIABLE-SHARE
LECTURE 15: AMERICAN OPTIONS
LECTURE 15: AMERICAN OPTIONS 1. Introduction All of the options that we have considered thus far have been of the European variety: exercise is permitted only at the termination of the contract. These
