E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting of online convex optimization which, as we shall see, encompasses some of the online learning problems we have seen so far as special cases. A standard offline convex optimization problem involves a convex set Ω R n and a fixed convex cost function c : Ω R; the goal is to find a point x Ω that minimizes cx over Ω. An online convex optimization problem also involves a convex set Ω R n, but proceeds in trials: at each trial t, the player/algorithm must choose a point x t Ω, after which a convex cost function c t : Ω R is revealed by the environment/adversary, and a loss of c t x t is incurred for the trial; the goal is to minimize the total loss incurred over a sequence of trials, relative to the best single point in Ω in hindsight. Online Convex Optimization Convex set Ω R n For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t We will denote the total loss incurred by an algorithm A on trials as and the total loss of a fixed point x Ω as L [A] c t x t, L [x] c t x ; the regret of A w.r.t. the best fixed point in Ω is then simply L [A] inf x Ω L [x]. Let us consider which of the online learning problems we have seen earlier can be viewed as instances of online convex optimization OCO: 1. Online linear regression GD analysis. Here one selects a weight vector w t R n, and incurs a loss given by c t w t l y tw t x t, which is assumed to be convex in its argument, and is therefore convex in w t ; however the total loss is compared against the best fixed weight vector in only { u R n : u 2 U } for some U > 0. Call this latter set Ω. Clearly, if the weight vectors w t are also selected to be in Ω, e.g. by Euclidean projection onto Ω which in this case amounts to a simple normalization, then this becomes an instance of OCO. 2. Online linear regression EG analysis. Here one selects w t n, and incurs a loss given by c t w t l y tw t x t, which is assumed to be convex in its argument, and is therefore convex in w t ; the total loss is also compared against the best fixed weight vector in n. hus this can be viewed as an instance of OCO with Ω n. 1
2 Online Convex Optimization 3. Online prediction from expert advice. Here one selects a probability vector ˆp t n, and incurs a loss c t ˆp t l y tˆp t ξ t, which is assumed to be convex in its argument, and is therefore convex in ˆp t. However the total loss is compared only against the best vector in { e i n : i [n] }, where e i denotes a unit vector with 1 in the i-th position and 0 elsewhere. In general, the smallest loss in this set is not the same as the smallest loss in n, and therefore this is not an instance of OCO. 4. Online decision/allocation among experts. Here one selects a probability vector ˆp t n, and incurs a linear loss c t ˆp t ˆp t l t, which is clearly convex in ˆp t. he total loss is again compared against the best vector in { e i n : i [n] } as above, but in this case, comparison against this set is actually equivalent to comparison against n, since inf L [ˆp] ˆp n inf ˆp n ˆp l t inf ˆp ˆp n l t min i [n] e i l t min e i l t. i [n] hus online allocation with a linear loss as above can be viewed as an instance of OCO with Ω n. 2 Online Gradient Descent Algorithm he online gradient descent OGD algorithm [7] generalizes the basic gradient descent algorithm used for standard offline optimization problems, and can be described as follows: Algorithm Online Gradient Descent OGD Parameter: η > 0 Initialize: x 1 Ω For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t Update: x t+1 x t η c t x t x t+1 P Ω x t+1, where P Ω x argmin x x 2 Euclidean projection of x onto Ω x Ω heorem 2.1 Zinkevich, 2003. Let Ω R n be a closed, non-empty, convex set with x x 1 2 D x Ω. Let c t : Ω R t 1, 2,... be any sequence of convex cost functions with bounded subgradients, c t x 2 G t, x Ω. hen D G [ ] L OGDη inf L [x] 1 D 2 x Ω 2 η + η G2. Moreover, setting η upper bounds on them, are known in advance, we have to minimize the right-hand side of the above bound assuming D, G and, or L [ OGDη ] inf x Ω L [x] DG. Notes. Before proving the above theorem, we note the following: 1. he analysis of the GD algorithm for online regression can be viewed as a special case of the above, with D U and G ZR 2 see Lecture 17 notes. 2. With η as above which depends on knowledge of D, G and, one gets an O regret bound. Such an O regret bound with worse constants can also be obtained by using a time-varying parameter η t 1/ t t in fact this was the version contained in the original paper of Zinkevich [7].
Online Convex Optimization 3 Proof of heorem 2.1. he proof is similar to that of the regret bound we obtained for the GD algorithm for online regression in Lecture 17. Let x Ω. We have t [ ]: ct x t c t x c t x t x t x, by convexity of c t x t x t+1 x t x Summing over t 1..., we thus get 1 1 1 η x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 x x t 2 2 x x t+1 2 2 + η 2 c t x t 2 2 ct x t c t x 1 x x 1 2 2 x x +1 2 2 + η 2 G 2 1 D 2 + η 2 G 2. It is easy to see that the right-hand side is minimized at η inequality gives the desired result. D G ; substituting this back in the above 2.1 Logarithmic regret bound for OGD when cost functions are strongly convex using time-varying parameter η t he above result gives an O regret bound for the OGD algorithm in general. When the cost functions c t are known to be α-strongly convex, one can in fact obtain an Oln regret bound with a slight variant of the OGD algorithm that uses a suitable time-varying parameter η t [3]: heorem 2.2 Hazan et al, 2007. Let Ω R n be a closed, non-empty, convex set with x x 1 2 D x Ω. Let c t : Ω R t 1, 2,... be any sequence of α-strongly convex cost functions with bounded subgradients, c t x t 2 G t, x Ω. hen the OGD algorithm with time-varying parameter η t 1 αt satisfies L [OGD η t 1 αt ] inf L [x] G2 ln + 1. x Ω Proof. Let x Ω. Proceeding similarly as before, we have t [ ]: ct x t c t x c t x t x t x α 2 xt x 2 2, by α-strong convexity of c t x t x t+1 x t x α η t 2 xt x 2 2 1 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 αη t x t x 2 2 t 1 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 αη t x t x 2 2 t 1 x x t 2 2 x x t+1 2 2 + η 2 t c t x t 2 2 αη t x t x 2 2. t Summing over t 1... then gives ct x t c t x G2 2 G2 η t + 1 1 α x x 2 1 2 2 + 1 η 1 2 1 + ln + 0 + 0. t2 1 1 α x x t 2 2 η t η t 1
4 Online Convex Optimization 3 Online Mirror Descent Algorithm he online mirror descent OMD algorithm, described for example in [6], generalizes the basic mirror descent algorithm used for standard offline optimization problems [4, 5, 1] much as OGD generalizes basic gradient descent. he OMD framework allows one to obtain O regret bounds for cost functions with bounded subgradients not only in Euclidean norm assumed by the OGD analysis, but in any suitable norm. Let Ω R n be a convex set containing Ω i.e. Ω Ω, and let F : Ω R be a strictly convex and differentiable function. Let B F : Ω Ω R denote the Bregman divergence associated with F ; recall that for all u, v Ω, this is defined as B F u, v F u F v u v F v. hen the OMD algorithm using the set Ω and function F can be described as follows: Algorithm Online Mirror Descent OMD Parameters: η > 0; convex set Ω Ω; strictly convex, differentiable function F : Ω R Initialize: x 1 argmin F x x Ω For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t Update: F x t+1 F x t η c t x t x t+1 P F Ω xt+1, where P F Ω Assume this yields x t+1 Ω x argmin B F x, x Bregman projection of x onto Ω x Ω Below we will obtain a regret bound for OMD with appropriate choice of the function F for cost functions with subgradients bounded in an arbitrary norm. Before we do so, let us recall the notion of a dual norm. Specifically, let be any norm defined on a closed convex set Ω R n. hen the corresponding dual norm is defined as u sup x Ω: x 1 x u 2 sup x u 1 2 x 2. x Ω heorem 3.1. Let Ω R n be a closed, non-empty, convex set. Let c t : Ω R t 1, 2,... be any sequence of convex cost functions with bounded subgradients c t x t G t, x Ω where is an arbitrary norm. Let Ω Ω be a convex set and F : Ω R be a strictly convex function such that a F x F x 1 D 2 x Ω; and b the restriction of F to Ω is α-strongly convex w.r.t., the dual norm of. hen [ ] L OMDη inf L [x] 1 D 2 + η2 G 2. x Ω η Moreover, setting η D G α, or upper bounds on them, are known in advance, we have to minimize the right-hand side of the above bound assuming D, G, and L [ OMDη ] inf x Ω L [x] DG 2 α.
Online Convex Optimization 5 Proof. Let x Ω. Proceeding as before, we have t [ ]: ct x t c t x c t x t x t x, by convexity of c t F x t F x t+1 x t x η B F x, x t B F x, x t+1 + B F x t, x t+1 1 η Summing over t 1..., we thus get 1 η B F x, x t B F x, x t+1 B F x t+1, x t+1 + B F x t, x t+1. ct x t c t x 1 B F x, x 1 B F x, x +1 + 1 η η Now, it can be shown that Also, we have hus we get B F x t, x t+1 B F x t+1, x t+1 B F x, x 1 F x F x 1 D 2. F x t F x t+1 F x t+1 x t x t+1 B F x t, x t+1 B F x t+1, x t+1. F x t x t x t+1 α 2 xt x t+1 2 F x t+1.x t x t+1, by α-strong convexity of F on Ω η c t x t x t x t+1 α 2 xt x t+1 2 ηg x t x t+1 α 2 xt x t+1 2, by Hölders inequality η 2 G 2, η2 G 2. since η2 G 2 + α 2 xt x t+1 2 ηg x t x t+1 ct x t c t x 1 η It is easy to verify that the right-hand side is minimized at η D G inequality gives the desired result. Examples. Examples of OMD algorithms include the following: D 2 + η2 G 2. ηg α 2 xt x t+1 2 0 ; substituting this back in the above 1. OGD. Here Ω R n and F x 1 2 x 2 2. Note that F is 1-strongly convex on R n and therefore on any Ω R n. 2. Exponential weights. Here Ω a n for some a > 0; Ω R n ++; and F x n i1 x i ln x n i1 x i. Note that F is strictly convex on Ω and 1-strongly convex on Ω n.
6 Online Convex Optimization Exercise. Suppose the cost functions c t are all β-strongly convex for some β > 0 and have subgradients bounded in some arbitrary norm as above. Can you obtain a logarithmic Oln regret bound for the OMD algorithm in this case, as we did for the OGD algorithm? Why or why not? More details on online convex optimization, including missing details in some of the above proofs, can be found for example in [2]. 4 Next Lecture In the next lecture, we will consider how algorithms for online supervised learning problems such as online regression and classification can be used in a batch setting, and will see how regret bounds for online learning algorithms can be converted to generalization error bounds for the resulting batch algorithms. References [1] Amir Beck and Marc eboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 313:167 175, 2003. [2] Sébastion Bubeck. Introduction to online optimization. Lecture Notes, Princeton University, 2011. [3] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007. [4] A. Nemirovski. Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody, 15, 1979. [5] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. [6] Shai Shalev-Shwartz. Online Learning: heory, Algorithms, and Applications. PhD thesis, he Hebrew University of Jerusalem, 2007. [7] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning ICML, 2003.