Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Transcription

1 Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago Shai Shalev-Shartz Toyota Technological Institute at Chicago Abstract We describe a primal-dual frameork for the design and analysis of online strongly convex optimization algorithms. Our frameork yields the tightest knon logarithmic regret bounds for Follo-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then sho that one can interpolate beteen these to extreme cases. In particular, e derive a ne algorithm that shares the computational simplicity of gradient descent but achieves loer regret in many practical situations. Finally, e further extend our frameork for generalized strongly convex functions. 1 Introduction In recent years, online regret minimizing algorithms have become idely used and empirically successful algorithms for many machine learning problems. Notable examples include efficient learning algorithms for structured prediction and ranking problems [Collins, 2002, Crammer et al., 2006]. Most of these empirically successful algorithms are based on algorithms hich are tailored to general convex functions, hose regret is O( T. Rather recently, there is a groing body of ork providing online algorithms for strongly convex loss functions, ith regret guarantees that are only O(log T. These algorithms have potential to be highly applicable since many machine learning optimization problems are in fact strongly convex either ith strongly convex loss functions (e.g. log loss, square loss or, indirectly, via strongly convex regularizers (e.g. L 2 or KL based regularization. Note that in this later case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem (e.g. the SVM optimization problem uses the hinge loss ith L 2 regularization. The aim of this paper is to provide a template for deriving a ider class of regret-minimizing algorithms for online strongly convex programming. Online convex optimization takes place in a sequence of consecutive rounds. At each round, the learner predicts a vector t S R n, and the environment responds ith a convex loss function, l t : S R. The goal of the learner is to minimize the difference beteen his cumulative loss and the cumulative loss of the optimal fixed vector, T l T t( t min S l t(. This is termed regret since it measures ho sorry the learner is, in retrospect, not to have predicted the optimal vector. Roughly speaking, the family of regret minimizing algorithms (for general convex functions can be seen as varying on to axes, the style and the aggressiveness of the update. In addition to online algorithms relative simplicity, the empirical successes are also due to having these to knobs to tune for the problem at hand (hich determine the nature of the regret bound. By style, e mean updates hich favor either rotational invariance (such as gradient descent like update rules or sparsity (like the multiplicative updates. Of course there is a much richer family here, including the L p updates. By the aggressiveness of the update, e mean ho much the algorithm moves its decision to be consistent ith most recent loss functions. For example, the preceptron algorithm makes no update 1

2 hen there is no error. In contrast, there is a family of algorithms hich more aggressively update the loss hen there is a margin mistake. These algorithms are shon to have improved performance (see for example the experimental study in Shalev-Shartz and Singer [2007b]. While historically much of the analysis of these algorithms have been done on a case by case basis, in retrospect, the proof techniques have become somehat boilerplate, hich has lead to groing body of ork to unify these analyses (see Cesa-Bianchi and Lugosi [2006] for revie. Perhaps the most unified vie of these algorithms is the primal-dual frameork of Shalev-Shartz and Singer [2006], Shalev-Shartz [2007], for hich the gamut of these algorithms can be largely vieed as special cases. To aspects are central in providing this unification. First, the frameork orks ith a complexity function, hich determines the style of algorithm and the nature of the regret guarantee (If this function is the L 2 norm, then one obtains gradient like updates, and if this function is the KLdistance, then one obtains multiplicative updates. Second, the algorithm maintains both primal and dual variables. Here, the the primal objective function is T l t( (here l t is the loss function provided at round t, and one can construct a dual objective function D t (, hich only depends on the loss functions l 1, l 2,... l t 1. The algorithm orks by incrementally increasing the dual objective value (in an online manner, hich can be done since each D t is only a function of the previous loss functions. By eak duality, this can be seen as decreasing the duality gap. The level of aggressiveness is seen to be ho fast the algorithm is attempting to increase the dual objective value. This paper focuses on extending the duality frameork for online convex programming to the case of strongly convex functions. This analysis provides a more unified and intuitive vie of the extant algorithms for online strongly convex programming. An important observation e make is that any σ-strongly convex loss function can be reritten as l i ( = f( + g i (, here f is a fixed σ- strongly convex function (i.e. f does not depend on i, and g i is a convex function. Therefore, after t online rounds, the amount of intrinsic strong convexity e have in the primal objective t l t( 1 is at least σ t. In particular, this explains the learning rate of σ t proposed in the gradient descent algorithm of Hazan et al. [2006]. Indeed, e sho that our frameork includes the gradient descent algorithm of Hazan et al. [2006] as an important special case, in hich the aggressiveness level is minimal. At the most aggressive end, our frameork yields the Follo-The-Leader algorithm. Furthermore, the template algorithm serves as a vehicle for deriving ne algorithms (hich enjoy logarithmic regret guarantees. The remainder of the paper is outlined as follos. We first provide background on convex duality. As a armup, in Section 3, e present an intuitive primal-dual analysis of Follo-The-Leader (FTL, hen f is the Euclidean norm. This naturally leads to a more general primal-dual algorithm (for hich FTL is a special case, hich e present in Section 4. Next, e further generalize our algorithmic frameork to include strongly convex complexity functions f ith respect to arbitrary norms. We note that the introduction of a complexity function as already provided in Shalev- Shartz and Singer [2007a], but the analysis is rather specialized and does not have a knob hich can tune the aggressiveness of the algorithm. Finally, in Sec. 6 e conclude ith a side-by-side comparison of our algorithmic frameork for strongly convex functions and the frameork for (non-strongly convex functions given in Shalev-Shartz [2007]. 2 Mathematical Background We denote scalars ith loer case letters (e.g. and λ, and vectors ith bold face letters (e.g. and λ. The inner product beteen vectors x and is denoted by x,. To simplify our notation, given a sequence of vectors λ 1,..., λ t or a sequence of scalars σ 1,..., σ t e use the shorthand λ 1:t = t λ i and = t σ i. Sets are designated by upper case letters (e.g. S. The set of non-negative real numbers is denoted by R +. For any k 1, the set of integers {1,..., k} is denoted by [k]. A norm of a vector x is denoted by x. The dual norm is defined as λ = sup{ x, λ : x 1}. For example, the Euclidean norm, x 2 = ( x, x 1/2 is dual to itself and the L 1 norm, x 1 = i x i, is dual to the L norm, x = max i x i. 2

3 FOR t = 1, 2,..., T : Define t = 1 λ t 1:(t 1 Receive a function l t( = σ t g t( and suffer loss l t( t Update λ t+1 t s.t. the folloing holds (λ t+1 t argmax D t+1(λ 1,..., λ t λ 1,...,λ t Figure 1: A primal-dual vie of Follo-the-Leader. Here the algorithm s decision t is the best decision ith respect to the previous losses. This presentation exposes the implicit role of the dual variables. Slightly abusing notation, λ 1:0 = 0, so that 1 = 0. See text. We next recall a fe definitions from convex analysis. A function f is σ-strongly convex if f(αu + (1 αv αf(u + (1 αf(v σ 2 α (1 α u v 2 2. In Sec. 5 e generalize the above definition to arbitrary norms. If a function f is σ-strongly convex then the function g( = f( σ 2 2 is convex. The Fenchel conjugate of a function f : S R is defined as f (θ = sup, θ f(. S If f is closed and convex, then the Fenchel conjugate of f is f itself (a function is closed if for all α > 0 the level set { : f( α} is a closed set. It is straightforard to verify that the function f( is conjugate to itself. The definition of f also implies that for c > 0 e have (c f (θ = c f (θ/c. A vector λ is a sub-gradient of a function f at if for all S, e have that f( f(, λ. The differential set of f at, denoted f(, is the set of all sub-gradients of f at. If f is differentiable at, then f( consists of a single vector hich amounts to the gradient of f at and is denoted by f(. The Fenchel-Young inequality states that for any and θ e have that f( + f (θ, θ. Sub-gradients play an important role in the definition of the Fenchel conjugate. In particular, the folloing lemma, hose proof can be found in Borein and Leis [2006], states that if λ f( then the Fenchel-Young inequality holds ith equality. Lemma 1 Let f be a closed and convex function and let f( be its differential set at. Then, for all λ f(, e have f( + f (λ = λ,. We make use of the folloing variant of Fenchel duality (see the appendix for more details: max f ( λ 1,...,λ T λ t g t (λ t min 3 Warmup: A Primal-Dual Vie of Follo-The-Leader T f( + g t (. (1 In this section, e provide a dual analysis for the FTL algorithm. The dual vie of FTL ill help us to derive a family of logarithmic regret algorithms for online convex optimization ith strongly convex functions. Recall that FTL algorithm is defined as follos: t = argmin t 1 l i (. (2 For each i [t 1] define g i ( = l i ( σi 2 2, here σ i is the largest scalar such that g i is still a convex function. The assumption that l i is σ-strongly convex guarantees that σ i σ. We can 3

4 therefore rerite the objective function on the right-hand side of Eq. (2 as P t ( = 2 t g i (, (3 (recall that = t 1 σ i. The Fenchel dual optimization problem (see Sec. 2 is to maximize the folloing dual objective function 1 t 1 D t (λ 1,..., λ t 1 = λ 1:(t 1 2 gi (λ i. (4 2 Let (λ t 1,..., λ t t 1 be the maximizer of D t. The relation beteen the optimal dual variables and the optimal primal vector is given by (see again Sec. 2 1 t = λ t 1:(t 1. (5 Throughout this section e assume that strong duality holds (i.e. Eq. (1 holds ith equality. See the appendix for sufficient conditions. In particular, under this assumption, e have that the above setting for t is in fact a minimizer of the primal objective, since (λ t 1,..., λ t t 1 maximizes the dual objective (see the appendix. The primal-dual vie of Follo-the-Leader is presented in Figure 1. Denote t = D t+1 (λ t+1 t D t (λ t 1,..., λ t t 1. (6 To analyze the FTL algorithm, e first note that (by strong duality t = D T +1 (λ1 T +1,..., λ T +1 T = min P T +1( = min l t (. (7 Second, the fact that (λ t+1 t is the maximizer of D t+1 implies that for any λ e have t D t+1 (λ t 1,..., λ t t 1, λ D t (λ t 1,..., λ t t 1. (8 The folloing central lemma shos that there exists λ such that the right-hand side of the above is sufficiently large. Lemma 2 Let (λ 1,..., λ t 1 be an arbitrary sequence of vectors. Denote = 1 λ 1:(t 1, let v l t (, and let λ = v σ t. Then, λ g t ( and D t+1 (λ 1,..., λ t 1, λ D t (λ 1,..., λ t 1 = l t ( v 2 2. Proof We prove the lemma for the case t > 1. The case t = 1 can be proved similarly. Since l t ( = σt g t ( and v l t ( e have that λ g t (. Denote t = D t+1 (λ 1,..., λ t 1, λ D t (λ 1,..., λ t 1. Simple algebraic manipulations yield t = 1 λ 1:(t 1 + λ λ1:(t 1 2 g 2 2σ t (λ 1:(t 1 = λ 1:(t 1 2 ( 1 1 +, λ λ 2 gt (λ 2 2 = σ t 2 ( 1 σ t +, λ λ 2 gt (λ 2 2 = σ t 2 ( σ +, λ gt 2 (λ t 2 + σ t, λ + λ 2 } 2 {{ } 2 2 } {{ } A B Since λ g t (, Lemma 1 thus implies that, λ gt (λ = g t (. Therefore, A = l t (. Next, e note that B = σt+λ 2 2. We have thus shon that t = l t ( σt+λ 2 2. Plugging the definition of λ into the above e conclude our proof. Combining Lemma 2 ith Eq. (7 and Eq. (8 e obtain the folloing: 4

5 FOR t = 1, 2,..., T : Define t = 1 λ t 1:(t 1 Receive a function l t( = σ t g t( and suffer loss l t( t Update λ t+1 t s.t. the folloing holds λ t g t( t, s.t. D t+1(λ t+1 t D t+1(λ t 1,..., λ t t 1, λ t Figure 2: A primal-dual algorithmic frameork for online convex optimization. Again, 1 = 0. Corollary 1 Let l 1,..., l T be a sequence of functions such that for all t [T ], l t is σ t -strongly convex. Assume that the FTL algorithm runs on this sequence and for each t [T ], let v t be in l t ( t. Then, l t ( t min l t ( 1 2 v t 2 (9 Furthermore, let L = max t v t and assume that for all t [T ], σ t σ. Then, the regret is bounded by L2 2σ (log(t + 1. If e are dealing ith the square loss l t ( = µ t 2 2 (here nature chooses µ t, then it is straightforard to see that Eq. (8 holds ith equality, and this leads to the previous regret bound holding ith equality. This equality is the underlying reason that the FTL strategy is a minimax strategy (See Abernethy et al. [2008] for a proof of this claim. 4 A Primal-Dual Algorithm for Online Strongly Convex Optimization In the previous section, e provided a dual analysis for FTL. Here, e extend the analysis and derive a more general algorithmic frameork for online optimization. We start by examining the analysis of the FTL algorithm. We first make the important observation that Lemma 2 is not specific to the FTL algorithm and in fact holds for any configuration of dual variables. Consider an arbitrary sequence of dual variables: (λ 2 1, (λ 3 1, λ 3 2,..., (λ T 1 +1,..., λ T +1 T and denote t as in Eq. (6. Using eak duality, e can replace the equality in Eq. (7 ith the folloing inequality that holds for any sequence of dual variables: t = D T +1 (λ1 T +1,..., λ T +1 T min A summary of the algorithmic frameork is given in Fig. 2. P T +1( = min l t (. (10 The folloing theorem, a direct corollary of the previous equation and Lemma 2, shos that all instances of the frameork achieve logarithmic regret. Theorem 1 Let l 1,..., l T be a sequence of functions such that for all t [T ], l t is σ t -strongly convex. Then, any algorithm that can be derived from Fig. 2 satisfies here v t l t ( t. l t ( t min l t ( 1 2 v t 2 (11 Proof Let t be as defined in Eq. (6. The last condition in the algorithm implies that t D t+1 (λ t 1,..., λ t t 1, v t σ t t D t (λ t 1,..., λ t t 1. The proof follos directly by combining the above ith Eq. (10 and Lemma 2. We conclude this section by deriving several algorithms from the frameork. 5

6 Example 1 (Follo-The-Leader As e have shon in Sec. 3, the FTL algorithm (Fig. 1 is equivalent to optimizing the dual variables at each online round. This update clearly satisfies the condition in Fig. 2 and is therefore a special case. Example 2 (Gradient-Descent Folloing Hazan et al. [2006], Bartlett et al. [2007] suggested the folloing update rule for differentiable strongly convex function t+1 = t 1 l t ( t. (12 Using a simple inductive argument, it is possible to sho that the above update rule is equivalent to the folloing update rule of the dual variables (λ t+1 t = (λ t 1,..., λ t t 1, l t ( t σ t t. (13 Clearly, this update rule satisfies the condition in Fig. 2. Therefore our frameork encompasses this algorithm as a special case. Example 3 (Online Coordinate-Dual-Ascent The FTL and the Gradient-Descent updates are to extreme cases of our algorithmic frameork. The former makes the largest possible increase of the dual hile the latter makes the smallest possible increase that still satisfies the sufficient dual increase requirement. Intuitively, the FTL method should have smaller regret as it consumes more of its potential earlier in the optimization process. Hoever, its computational complexity is large as it requires a full blon optimization procedure at each online round. A possible compromise is to fully optimize the dual objective but only ith respect to a small number of dual variables. In the extreme case, e optimize only ith respect to the last dual variable. Formally, e let { λ t i if i < t λ t+1 i = argmax D t+1 (λ t 1,..., λ t t 1, λ t if i = t λ t Clearly, the above update satisfies the requirement in Fig. 2 and therefore enjoys a logarithmic regret bound. The computational complexity of performing this update is often small as e optimize over a single dual vector. Occasionally, it is possible to obtain a closed-form solution of the optimization problem and in these cases the computational complexity of the coordinate-dual-ascent update is identical to that of the gradient-descent method. 5 Generalized strongly convex functions In this section, e extend our algorithmic frameork to deal ith generalized strongly convex functions. We first need the folloing generalized definition of strong convexity. Definition 1 A continuous function f is σ-strongly convex over a convex set S ith respect to a norm if S is contained in the domain of f and for all v, u S and α [0, 1] e have f(α v + (1 α u α f(v + (1 α f(u σ 2 α (1 α v u 2. (14 It is straightforard to sho that the function f( = is strongly convex ith respect to the Euclidean norm. Less trivial examples are given belo. Example 4 The function f( = n i log( i / 1 n is strongly convex over the probability simplex, S = { R n + : 1 = 1}, ith respect to the L 1 norm. Its conjugate function is f (θ = log( 1 n n exp(θ i. Example 5 For q (1, 2, the function f( = 1 2(q 1 2 q is strongly convex over S = R n ith respect to the L q norm. Its conjugate function is f (θ = 1 2(p 1 θ 2 p, here p = (1 1/q 1. For proofs, see for example Shalev-Shartz [2007]. In the appendix, e list several important properties of strongly convex functions. In particular, the Fenchel conjugate of a strongly convex function is differentiable. 6

7 INPUT: A strongly convex function f FOR t = 1, 2,..., T : «1 Define t = f λt 1:(t 1 t 2 Receive a function l t 3 Suffer loss l t( t 4 Update λ t+1 t s.t. there exists λ t l t( t ith D t+1(λ t+1 t D t+1(λ t 1,..., λ t t 1, λ t INPUT: A σ-strongly convex function f FOR t = 1, 2,..., T : «1 Define t = f λt 1:(t 1 2 Receive a function l t = σf + g t 3 Suffer loss l t( t 4 Update λ t+1 t s.t. there exists λ t g t( t ith D t+1(λ t+1 t D t+1(λ t 1,..., λ t t 1, λ t Figure 3: Primal-dual template algorithms for general online convex optimization (left and online strongly convex optimization (right. Here a 1:t = P t ai, and for notational convenient, e implicitly assume that a 1:0 = 0. See text for description. Consider the case here for all t, l t can be ritten as σ t f + g t here f is 1-strongly convex ith respect to some norm and g t is a convex function. We also make the simplifying assumption that σ t is knon to the forecaster before he defines t. For each round t, e no define the primal objective to be The dual objective is (see again Sec. 2 t 1 P t ( = f( + g i (. (15 ( D t (λ 1,..., λ t 1 = f λ t 1 1:(t 1 gi (λ i. (16 An algorithmic frameork for online optimization in the presence of general strongly convex functions is given on the right-hand side of Fig. 3. The folloing theorem provides a logarithmic regret bound for the algorithmic frameork given on the right-hand side of Fig. 3. Theorem 2 Let l 1,..., l T be a sequence of functions such that for all t [T ], l t = σ t f + g t for f being strongly convex.r.t. a norm and g t is a convex function. Then, any algorithm that can be derived from Fig. 3 (right satisfies l t ( t min l t ( 1 2 here v t g t ( t and is the norm dual to. The proof of the theorem is given in Sec. B 6 Summary v t 2, (17 In this paper, e extended the primal-dual algorithmic frameork for general convex functions from Shalev-Shartz and Singer [2006], Shalev-Shartz [2007] to strongly convex functions. The template algorithms are outlined in Fig. 3. The left algorithm is the primal-dual algorithm for general convex functions from Shalev-Shartz and Singer [2006], Shalev-Shartz [2007]. Here, f is the complexity function, (λ t 1,..., λ t t are the dual variables at time t, and D t ( is the dual objective 7

8 function at time t (hich is a loer bound on primal value. The function f is the gradient of the conjugate function of f, hich can be vieed as a projection of the dual variables back into the primal space. At the least aggressive extreme, in order to obtain T regret, it is sufficient to set λ i t (for all i to be a subgradient of the loss l t ( t. We then recover an online mirror descent algorithm [Beck and Teboulle, 2003, Grove et al., 2001, Kivinen and Warmuth, 1997], hich specializes to gradient descent hen f is the squared 2-norm or the exponentiated gradient descent algorithm hen f is the relative entropy. At the most aggressive extreme, here D t is maximized at each round, e t 1 have Follo the Regularized Leader, hich is t = arg min l i( + t f(. Intermediate algorithms can also be devised such as the passive-aggressive algorithms [Crammer et al., 2006, Shalev-Shartz, 2007]. The right algorithm in Figure 3 is our ne contribution for strongly convex functions. Any σ- strongly convex loss function can be decomposed into l t = σf + g t, here g t is convex. The algorithm for strongly convex functions is different in to ays. First, the effective learning rate is no 1 rather than 1 t (see Step 1 in both algorithms. Second, more subtly, the condition on the dual variables (in Step 4 is only determined by the subgradient of g t, rather than the subgradient of l t. At the most aggressive end of the spectrum, here D t is maximized at each round, e have the t 1 Follo the Leader (FTL algorithm: t = arg min l i(. At the least aggressive end, e have the gradient descent algorithm of Hazan et al. [2006] (hich uses learning rate 1. Furthermore, e provide algorithms hich lie in beteen these to extremes it is these algorithms hich have the potential for most practical impact. Empirical observations suggest that algorithms hich most aggressively close the duality gap tend to perform most favorably [Crammer et al., 2006, Shalev-Shartz and Singer, 2007b]. Hoever, at the FTL extreme, this is often computationally prohibitive to implement (as one must solve a full blon optimization problem at each round. Our template algorithm suggests a natural compromise, hich is to optimize the dual objective but only ith respect to a small number of dual variables (say the most current dual variable e coin this algorithm online coordinate-dual-ascent. In fact, it is sometimes possible to obtain a closed-form solution of this optimization problem, so that the computational complexity of the coordinate-dual-ascent update is identical to that of a vanilla gradient-descent method. This variant update still enjoys a logarithmic regret bound. References J. Abernethy, P. Bartlett, A. Rakhlin, and A. Teari. Optimal strategies and minimax loer bounds for online convex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 21, A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31: , J. Borein and A. Leis. Convex Analysis and Nonlinear Optimization. Springer, S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments ith perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7: , Mar A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3: , E. Hazan, A. Kalai, S. Kale, and A. Agaral. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1:1 64, January S. Shalev-Shartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebre University, S. Shalev-Shartz and Y. Singer. Convex repeated games and Fenchel duality. In Advances in Neural Information Processing Systems 20, S. Shalev-Shartz and Y. Singer. Logarithmic regret algorithms for strictly convex repeated games. Technical report, The Hebre University, 2007a. Available at shais. S. Shalev-Shartz and Y. Singer. A unified algorithmic approach for efficient online label ranking. In aistat07, 2007b. 8

9 A Fenchel duality Consider the optimization problem An equivalent problem is ( inf 0, 1,..., T f( 0 + inf S ( f( + g t ( t g t (. s.t. 0 S and t [T ], t = 0. (18 Introducing T vectors λ 1,..., λ T, each λ t R n is a vector of Lagrange multipliers for the equality constraint t = 0, e obtain the folloing Lagrangian L( 0, 1,..., T, λ 1,..., λ T = f( 0 + g t ( t + The dual problem is the task of maximizing the folloing dual objective value, λ t, 0 t. D(λ 1,..., λ T = inf L( 0, 1,..., T, λ 1,..., λ T 0 S, 1,..., T ( = sup 0, λ t f( 0 sup ( t, λ t g t ( t 0 S t = f ( λ t gt (λ t, here f, g1,..., gt are the Fenchel conjugate functions of f, g 1,..., g T. Therefore, the generalized Fenchel dual problem is ( f λ t gt (λ t. (19 sup λ 1,...,λ T Note that hen T = 1 the above duality is the so-called Fenchel duality [Borein and Leis, 2006]. The eak duality theorem tells us that the primal objective upper bounds the dual objective: sup f ( λ 1,...,λ T λ t g (λ t inf f( + T g t (. S A sufficient condition for equality to hold (i.e. strong duality is that f is a strongly convex function, g 1,..., g T are convex functions, and the intersection of the domains of g 1,..., g T is polyhedral. Assume that strong duality holds, let (λ 1,..., λ T be a maximizer of the dual objective function, and let (0,..., T be a maximizer of the problem given in Eq. (18. Then, optimality conditions imply that (0,..., T = argmin L( 0,..., T, λ 1,..., λ T. 0,..., T See for example Boyd and Vandenberghe [2004] Section In particular, the above implies that 0 = argmax 0, λ t f( 0 = f ( λ 1:T, 0 S here in the last equality e used Lemma 1. Since 0 is also a minimizer of our original problem, e obtain the primal-dual link function = f ( λ 1:T. 9

10 B Proof of Thm. 2 We first use the folloing properties of the Fenchel conjugate of strongly convex functions. The proof of this lemma follos from Lemma 18 in Shalev-Shartz [2007]. Lemma 3 Let f be a σ-strongly convex function over S ith respect to a norm. Let f be the Fenchel conjugate of f. Then, f is differentiable and for all θ 1, θ 2 R n, e have f (θ 1 + θ 2 f (θ 1 f (θ 1, θ σ θ 2 2 We also need the folloing technical lemma. Lemma 4 Assume f strongly convex, let a, b 0, and let b = f (θ/b. Then, af (θ/a bf (θ/b (b af( Proof Since f is strongly convex e kno that f is differentiable. Using Lemma 1 e have The definition of f no implies that f (θ/b = b, θ/b f( b f (θ/a = max, θ/a f( b, θ/a f( b Therefore, af (θ/a bf (θ/b (a bf( b hich concludes our proof. Next, e sho that the gradient descend update rule yields a sufficient increase of the dual objective. Lemma 5 Let (λ 1,..., λ t 1 be an arbitrary sequence of vectors. Denote = f ( 1 λ 1:(t 1 and let λ g t (. Then, D t+1 (λ 1,..., λ t 1, λ D t (λ 1,..., λ t 1 l t ( λ 2 2. Proof Denote t = D t+1 (λ 1,..., λ t 1, λ D t (λ 1,..., λ t 1. Since f is strongly convex e can apply Lemma 3 to get that ( ( t = f λ 1:(t 1+λ + f λ 1:(t 1 gt (λ ( ( ( f λ 1:(t 1,λ + λ 2 2( + σ 2 1:(t 1 f λ 1:(t 1 gt (λ ( ( = f λ 1:(t 1 f λ 1:(t 1 +, λ gt (λ λ 2. } {{ } } {{ } 2 B A Since λ g t ( e get from Lemma 1 that B = g t (. Next, e use Lemma 4 and the definition of to get that A ( f( = σt f(. Thus, A + B σ t f( + g t ( = l t ( and this concludes our proof. The proof of Thm. 2 no easily follos. Proof [of Thm. 2] Denote t = D t+1 (λ t+1 t D t (λ t 1,..., λ t t 1 and note that Eq. (10 still holds. The definition of the update in Fig. 3 and Lemma 5 implies that there exists v t g t ( t such that t l t ( t vt 2 2. Summing over t and combining ith Eq. (10 e conclude our proof. 10