Online Convex Optimization



Similar documents
Adaptive Online Gradient Descent

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Follow the Perturbed Leader

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Introduction to Online Optimization

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Week 1: Introduction to Online Learning

The p-norm generalization of the LMS algorithm for adaptive filtering

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

GI01/M055 Supervised Learning Proximal Methods

1 Portfolio Selection

10. Proximal point method

Strongly Adaptive Online Learning

Big Data - Lecture 1 Optimization reminders

Load balancing of temporary tasks in the l p norm

Convex analysis and profit/cost/support functions

Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Projection-free Online Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Online Lazy Updates for Portfolio Selection with Transaction Costs

Adaptive Bound Optimization for Online Convex Optimization

Online Learning, Stability, and Stochastic Gradient Descent

Online Convex Programming and Generalized Infinitesimal Gradient Ascent

TD(0) Leads to Better Policies than Approximate Value Iteration

The Advantages and Disadvantages of Online Linear Optimization

Online Algorithms: Learning & Optimization with No Regret.

2.3 Convex Constrained Optimization Problems

Online Learning and Online Convex Optimization. Contents

Introduction to Online Learning Theory

ALMOST COMMON PRIORS 1. INTRODUCTION

Definition Given a graph G on n vertices, we define the following quantities:

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Efficient Projections onto the l 1 -Ball for Learning in High Dimensions

Mathematical Methods of Engineering Analysis

The Relative Worst Order Ratio for On-Line Algorithms

Notes from Week 1: Algorithms for sequential prediction

Numerical Analysis Lecture Notes

Simple and efficient online algorithms for real world applications

Follow the Leader with Dropout Perturbations

Distributed Machine Learning and Big Data

(Quasi-)Newton methods

Stochastic gradient methods for machine learning

Algorithms for sustainable data centers

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

4.1 Introduction - Online Learning Model

1 Norms and Vector Spaces

Stochastic gradient methods for machine learning

Nonlinear Programming Methods.S2 Quadratic Programming

Convex Programming Tools for Disjunctive Programs

Separation Properties for Locally Convex Cones

Completely Positive Cone and its Dual

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

We shall turn our attention to solving linear systems of equations. Ax = b

The Expectation Maximization Algorithm A short tutorial

Classification/Decision Trees (II)

An interval linear programming contractor

Date: April 12, Contents

LS.6 Solution Matrices

Lecture 7: Finding Lyapunov Functions 1

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Operations Research and Financial Engineering. Courses

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Several Views of Support Vector Machines

Linear Risk Management and Optimal Selection of a Limited Number

Lecture 4: BK inequality 27th August and 6th September, 2007

Online Learning: Theory, Algorithms, and Applications. Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz

Properties of BMO functions whose reciprocals are also BMO

Big Data Analytics. Lucas Rego Drumond

Tail inequalities for order statistics of log-concave vectors and applications

Mathematical finance and linear programming (optimization)

Quasi-static evolution and congested transport

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

The Goldberg Rao Algorithm for the Maximum Flow Problem

Big Data Analytics: Optimization and Randomization

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Stiffie's On Line Scheduling Algorithm

Online Learning and Competitive Analysis: a Unified Approach

Beyond stochastic gradient descent for large-scale machine learning

Mathematics Course 111: Algebra I Part IV: Vector Spaces

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

OPTIMAL CONTROL OF A COMMERCIAL LOAN REPAYMENT PLAN. E.V. Grigorieva. E.N. Khailov

DRAFT: Introduction to Online Convex Optimization DRAFT

LECTURE 1: DIFFERENTIAL FORMS forms on R n

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

1 if 1 x 0 1 if 0 x 1

Transcription:

E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting of online convex optimization which, as we shall see, encompasses some of the online learning problems we have seen so far as special cases. A standard offline convex optimization problem involves a convex set Ω R n and a fixed convex cost function c : Ω R; the goal is to find a point x Ω that minimizes cx over Ω. An online convex optimization problem also involves a convex set Ω R n, but proceeds in trials: at each trial t, the player/algorithm must choose a point x t Ω, after which a convex cost function c t : Ω R is revealed by the environment/adversary, and a loss of c t x t is incurred for the trial; the goal is to minimize the total loss incurred over a sequence of trials, relative to the best single point in Ω in hindsight. Online Convex Optimization Convex set Ω R n For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t We will denote the total loss incurred by an algorithm A on trials as and the total loss of a fixed point x Ω as L [A] c t x t, L [x] c t x ; the regret of A w.r.t. the best fixed point in Ω is then simply L [A] inf x Ω L [x]. Let us consider which of the online learning problems we have seen earlier can be viewed as instances of online convex optimization OCO: 1. Online linear regression GD analysis. Here one selects a weight vector w t R n, and incurs a loss given by c t w t l y tw t x t, which is assumed to be convex in its argument, and is therefore convex in w t ; however the total loss is compared against the best fixed weight vector in only { u R n : u 2 U } for some U > 0. Call this latter set Ω. Clearly, if the weight vectors w t are also selected to be in Ω, e.g. by Euclidean projection onto Ω which in this case amounts to a simple normalization, then this becomes an instance of OCO. 2. Online linear regression EG analysis. Here one selects w t n, and incurs a loss given by c t w t l y tw t x t, which is assumed to be convex in its argument, and is therefore convex in w t ; the total loss is also compared against the best fixed weight vector in n. hus this can be viewed as an instance of OCO with Ω n. 1

2 Online Convex Optimization 3. Online prediction from expert advice. Here one selects a probability vector ˆp t n, and incurs a loss c t ˆp t l y tˆp t ξ t, which is assumed to be convex in its argument, and is therefore convex in ˆp t. However the total loss is compared only against the best vector in { e i n : i [n] }, where e i denotes a unit vector with 1 in the i-th position and 0 elsewhere. In general, the smallest loss in this set is not the same as the smallest loss in n, and therefore this is not an instance of OCO. 4. Online decision/allocation among experts. Here one selects a probability vector ˆp t n, and incurs a linear loss c t ˆp t ˆp t l t, which is clearly convex in ˆp t. he total loss is again compared against the best vector in { e i n : i [n] } as above, but in this case, comparison against this set is actually equivalent to comparison against n, since inf L [ˆp] ˆp n inf ˆp n ˆp l t inf ˆp ˆp n l t min i [n] e i l t min e i l t. i [n] hus online allocation with a linear loss as above can be viewed as an instance of OCO with Ω n. 2 Online Gradient Descent Algorithm he online gradient descent OGD algorithm [7] generalizes the basic gradient descent algorithm used for standard offline optimization problems, and can be described as follows: Algorithm Online Gradient Descent OGD Parameter: η > 0 Initialize: x 1 Ω For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t Update: x t+1 x t η c t x t x t+1 P Ω x t+1, where P Ω x argmin x x 2 Euclidean projection of x onto Ω x Ω heorem 2.1 Zinkevich, 2003. Let Ω R n be a closed, non-empty, convex set with x x 1 2 D x Ω. Let c t : Ω R t 1, 2,... be any sequence of convex cost functions with bounded subgradients, c t x 2 G t, x Ω. hen D G [ ] L OGDη inf L [x] 1 D 2 x Ω 2 η + η G2. Moreover, setting η upper bounds on them, are known in advance, we have to minimize the right-hand side of the above bound assuming D, G and, or L [ OGDη ] inf x Ω L [x] DG. Notes. Before proving the above theorem, we note the following: 1. he analysis of the GD algorithm for online regression can be viewed as a special case of the above, with D U and G ZR 2 see Lecture 17 notes. 2. With η as above which depends on knowledge of D, G and, one gets an O regret bound. Such an O regret bound with worse constants can also be obtained by using a time-varying parameter η t 1/ t t in fact this was the version contained in the original paper of Zinkevich [7].

Online Convex Optimization 3 Proof of heorem 2.1. he proof is similar to that of the regret bound we obtained for the GD algorithm for online regression in Lecture 17. Let x Ω. We have t [ ]: ct x t c t x c t x t x t x, by convexity of c t x t x t+1 x t x Summing over t 1..., we thus get 1 1 1 η x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 x x t 2 2 x x t+1 2 2 + η 2 c t x t 2 2 ct x t c t x 1 x x 1 2 2 x x +1 2 2 + η 2 G 2 1 D 2 + η 2 G 2. It is easy to see that the right-hand side is minimized at η inequality gives the desired result. D G ; substituting this back in the above 2.1 Logarithmic regret bound for OGD when cost functions are strongly convex using time-varying parameter η t he above result gives an O regret bound for the OGD algorithm in general. When the cost functions c t are known to be α-strongly convex, one can in fact obtain an Oln regret bound with a slight variant of the OGD algorithm that uses a suitable time-varying parameter η t [3]: heorem 2.2 Hazan et al, 2007. Let Ω R n be a closed, non-empty, convex set with x x 1 2 D x Ω. Let c t : Ω R t 1, 2,... be any sequence of α-strongly convex cost functions with bounded subgradients, c t x t 2 G t, x Ω. hen the OGD algorithm with time-varying parameter η t 1 αt satisfies L [OGD η t 1 αt ] inf L [x] G2 ln + 1. x Ω Proof. Let x Ω. Proceeding similarly as before, we have t [ ]: ct x t c t x c t x t x t x α 2 xt x 2 2, by α-strong convexity of c t x t x t+1 x t x α η t 2 xt x 2 2 1 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 αη t x t x 2 2 t 1 x x t 2 2 x x t+1 2 2 + x t x t+1 2 2 αη t x t x 2 2 t 1 x x t 2 2 x x t+1 2 2 + η 2 t c t x t 2 2 αη t x t x 2 2. t Summing over t 1... then gives ct x t c t x G2 2 G2 η t + 1 1 α x x 2 1 2 2 + 1 η 1 2 1 + ln + 0 + 0. t2 1 1 α x x t 2 2 η t η t 1

4 Online Convex Optimization 3 Online Mirror Descent Algorithm he online mirror descent OMD algorithm, described for example in [6], generalizes the basic mirror descent algorithm used for standard offline optimization problems [4, 5, 1] much as OGD generalizes basic gradient descent. he OMD framework allows one to obtain O regret bounds for cost functions with bounded subgradients not only in Euclidean norm assumed by the OGD analysis, but in any suitable norm. Let Ω R n be a convex set containing Ω i.e. Ω Ω, and let F : Ω R be a strictly convex and differentiable function. Let B F : Ω Ω R denote the Bregman divergence associated with F ; recall that for all u, v Ω, this is defined as B F u, v F u F v u v F v. hen the OMD algorithm using the set Ω and function F can be described as follows: Algorithm Online Mirror Descent OMD Parameters: η > 0; convex set Ω Ω; strictly convex, differentiable function F : Ω R Initialize: x 1 argmin F x x Ω For t 1,..., : Play x t Ω Receive cost function c t : Ω R Incur loss c t x t Update: F x t+1 F x t η c t x t x t+1 P F Ω xt+1, where P F Ω Assume this yields x t+1 Ω x argmin B F x, x Bregman projection of x onto Ω x Ω Below we will obtain a regret bound for OMD with appropriate choice of the function F for cost functions with subgradients bounded in an arbitrary norm. Before we do so, let us recall the notion of a dual norm. Specifically, let be any norm defined on a closed convex set Ω R n. hen the corresponding dual norm is defined as u sup x Ω: x 1 x u 2 sup x u 1 2 x 2. x Ω heorem 3.1. Let Ω R n be a closed, non-empty, convex set. Let c t : Ω R t 1, 2,... be any sequence of convex cost functions with bounded subgradients c t x t G t, x Ω where is an arbitrary norm. Let Ω Ω be a convex set and F : Ω R be a strictly convex function such that a F x F x 1 D 2 x Ω; and b the restriction of F to Ω is α-strongly convex w.r.t., the dual norm of. hen [ ] L OMDη inf L [x] 1 D 2 + η2 G 2. x Ω η Moreover, setting η D G α, or upper bounds on them, are known in advance, we have to minimize the right-hand side of the above bound assuming D, G, and L [ OMDη ] inf x Ω L [x] DG 2 α.

Online Convex Optimization 5 Proof. Let x Ω. Proceeding as before, we have t [ ]: ct x t c t x c t x t x t x, by convexity of c t F x t F x t+1 x t x η B F x, x t B F x, x t+1 + B F x t, x t+1 1 η Summing over t 1..., we thus get 1 η B F x, x t B F x, x t+1 B F x t+1, x t+1 + B F x t, x t+1. ct x t c t x 1 B F x, x 1 B F x, x +1 + 1 η η Now, it can be shown that Also, we have hus we get B F x t, x t+1 B F x t+1, x t+1 B F x, x 1 F x F x 1 D 2. F x t F x t+1 F x t+1 x t x t+1 B F x t, x t+1 B F x t+1, x t+1. F x t x t x t+1 α 2 xt x t+1 2 F x t+1.x t x t+1, by α-strong convexity of F on Ω η c t x t x t x t+1 α 2 xt x t+1 2 ηg x t x t+1 α 2 xt x t+1 2, by Hölders inequality η 2 G 2, η2 G 2. since η2 G 2 + α 2 xt x t+1 2 ηg x t x t+1 ct x t c t x 1 η It is easy to verify that the right-hand side is minimized at η D G inequality gives the desired result. Examples. Examples of OMD algorithms include the following: D 2 + η2 G 2. ηg α 2 xt x t+1 2 0 ; substituting this back in the above 1. OGD. Here Ω R n and F x 1 2 x 2 2. Note that F is 1-strongly convex on R n and therefore on any Ω R n. 2. Exponential weights. Here Ω a n for some a > 0; Ω R n ++; and F x n i1 x i ln x n i1 x i. Note that F is strictly convex on Ω and 1-strongly convex on Ω n.

6 Online Convex Optimization Exercise. Suppose the cost functions c t are all β-strongly convex for some β > 0 and have subgradients bounded in some arbitrary norm as above. Can you obtain a logarithmic Oln regret bound for the OMD algorithm in this case, as we did for the OGD algorithm? Why or why not? More details on online convex optimization, including missing details in some of the above proofs, can be found for example in [2]. 4 Next Lecture In the next lecture, we will consider how algorithms for online supervised learning problems such as online regression and classification can be used in a batch setting, and will see how regret bounds for online learning algorithms can be converted to generalization error bounds for the resulting batch algorithms. References [1] Amir Beck and Marc eboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 313:167 175, 2003. [2] Sébastion Bubeck. Introduction to online optimization. Lecture Notes, Princeton University, 2011. [3] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007. [4] A. Nemirovski. Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody, 15, 1979. [5] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. [6] Shai Shalev-Shwartz. Online Learning: heory, Algorithms, and Applications. PhD thesis, he Hebrew University of Jerusalem, 2007. [7] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning ICML, 2003.