Online Learning and Competitive Analysis: a Unified Approach

Transcription

1 Online Learning and Competitive Analysis: a Unified Approach Shahar Chen

2

3 Online Learning and Competitive Analysis: a Unified Approach Research Thesis Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Shahar Chen Submitted to the Senate of the Technion Israel Institute of Technology Iyar 5775 Haifa April 2015

4

5 This research was carried out under the supervision of Prof. Seffi Naor and Dr. Niv Buchbinder, in the Department of Computer Science. Technion - Computer Science Department - Ph.D. Thesis PHD Some results in this thesis have been published as articles by the author and research collaborators in conferences and journals during the course of the author s doctoral research period, the most up-to-date versions of which being: Niv Buchbinder, Shahar Chen, Anupam Gupta, Viswanath Nagarajan, and Joseph Naor. packing and covering framework with convex objectives. CoRR, abs/ , Online Niv Buchbinder, Shahar Chen, and Joseph Naor. Competitive algorithms for restricted caching and matroid caching. In Algorithms - ESA th Annual European Symposium, Wroclaw, Poland, September 8-10, Proceedings, pages , Niv Buchbinder, Shahar Chen, and Joseph Naor. Competitive analysis via regularization. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages , Niv Buchbinder, Shahar Chen, Joseph Naor, and Ohad Shamir. Unified algorithms for online learning and competitive analysis. In COLT The Twenty-fifth Annual Conference on Learning Theory, pages , Acknowledgements I would like to thank Seffi and Niv, my advisors, who wisely led me and graciously guided me throughout the years of my work. It has been a fascinating period, and I highly appreciate the good fortune of having the chance to work with you. The generous financial help of Irwin and Joan Jacobs, the Zeff Fellowship, and the Technion is gratefully acknowledged.

6

7 Contents List of Figures Abstract 1 Abbreviations and Notations 3 1 Introduction 5 2 Preliminaries Linear Programming and Convex Programming Lagrangian Duality and Optimality Conditions Approximation Algorithms using Linear Programming Matroids and Submodular Functions Matroids Submodular Functions Brief Introduction to Online Computation Competitiveness via Regularization Introduction Online Regularization Algorithm Analysis General Covering Constraints with Variable Upper Bounds Online Set Cover with Service Cost Unified Algorithms for Online Learning and Competitive Analysis Introduction Preliminaries: Online Learning and Competitive Analysis Algorithms and Results Proofs and Algorithm Derivation the Experts/MTS Case Proofs and Algorithm Derivation the Matroid Case

8 5 Restricted Caching and Matroid Caching Introduction Definitions and Problem Formulation Main Algorithm Rounding the Fractional Solution Online A Lower Bound on the Auxiliary Graph Diameter Special Cases of Restricted Caching Concluding Remarks Online Packing and Covering Framework with Convex Objectives Introduction Techniques and Chapter Outline The General Framework The Algorithm Monotone Online Maximization Applications l p -norm of Packing Constraints Online Set Cover with Multiple Costs Profit maximization with nonseparable production costs Hebrew Abstract i

9 List of Figures 3.1 Primal and dual LP formulations for the online covering problem The primal and dual LP formulations for the MTS problem The primal and dual LP formulations for the Matroid problem n, l-companion cache The primal and dual LP formulations for the matroid caching problem Uniform decomposition into spanning trees of the initial fractional solution Decomposition into spanning trees of the updated fractional solution

10

11 Abstract Online learning and competitive analysis are two widely studied frameworks for online decisionmaking settings. Despite the frequent similarity of the problems they study, there are significant differences in their assumptions, goals and techniques, hindering a unified analysis and richer interplay between the two. In this research we provide several contributions in this direction. First, we provide a single unified algorithm which by parameter tuning, interpolates between optimal regret for learning from experts in online learning and optimal competitive ratio for the metrical task systems problem MTS in competitive analysis, improving on the results of Blum and Burch The algorithm also allows us to obtain new regret bounds against drifting experts, which might be of independent interest. Moreover, our approach allows us to go beyond experts/mts, obtaining similar unifying results for structured action sets and combinatorial experts, whenever the setting has a certain matroid structure. A complementary direction of our research tries to borrow various learning techniques, specifically focusing on the online convex optimization domain, in order to obtain new results in the competitive analysis framework. We show how regularization, a fundamental method in machine learning and particularly in the field of online learning, can be applied to obtain new results in the area of competitive analysis. We also show how convex conjugacy and Fenchel duality, other powerful techniques used in online convex optimization and learning, can be used in the competitive analysis setting, allowing us to cope with a richer world of online optimization problems. 1

12 2

13 Abbreviations and Notations LP : Linear program P : A primal minimum program D : A dual maximum program P, D : The change in the cost of the primal and dual programs, respectively G = V, E : Graph with set of vertices V and set of edges E opt : The cost of the optimal offline solution u v : Relative entropy with respect to u and v E : Ground set n : Number of elements in ground set I : Collection of independent sets subset of 2 E M : Matroid rm : Matroid rank function γm : Matroid density cm : Matroid circumference BM : The bases polytope corresponding to M PM : The independent sets polytope corresponding to M P ss M : The spanning sets polytope corresponding to M M : The dual matroid of M f : The convex conjugate of function f l fx : The l th coordinate of the gradient of f at point x 3

14 4

15 Chapter 1 Introduction Online learning, in its decision-theoretic formulation, captures the problem of a decision-maker who iteratively needs to make decisions in the face of future uncertainty. In each round, the decision-maker picks a certain action from an action set, and then suffers a cost associated with that action. The cost vector is not known in advance, and might even be chosen by an adversary with full knowledge of the decision-maker s strategy. The performance is typically measured in terms of the regret, namely the difference between the total accumulated cost and the cost of an arbitrary fixed policy from some comparison class. Non-trivial algorithms usually attain regret which is sublinear in the number of rounds. While online learning is a powerful and compelling framework, with deep connections to statistical learning, it also has some shortcomings. In particular, it is well-recognized that regret against a fixed policy is often too weak, especially when the environment changes over time and thus no single policy is always good. This has led to several papers e.g., [HW98, HS09, CMEDV10, RST11] which discuss performance with respect to stronger notions of regret, such as adaptive regret or tracking the best expert. A related shortcoming of online learning is that it does not capture well problems with states, where costs depend on the decision-maker s current configuration as well as on past actions. Consider, for instance, the problem of allocating jobs to servers in an online fashion. Clearly, the time it takes to process jobs strongly depends on the system state, such as its overall load, determined by all previous allocation decisions. The notion of regret does not capture this setting well, since it measures the regret with respect to a fixed policy, while assuming that at each step this policy faces the exact same costs. Thus, one might desire algorithms for a much more ambitious framework, where we need to compete against arbitrary policies, including an optimal offline policy which has access to future unknown costs, and where we can model states. Such problems have been intensively studied in the field of competitive analysis for a detailed background, see [BEY98]. In such a framework, attaining sublinear regret is hopeless in general. Instead, the main measure used is the competitive ratio, that bounds the ratio of the total cost of the decision-maker and the 5

16 total cost of an optimal offline policy, in a worst-case sense. This usually provides a weaker performance guarantee than online learning, but with respect to a much stronger optimality criterion. While problems studied under these two frameworks are often rather similar, there has not been much research on general connections between the two. The main reason for this situation other than social factors stemming from the separate communities studying them is some crucial differences in the modeling assumptions. For example, in order to model the notion of state, competitive analysis usually assumes a movement cost of switching between states. In the online learning framework, this would be equivalent to having an additional cost associated with switching actions between rounds. Another difference is that in competitive analysis one assumes 1-lookahead, i.e., the decision-maker knows the cost vector in the current round. In contrast, online learning has 0-lookahead, and the decision-maker does not know the cost vector of the current round until a decision is made. Such differences, as stated in [CBL06, p. 3], have so far prevented the derivation of a general theory allowing a unified analysis of both types of problems. In this work, we attempt to connect the two fields, online learning and competitive analysis. Our attempt can be classified into two lines of action. The first line of action is exploit the similarities between the frameworks to provide a unified algorithmic approach, that attains both optimal regret and an optimal competitive ratio to a large class of problems in both fields. Chapter 4 addresses this issue. The second line of action is to bridge over the analytical gap between the two fields. That is, as the two communities have worked separately, different tools and methods have evolved in one field without much attention from the other. Our research tries to borrow various techniques, especially from the learning domain, in order to obtain new results in the competitive analysis framework. Chapter 3 and Chapter 6 address this issue. Our Contribution In Chapter 3 we provide a framework for designing competitive online algorithms using regularization, a widely used technique in online learning, particularly in online convex optimization. In our new framework we exhibit a general competitive deterministic algorithm for generating a fractional solution that satisfies a time-varying set of online covering and precedence constraints. This framework allows to incorporate both service costs over time and setup costs into a host of applications. We then provide a competitive randomized algorithm for the online set cover problem with service cost. This model allows for sets to be both added and deleted over time from a solution. Chapter 4 adopts the regularization approach studied in Chapter 3 to introduce a single unified algorithm which by parameter tuning, interpolates between optimal regret for learning from experts in online learning and optimal competitive ratio for the metrical task systems problem MTS in competitive analysis, improving on previous results. The algorithm also 6

17 allows us to obtain new regret bounds against drifting experts, which might be of independent interest. Moreover, our approach allows us to go beyond experts/mts, obtaining similar unifying results for structured action sets and combinatorial experts, whenever the setting has a certain matroid structure. In Chapter 5 we exploit the techniques introduced in the Chapter 4 to study the online restricted caching problem, where each memory item can be placed in only a restricted subset of cache locations. We solve this problem through a more general online caching problem, in which the cache architecture is subject to matroid constraints. Our main result is a polynomial time approximation algorithm to the matroid caching problem, which guarantees an Olog 2 k- approximation for any restricted cache of size k, independently of its structure. In addition, we study the n, l-companion caching problem, defined by [BETW01] as a special case of restricted caching, and prove that our algorithm achieves an optimal competitive factor of Olog n + log l, improving on previous results of [FMS02]. Chapter 6 considers online fractional covering problems with a convex objective, where the covering constraints arrive over time. We also consider the corresponding dual online packing problems with concave objective. We provide an online primal-dual framework for both classes of problems with competitive ratio depending on certain monotonicity and smoothness parameters of the objective function f, which match or improve on guarantees for some special classes of functions f considered previously. This framework extends the primal-dual linear programming techniques developed in competitive analysis, using the notion of convex conjugacy and Fenchel duality, well studied techniques in online convex optimization. 7

18 8

19 Chapter 2 Preliminaries 2.1 Linear Programming and Convex Programming A mathematical program, or a mathematical optimization problem, is a problem of minimizing or maximizing a function over a feasible set of constraints. More formally, we define a mathematical program in the following form: min f 0 x subject to: 2.1 for any 1 j m, f j x b j. Here, the vector x = x 1,..., x n is the optimization variable of the problem, the function f 0 : R n R is the objective function, and the constraints are defined by m constraint functions f j : R n R, j = 1,..., m, and m constants b 1,..., b m. When minimization is considered we usually refer to the problem as the primal problem, denoted by P. An important class of mathematical optimization problems is linear optimization problems. Optimization problem 2.1 is called a linear program LP if the functions f 0,..., f m are linear, i.e., satisfy f j αx + βy = αf j x + βf j y, 2.2 for all x, y R n and α, β R. In our discussion, we usually consider the following linear 9

20 program formulation: Technion - Computer Science Department - Ph.D. Thesis PHD P : min n c i x i i=1 subject to: 2.3 n for any 1 j m, a ij x i b j, i=1 for any 1 i n, x i 0. It is well known that any linear program can be formulated in this way. The sparsity of a linear constraint is referred to the number of non-zero coefficients in the latter formulation. Another, more general, class of mathematical optimization problems is linear programs with a convex objective. In this work, we refer to an optimization problem 2.1 as such if the constraint functions are linear and the objective function f 0 is convex, i.e., satisfy f 0 αx + βy αf 0 x + βf 0 y, 2.4 for all x, y R n and α, β R, with α + β = 1 and α, β 0. If strict inequality holds in 2.4 whenever x y and 0 < a, b < 1, then we say that the objective function f 0 is strictly convex Lagrangian Duality and Optimality Conditions Given problem 2.1, the Lagrangian L : R n R m R is defined as, Lx, λ = f 0 x m λ j f j x b j, and the vector λ = λ 1,..., λ m contains the lagrangian dual variables. j=1 The idea in lagrangian duality is use the Lagrangian in order to bound any feasible solution and in particular the optimal solution of optimization problem 2.1. To do so, we define the lagrangian dual function g : R m R as the minimum of the Lagrangian L over x: Weak Duality: gλ = inf x Lx, λ = inf x f 0 x m λ j f j x b j. The weak duality property states that for any optimization problem, its lagrangian dual function j=1 yields a lower bound on its optimal value. More formally, Theorem 2.1. Let p denote the value of an optimal primal optimization problem 2.1, and 10

21 let g denote the corresponding dual function. Then for any λ 0, we have gλ p. Technion - Computer Science Department - Ph.D. Thesis PHD Proof. Suppose that x is a feasible solution for the primal problem. This immediately implies that f j x b j and λ j 0, for any 1 j m. Then we have and, therefore, m λ j f j x b j 0 j=1 gλ = inf x Lx, λ L x, λ = f 0 x m λ j f j x b j f 0 x. Since the latter inequality holds for any feasible x, the theorem follows. Theorem 2.1 states that given a primal optimization problem, for any λ 0 the dual function gives a lower bound on the optimal value of the problem. In order to obtain the best lower bound using lagrangian duality, we formulate the following lagrangian dual problem denoted by D: j=1 max gλ 1,..., λ m subject to: 2.5 for any 1 j m, λ j 0. Lagrangian duality plays an important role in the area of mathematical optimization, and particularly in linear and convex optimization. In the case of linear optimization, problem 2.5 yields dual program D which corresponds to linear program P in 2.3: D : max m b j y j j=1 subject to: 2.6 m for any 1 i n, a ij y j c i, j=1 for any 1 j m, y j 0. A special subclass of linear programs consists of programs in which the coefficients a ij, b j and c i in 2.3 and 2.6 are all nonnegative. In this case the primal program is called covering problem and the dual program is called packing problem. Capturing various applications and classical optimization problems, this is an important subclass in the study of approximation 11

22 algorithms, see e.g. [Vaz01]. In the case of LPs with a convex objective, one can also develop problem 2.5 to obtain a more specific Formulation based on the convex conjugate function of f 0. We refer the reader to Chapter 6 for further details, since this is where convex duality is used. Strong Duality: When the gap between the optimal value of the primal problem 2.1 and the optimal value of the dual problem 2.5 is zero, we say that strong duality holds. It turns out that when the objective function f 0 is convex, and some basic conditions on the constraints exist, then we have strong duality. Specifically, for linear programs with a convex objective the following theorem holds. Theorem 2.2. A primal linear program with a convex objective has a finite optimal solution if and only if its dual program has a finite optimal solution. In this case, the value of the optimal solutions of the primal and dual programs is equal. Optimality Conditions: Let x be an optimal solution for problem 2.1, and let λ be an optimal solution for the dual problem. Then, if strong duality holds, the following conditions, called the Karush-Kuhn- Tucker KKT conditions, are satisfied: f j x b j, j = 1,..., m 2.7 λ j 0, j = 1,..., m 2.8 λ j f j x b j = 0, j = 1,..., m 2.9 m f 0 x + λ j f j x = j=1 The first two conditions follow immediately from the feasibility of the primal and dual solutions. To obtain the last two conditions we use strong duality and get, f 0 x = gλ = inf x f 0 x f 0 x f 0 x. m λ j f j x b j j=1 m λ j f j x b j j=1 12

23 We conclude that the last two inequalities hold with equality. Condition 2.9, also known as complementary slackness, follows from the last inequality or equality since each term in the sum m j=1 λ j f jx b j is nonnegative. Since the first inequality holds with equality, x minimizes Lx, λ over x, and therefore its gradient with respect to x must vanish at x, i.e., the Condition 2.10 follows. For convex problems, the KKT conditions are also sufficient to ensure optimality. That is, any primal-dual pair of solutions that satisfy the above inequalities are primal and dual optimal with zero optimality gap. For a comprehensive survey on convex programming and optimization see e.g., [BV04] Approximation Algorithms using Linear Programming Many interesting optimization problems can be formulated as integer programs, i.e., mathematical programs in which the optimization variables are assigned with integral values, x 1,..., x n Z. Unfortunately, adding integrality restriction often makes the problems hard. A way of handling this hardness is by relaxing the formulation removing the integrality restriction, and allowing a fractional assignment to the variables. Thus, the optimal solution of a relaxation of a minimization problem is a lower bound on the optimal solution of the problem. The ratio between the optimal solutions is called the integrality gap of the relaxation. As a result, linear programming has become a very influential tool for the design and analysis of approximation algorithms and online algorithms. Given an integral optimization problem, we formulate a linear relaxation to the problem. Next, we solve possibly approximately the linear relaxation, obtaining a fractional solution. Finally, we apply a procedure which rounds the fractional solution often using randomness, to obtain a feasible solution for the original problem. We shall demonstrate this extremely useful technique in the following chapters. For further information on approximation techniques we refer the reader to [Vaz01]. 2.2 Matroids and Submodular Functions Matroids Matroids are extremely useful combinatorial objects that capture many natural collections of subsets such as sparse subsets, forests in graphs, linearly independent sets in vector spaces, and sets of nodes in legal matchings of a given graph. Let E be a finite set and let I be a non-empty collection of subsets of E. M = E, I is called a matroid if I satisfies: for every S 1 S 2, if S 2 I then also S 1 I, if S 1, S 2 I and S 1 > S 2, then there exists an element e S 1 \S 2 such that S 2 {e} I. 13

24 The latter property is called the set exchange property. Given a matroid M = E, I, we refer to E as the ground set, and every subset S I is called independent or dependent otherwise. For S E, a subset B of S is called a base of S if B is a maximal independent subset of S. A well known fact is that for any subset S of E, any two bases of S have the same size, called the rank of S, denoted by r S. For example, s-sparse subsets are the bases of an s-uniform matroid, where re = s. Spanning trees in a connected graph G = V, E are bases of a graphic matroid with E = E and I being the collection of all subsets of E that form a forest, with rank re = V 1. A subset of E is called spanning if it contains a base of E. So bases are the inclusion-wise minimal spanning sets and the only independent spanning sets. Each matroid M = E, I is associated with a dual matroid M = E, I, where I = {I E E \ I is a spanning set of M}. This means that the bases of M are precisely the complements of the bases of M, implying M = M. Moreover, for any S E the rank function of the dual matroid satisfies r S = S + re \ S re. The density of a matroid M, γ M, is defined as max S E,S = { S /rs}. For example, the density of the s-subsets matroid is n/s. The density of a graphic matroid spanning trees in a graph G = V, E is max S V, S >1 { ES / S 1}, where ES is the set of edges in the subgraph defined by the vertices of S. A circuit C in a matroid M, is defined as an inclusion-wise minimal dependent set, that is C \ {e} I, for every e C. The circumference of a matroid M, cm, is the cardinality of the largest circuit in M. For example, the circumference of an s-uniform matroid is s + 1, and the circumference of a graphic matroid in a graph G = V, E is the length of the longest simple cycle in it. A subset F of E is called nonseparable if every pair of elements in F lie in a common circuit; otherwise there is a partition of F into non-empty sets F 1 and F 2 with rf = rf 1 + rf 2. See [HW69, Whi35] for further details Submodular Functions A set function is a function f : 2 E R which assigns a value to all subsets of E. f is called submodular if, fs 1 + fs 2 fs 1 S 2 + fs 2 S 2, 2.11 for all subsets S 1, S 2 of E. Similarly, f is supermodular when it satisfies 2.11 with the opposite inequality sign. Furthermore, f is called nondecreasing if fs 2 fs 1 for any S 2 S 1 E, and f is called normalized if f = 0. Matroids are closely related to submodularity, since the rank function of any matroid is submodular, nondecreasing and normalized. For a thorough survey of the results on matroids we refer the reader to [Sch03, Law76]. 14

25 2.3 Brief Introduction to Online Computation Technion - Computer Science Department - Ph.D. Thesis PHD Most of the thesis deals with online algorithms and online optimization problems. We therefore start with an introduction to online computation. Since the theory of online computation, and this thesis as well, considers various models and settings, we keep this introduction general and brief. In Chapter 3 and Chapter 4 we further elaborate on the settings that we consider. An online algorithm must respond to a sequence of events or requests by producing a sequence of decisions. Each decision is made based on the history of past events and decisions, but without knowledge of the future. The decisions made by the algorithm generate a cost or a profit, which the algorithm tries to minimize or to maximize, respectively. We consider several online settings, and follow previous work and adopt the popular notions of competitive ratio and regret to evaluate the performance of our online algorithms under these settings. Let opti be the cost of the optimal feasible solution of a sequence of events denoted by I. 1 An online algorithm is said to be c-competitive for a minimization problem, if for every sequence of events I, the algorithm generates a cost of at most c opti + d, where d is independent of the event sequence. Analysis of online algorithms with respect to this measure is referred to as competitive analysis. For maximization, the definition of competitiveness is analogous. When considering a maximization problem, a c-competitive algorithm is guaranteed to return a solution with profit of at least opti/c d, where opti is the maximum profit solution, and d is independent of the event sequence. The second performance measure that we use is regret. For minimization, an algorithm obtains a regret of h if it is guaranteed to return a solution with cost at most opti + h. An equivalent way to view an online problem is as a game between an online player and a malicious adversary. The online player follows an online algorithm on an input that is created by the adversary. Knowing the strategy of the online player, the adversary produces the worst possible input. In other words, the adversary constructs a sequence of events that produces bad results expensive cost, or low profit for the online player, but at the same time good results for an optimal offline strategy. It is also possible to consider competitiveness and regret when the online algorithm uses randomization. In this work, we only consider models where the adversary knows the algorithm and the probability distribution the algorithm uses to make its random decisions. The adversary is not aware, however, of the actual random choices made by the algorithm throughout its execution. This kind of adversary is called an oblivious adversary. In case randomization is allowed, the expected cost or profit of the algorithm is compared against optimal solution opti. In Chapter 4 we show connections between these two performance measures in various settings. 1 The notion of optimal feasible may differ from one online setting to the other. See the following chapters for more details. 15

26 16

27 Chapter 3 Competitiveness via Regularization In this Chapter we provide a framework for designing competitive online algorithms using regularization, a widely used technique in online learning, particularly in online convex optimization. An online algorithm that uses regularization serves requests by computing a solution, at each step, to an objective function involving a smooth convex regularization function. Applying the technique of regularization allows us to obtain new results in the domain of competitive analysis. 3.1 Introduction Competitive analysis and online learning are two important research fields that study the problem of a decision-maker who iteratively needs to make decisions in the face of uncertainty. A typical online problem proceeds in rounds, where in each round an online algorithm is given a request and needs to serve it. We propose a general setting for studying online problems by letting the request sequence be a convex set that varies over time. To be more specific, in each round t {1,..., T }, a feasible convex region P t R n is revealed along with a service cost vector c t. The online algorithm needs to choose a feasible point y t P t and move from y t 1 to y t. The cost of the algorithm at round t is the sum of the service cost c t, y t, and the movement cost y t, y t The goal is to find an online algorithm which is competitive with respect to an offline solution minimizing T t=1 c t, y t + T t=1 y t, y t 1 1. This setting captures many important problems in online computation, e.g., caching [ST85], online covering [AAA + 03, BN05], allocation problem [BBMN11], and hyperplane chasing [FL93]. Consider for example metrical task systems MTS [BLS92] and online set cover [AAA + 03]. In the online set cover problem we are given a set system defined over a universe of elements. The elements appear one by one and need to be covered upon arrival. To formulate the problem within our general setting, the feasible region is initially the whole space 2. Upon arrival of element t, convex set P t is defined to be the intersection of P t 1 and the covering constraint 1 More generally, we allow movement cost which is n i=1 wi yi,t yi,t 1. 2 Without loss of generality, y 0 is initialized to be the origin. 17

28 corresponding to element t. Covering element t by set s means increasing y s from 0 to 1. There is no service cost in the online set cover problem. In MTS, the feasible convex region remains {x R n +, n i=1 x i = 1} throughout all rounds, however, in each round t a new service cost vector c t is given. Thus, both service and movements costs are incurred, yet the feasible region is always defined by a single fixed covering constraint. We proceed to define the online set cover problem with service cost that generalizes both online set cover and MTS. The basic setting is the same as in online set cover, where in each round a subset of the elements needs to be covered. Each chosen set pays an opening cost as in set cover; it also pays a service cost in each of the rounds in which it is open. This means that it can be beneficial for an online algorithm to both add and delete sets from the cover throughout its execution, while paying a movement cost for that. This setting captures both service costs that should be paid if sets facilities remain open and a fully dynamic environment in which both sets and elements arrive and depart over time. Cloud computing is an example of a practical setting captured by the online set cover problem with service cost. The covering constraints indicate the number of servers that are needed in certain regions; movement cost corresponds to the cost of turning servers on and off, and the service cost corresponds to the energy consumption of the servers. Let us now turn our attention to the area of online learning and in particular to the domain of online convex optimization. In an online convex optimization problem, a bounded closed convex feasible set P R n is given as input, and in each round t {1,..., T } a loss vector c t is revealed 3. The online algorithm picks a point y t P and the loss incurred is c t, y t 1. The performance of the online algorithm is usually measured via the notion of regret defined as T t=1 c t, y t 1 min x P T t=1 c t, x. We further elaborate on this setting in Chapter 4. For additional details and techniques, see for example [Zin03, CBL06, HKKA06, FKM05]. An important technique used in online convex optimization is regularization. Generally speaking, regularization is achieved by adding a smooth convex function to a given objective function and then greedily solving the new online problem, in order to obtain good regret bounds. The goal of regularization is to stabilize the solution, that is, avoid drastic shifts in the solution from round to round. Regularization in online learning appears in the literature at least as early as [KW97] and [Gor99], and more modern analyses can be found in [CBL06, Rak09]. Coming back to the world of competitive analysis, it is clear that any online algorithm for our general setting must try to mimic the configurations of an optimal offline solution on one hand, while minimizing the movement cost on the other hand. That is, in a sense, an online algorithm must maintain stability. Therefore, our goal is to apply a regularization approach, similarly to the way it is applied in online learning, in order to obtain good competitive bounds. 3 More generally, a convex loss function f t is given. 18

29 Our Results Technion - Computer Science Department - Ph.D. Thesis PHD We provide a novel framework for designing competitive online algorithms that uses regularization together with a primal-dual analysis. In this framework, the online algorithm s output in each round is chosen to be the solution to an optimization problem involving a smooth convex regularization function. The analysis of the competitive factor is based on recent online primaldual LP techniques developed in competitive analysis see the survey of [BN09a]. We exhibit an online algorithm which is obtained from this framework for the case where P t is defined by covering and precedence constraints 4, and prove bounds on its competitive ratio. Our main result is: Theorem 3.1. For any ɛ > 0, there is an O1 + ɛ log1 + k/ɛ-competitive online algorithm if P t is a covering and precedence polytope, where k is the maximal sparsity of the covering constraints. The proof of the theorem is simple and elegant. We start with the KKT optimality conditions of the regularized problem in each round. These conditions imply a simple construction for a dual solution to the original online problem i.e., before regularization and thus can serve as a lower bound on the optimal offline solution. The theorem yields an alternative algorithm and proof to many previously studied fundamental online problems, e.g., caching, MTS on a weighted star, online set cover, online connectivity, the allocation problem, and more. The importance of Theorem 3.1 is that it allows us to be competitive against a combination of both service and movement costs, while satisfying multiple constraints. Thus, we obtain competitive algorithms for online problems which previously did not seem within reach of polylogarithmic competitive factors. One example is the online set cover problem with service cost mentioned earlier. Note that even though the formulations of classic online problems e.g., online set cover, caching, etc. contain multiple constraints, no service cost is incurred over time. The only exceptions are MTS and finely-competitive paging [BBK99], but both have a formulation the consists of a single fixed covering constraint. Another example is fractional shortest path with time-varying traffic loads, a problem studied extensively in the online learning community with respect to regret minimization. In this problem there is a graph together with a source node s and a target node t. An online algorithm needs to maintain unit flow between s and t. In each round, new edge costs are revealed, and the online algorithm is allowed to both increase and decrease fractionally capacities on the edges so as to maintain the unit flow between s and t. Thus, both movement and service costs are incurred here. A feasible solution is defined by multiple constraints corresponding to the cuts separating s from t. Unlike the learning variant, our online algorithm also pays a cost for increasing the capacities of the edges, capturing the fact that modifying a solution over time incurs an additional cost. 4 A precedence constraint is of the form x y. 19

30 The next issue we consider is rounding online a fractional solution to the online set cover with service cost. Recall that this problem captures both online set cover and MTS which are rounded by very different techniques. Online set cover is rounded by adding to the cover each set independently taking advantage of the fact that variables can only increase. In contrast, rounding MTS uses strong dependence between the variables, taking into advantage the fact that there is only a single fixed constraint. We utilize recent ideas for rounding fractional solutions via exponential clocks see [BNS13], thus allowing us to unify both independent and dependent choices into a single rounding scheme. We obtain the following tight result. Theorem 3.2. There is a randomized algorithm which is Olog S max log m-competitive for the online set cover problem with service costs, where m denotes the number of sets and S max denotes the maximal set size. Related Work: There are several works that discuss the connection between competitive analysis and online learning, such as [BB97, BBK99, BBN10, ABL + 13]. We defer the full overview to Chapter 4. Abernethy et al. [ABBS10] discuss competitive-analysis algorithms using a regularized work function, yet limited to the MTS setting. 3.2 Online Regularization Algorithm In this section we develop our main algorithm which is based on regularization. The algorithm is given in each round a new polyhedron P t R n defined by a set of covering constraints, and a cost vector c t R n +. The goal is to minimize T t=1 c t, y t + T n t=1 i=1 w i y i,t y i,t 1, where in each round t, y t P t. The algorithm is very simple conceptually, and is based on solving in each round a convex optimization problem with a regularized objective function which includes both the previous point y t 1 as well as the current cost vector c t. Thus, our solution in each step is determined greedily and independently of rounds prior to t 1. The convex objective function is obtained by trading off relative entropy plus a linear term with the movement cost. Algorithm 3.1 Regularization Algorithm parameters: ɛ > 0, η = ln 1 + n/ɛ. initialize y i,0 = 0 for all i = 1,..., n. for t = 1, 2,..., T do let c t R n + be the cost vector and let P t be the feasible set of solutions at time t. solve the following convex program to obtain y t, P y t = arg min x P t { c t, x + 1 η n w i x i + ɛ } xi + ɛ/n ln x i. n y i,t 1 + ɛ/n i=1 end for 20

31 The relative entropy function, w u = i w i ln w i + u i w i u i is widely used as a regularizer in online learning problems that involve l 1 -norm constraints such as maintaining a distribution over a ground set of elements. We note that since in each round the objective function is convex and P t is a convex set, then the program P is solvable in polynomial time using standard convex optimization techniques, such as interior-point methods [NN94] Analysis We next analyze the algorithm, thus proving Theorem 3.1. First, we formulate the offline problem as a linear program, and also write its dual that will serve as a lower bound. To demonstrate the ideas we assume that in round t, P t is defined as m t covering constraints of the form i S j,t y i,t 1. In Section we show how to deal with the more general cases in which we have either precedence constraints, or each constraint is of the form i S j,t y i,t r j,t where r j,t N and 0 y i,t 1 for every i. Without loss of generality, we assume that both the optimal solution and our algorithm pay only for increasing the variables, thus the movement cost from y t 1 to y t is equal to n i=1 w i max{0, y i,t y i,t 1 }. The problem formulation appears in Figure 3.1. Let k denote the maximal sparsity of the covering constraints, i.e., k = max { S j,t : 1 t T, 1 j m t }. Our proof is based on deriving the KKT optimality conditions of the regularized problem in each round. The constraints define dual variables that are then carefully plugged into the dual formulation in Figure 3.1 to yield a feasible dual solution. This, in turn, yields a lower bound on the performance of the online algorithm. P min T n t=1 i=1 c i,t y i,t + T n t=1 i=1 w i z i,t D max T mt t=1 j=1 a j,t t 1 and 1 j m t i S j,t y i,t 1 t 1 and i b i,t w i t 1 and 1 i n z i,t y i,t y i,t 1 t 1 and 1 i n b i,t+1 b i,t c i,t j i S j,t a j,t t and 1 i n z i,t, y i,t 0 t 1 and i, j a j,t, b i,t 0 Figure 3.1: Primal and dual LP formulations for the online covering problem. KKT optimality conditions. In each round, Algorithm 3.1 solves a convex program P with m t covering constraints. We define a nonnegative lagrangian variable a j,t for each covering constraint in round t {1,..., T }. The KKT optimality conditions define the following 21

32 relationship between the optimal values of y t and a j,t. Technion - Computer Science Department - Ph.D. Thesis PHD j m t, 1 j m t, 1 i n, 1 i n, y i,t 1 0, 3.1 i S j,t a j,t y i,t 1 = 0, 3.2 i S j,t c i,t + w i η ln yi,t + ɛ n y i,t 1 + ɛ a j,t 0, 3.3 n j:i S j,t y i,t c i,t + w i η ln yi,t + ɛ n y i,t 1 + ɛ = n j:i S j,t a j,t Proof of Theorem 3.1. We first construct a dual solution to the offline problem D using the values that are obtained by the KKT optimality conditions. We then show that the primal and dual solutions obtained are feasible, and finally we prove that the dual we constructed can pay for both movement and service cost of the online algorithm. To construct the dual we simply assign the same dual value obtained by the optimality conditions to a j,t in D, and we define b i,t+1 = w i η ln 1+ɛ/n y i,t +ɛ/n. We claim next that the primal and dual solutions are feasible. From Condition 3.1, all the primal covering constraints are satisfied, and by setting z e,t = max {0, y e,t y e,t 1 } we get a feasible primal solution. To this end note that for any t and i, b i,t+1 b i,t = w i η ln yi,t + ɛ n y i,t 1 + ɛ c i,t a j,t, n j:i S j,t w where the inequality follows from 3.3. Also, 0 b i,t+1 = i ln1+n/ɛ ln 1+ɛ/n y i,t +ɛ/n w i, which follows since 0 y i,t 1, and additionally a j,t 0 as this is a lagrangian dual variable of a covering constraint. Bounding the movement cost at time t: Let M t be the movement cost at time t. As we indicated we charge our algorithm and opt only for increasing the fractional value of the elements. We get, w i M t = η η y i,t y i,t 1 y i,t >y i,t 1 η = η y i,t >y i,t 1 y i,t >y i,t 1 y i,t + ɛ n y i,t + ɛ n wi η ln yi,t + ɛ n y i,t 1 + ɛ 3.5 n a j,t c i,t, 3.6 j i S j,t 22

33 where Inequality 3.5 follows as a b a ln a/b for any a, b > 0, and Equality 3.6 follows from Condition 3.4 since if y i,t > y i,t 1 then also y i,t > 0. Since c i,t, y i,t and a j,t are nonnegative, we then get, η n i=1 y i,t + ɛ a j,t = η n j i S j,t m t a j,t j=1 y i,t + ɛ S j,t n i S j,t η1 + ɛk m t n a j,t. 3.7 j=1 Inequality 3.7 follows from Condition 3.2. Summing over all times t we get that the total movement cost is at most η1 + ɛk n times the value of D. Bounding the service cost: Let S be the service cost paid by the algorithm. We rely on the following property, which follows from Jensen s inequality, to bound S. Lemma Log sum inequality, [CT91]. For any nonnegative numbers a 1, a 2,..., a n and b 1, b 2,..., b n, with equality if and only if a i b i We get, n i=1 a i log a i b i = const. n a i log i=1 n i=1 a i n i=1 b i T n T m t S = c i,t y i,t = a j,t y i,t 1 T n yi,t + ɛ n w i y i,t ln η y t=1 i=1 t=1 j=1 i S j,t t=1 i=1 i,t 1 + ɛ 3.8 n { T m t = a j,t 1 n T w i y i,t + ɛ yi,t + ɛ n ln η n y t=1 j=1 i=1 t=1 i,t 1 + ɛ ɛ T yi,t + ɛ } n ln n n y t=1 i,t 1 + ɛ 3.9 n D 1 n T T w i y i,t + ɛ y i,t + ɛ n η n t=1 ln T i=1 t=1 y i,t 1 + ɛ n ɛ n ln yi,t + ɛ n y i,0 + ɛ 3.10 n D. t= Equality 3.8 follows from Condition 3.4. Equality 3.9 follows from Condition 3.2. Inequality 3.10 follows by telescopic sum and Lemma Inequality 3.11 follows as y i,0 = 0, so for every i, ɛ n ln yi,t + ɛ n y i,0 + ɛ = y i,0 + ɛ yi,0 + ɛ n ln n n y i,t + ɛ y i,0 y i,t, n 23

34 and, T y i,t + ɛ n T ln y i,t + ɛ T n / y i,t 1 + ɛ n t=1 t=1 t=1 both because a b a lna/b for any a, b 0. T y i,t + ɛ T n y i,t 1 + ɛ n = y i,t y i,0, Hence, by choosing ɛ = ɛn k, one can conclude that total cost is at most 1+1+ɛ ln1+k/ɛ times the value of D General Covering Constraints with Variable Upper Bounds The latter proof also holds in the more general case where in round t, P t is defined by m t covering constraints of the form i S j,t y i,t r j,t where r j,t N, and 0 y i,t 1 for every 1 i n. This captures settings like weighted paging, as well as more involved generalizations. To see this, we note that we can replace every box constraint by a set of knapsack cover KC inequalities, as suggested by [CFLP00]. Given a covering LP form, the KC-inequalities for a particular covering constraint i s x i r are defined as follows: for any subset s s of variables, the maximum possible contribution of the variables in s to the constraint is s, and if s < r then at least a contribution of r s must come from variables in s \ s. Therefore, for every s s : s < r we get the valid inequality i s\s x i r s. By adding the KC constraints, the original box constraint becomes unnecessary; consider the first round t in which a variable y i,t exceeds 1. For every s that contains i we know that l s\{i} y l,t r j,t 1, and thus we may reduce y i,t to 1 satisfying all constraints. This reduces both the value of the the service cost and value of the relative entropy since y i,t 1 1, contradicting the minimality of each step in the regularized problem. The rest of the analysis follows along the same lines as above, except for Inequality 3.7 where we get M t η1 + ɛk n r j,t m t j=1 r j,ta j,t, which is clearly bounded by η1 + ɛk n D t, since r j,t 1 for all constraints, even after adding the KC-inequalities. A similar proof also holds for the case where we are given a fixed set of precedence constraints of the form x y, in addition to the varying covering constraints. Such constraints appear, for example, in standard facility location formulations, or in the allocation problem [BBMN11]. Here, we obtain an additional KKT condition as well as new dual variables which correspond to the precedence constraints. It is relatively easy to show that by assigning the new dual variables obtained by the KKT conditions to their corresponding variables in D we obtain a dual solution that can pay for both movement and service cost of the online algorithm. 3.3 Online Set Cover with Service Cost t=1 t=1 In this section we show how to round a fractional solution to the online set cover with service cost. The problem statement is as follows. We are given a set of elements E = {e 1, e 2,..., e n }, and a family of subsets S = {s 1,..., s m }, where each s i E. At each time t {1,..., T } we 24

35 have a set of elements E t E that the algorithm should cover, and a service cost c s,t on each set s S. The algorithm pays the sum of service costs of the sets that are taken at time t. Also, the algorithm pays one unit for allocating each additional set that is not in the solution at time t 1. It is easy to see that the problem can be formulated and solved fractionally using Algorithm 3.1. Next, we show how to round the fractional solution online. We present a randomized rounding algorithm for the fractional solution which is based on exponential clocks. A random variable X is distributed according to the exponential distribution with rate λ if it has density f X x = λe λx for every x 0, and f X x = 0 otherwise. We denote this by X expλ. Exponential clocks are simply competing independent exponential random variables. An exponential clock wins a competition if it has the smallest value among all participating exponential clocks. The rounding is as follows. Algorithm 3.2 Rounding Algorithm 1: parameter: α 0 2: for each s S, choose i.i.d random variable Z s exp1. 3: for each e E, choose i.i.d random variable Z e exp1. 4: at any time t, let y s,t denote the current fractional value of s. 5: for t = 1, 2,..., T do 6: let A t = { s S Zs y s,t } < α. 5 7: let B t = e E { s s = arg min s e s { Zs 8: output A t B t. 9: end for y s,t }, and Zs y s < Z e max{0,1 s e s ys,t} }. First, we observe that the algorithm covers all elements in E t at time t. This is true since for each such element e E t, s e s y s,t 1, and thus B t always contains a set s that covers e. We next prove the following theorem that bounds the performance of the algorithm. Theorem 3.3. The expected cost of the solution is Olog S max times the cost of the fractional solution, by choosing α = logs max, where S max = max { s : s S}. 6 Since the fractional solution is Olog m-competitive with respect to the optimal solution, we get that the integral algorithm is Olog S max log m-competitive, thus proving Theorem 3.2. Proof. We use the following well known properties of the exponential distribution: 1. If X expλ and c > 0, then X c expλc. 2. Let X 1,..., X k be independent random variables with X i expλ i : a min {X 1,..., X k } expλ λ k. b P r[x i min j i {X j }] = λ i λ λ k. 5 In any case of division by 0, we assume the value to be. 6 S max does not has to be known in advance. We may set it to be the maximal set size known so far. 25