# Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Save this PDF as:

Size: px
Start display at page:

Download "Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass."

## Transcription

1 Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel Abstract We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1 Introduction In this paper we describe and analyze several learning tasks through the same algorithmic prism. Specifically, we discuss online classification, online regression, and online uniclass prediction. In all three settings we receive instances in a sequential manner. For concreteness we assume that these instances are vectors in R n and denote the instance received on round t by x t. In the classification problem our goal is to find a mapping from the instance space into the set of labels, { 1, +1}. In the regression problem the mapping is into R. Our goal in the uniclass problem is to find a center-point in R n with a small Euclidean distance to all of the instances. We first describe the classification and regression problems. For classification and regression we restrict ourselves to mappings based on a weight vector w R n, namely the mapping f : R n R takes the form f(x) = w x. After receiving x t we extend a prediction ŷ t using f. For regression the prediction is simply ŷ t = f(x t ) while for classification ŷ t = sign(f(x t )). After extending the prediction ŷ t, we receive the true outcome y t. We then suffer an instantaneous loss based on the discrepancy between y t and f(x t ). The goal of the online learning algorithm is to minimize the cumulative loss. The losses we discuss in this paper depend on a pre-defined insensitivity parameter ɛ and are denoted l ɛ (w; (x, y)). For regression the ɛ-insensitive loss is, { 0 y w x ɛ l ɛ (w; (x, y)) =, (1) y w x ɛ otherwise while for classification the ɛ-insensitive loss is defined to be, { 0 y(w x) ɛ l ɛ (w; (x, y)) =. (2) ɛ y(w x) otherwise As in other online algorithms the weight vector w is updated after receiving the feedback y t. Therefore, we denote by w t the vector used for prediction on round t. We leave the details on the form this update takes to later sections.

3 view δ(w; z t ) as a convex function of w. Let [a] + be the function that equals a whenever a > 0 and otherwise equals zero. Using the discrepancies defined above, the three different losses given in Eqs. (1)-(3) can all be written as l ɛ (w; z) = [δ(w; z) ɛ] +, where for classification we set ɛ ɛ since the discrepancy is defined as the negative of the margin. While this construction might seem a bit odd for classification, it is very useful in unifying the three problems. To conclude, the loss in all three problems can be derived by applying the same hinge loss to different (problem dependent) discrepancies. 3 An Additive Algorithm for the Realizable Case Equipped with the simple unified notion of loss we describe in this section a single online algorithm that is applicable to all three problems. The algorithm and the analysis we present in this section assume that there exist a weight vector w and an insensitivity parameter ɛ for which the data is perfectly realizable. Namely, we assume that l ɛ (w ; z t ) = 0 for all t which implies that, y t (w x t ) ɛ (Class.) y t w x t ɛ (Reg.) x t w ɛ (Unic.). (4) A modification of the algorithm for the unrealizable case is given in Sec. 5. The general method we use for deriving our on-line update rule is to define the new weight vector w t+1 as the solution to the following projection problem w t+1 = argmin w 1 2 w w t 2 s.t. l ɛ (w; z t ) = 0, (5) namely, w t+1 is set to be the projection of w t onto the set of all weight vectors that attain a loss of zero. We denote this set by C. For the case of classification, C is a half space, C = {w : y t w x t ɛ}. For regression C is an ɛ-hyper-slab, C = {w : w x t y t ɛ} and for uniclass it is a ball of radius ɛ centered at x t, C = {w : w x t ɛ}. In Fig. 2 we illustrate the projection for the three cases. This optimization problem attempts to keep w t+1 as close to w t as possible, while forcing w t+1 to achieve a zero loss on the most recent example. The resulting algorithm is passive whenever the loss is zero, that is, w t+1 = w t whenever l ɛ (w t ; z t ) = 0. In contrast, on rounds for which l ɛ (w t ; z t ) > 0 we aggressively force w t+1 to satisfy the constraint l ɛ (w t+1 ; z t ) = 0. Therefore we name the algorithm passive-aggressive or PA for short. In the following we show that for the three problems described above the solution to the optimization problem in Eq. (5) yields the following update rule, w t+1 = w t + τ t v t, (6) where v t is minus the gradient of the discrepancy and τ t = l ɛ (w t ; z t )/ v t 2. (Note that although the discrepancy might not be differentiable everywhere, its gradient exists whenever the loss is Parameter: Insensitivity: ɛ Initialize: Set w 1 = 0 (R&C) ; w 1 = x 0 (U) For t = 1, 2,... Get a new instance: z t R n Suffer loss: l ɛ (w t ; z t ) If l ɛ (w t ; z t ) > 0 : 1. Set v t (see Table 1) 2. Set τ t = lɛ(wt;zt) v t 2 3. Update: w t+1 = w t + τ t v t Figure 1: The additive PA algorithm. greater than zero). To see that the update from Eq. (6) is the solution to the problem defined by Eq. (5), first note that the equality constraint l ɛ (w; z t ) = 0 is equivalent to the inequality constraint δ(w; z t ) ɛ. The Lagrangian of the optimization problem is L(w, τ) = 1 2 w w t 2 + τ (δ(w; z t ) ɛ), (7)

4 w t w t+1 w t w t+1 w t w t+1 Figure 2: An illustration of the update: w t+1 is found by projecting the current vector w t onto the set of vectors attaining a zero loss on z t. This set is a stripe in the case of regression, a half-space for classification, and a ball for uniclass. where τ 0 is a Lagrange multiplier. To find a saddle point of L we first differentiate L with respect to w and use the fact that v t is minus the gradient of the discrepancy to get, w (L) = w w t + τ w δ = 0 w = w t + τv t. To find the value of τ we use the KKT conditions. Hence, whenever τ is positive (as in the case of non-zero loss), the inequality constraint, δ(w; z t ) ɛ, becomes an equality. Simple algebraic manipulations yield that the value τ for which δ(w; z t ) = ɛ for all three problems is equal to, τ t = l ɛ (w; z t )/ v t 2. A summary of the discrepancy functions and their respective updates is given in Table 1. The pseudo-code of the additive algorithm for all three settings is given in Fig. 1. We now discuss the initialization of w 1. For classification and regression a reasonable choice for w 1 is the zero vector. However, in the case of uniclass initializing w 1 to be the zero vector might incur large losses if, for instance, all the instances are located far away from the origin. A more sensible choice for uniclass is to initialize w 1 to be one of the examples. For simplicity of the description we assume that we are provided with an example x 0 prior to the run of the algorithm and initialize w 1 = x 0. To conclude this section we note that for all three cases the weight vector w t is a linear combination of the instances. This representation enables us to employ kernels [15]. 4 Analysis The following theorem provides a unified loss bound for all three settings. After proving the theorem we discuss a few of its implications. Theorem 1 Let z 1, z 2,..., z t,... be a sequence of examples for one of the problems described in Table 1. Assume that there exist w and ɛ such that l ɛ (w ; z t ) = 0 for all t. Then if the additive PA algorithm is run with ɛ ɛ, the following bound holds for any T 1 (l ɛ (w t ; z t )) 2 + 2(ɛ ɛ ) l ɛ (w t ; z t ) B w w 1 2, (8) where for classification and regression B is a bound on the squared norm of the instances ( t : B x t 2 2) and B = 1 for uniclass.

5 Proof: Define t = w t w 2 w t+1 w 2. We prove the theorem by bounding T t from above and below. First note that T t is a telescopic sum and therefore t = w 1 w 2 w T +1 w 2 w 1 w 2. (9) This provides an upper bound on t t. In the following we prove the lower bound t l ɛ(w t ; z t ) B (l ɛ (w t ; z t ) + 2(ɛ ɛ )). (10) First note that we do not modify w t if l ɛ (w t ; z t ) = 0. Therefore, this inequality trivially holds when l ɛ (w t ; z t ) = 0 and thus we can restrict ourselves to rounds on which the discrepancy is larger than ɛ, which implies that l ɛ (w t ; z t ) = δ(w t ; z t ) ɛ. Let t be such a round then by rewriting w t+1 as w t + τ t v t we get, t = w t w 2 w t+1 w 2 = w t w 2 w t + τ t v t w 2 = w t w 2 ( τ 2 t v t 2 + 2τ t (v t (w t w )) + w t w 2) = τ 2 t v t 2 + 2τ t v t (w w t ). (11) Using the fact that v t is the gradient of the convex function δ(w; z t ) at w t we have, δ(w ; z t ) δ(w t ; z t ) ( v t ) (w w t ). (12) Adding and subtracting ɛ from the left-hand side of Eq. (12) and rearranging we get, v t (w w t ) δ(w t ; z t ) ɛ + ɛ δ(w ; z t ). (13) Recall that δ(w t ; z t ) ɛ = l ɛ (w t ; z t ) and that ɛ δ(w ; z t ). Therefore, (δ(w t ; z t ) ɛ) + (ɛ δ(w ; z t )) l ɛ (w t ; z t ) + (ɛ ɛ ). (14) Combining Eq. (11) with Eqs. (13-14) we get t τ 2 t v t 2 + 2τ t (l ɛ (w t ; z t ) + (ɛ ɛ )) = τ t ( τt v t 2 + 2l ɛ (w t ; z t ) + 2(ɛ ɛ ) ). (15) Plugging τ t = l ɛ (w t ; z t )/ v t 2 into Eq. (15) we get t l ɛ(w t ; z t ) v t 2 (l ɛ (w t ; z t ) + 2(ɛ ɛ )). For uniclass v t 2 is always equal to 1 by construction and for classification and regression we have v t 2 = x t 2 B which gives, t l ɛ(w t ; z t ) (l ɛ (w t ; z t ) + 2(ɛ ɛ )). B Comparing the above lower bound with the upper bound in Eq. (9) we get (l ɛ (w t ; z t )) 2 + This concludes the proof. 2(ɛ ɛ )l ɛ (w t ; z t ) B w w 1 2. Let us now discuss the implications of Thm. 1. We first focus on the classification case. Due to the realizability assumption, there exist w and ɛ such that for all t, l ɛ (w ; z t ) = 0 which implies that y t (w x t ) ɛ. Dividing w by its norm we can rewrite the latter as y t (ŵ x t ) ˆɛ where ŵ = w / w and ˆɛ = ɛ / w. The parameter ˆɛ is often

6 referred to as the margin of a unit-norm separating hyperplane. Now, setting ɛ = 1 we get that l ɛ (w; z) = [1 y(w x)] + the hinge loss for classification. We now use Thm. 1 to obtain two loss bounds for the hinge loss in a classification setting. First, note that by also setting w = ŵ /ˆɛ and thus ɛ = 1 we get that the second term on the left hand side of Eq. (8) vanishes as ɛ = ɛ = 1 and thus, ([1 y t (w t x t )] + ) 2 B w 2 = B (ˆɛ ) 2. (17) We thus have obtained a bound on the squared hinge loss. The same bound was also derived by Herbster [8]. We can immediately use this bound to derive a mistake bound for the PA algorithm. Note that the algorithm makes a prediction mistake iff y t (w t x t ) 0. In this case, [1 y t (w t x t )] + 1 and therefore the number of prediction mistakes is bounded by B/(ˆɛ ) 2. This bound is common to online algorithms for classification such as ROMMA [14]. We can also manipulate the result of Thm. 1 to obtain a direct bound on the hinge loss. Using again ɛ = 1 and omitting the first term in the left hand side of Eq. (8) we get, 2( 1 ɛ ) [1 y t (w t x t )] + B w 2. By setting w = 2ŵ /ˆɛ, which implies that ɛ = 2, we can further simplify the above to get a bound on the cumulative hinge loss, [1 y t (w t x t )] + 2 B (ˆɛ ) 2. To conclude this section, we would like to point out that the PA online algorithm can also be used as a building block for a batch algorithm. Concretely, let S = {z 1,..., z m } be a fixed training set and let β R be a small positive number. We start with an initial weight vector w 1 and then invoke the PA algorithm as follows. We choose an example z S such that l ɛ (w 1 ; z) 2 > β and present z to the PA algorithm. We repeat this process and obtain w 2, w 3,... until the T th iteration on which for all z S, l ɛ (w T ; z) 2 β. The output of the batch algorithm is w T. Due to the bound of Thm. 1, T is at most B w w 1 2 /β and by construction the loss of w T on any z S is at most β. Moreover, in the following lemma we show that the norm of w T cannot be too large. Since w T achieves a small empirical loss and its norm is small, it can be shown using classical techniques (cf. [15]) that the loss of w T on unseen data is small as well. Lemma 2 Under the same conditions of Thm. 1, the following bound holds for any T 1 w T w 1 2 w w 1. Proof: First note that the inequality trivially holds for T = 1 and thus we focus on the case T > 1. We use the definition of t from the proof of Thm. 1. Eq. (10) implies that t is non-negative for all t. Therefore, we get from Eq. (9) that 0 1 t = w 1 w 2 w T w 2. (18) Rearranging the terms in Eq. (18) we get that w T w w w 1. Finally, we use the triangle inequality to get the bound, w T w 1 = (w T w ) + (w w 1 ) w T w + w w 1 2 w w 1. This concludes the proof.

7 5 A Modification for the Unrealizable Case We now briefly describe an algorithm for the unrealizable case. This algorithm applies only to regression and classification problems. The case of uniclass is more involved and will be discussed in detail elsewhere. The algorithm employs two parameters. The first is the insensitivity parameter ɛ which defines the loss function as in the realizable case. However, in this case we do not assume that there exists w that achieves zero loss over the sequence. We instead measure the loss of the online algorithm relative to the loss of any vector w. The second parameter, γ > 0, is a relaxation parameter. Before describing the effect of this parameter we define the update step for the unrealizable case. As in the realizable case, the algorithm is conservative. That is, if the loss on example z t is zero then w t+1 = w t. In case the loss is positive the update rule is w t+1 = w t + τ t v t where v t is the same as in the realizable case. However, the scaling factor τ t is modified and is set to, τ t = l ɛ(w t ; z t ) v t 2 + γ. The following theorem provides a loss bound for the online algorithm relative to the loss of any fixed weight vector w. Theorem 3 Let z 1 = (x 1, y 1 ), z 2 = (x 2, y 2 ),..., z t = (x t, y t ),... be a sequence of classification or regression examples. Let w be any vector in R n. Then if the PA algorithm for the unrealizable case is run with ɛ, and with γ > 0, the following bound holds for any T 1 and a constant B satisfying B x t 2, ( (l ɛ (w t ; z t )) 2 (γ + B) w w B ) T (l ɛ (w ; z t )) 2. (19) γ The proof of the theorem is based on a reduction to the realizable case (cf. [4, 13, 14]) and is omitted due to the lack of space. 6 Extensions There are numerous potential extensions to our approach. For instance, if all the components of the instances are non-negative we can derive a multiplicative version of the PA algorithm. The multiplicative PA algorithm maintains a weight vector w t P n where P n = {x : x R n n +, j=1 x j = 1}. The multiplicative update of w t is, w t+1,j = (1/Z t ) w t,j e τtvt,j, where v t is the same as the one used in the additive algorithm (Table 1), τ t now becomes 4l ɛ (w t ; z t )/ v t 2 for regression and classification and l ɛ(w t ; z t )/(8 v t 2 ) for uniclass and Z t = n j=1 w t,je τtvt,j is a normalization factor. For the multiplicative PA we can prove the following loss bound. Theorem 4 Let z 1, z 2,..., z t = (x t, y t ),... be a sequence of examples such that x t,j 0 for all t. Let D RE (w w ) = j w j log(w j /w j ) denote the relative entropy between w and w. Assume that there exist w and ɛ such that l ɛ (w ; z t ) = 0 for all t. Then when the multiplicative version of the PA algorithm is run with ɛ > ɛ, the following bound holds for any T 1, (l ɛ (w t ; z t )) 2 + 2(ɛ ɛ ) l ɛ (w t ; z t ) 1 2 B D RE (w w 1 ), where for classification and regression B is a bound on the square of the infinity norm of the instances ( t : B x t 2 ) and B = 16 for uniclass.

8 The proof of the theorem is rather technical and uses the proof technique of Thm. 1 in conjunction with inequalities on the logarithm of Z t (see for instance [7, 11, 9]). An interesting question is whether the unified view of classification, regression, and uniclass can be exported and used with other algorithms for classification such as ROMMA [14] and ALMA [5]. Another, rather general direction for possible extension surfaces when replacing the Euclidean distance between w t+1 and w t with other distances and divergences such as the Bregman divergence. The resulting optimization problem may be solved via Bregman projections. In this case it might be possible to derive general loss bounds, see for example [12]. We are currently exploring generalizations of our framework to other decision tasks such as distance-learning [16] and online convex programming [17]. References [1] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex feasibility problems. SIAM Review, [2] Y. Censor and S. A. Zenios. Parallel Optimization.. Oxford University Press, [3] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3: , [4] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3): , [5] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2: , [6] C. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS 98. [7] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. A comparison of new and old algorithms for a mixture estimation problem. In COLT 95. [8] M. Herbster. Learning additive models online with fast evaluating kernels. In COLT 01. [9] J. Kivinen, D. P. Helmbold, and M. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6): , [10] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. In NIPS 02. [11] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 64, January [12] J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3): , July [13] N. Klasner and H. U. Simon. From noise-free to noise-tolerant and from on-line to batch learning. In COLT 95. [14] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1 3): , [15] V. N. Vapnik. Statistical Learning Theory. Wiley, [16] E. Xing, A. Y. Ng, M. Jordan, and S. Russel. Distance metric learning, with application to clustering with side-information. In NIPS 03. [17] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML 03.

### Online Passive-Aggressive Algorithms

Journal of Machine Learning Research 7 2006) 551 585 Submitted 5/05; Published 3/06 Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Joseph Keshet Shai Shalev-Shwartz Yoram Singer School of

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Online Convex Optimization

E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### DUOL: A Double Updating Approach for Online Learning

: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

### 1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### The p-norm generalization of the LMS algorithm for adaptive filtering

The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Online Learning meets Optimization in the Dual

Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### 10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago sham@tti-c.org Shai Shalev-Shartz Toyota Technological Institute at

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Efficient Projections onto the l 1 -Ball for Learning in High Dimensions

Efficient Projections onto the l 1 -Ball for Learning in High Dimensions John Duchi Google, Mountain View, CA 94043 Shai Shalev-Shwartz Toyota Technological Institute, Chicago, IL, 60637 Yoram Singer Tushar

### A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Online Learning: Theory, Algorithms, and Applications. Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz

Online Learning: Theory, Algorithms, and Applications Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz Submitted to the Senate of the Hebrew University July 2007 This work

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Date: April 12, 2001. Contents

2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

### 4.1 Learning algorithms for neural networks

4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch Pitts units and perceptrons, but the question of how to

### Minimize subject to. x S R

Chapter 12 Lagrangian Relaxation This chapter is mostly inspired by Chapter 16 of [1]. In the previous chapters, we have succeeded to find efficient algorithms to solve several important problems such

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Some Notes on Taylor Polynomials and Taylor Series

Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### Online Learning of Optimal Strategies in Unknown Environments

1 Online Learning of Optimal Strategies in Unknown Environments Santiago Paternain and Alejandro Ribeiro Abstract Define an environment as a set of convex constraint functions that vary arbitrarily over

### On Adaboost and Optimal Betting Strategies

On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of

Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann

### Metric Spaces. Chapter 7. 7.1. Metrics

Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

### Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

### Convex Optimization SVM s and Kernel Machines

Convex Optimization SVM s and Kernel Machines S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola and Stéphane Canu S.V.N.

### Adapting Codes and Embeddings for Polychotomies

Adapting Codes and Embeddings for Polychotomies Gunnar Rätsch, Alexander J. Smola RSISE, CSL, Machine Learning Group The Australian National University Canberra, 2 ACT, Australia Gunnar.Raetsch, Alex.Smola

### Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information

Karthik Sridharan Contact Information 424 Gates Hall Ithaca, NY 14853-7501 USA E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Research Interests Machine Learning, Statistical Learning

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### Tangent and normal lines to conics

4.B. Tangent and normal lines to conics Apollonius work on conics includes a study of tangent and normal lines to these curves. The purpose of this document is to relate his approaches to the modern viewpoints

### Optimal Strategies and Minimax Lower Bounds for Online Convex Games

Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### 4.1 Introduction - Online Learning Model

Computational Learning Foundations Fall semester, 2010 Lecture 4: November 7, 2010 Lecturer: Yishay Mansour Scribes: Elad Liebman, Yuval Rochman & Allon Wagner 1 4.1 Introduction - Online Learning Model

### KERNEL methods have proven to be successful in many

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 8, AUGUST 2004 2165 Online Learning with Kernels Jyrki Kivinen, Alexer J. Smola, Robert C. Williamson, Member, IEEE Abstract Kernel-based algorithms

### BDUOL: Double Updating Online Learning on a Fixed Budget

BDUOL: Double Updating Online Learning on a Fixed Budget Peilin Zhao and Steven C.H. Hoi School of Computer Engineering, Nanyang Technological University, Singapore E-mail: {zhao0106,chhoi}@ntu.edu.sg

### MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Large Scale Learning to Rank

Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

### L1 vs. L2 Regularization and feature selection.

L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1

### Sequences and Convergence in Metric Spaces

Sequences and Convergence in Metric Spaces Definition: A sequence in a set X (a sequence of elements of X) is a function s : N X. We usually denote s(n) by s n, called the n-th term of s, and write {s

### The Alternating Series Test

The Alternating Series Test So far we have considered mostly series all of whose terms are positive. If the signs of the terms alternate, then testing convergence is a much simpler matter. On the other

### Online Algorithms: Learning & Optimization with No Regret.

Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the

### List of Publications by Claudio Gentile

List of Publications by Claudio Gentile Claudio Gentile DiSTA, University of Insubria, Italy claudio.gentile@uninsubria.it November 6, 2013 Abstract Contains the list of publications by Claudio Gentile,

### Analysis of Representations for Domain Adaptation

Analysis of Representations for Domain Adaptation Shai Ben-David School of Computer Science University of Waterloo shai@cs.uwaterloo.ca John Blitzer, Koby Crammer, and Fernando Pereira Department of Computer

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### 1. Periodic Fourier series. The Fourier expansion of a 2π-periodic function f is:

CONVERGENCE OF FOURIER SERIES 1. Periodic Fourier series. The Fourier expansion of a 2π-periodic function f is: with coefficients given by: a n = 1 π f(x) a 0 2 + a n cos(nx) + b n sin(nx), n 1 f(x) cos(nx)dx

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora Scribe: One of the running themes in this course is the notion of

Load balancing of temporary tasks in the l p norm Yossi Azar a,1, Amir Epstein a,2, Leah Epstein b,3 a School of Computer Science, Tel Aviv University, Tel Aviv, Israel. b School of Computer Science, The

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Math 317 HW #5 Solutions

Math 317 HW #5 Solutions 1. Exercise 2.4.2. (a) Prove that the sequence defined by x 1 = 3 and converges. x n+1 = 1 4 x n Proof. I intend to use the Monotone Convergence Theorem, so my goal is to show

### The Multiplicative Weights Update method

Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game

### LINEAR PROGRAMMING WITH ONLINE LEARNING

LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

### 1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

### The Expectation Maximization Algorithm A short tutorial

The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected

### What is Linear Programming?

Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

### Geometry of Linear Programming

Chapter 2 Geometry of Linear Programming The intent of this chapter is to provide a geometric interpretation of linear programming problems. To conceive fundamental concepts and validity of different algorithms

### Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Breaking The Code Ryan Lowe Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and a minor in Applied Physics. As a sophomore, he took an independent study

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### Notes on Support Vector Machines

Notes on Support Vector Machines Fernando Mira da Silva Fernando.Silva@inesc.pt Neural Network Group I N E S C November 1998 Abstract This report describes an empirical study of Support Vector Machines

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### Online Learning, Stability, and Stochastic Gradient Descent

Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain

### Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

### Course 221: Analysis Academic year , First Semester

Course 221: Analysis Academic year 2007-08, First Semester David R. Wilkins Copyright c David R. Wilkins 1989 2007 Contents 1 Basic Theorems of Real Analysis 1 1.1 The Least Upper Bound Principle................

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### We shall turn our attention to solving linear systems of equations. Ax = b

59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

### CHAPTER 2. Inequalities

CHAPTER 2 Inequalities In this section we add the axioms describe the behavior of inequalities (the order axioms) to the list of axioms begun in Chapter 1. A thorough mastery of this section is essential

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### RN-Codings: New Insights and Some Applications

RN-Codings: New Insights and Some Applications Abstract During any composite computation there is a constant need for rounding intermediate results before they can participate in further processing. Recently

### Online Lazy Updates for Portfolio Selection with Transaction Costs

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department