# Efficient Projections onto the l 1 -Ball for Learning in High Dimensions

Save this PDF as:

Size: px
Start display at page:

Download "Efficient Projections onto the l 1 -Ball for Learning in High Dimensions"

## Transcription

2 the dimension are large. For instance, Shalev-Shwartz et al. (2007) give recent state-of-the-art methods for solving large scale support vector machines. Adapting these recent results to projection methods onto the l 1 ball poses algorithmic challenges. While projections onto l 2 balls are straightforward to implement in linear time with the appropriate data structures, projection onto an l 1 ball is a more involved task. The main contribution of this paper is the derivation of gradient projections with l 1 domain constraints that can be performed almost as fast as gradient projection with l 2 constraints. Our starting point is an efficient method for projection onto the probabilistic simplex. The basic idea is to show that, after sorting the vector we need to project, it is possible to calculate the projection exactly in linear time. This idea was rediscovered multiple times. It was first described in an abstract and somewhat opaque form in the work of Gafni and Bertsekas (1984) and Bertsekas (1999). Crammer and Singer (2002) rediscovered a similar projection algorithm as a tool for solving the dual of multiclass SVM. Hazan (2006) essentially reuses the same algorithm in the context of online convex programming. Our starting point is another derivation of Euclidean projection onto the simplex that paves the way to a few generalizations. First we show that the same technique can also be used for projecting onto the l 1 -ball. This algorithm is based on sorting the components of the vector to be projected and thus requires O(n log(n)) time. We next present an improvement of the algorithm that replaces sorting with a procedure resembling median-search whose expected time complexity is O(n). In many applications, however, the dimension of the feature space is very high yet the number of features which attain non-zero values for an example may be very small. For instance, in our experiments with text classification in Sec. 7, the dimension is two million (the bigram dictionary size) while each example has on average one-thousand non-zero features (the number of unique tokens in a document). Applications where the dimensionality is high yet the number of on features in each example is small render our second algorithm useless in some cases. We therefore shift gears and describe a more complex algorithm that employs redblack trees to obtain a linear dependence on the number of non-zero features in an example and only logarithmic dependence on the full dimension. The key to our construction lies in the fact that we project vectors that are the sum of a vector in the l 1 -ball and a sparse vector they are almost in the l 1 -ball. In conclusion to the paper we present experimental results that demonstrate the merits of our algorithms. We compare our algorithms with several specialized interior point (IP) methods as well as general methods from the literature for solving l 1 -penalized problems on both synthetic and real data (the MNIST handwritten digit dataset and the Reuters RCV1 corpus) for batch and online learning. Our projection based methods outperform competing algorithms in terms of sparsity, and they exhibit faster convergence and lower regret than previous methods. 2. Notation and Problem Setting We start by establishing the notation used throughout the paper. The set of integers 1 through n is denoted by [n]. Scalars are denoted by lower case letters and vectors by lower case bold face letters. We use the notation w b to designate that all of the components of w are greater than b. We use as a shorthand for the Euclidean norm 2. The other norm we use throughout the paper is the 1- norm of the vector, v 1 = n v i. Lastly, we consider order statistics and sorting vectors frequently throughout this paper. To that end, we let v (i) denote the i th order statistic of v, that is, v (1) v (2)... v (n) for v R n. In the setting considered in this paper we are provided with a convex function L : R n R. Our goal is to find the minimum of L(w) subject to an l 1 -norm constraint on w. Formally, the problem we need to solve is minimize L(w) s.t. w w 1 z. (1) Our focus is on variants of the projected subgradient method for convex optimization (Bertsekas, 1999). Projected subgradient methods minimize a function L(w) subject to the constraint that w X, for X convex, by generating the sequence {w (t) } via w (t+1) = Π X (w (t) η t (t)) (2) where (t) is (an unbiased estimate of) the (sub)gradient of L at w (t) and Π X (x) = argmin y { x y y X} is Euclidean projection of x onto X. In the rest of the paper, the main algorithmic focus is on the projection step (computing an unbiased estimate of the gradient of L(w) is straightforward in the applications considered in this paper, as is the modification of w (t) by (t) ). 3. Euclidean Projection onto the Simplex For clarity, we begin with the task of performing Euclidean projection onto the positive simplex; our derivation naturally builds to the more efficient algorithms. As such, the most basic projection task we consider can be formally described as the following optimization problem, minimize w 1 2 w v 2 2 s.t. n w i = z, w i 0. (3)

3 When z = 1 the above is projection onto the probabilistic simplex. The Lagrangian of the problem in Eq. (3) is ( L(w,ζ) = 1 n ) 2 w v 2 + θ w i z ζ w, where θ R is a Lagrange multiplier and ζ R n + is a vector of non-negative Lagrange multipliers. Differentiating with respect to w i and comparing to zero gives the dl optimality condition, dw i = w i v i + θ ζ i = 0. The complementary slackness KKT condition implies that whenever w i > 0 we must have that ζ i = 0. Thus, if w i > 0 we get that w i = v i θ + ζ i = v i θ. (4) All the non-negative elements of the vector w are tied via a single variable, so knowing the indices of these elements gives a much simpler problem. Upon first inspection, finding these indices seems difficult, but the following lemma (Shalev-Shwartz & Singer, 2006) provides a key tool in deriving our procedure for identifying non-zero elements. Lemma 1. Let w be the optimal solution to the minimization problem in Eq. (3). Let s and j be two indices such that v s > v j. If w s = 0 then w j must be zero as well. Denoting by I the set of indices of the non-zero components of the sorted optimal solution, I = {i [n] : v (i) > 0}, we see that Lemma 1 implies that I = [ρ] for some 1 ρ n. Had we known ρ we could have simply used Eq. (4) to obtain that n n ρ ρ ( w i = w (i) = w (i) = v(i) θ ) = z and therefore ( ρ ) θ = 1 v (i) z ρ. (5) Given θ we can characterize the optimal solution for w as w i = max {v i θ, 0}. (6) We are left with the problem of finding the optimal ρ, and the following lemma (Shalev-Shwartz & Singer, 2006) provides a simple solution once we sort v in descending order. Lemma 2. Let w be the optimal solution to the minimization problem given in Eq. (3). Let µ denote the vector obtained by sorting v in a descending order. Then, the number of strictly positive elements in w is { ( j ) } ρ(z,µ) = max j [n] : µ j 1 µ r z > 0 j r=1 The pseudo-code describing the O(n log n) procedure for solving Eq. (3) is given in Fig. 1.. INPUT: A vector v R n and a scalar z > 0 Sort v into µ : µ 1 µ 2... µ { ( p j ) } Find ρ = max j [n] : µ j 1 j µ r z > 0 ( ρ ) r=1 Define θ = 1 ρ µ i z OUTPUT: w s.t. w i = max {v i θ, 0} Figure 1. Algorithm for projection onto the simplex. 4. Euclidean Projection onto the l 1 -Ball We next modify the algorithm to handle the more general l 1 -norm constraint, which gives the minimization problem minimize w R n w v 2 2 s.t. w 1 z. (7) We do so by presenting a reduction to the problem of projecting onto the simplex given in Eq. (3). First, we note that if v 1 z then the solution of Eq. (7) is w = v. Therefore, from now on we assume that v 1 > z. In this case, the optimal solution must be on the boundary of the constraint set and thus we can replace the inequality constraint w 1 z with an equality constraint w 1 = z. Having done so, the sole difference between the problem in Eq. (7) and the one in Eq. (3) is that in the latter we have an additional set of constraints, w 0. The following lemma indicates that each non-zero component of the optimal solution w shares the sign of its counterpart in v. Lemma 3. Let w be an optimal solution of Eq. (7). Then, for all i, w i v i 0. Proof. Assume by contradiction that the claim does not hold. Thus, there exists i for which w i v i < 0. Let ŵ be a vector such that ŵ i = 0 and for all j i we have ŵ j = w j. Therefore, ŵ 1 = w 1 w i z and hence ŵ is a feasible solution. In addition, w v 2 2 ŵ v 2 2 = (w i v i ) 2 (0 v i ) 2 = w 2 i 2w i v i > w 2 i > 0. We thus constructed a feasible solution ŵ which attains an objective value smaller than that of w. This leads us to the desired contradiction. Based on the above lemma and the symmetry of the objective, we are ready to present our reduction. Let u be a vector obtained by taking the absolute value of each component of v, u i = v i. We now replace Eq. (7) with minimize β R n β u 2 2 s.t. β 1 z and β 0. (8) Once we obtain the solution for the problem above we construct the optimal of Eq. (7) by setting w i = sign(v i )β i.

5 INPUT A balanced tree T and a scalar z > 0 INITIALIZE v =, ρ = n + 1, s = z CALL PIVOTSEARCH(root(T ), 0, 0) PROCEDURE PIVOTSEARCH(v, ρ, s) COMPUTE ˆρ = ρ + r(v) ; ŝ = s + σ(v) IF ŝ < vˆρ + z // v pivot IF v > v v = v ; ρ = ˆρ ; s = ŝ IF leaf T (v) RETURN θ = (s z)/ρ CALL PIVOTSEARCH(left T (v), ˆρ,ŝ) ELSE // v < pivot IF leaf T (v) RETURN θ = (s z)/ρ ( CALL PIVOTSEARCH rightt (v),ρ,s ) ENDPROCEDURE Figure 3. Efficient search of pivot value for sparse feature spaces. by iterating w (t+1) = Π W ( w (t) + g (t)) where g (t) = η t (t), W = {w w 1 z} and Π W is projection onto this set. In many applications the dimension of the feature space is very high yet the number of features which attain a non-zero value for each example is very small (see for instance our experiments on text documents in Sec. 7). It is straightforward to implement the gradient-related updates in time which is proportional to the number of non-zero features, but the time complexity of the projection algorithm described in the previous section is linear in the dimension. Therefore, using the algorithm verbatim could be prohibitively expensive in applications where the dimension is high yet the number of features which are on in each example is small. In this section we describe a projection algorithm that updates the vector w (t) with g (t) and scales linearly in the number of non-zero entries of g (t) and only logarithmically in the total number of features (i.e. non-zeros in w (t) ). The first step in facilitating an efficient projection for sparse feature spaces is to represent the projected vector as a raw vector v by incorporating a global shift that is applied to each non-zero component. Specifically, each projection step amounts to deducting θ from each component of v and thresholding the result at zero. Let us denote by θ t the shift value used on the t th iteration of the algorithm and by Θ t the cumulative sum of the shift values, Θ t = s t θ s. The representation we employ enables us to perform the step in which we deduct θ t from all the elements of the vector implicitly, adhering to the goal of performing a sublinear number of operations. As before, we assume that the goal is to project onto the simplex. Equipped with these variables, the j th component of the projected vector after t projected gradient steps can be written as max{v j Θ t,0}. The second substantial modification to the core algorithm is to keep only the non-zero components of the weight vector in a red-black tree (Cormen et al., 2001). The red-black tree facilitates an efficient search for the pivot element (v (ρ) ) in time which is logarithmic in the dimension, as we describe in the sequel. Once the pivot element is found we implicitly deduct θ t from all the non-zero elements in our weight vector by updating Θ t. We then remove all the components that are less than v (ρ) (i.e. less than Θ t ); this removal is efficient and requires only logarithmic time (Tarjan, 1983). The course of the algorithm is as follows. After t projected gradient iterations we have a vector v (t) whose non-zero elements are stored in a red-black tree T and a global deduction value Θ t which is applied to each non-zero component just-in-time, i.e. when needed. Therefore, each non-zero weight is accessed as v j Θ t while T does not contain the zero elements of the vector. When updating v with a gradient, we modify the vector v (t) by adding to it the gradientbased vector g (t) with k non-zero components. This update is done using k deletions (removing v i from T such that g (t) i 0) followed by k re-insertions of v i = (v i + g (t) i ) into T, which takes O(k log(n)) time. Next we find in O(log(n)) time the value of θ t. Fig. 3 contains the algorithm for this step; it is explained in the sequel. The last step removes all elements of the new raw vector v (t) +g (t) which become zero due to the projection. This step is discussed at the end of this section. In contrast to standard tree-based search procedure, to find θ t we need to find a pair of consecutive values in v that correspond to v (ρ) and v (ρ+1). We do so by keeping track of the smallest element that satisfies the left hand side of Eq. (9) while searching based on the condition given on the right hand side of the same equation. T is keyed on the values of the un-shifted vector v t. Thus, all the children in the left (right) sub-tree of a node v represent values in v t which are smaller (larger) than v. In order to efficiently find θ t we keep at each node the following information: (a) The value of the component, simply denoted as v. (b) The number of elements in the right sub-tree rooted at v, denoted r(v), including the node v. (c) The sum of the elements in the right sub-tree rooted at v, denoted σ(v), including the value v itself. Our goal is to identify the pivot element v (ρ) and its index ρ. In the previous section we described a simple condition for checking whether an element in v is greater or smaller than the pivot value. We now rewrite this expression yet one more time. A component with value v is not

6 smaller than the pivot iff the following holds: v j > {j : v j v} v + z. (11) j:v j v The variables in the red-black tree form the infrastructure for performing efficient recursive computation of Eq. (11). Note also that the condition expressed in Eq. (11) still holds when we do not deduct Θ t from all the elements in v. The search algorithm maintains recursively the number ρ and the sum s of the elements that have been shown to be greater or equal to the pivot. We start the search with the root node of T, and thus initially ρ = 0 and s = 0. Upon entering a new node v, the algorithm checks whether the condition given by Eq. (11) holds for v. Since ρ and s were computed for the parent of v, we need to incorporate the number and the sum of the elements that are larger than v itself. By construction, these variables are r(v) and σ(v), which we store at the node v itself. We let ˆρ = ρ + r(v) and ŝ = s + σ(v), and with these variables handy, Eq. (11) distills to the expression ŝ < vˆρ+z. If the inequality holds, we know that v is either larger than the pivot or it may be the pivot itself. We thus update our current hypothesis for µ ρ and ρ (designated as v and ρ in Fig. 3). We continue searching the left sub-tree (left T (v)) which includes all elements smaller than v. If inequality ŝ < vˆρ + z does not hold, we know that v < µ ρ, and we thus search the right subtree (right T (v)) and keep ρ and s intact. The process naturally terminates once we reach a leaf, where we can also calculate the correct value of θ using Eq. (10). Once we find θ t (if θ t 0) we update the global shift, Θ t+1 = Θ t + θ t. We need to discard all the elements in T smaller than Θ t+1, which we do using Tarjan s (1983) algorithm for splitting a red-black tree. This step is logarithmic in the total number of non-zero elements of v t. Thus, as the additional variables in the tree can be updated in constant time as a function of a node s child nodes in T, each of the operations previously described can be performed in logarthmic time (Cormen et al., 2001), giving us a total update time of O(k log(n)). 7. Experiments We now present experimental results demonstrating the effectiveness of the projection algorithms. We first report results for experiments with synthetic data and then move to experiments with high dimensional natural datasets. In our experiment with synthetic data, we compared variants of the projected subgradient algorithm (Eq. (2)) for l 1 -regularized least squares and l 1 -regularized logistic regression. We compared our methods to a specialized coordinate-descent solver for the least squares problem due to Friedman et al. (2007) and to very fast interior point Coordinate L1 Line L1 Batch IP Approximate Flops x Coordinate L1 Batch IP Approximate Flops x 10 9 Figure 4. Comparison of methods on l 1-regularized least squares. The left has dimension n = 800, the right n = 4000 methods for both least squares and logistic regression (Koh et al., 2007; Kim et al., 2007). The algorithms we use are batch projected gradient, stochastic projected subgradient, and batch projected gradient augmented with a backtracking line search (Koh et al., 2007). The IP and coordinatewise methods both solve regularized loss functions of the form f(w) = L(w) + λ w 1 rather than having an l 1 - domain constraint, so our objectives are not directly comparable. To surmount this difficulty, we first minimize L(w)+λ w 1 and use the 1-norm of the resulting solution w as the constraint for our methods. To generate the data for the least squares problem setting, we chose a w with entries distributed normally with 0 mean and unit variance and randomly zeroed 50% of the vector. The data matrix X R m n was random with entries also normally distributed. To generate target values for the least squares problem, we set y = Xw + ν, where the components of ν were also distributed normally at random. In the case of logistic regression, we generated data X and the vector w identically, but the targets y i were set to be sign(w x i ) with probability 90% and to sign(w x i ) otherwise. We ran two sets of experiments, one each for n = 800 and n = We also set the number of examples m to be equal to n. For the subgradient methods in these experiments and throughout the remainder, we set η t = η 0 / t, choosing η 0 to give reasonable performance. (η 0 too large will mean that the initial steps of the gradient method are not descent directions; the noise will quickly disappear because the step sizes are proportional to 1/ t). Fig. 4 and Fig. 5 contain the results of these experiments and plot f(w) f(w ) as a function of the number of floating point operations. From the figures, we see that the projected subgradient methods are generally very fast at the outset, getting us to an accuracy of f(w) f(w ) 10 2 quickly, but their rate of convergence slows over time. The fast projection algorithms we have developed, however, allow projected-subgradient methods to be very competitive with specialized methods, even on these relatively small problem sizes. On higher-dimension data sets interior point methods are infeasible or very slow. The rightmost graphs in Fig. 4 and Fig. 5 plot f(w) f(w ) as functions of floating point operations for least squares and logistic re-

### Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Efficient Learning using Forward-Backward Splitting

Efficient Learning using Forward-Backward Splitting John Duchi University of California Berkeley jduchi@cs.berkeley.edu Yoram Singer Google singer@google.com Abstract We describe, analyze, and experiment

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### The p-norm generalization of the LMS algorithm for adaptive filtering

The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics

### 1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### Online Convex Optimization

E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago sham@tti-c.org Shai Shalev-Shartz Toyota Technological Institute at

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Online Lazy Updates for Portfolio Selection with Transaction Costs

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

Journal of Machine Learning Research 15 (2014) 489-493 Submitted 3/13; Revised 8/13; Published 2/14 The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Haotian

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Supporting Online Material for

www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

1. Introduction Linear Programming for Optimization Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc. 1.1 Definition Linear programming is the name of a branch of applied mathematics that

### Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Large Scale Learning to Rank

Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### Overview of Math Standards

Algebra 2 Welcome to math curriculum design maps for Manhattan- Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

### Solving polynomial least squares problems via semidefinite programming relaxations

Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

### 10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### 1 Polyhedra and Linear Programming

CS 598CSC: Combinatorial Optimization Lecture date: January 21, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im 1 Polyhedra and Linear Programming In this lecture, we will cover some basic material

### Linear Programming. April 12, 2005

Linear Programming April 1, 005 Parts of this were adapted from Chapter 9 of i Introduction to Algorithms (Second Edition) /i by Cormen, Leiserson, Rivest and Stein. 1 What is linear programming? The first

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Advanced Algebra 2. I. Equations and Inequalities

Advanced Algebra 2 I. Equations and Inequalities A. Real Numbers and Number Operations 6.A.5, 6.B.5, 7.C.5 1) Graph numbers on a number line 2) Order real numbers 3) Identify properties of real numbers

### DUOL: A Double Updating Approach for Online Learning

: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

### L1 vs. L2 Regularization and feature selection.

L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1

### Introduction to Machine Learning NPFL 054

Introduction to Machine Learning NPFL 054 http://ufal.mff.cuni.cz/course/npfl054 Barbora Hladká hladka@ufal.mff.cuni.cz Martin Holub holub@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### ME128 Computer-Aided Mechanical Design Course Notes Introduction to Design Optimization

ME128 Computer-ided Mechanical Design Course Notes Introduction to Design Optimization 2. OPTIMIZTION Design optimization is rooted as a basic problem for design engineers. It is, of course, a rare situation

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University haipengl@cs.princeton.edu Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Online Learning meets Optimization in the Dual

Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Chap 4 The Simplex Method

The Essence of the Simplex Method Recall the Wyndor problem Max Z = 3x 1 + 5x 2 S.T. x 1 4 2x 2 12 3x 1 + 2x 2 18 x 1, x 2 0 Chap 4 The Simplex Method 8 corner point solutions. 5 out of them are CPF solutions.

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### Linear Systems COS 323

Linear Systems COS 323 Last time: Constrained Optimization Linear constrained optimization Linear programming (LP) Simplex method for LP General optimization With equality constraints: Lagrange multipliers

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Algorithms and Data Structures

Algorithm Analysis Page 1 BFH-TI: Softwareschule Schweiz Algorithm Analysis Dr. CAS SD01 Algorithm Analysis Page 2 Outline Course and Textbook Overview Analysis of Algorithm Pseudo-Code and Primitive Operations

### Sub-class Error-Correcting Output Codes

Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat

### An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

### Chapter 10: Network Flow Programming

Chapter 10: Network Flow Programming Linear programming, that amazingly useful technique, is about to resurface: many network problems are actually just special forms of linear programs! This includes,

### CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

### APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### LINEAR PROGRAMMING P V Ram B. Sc., ACA, ACMA Hyderabad

LINEAR PROGRAMMING P V Ram B. Sc., ACA, ACMA 98481 85073 Hyderabad Page 1 of 19 Question: Explain LPP. Answer: Linear programming is a mathematical technique for determining the optimal allocation of resources

### Solving Mixed Integer Linear Programs Using Branch and Cut Algorithm

1 Solving Mixed Integer Linear Programs Using Branch and Cut Algorithm by Shon Albert A Project Submitted to the Graduate Faculty of North Carolina State University in Partial Fulfillment of the Requirements

### 1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three