Accelerated Parallel Optimization Methods for Large Scale Machine Learning
|
|
|
- Aldous Gilbert
- 10 years ago
- Views:
Transcription
1 Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com Abstract The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms. In this work, we consider combining two techniques, parallelism and Nesterov s acceleration, to design faster algorithms for L 1 -regularized loss. We first simplify BOOM [11], a variant of gradient descent, and study it in a unified framework, which allows us to not only propose a refined measurement of sparsity to improve BOOM, but also show that BOOM is provably slower than FISTA [1]. Moving on to parallel coordinate descent methods, we then propose an efficient accelerated version of Shotgun [3], improving the convergence rate from O1/t) to O1/t ). Our algorithm enjoys a concise form and analysis compared to previous work, and also allows one to study several connected work in a unified way. 1 Introduction Many machine learning problems boil down to optimizing specific objective functions. In this paper, we consider the following generic optimization problem associated with L 1 -regularized loss: n min F w) = min lx T w R d w R d i w, y i ) + λ w 1, where x i, y i ),...,n represent n training examples of the task, each with feature vector x i R d and label/response y i, and l is a smooth convex loss function with respect to its first argument. This objective function is the heart of several important machine learning problems, including Lasso [1] where lŷ, y) = 1 ŷ y, and sparse logistic regression [14] where lŷ, y) = ln1 + exp yŷ)). Much effort has been put into developing optimization methods for this model, ranging from coordinate minimization [5], randomized coordinate descent [18], stochastic gradient descent [, 17], dual coordinate ascent [19], to higher order methods such as interior point methods [6], L-BFGS [15], to name a few. However, the need for faster and more scalable algorithms is still growing due to the emergence of applications that make use of massive amount of high dimensional data e.g. [0]). One direction to design faster algorithms is to utilize parallel computations on shared memory multiprocessors or on clusters. Some methods parallelize over examples [7, 10, ], while others parallelize over features [3, 16]. As the references of the latter approach argue, it is sometimes more preferable to parallelize over features for L 1 -regularized loss, which will thus be the focus of this work. Another direction to design more efficient algorithms is to make use of the curvature of the objective function to obtain faster theoretical convergence rate. It is well known that for smooth objective functions, vanilla gradient descent only converges at a suboptimal rate O1/t), while Nesterov s acceleration technique would allow the optimal rate O1/t ) [13]. Recent progress along this line includes generalizing Nesterov s acceleration to randomized coordinate descent [1, 8]. Even if 1
2 Input: normalized data matrix X, smoothness parameter β, regularization parameter λ. Set step size η = 1/cβ), where Option 1 FISTA): c = ρ, where ρ is the spectral radius of X T X. Option BOOM): c = κ, where κ = max i κ i and κ i = {j : X ij 0}. Option 3 Our Method): c = κ, where κ = max j n κ ix ij. Initialize w 1 = u 1. for t = 1,,... do in parallel for each coordinate) w t+1 = P λη u t η fu t )). u t+1 = 1 γ t )w t+1 + γ t w t. Algorithm 1: Parallel Accelerated Gradient Descent the objective is not smooth, algorithms that still enjoy the same fast convergence rate have been proposed for functions that are a sum of a smooth part and a simple separable non-smooth part, such as the L 1 -regularized loss we consider here see for instance the FISTA algorithm [1]). In this work we aim to combine both of the techniques mentioned above, that is, to design parallelizable accelerated optimization algorithms. Similar work includes [11, 4]. We start by revisiting and improving BOOM [11], a parallelizable variant of accelerated gradient descent that tries to utilize the sparsity and elliptical geometry of the data. We first give a simplified form of BOOM which allows one to clearly see the the connection between BOOM and FISTA, and also to study these algorithms in a unified framework. Surprisingly, we show that BOOM is actually provably slower than FISTA when data is normalized, which is an equivalent way of utilizing the elliptical geometry. Moreover, we also propose a refined measurement of sparsity that improves the one used in BOOM. Moving on to parallel coordinate descent algorithms, we then propose an accelerated version of the Shotgun algorithm [3]. Shotgun converges as fast as vanilla gradient descent while only updating a small subset of coordinates per iteration in parallel). Our accelerated version even improves the convergence rate from O1/t) to O1/t ), that is, the same convergence rate as FISTA while updating much fewer coordinates per iteration. Our algorithm is a unified framework of accelerated single coordinate descent [1, 8], multiple coordinate descent and full gradient descent. However, instead of directly generalizing [1, 8] or the very recent work [4], we take a different route to present Nesterov s acceleration technique so that our method enjoys a simpler form that makes use of only one auxiliary sequence and our analysis is also much more concise. We discuss how these algorithms are connected and which one is optimal under different circumstances. We finally mention several computational tricks to allow highly efficient implementation of our algorithm. Gradient Descent: Improving BOOM In this section, we investigate algorithms that make use of a full gradient at each round. Specifically, we revisit, simplify and improve the BOOM algorithm [11]. There are essentially two forms of Nesterov s acceleration technique, one which follows the original presentation of Nesterov and uses two auxiliary sequences of points, and the other which follows the presentation of FISTA and uses only one auxiliary sequence. BOOM falls into the first category. To make the algorithm more clear and the connection to other algorithms more explicit, we will first translate BOOM into the second form. Before doing so, to make things even more concise we assume that the data is normalized. Specifically, let X be an n by d matrix such that the i th row is x T i. We assume each column of X is normalized such that n X ij = 1 for all j {1,..., d}. It is clear that this is without loss of generality see further discussion at the end of this section). Now we are ready to present our simplified version of BOOM see Algorithm 1, Option ). Here, β is the smoothness parameter of the loss function such that its second derivative l ŷ, y) with respect to the first argument) is upper bounded by β for all ŷ and y. For instance, β = 1 for square loss used in Lasso and β = 1/4 for logistic loss. γ t is the usual coefficient for Nesterov s technique which satisfies: γ t = 1 θ t )/θ t+1, with θ 0 = 0, θ t+1 = θ t ). Finally, the shrinkage function P is defined as P a w) j = signw j ) max{ w j a, 0}. One can see that BOOM explicitly makes use of the sparsity of the data, and uses κ, a measurement of sparsity, to scale the gradient.
3 When written in this form, it is clear that BOOM is very close to FISTA see Algorithm 1, Option 1). The only difference is that BOOM replaces ρ with the sparsity κ. The questions are, whether this slight modification results in a faster algorithm? Does BOOM converges faster by utilizing the sparsity of the data? The following lemma and theorem answer these questions in the negative. Lemma 1. Let ρ, κ and κ be defined as in Algorithm 1. Then we have ρ κ κ. Proof. Let z be a unit eigenvector of X T X with respect to the eigenvalue ρ. One has ρ = z T ρz) = z T X T Xz) = Xz = n ) j:x X ij 0 ijz j. By Cauchy-Schwarz inequality, we continue n ρ ) n d d n 1 Xijz j = κ i Xijz j = zj κ i Xij κ κ, j:x ij 0 j:x ij 0 where the last two equalities make use of the fact that j z j = 1 and i X ij = 1 respectively. Theorem 1. Suppose F w) admits a minimizer w. As long as c ρ, Algorithm 1 insures F w t ) F w ) cβ w 1 w /t after t iterations. In other words, to reach ɛ accuracy, Algorithm 1 needs T ɛ = O cβ/ɛ) iterations. We omit the proof of Theorem 1 since it is a direct generalization of the original proof for FISTA. Together with Lemma 1, this theorem suggests that we should always choose FISTA over BOOM if we know ρ, since it uses a larger step size and converges faster. In practice, computing ρ might be time consuming. However, a useful side product of Lemma 1 is that it proposes a refined measurement of sparsity κ, which is clearly also easy to compute at the same time. We include this improved variant in Option 3 of Algorithm 1. Note that for any fixed j, X1j,..., X nj forms a distribution, and thus κ should be interpreted as the maximum weighted average sparsity of the data. Generality of Feature Normalizing. One might doubt that our results hold merely because of the normalization assumption of the data. Indeed, authors of [11] emphasize that one of the advantages of BOOM is that it utilizes the elliptical geometry of the feature space, which does not exist any more if each feature is normalized. However, one can easily verify that the outputs of the following two methods are completely identical: 1) apply BOOM directly on the original data; ) scale each feature first so that the data is normalized, apply BOOM, and at the end scale back each coordinate of the output accordingly. Therefore, feature normalization does not affect the behavior of BOOM at all. Put it differently, feature normalization is an equivalent way to utilize the elliptical geometry of the data. Note that the same argument does not hold for FISTA. Indeed, while our results show that FISTA is provably faster than BOOM, experiments in [11] show the opposite on unnormalized data. This suggests that we should always normalize the data before applying FISTA. j=1 3 Coordinate Descent: Accelerated Shotgun Updating all coordinates in parallel is not realistic, either because of a limited number of cores in multi-processors or communication bottlenecks in clusters. Therefore in this section, we shift gears and consider algorithms that only update a subset of the coordinates at each iterations. To simplify presentation, we assume there is no regularization i.e. λ = 0) and again data is normalized. We propose a generalized and accelerated version of the Shotgun algorithm 1 [3] see Algorithm ). What Shotgun does is to perform coordinate descent on several randomly selected coordinates in parallel), with the same step size as in the usual single) coordinate descent. If the number of updates per iteration P is well tuned, Shotgun converges as fast as doing a full gradient descent, even if it updates much fewer coordinates. The main difference between shotgun and our accelerated version is that as usual accelerated methods, our algorithm maintains an auxiliary sequence u t at which we compute gradients. However, u t+1 is not just 1 γ t )w t+1 + γ t w t as in Algorithm 1. Instead, a small step back has to be taken, which is reflected in the extra term c t u t w t+1 ) where constant c t will be specified later in Theorem. Intuitively, this step back is to reduce the momentum due to the fact that we are not updating all coordinates. Another important generalization of Shotgun is that we 1 Note that Shotgun has already been successfully accelerated with Coordinate Descent Newton method [3], but no analysis of the convergence rate is provided for this heuristic method. j=1 3
4 Input: number of parallel updates P, step size coefficient η, smoothness parameter β Initialize w 1 = u 1. for t = 1,,..., do pick a random subset S t of {1,..., d} such that S t = P. for j = 1 to d{ do u t,j η w t+1,j = β fu t) j if j S t ; u t,j else. u t+1 = 1 γ t )w t+1 + γ t w t + c t u t w t+1 ). Algorithm : Accelerated Shotgun introduce an extra step size coefficient η, which allows us to unify several algorithms see further discussion below). Note that η is fixed to 1 in the original Shotgun. We now state the convergence rate of our Algorithm in the following theorem. Theorem. Suppose F w) admits a minimizer w. If P and η > 0 are such that η 1 + σ) < 1 P 1)ρ 1) d 1 where σ =, and c t = θt θ t+1 1 P d recall ρ is the spectral radius of X T X), constant γ t is defined as in Section 1 η 1 + σ))), then the following holds for any t > 1, E S1:t 1 [F w t )] F w βd w 1 w ) t P η 1 η 1) 1 + σ)), where the expectation is with respect to the random choices of subsets S 1,..., S t 1. The complete proof of Theorem can be found in the long version of this work [9]. Here we focus on explaining how to interpret this result and how to choose parameters P and η. First, note that to reach ɛ accuracy i.e. E[F w t )] F w ) ɛ), Algorithm requires T ɛ = O β/ ɛη 1 η 1 + σ)))) iterations. For any fixed P, one can verify that η = 1/1 + σ) d P is the optimal choice for η to minimize T ɛ. Plugging η back in T ɛ and minimizing over P, one can check that P = d is the best choice. In this case, since σ = ρ 1, η = 1/ρ and c t = 0, Algorithm actually degenerates to FISTA and T ɛ = O ρβ/ɛ), recovering the results in Theorem 1 exactly. Of course, at the end it is not the number of iterations, but the total computational complexity that we care about most. Suppose we implement the algorithm without using any parallel computation. Then as we will discuss later, the time complexity for each iteration would be OnP ) recall n is the number of examples), and the total complexity OT ɛ np ) is minimized when η = 1 and P = 1, leading to Ond β/ɛ). In this case, our algorithm degenerates to an accelerated version of randomized single) coordinate descent. Note that this essentially recovers the algorithms and results in [1, 8], but in a much simpler form and analysis). Finally, we consider implementing the algorithm using parallel computation. Note that η = 1/1+ σ) is always at most 1. So we fix η to be 1 as in the original Shotgun algorithm), leading to the largest possible step size 1/β and a potentially small P. We assume in this case updating P coordinates in parallel costs approximately the same time no matter what P is. In other words, we are again only interested in minimizing T ɛ. One can verify that the optimal choice here for P is 3 d 1 ρ 1 + 1) and T ɛ = Oρ β/ɛ). This improves the convergence rate from O1/ɛ) to O 1/ɛ) compared to Shotgun, and is almost as good as FISTA with much less computation per iteration). Efficient Implementation. We discuss how to efficiently implement Algorithm. 1) At first glance it seems that computing vector u needs to go over all d coordinates. However, one can easily generalize the trick introduced in [8] to update only P coordinates and compute w and u implicitly. ) Another widely used trick is to maintain inner products between examples and weight vectors i.e. Xw and Xu), so that computing a single coordinate of the gradient can be done in On), or even faster in the case of sparse data. 3) Instead of choosing S t uniformly at random, one can also do the following: arbitrarily divide the set {1,..., d} into P disjoint subsets of equal size in advanced assuming d is a multiple of P for simplicity), then at each iteration, select one and only one element from each of the P subsets uniformly at random to form S t. This would only lead to a minor change of the convergence results indeed, one only needs to redefine σ to be ρ 1)P/d). The advantage of this approach is that it suggests that we can separate the data matrix X by columns and store these subsets on P machines separately to naturally allow parallel updates at each iteration. 4
5 References [1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 1):183 0, 009. [] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 0, 008. [3] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l 1 -regularized loss minimization. In Proceedings of the 8th International Conference on Machine Learning, 011. [4] Olivier Fercoq and Peter Richtárik. Accelerated, parallel and proximal coordinate descent. arxiv preprint arxiv: , 014. [5] Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 73): , [6] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale l 1-regularized least squares. Selected Topics in Signal Processing, IEEE Journal of, 14): , 007. [7] John Langford, Alexander Smola, and Martin Zinkevich. Slow learners are fast. Advances in Neural Information Processing Systems, 009. [8] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In 54th Annual Symposium on Foundations of Computer Science, 013. [9] Haipeng Luo, Patrick Haffner, and Jean-François Paiement. Accelerated parallel optimization methods for large scale machine learning. arxiv preprint arxiv: , 014. [10] Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, and Daniel D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, 009. [11] Indraneel Mukherjee, Kevin Canini, Rafael Frongillo, and Yoram Singer. Parallel boosting with momentum. ECML PKDD 013, Part III, LNAI 8190, pages 17 3, 013. [1] Yu. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, ):341 36, 01. [13] YU. E. Nesterov. A method of solving a convex programming problem with convergence rate o1/k ). In Soviet Mathematics Doklady, volume 7, pages , [14] Andrew Y. Ng. Feature selection, L 1 vs. L regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, 004. [15] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer Series in Operations Research and Financial Engineering, nd edition, 006. [16] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. arxiv preprint arxiv: , 01. [17] Shai Shalev-Shwartz and Nathan Srebro. Svm optimization: inverse dependence on training set size. In Proceedings of the 5th International Conference on Machine Learning, 008. [18] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1-regularized loss minimization. The Journal of Machine Learning Research, 1: , 011. [19] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31st International Conference on Machine Learning, 014. [0] Krysta M Svore and CJ Burges. Large-scale learning to rank using boosted decision trees. Scaling Up Machine Learning: Parallel and Distributed Approaches,, 011. [1] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B Methodological), 581):67 88, [] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 3,
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Federated Optimization: Distributed Optimization Beyond the Datacenter
Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh [email protected] H. Brendan McMahan Google, Inc. Seattle, WA 98103
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Big Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
GI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
Online Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
Large-Scale Machine Learning with Stochastic Gradient Descent
Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA [email protected] Abstract. During the last decade, the data sizes have grown faster than
Efficient Projections onto the l 1 -Ball for Learning in High Dimensions
Efficient Projections onto the l 1 -Ball for Learning in High Dimensions John Duchi Google, Mountain View, CA 94043 Shai Shalev-Shwartz Toyota Technological Institute, Chicago, IL, 60637 Yoram Singer Tushar
Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research [email protected]
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research [email protected] Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
Beyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Stochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine
Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
Randomized Robust Linear Regression for big data applications
Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,
Big learning: challenges and opportunities
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Stochastic Optimization for Big Data Analytics: Algorithms and Libraries
Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Large Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. [email protected] Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
Sketch As a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Graph Sparsification) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania April, 2015 Sepehr Assadi (Penn)
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Large-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Bilinear Prediction Using Low-Rank Models
Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J.
Stochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: [email protected] http://www.cs.cornell.edu/ sridharan/ Contact Information
Karthik Sridharan Contact Information 424 Gates Hall Ithaca, NY 14853-7501 USA E-mail: [email protected] http://www.cs.cornell.edu/ sridharan/ Research Interests Machine Learning, Statistical Learning
Online Lazy Updates for Portfolio Selection with Transaction Costs
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Integer Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 [email protected] March 16, 2011 Abstract We give
Parallel Selective Algorithms for Nonconvex Big Data Optimization
1874 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 7, APRIL 1, 2015 Parallel Selective Algorithms for Nonconvex Big Data Optimization Francisco Facchinei, Gesualdo Scutari, Senior Member, IEEE,
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.
An Overview Of Software For Convex Optimization Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 [email protected] In fact, the great watershed in optimization isn t between linearity
Machine learning and optimization for massive data
Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?
Jubatus: An Open Source Platform for Distributed Online Machine Learning
Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Sparse Prediction with the k-support Norm
Sparse Prediction with the -Support Norm Andreas Argyriou École Centrale Paris [email protected] Rina Foygel Department of Statistics, Stanford University [email protected] Nathan Srebro Toyota Technological
On Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: [email protected] Fabrizio Smeraldi School of
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
The p-norm generalization of the LMS algorithm for adaptive filtering
The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
4.1 Learning algorithms for neural networks
4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch Pitts units and perceptrons, but the question of how to
Nonlinear Optimization: Algorithms 3: Interior-point methods
Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris [email protected] Nonlinear optimization c 2006 Jean-Philippe Vert,
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Large Linear Classification When Data Cannot Fit In Memory
Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan [email protected] Kai-Wei Chang Dept. of Computer
2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Applied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)
SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI (B.Sc.(Hons.), BUAA) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
Solving polynomial least squares problems via semidefinite programming relaxations
Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne
The LCA Problem Revisited
The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus
The Advantages and Disadvantages of Online Linear Optimization
LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
Distributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
Influences in low-degree polynomials
Influences in low-degree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known
DUOL: A Double Updating Approach for Online Learning
: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 [email protected] Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University
The Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve
The Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs The degree-diameter problem for circulant graphs of degree 8 and 9 Journal Article How to cite:
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
Support Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. [email protected] 1 Support Vector Machines: history SVMs introduced
Equilibrium computation: Part 1
Equilibrium computation: Part 1 Nicola Gatti 1 Troels Bjerre Sorensen 2 1 Politecnico di Milano, Italy 2 Duke University, USA Nicola Gatti and Troels Bjerre Sørensen ( Politecnico di Milano, Italy, Equilibrium
