Big Data Analytics: Optimization and Randomization



Similar documents
Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Federated Optimization: Distributed Optimization Beyond the Datacenter

Stochastic gradient methods for machine learning

Introduction to Online Learning Theory

GI01/M055 Supervised Learning Proximal Methods

Statistical Machine Learning

Lecture 6: Logistic Regression

Linear Threshold Units

Beyond stochastic gradient descent for large-scale machine learning

Distributed Machine Learning and Big Data

Large-Scale Similarity and Distance Metric Learning

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Statistical machine learning, high dimension and big data

Adaptive Online Gradient Descent

Stochastic gradient methods for machine learning

Big Data - Lecture 1 Optimization reminders

Lecture 3: Linear methods for classification

Big Data Analytics. Lucas Rego Drumond

Support Vector Machines with Clustering for Training with Very Large Datasets

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine Learning Big Data using Map Reduce

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Data Mining: Algorithms and Applications Matrix Math Review

Large-Scale Machine Learning with Stochastic Gradient Descent

Simple and efficient online algorithms for real world applications

An Introduction to Machine Learning

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Big learning: challenges and opportunities

Machine learning and optimization for massive data

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Making Sense of the Mayhem: Machine Learning and March Madness

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

(Quasi-)Newton methods

Review Jeopardy. Blue vs. Orange. Review Jeopardy

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

10. Proximal point method

Randomized Robust Linear Regression for big data applications

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Learning, Sparsity and Big Data

Large Scale Learning to Rank

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CSCI567 Machine Learning (Fall 2014)

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Predict Influencers in the Social Network

Several Views of Support Vector Machines

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Social Media Mining. Data Mining Essentials

The Many Facets of Big Data

Lecture 2: The SVM classifier

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Online Convex Optimization

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Least Squares Estimation

Machine Learning and Pattern Recognition Logistic Regression

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Lecture 8 February 4

Big Data Analytics. Lucas Rego Drumond

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Machine learning challenges for big data

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

6. Cholesky factorization

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Feature Engineering in Machine Learning

Randomized Coordinate Descent Methods for Big Data Optimization

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

5. Orthogonal matrices

Bilinear Prediction Using Low-Rank Models

Map-Reduce for Machine Learning on Multicore

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Linear Algebra Review. Vectors

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Support Vector Machines

Linear Models for Classification

We shall turn our attention to solving linear systems of equations. Ax = b

Virtual Landmarks for the Internet

Notes on Symmetric Matrices

Variational approach to restore point-like and curve-like singularities in imaging

Performance of First- and Second-Order Methods for Big Data Optimization

Nonlinear Optimization: Algorithms 3: Interior-point methods

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

October 3rd, Linear Algebra & Properties of the Covariance Matrix

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Transcription:

Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of Management Sciences, The University of Iowa, IA, USA Department of Computer Science and Engineering, Michigan State University, MI, USA Institute of Data Science and Technologies at Alibaba Group, Seattle, USA August 10, 2015 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 1 / 234

URL http://www.cs.uiowa.edu/ tyng/kdd15-tutorial.pdf Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 2 / 234

Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning/data mining algorithms It is about how to efficiently solve machine learning/data mining (formulated as optimization) problems for big data Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 3 / 234

Outline Part I: Basics Part II: Optimization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 4 / 234

Big Data Analytics: Optimization and Randomization Part I: Basics Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 5 / 234

Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 6 / 234

Basics Introduction Three Steps for Machine Learning distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 1/T 2 1/T 20 40 60 80 100 iterations Data Model Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 7 / 234

Basics Introduction Big Data Challenge Big Data Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 8 / 234

Basics Introduction Big Data Challenge Big Model 60 million parameters Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 9 / 234

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 10 / 234

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 11 / 234

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 12 / 234

Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 13 / 234

Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 14 / 234

Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 15 / 234

Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 16 / 234

Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 17 / 234

Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 18 / 234

Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 19 / 234

Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 20 / 234

Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 21 / 234

Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 22 / 234

Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 23 / 234

Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 24 / 234

Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 25 / 234

Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 26 / 234

Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 26 / 234

Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 26 / 234

Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 27 / 234

Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 27 / 234

Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 27 / 234

Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 28 / 234

Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 28 / 234

Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 29 / 234

Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 30 / 234

Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 30 / 234

Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 31 / 234

Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 31 / 234

Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) 0.8 0.6 0.4 0.2 0 non smooth sub gradient 0.2 1 0.5 0 0.5 1 x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 32 / 234

Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) 0.8 0.6 0.4 0.2 0 non smooth sub gradient 0.2 1 0.5 0 0.5 1 x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 32 / 234

Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) 0.8 0.6 0.4 0.2 0 non smooth sub gradient 0.2 1 0.5 0 0.5 1 x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 32 / 234

Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 33 / 234

Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 34 / 234

Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 35 / 234

Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 36 / 234

Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 } {{ } F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 37 / 234

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t 0.5 0.4 objective 0.3 0.2 0.1 ε 0 T 0 20 40 60 80 100 iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 38 / 234

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective 0.5 0.4 0.3 0.2 0.1 ε 0 T 0 20 40 60 80 100 iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 38 / 234

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective 0.5 0.4 0.3 0.2 0.1 Convergence Rate: after T iterations, how good is the solution ε 0 T 0 20 40 60 80 100 iterations F (ŵ T ) min F (w) ɛ(t ) w Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 38 / 234

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective 0.5 0.4 0.3 0.2 0.1 Convergence Rate: after T iterations, how good is the solution ε 0 T 0 20 40 60 80 100 iterations F (ŵ T ) min F (w) ɛ(t ) w Total Runtime = Per-iteration Cost Iteration Complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 38 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 39 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 39 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum 0.4 0.3 0.2 0.1 0 seconds 0.5 T 0 20 40 60 80 100 iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 40 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes 0.5 T 1/T 0 0 20 40 60 80 100 iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 41 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes hours 0.5 T 1/T 1/T 0.5 0 0 20 40 60 80 100 iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 42 / 234

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 O(µ T ) O T 2 ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ 2 ( ) 1 O O T ( 1 T ) 0 0 20 40 60 80 100 iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 43 / 234

Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 44 / 234

Optimization (Sub)Gradient Methods Strong convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 45 / 234

Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 46 / 234

Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 46 / 234

Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 47 / 234

Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 48 / 234

Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 48 / 234

Optimization (Sub)Gradient Methods Accelerated Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 49 / 234

Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min w R d F (w) = 1 n n i=1 l(w x i, y i ) + R (w) + τ w 1 } {{ } R(w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 50 / 234

Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min F (w) = 1 n l(w x i, y i ) + R (w) +τ w 1 w R d n i=1 } {{ } F (w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 51 / 234

Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t 1 2 2 + τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 52 / 234

Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t 1 2 2 + τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 52 / 234

Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 53 / 234

Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 53 / 234

Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 54 / 234

Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 55 / 234

Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 56 / 234

Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Data Sampling Stochastic Gradient Descent (SGD) Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Accelerated Proximal Coordinate Gradient (APCG) Assumption: x i 1 for any i Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 57 / 234

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 58 / 234

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 59 / 234

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 59 / 234

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 60 / 234

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 61 / 234

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 62 / 234

Optimization Stochastic Optimization Algorithms for Big Data Full Gradient V.S. Stochastic Gradient Full gradient method needs fewer iterations Stochastic gradient method has lower cost per iteration For small ɛ, use full gradient Satisfied with large ɛ, use stochastic gradient Full gradient can be accelerated Stochastic gradient cannot Full gradient s iterations complexity depends on smoothness and strong convexity Stochastic gradient s iteration complexity only depends on strong convexity Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 63 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 64 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 64 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 Stochastic variance reduced gradient: g it (w) = F ( w) g it ( w) + g it (w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 65 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = F ( w) g it ( w) + g it (w) 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 66 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T K = O ( 1 λ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 67 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 68 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 68 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not required. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 69 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 70 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 70 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 71 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 71 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 71 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 72 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 72 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 73 / 234

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 73 / 234

Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 74 / 234

Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α 2 4 + αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 75 / 234

Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α 2 4 + αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 75 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 76 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 76 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Solve Dual Problem: 1 max α R n n n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 77 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 ni=1 λn αi tx i Change variable α i α i ( 1 n max l α R n i (αi t + α i ) λ n ) 1 n 2 αi t x n 2 i + α i x i λn i=1 i=1 i=1 2 1 n max l α R n i (αi t + α i ) λ n 2 w t + 1 n 2 α i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 78 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t Rn 1 n l i t ( α t i t α it ) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 79 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 80 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 81 / 234

Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 81 / 234

Optimization Stochastic Optimization Algorithms for Big Data SVRG V.S. SDCA V.S. SGD l 2 -regularized logistic regression with λ = 10 4 MNIST data Johnson & Zhang (2013) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 82 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 83 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 83 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. ) (( N.A. ) ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ Compared to SDCA, APCG has shorter runtime when λ is very small. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 84 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA squared hinge loss SVM real data datasets number of samples n number of features d sparsity rcv1 20,242 47,236 0.16% covtype 581,012 54 22% news20 19,996 1,355,191 0.04% F (w t ) F (w ) V.S. the number of passes of data Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 85 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 86 / 234

Optimization Stochastic Optimization Algorithms for Big Data For general R(w) Dual Problem: 1 max α R n n n i=1 R is the conjugate of R l i (α i ) R ( 1 λn Sample i t {1,..., n}. Optimize α it ) n α i x i i=1 while fixing others Can be still updated in O(d) in many cases (Shalev-Shwartz & Zhang (2013)) Iteration complexity and runtime of SDCA and APCG remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 87 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 + τ w 1 i=1 Suppose d >> n. Per-iteration cost O(d) is too high Apply APCG to the primal instead of dual problem Sample over features instead of data Per-iteration cost becomes O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 88 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y 2 2 + λ 2 w 2 2 + τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 89 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y 2 2 + λ 2 w 2 2 + τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 89 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y 2 2 + λ 2 w 2 2 + τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 89 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem APCG accelerates PCG using Auxiliary sequence (v t ) Momentum step APCG w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = t i otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 90 / 234

Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ λ-strongly convex N.A. O (( d λ n >> d: Apply APCG to the dual problem. d >> n: Apply APCG to the primal problem. ) ( )) log 1 ɛ Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 91 / 234

Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Satisfied with large ɛ: SGD For small ɛ: R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex SGD SAGA λ-strongly convex APCG APCG Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 92 / 234

Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 93 / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 93 / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 94 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 94 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 94 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 94 / 234

Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 95 / 234

Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 96 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x N ] = S 1 S 2 S m Machine j only has access to S j. W.L.O.G: S j = n = N m S 1 S 2 S 3 S 4 S 5 S 6 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 97 / 234

Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 98 / 234

Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 98 / 234

Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD: Average Stochastic Gradient Machine j sample i t S j and construct a stochastic gradient g j (w t 1 ) = l(w t 1x it, y it ) + R(w t 1 ) S 1 S 2 S 3 S 4 S 5 S 6 g 1 (w t 1 ) g 2 (w t 1 ) g 3 (w t 1 ) g 4 (w t 1 ) g 5 (w t 1 ) g 6 (w t 1 ) 1 m Center computes: w t = w t 1 γ t g j (w t 1 ) m j=1 } {{ } Mini-batch SG Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 99 / 234

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 100 / 234

Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 101 / 234

Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 101 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 102 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 102 / 234

Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Center maintains a primal solution: w t = 1 λn Change variable α i α i 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j N αi t x i i=1 1 N αi t x i + α i x i λn i=1 i S j 2 wt + 1 α i x i λn i S j 2 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 103 / 234

Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S 6 2 2 w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 104 / 234

Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S 6 2 2 w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 104 / 234

Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i (αi t + α i ) λ N 2 i S j wt + 1 α i x i λn i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj 2 2 ) 0 < Θ < 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 105 / 234

Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( 1 1 1 Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( 1 1 1 Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 106 / 234

Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( 1 1 1 Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( 1 1 1 Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 106 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 107 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 107 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) The rounds of communication of Dual SDCA does not depends on m. DiSCO: Distributed Second-Order method (Zhang & Xiao, 2015) The rounds of communication of DiSCO depends on m. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 108 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Global problem min w R d { F (w) = 1 } N l(w x i, y i ) + R(w) N i=1 Local problem min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j Global problem can be written as min w R d F (w) = 1 m f j (w) m j=1 Applicable when f j (w) is smooth, λ-strongly convex and self-concordant Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 109 / 234

Optimization Stochastic Optimization Algorithms for Big Data Newton Direction At optimal solution w : F (w ) = 0 We hope moving w t along v t leads to F (w t v t ) = 0 Taylor expansion: F (w t v t ) F (w t ) 2 F (w t )v t = 0 Such v t is called a Newton s direction Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 110 / 234

Optimization Stochastic Optimization Algorithms for Big Data Newton Method Newton Method Find a Netwon direction v t by solving 2 F (w t )v t = F (w t ) Then update w t+1 = w t γ t v t Require solving a linear d d equation system. Costly when d > 1000. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 111 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Inexact Newton Method Find an inexact Netwon direction v t using Preconditioned Conjugate Gradient (PCG) (Golub & Ye, 1997) 2 F (w t )v t F (w t ) 2 ɛ t Then update w t+1 = w t γ t v t Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 112 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 113 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 113 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) DiSCO compute 2 F (w t ) v distributedly: 2 F (w t ) v = 2 f 1 (w t ) v } {{ } machine 1 + 2 f 2 (w t ) v } {{ } machine 2 + + 2 f m (w t ) } {{ Then, compute P 1 2 F (w t ) v only in machine 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 114 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 115 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 116 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 117 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 118 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) For high-dimensional data (e.g. d 1000), computing P 1 2 F (w t ) v is time costly. Instead, use SDCA in machine 1 to solve P 1 2 1 F (w t ) v arg min u R d 2 u Pu u F (w t )v Local runtime: O( N m + 1+µ λ+µ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 119 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 120 / 234

Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 120 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) A distributed version of SVRG using a round-robin scheme Assume the user can control the distribution of data before algorithm. Applicable when f j (w) is smooth and λ-strongly convex Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 121 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 122 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 122 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 122 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 122 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Solution: Store a second set of data R j in machine j, which are sampled with replacement from {x 1, x 2,..., x n } before the algorithm starts. Construct the stochastic gradient g it (w t 1 ) by sampling i t R j and removing i t from R j after. E it R j [ g it (w t 1 )] = F (w t 1 ) When R j =, pass w t to next machine. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 123 / 234

Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 124 / 234

Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 125 / 234

Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 126 / 234

Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 127 / 234

Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 128 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Suppose R j = r for all j. Local Runtime Per-round: O( N m + r) ( ) Rounds of Communication: O( 1 rλ log 1 ɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 129 / 234

Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Choice of m (how many machines to use?). Fast machines but slow network: Use small m Fast network but slow machine: Use large m Choice of r (how many data points to pre-sample in R j?). The larger, the better Required machine memory space: S j + R j = N m + r Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 130 / 234

Optimization Stochastic Optimization Algorithms for Big Data Other Distributed Optimization Methods ADMM (Boyd et al., 2011; Ozdaglar, 2015) Rounds of Communication: O(Network Graph Dependency Term 1 λ log( 1 ɛ )) DANE (Shamir et al., 2014) Approximate Newton direction with a difference approach from DISCO Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 131 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is smooth and R(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO Runtime Per-round O(1) O( 1 λ + N m ) O( N m ) Round of Comm. O( 1 λmɛ ) O( 1 λ log( 1 ɛ )) m1/4 O( log( 1 λn 1/4 ɛ )) Table : Assume µ = O( m N ) alg. DSVRG Full Grad. Stoch. Grad. Runtime Per-round O( N m ) O(r) Round of Comm. O(log( 1 1 ɛ )) O( rλ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 132 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is non-smooth and g(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) O( 1 λ + N m ) N.A. N.A. Round of Comm. O( 1 λmɛ ) O( 1 λɛ ) N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 133 / 234

Optimization Stochastic Optimization Algorithms for Big Data Summary g(w) is non-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) N.A. N.A. N.A. Round of Comm. O( 1 ) mɛ 2 N.A. N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 134 / 234

Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Algorithm to use R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex Mini-SGD DSVRG λ-strongly convex Dist-SDCA DiSCO/DSVRG Between DiSCO/DSVRG Use DiSCO for small λ, e.g., λ < 10 5 Use DSVRG for large λ, e.g., λ > 10 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 135 / 234

Optimization Stochastic Optimization Algorithms for Big Data Distributed Machine Learning Systems and Library Petuum: http://petuum.github.io Apache Spark: http://spark.apache.org/ Parameter Server: http://parameterserver.org/ Birds: http://cs.uiowa.edu/ tyng/software.html Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 136 / 234

Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 137 / 234

Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 138 / 234

Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 139 / 234

Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 140 / 234

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 141 / 234

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 142 / 234

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 143 / 234

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 144 / 234

Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 145 / 234

Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 146 / 234

Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 147 / 234

Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 148 / 234

Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 149 / 234

Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 150 / 234

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 151 / 234

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 151 / 234

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 151 / 234

Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ 1 1 1 1 ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 152 / 234

Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ 1 1 1 1 ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 152 / 234

Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ 1 1 1 1 ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 152 / 234

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x 2 2 3 log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 153 / 234

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 154 / 234

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 154 / 234

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 155 / 234

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q 1... 1 s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 156 / 234

Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 157 / 234

Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 158 / 234

Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 158 / 234

Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 158 / 234

Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 159 / 234

Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q 1... 1 s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 160 / 234

Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 161 / 234

Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 162 / 234

Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 163 / 234

Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 163 / 234

Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 163 / 234

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 164 / 234

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 164 / 234

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 164 / 234

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 164 / 234

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 164 / 234

Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 165 / 234

Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 165 / 234

Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 166 / 234

Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 167 / 234

Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 168 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 169 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 170 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 170 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 170 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 170 / 234

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 170 / 234

Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 171 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 172 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 172 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 172 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 173 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 173 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 173 / 234

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm 1 0.8 0.6 0.4 0.2 Dual Error λ=0.001 0 0.1 0.3 0.5 0.7 0.9 τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm 1 0.8 0.6 0.4 0.2 λ=0.001 0 0.1 0.3 0.5 0.7 0.9 τ m=1024 m=2048 m=4096 m=8192 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 174 / 234

Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 175 / 234

Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 176 / 234

Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 177 / 234

Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 178 / 234

Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 179 / 234

Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 180 / 234

Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 181 / 234

Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 181 / 234

Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 182 / 234

Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 183 / 234

Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 184 / 234

Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 184 / 234

Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 185 / 234

Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 185 / 234

Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 186 / 234

Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 187 / 234

Randomized Algorithms Randomized Kernel methods Random Fourier Features (RFF) Bochner s theorem A shift-invariant kernel κ(x, y) = κ(x y) is a valid kernel if only if κ(δ) is the Fourier transform of a non-negative measure, i.e., κ(x y) = p(ω)e jω (x y) dω RFF (Rahimi & Recht, 2008): generate a set of ω 1,..., ω m R d following p(ω). For an example x R d, construct x = (cos(ω 1 x), sin(ω 1 x),..., cos(ω mx), sin(ω mx)) R 2m RBF kernel exp( x y 2 2 2γ 2 ): p(ω) = N (0, γ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 188 / 234

Randomized Algorithms Randomized Kernel methods Random Fourier Features (RFF) Bochner s theorem A shift-invariant kernel κ(x, y) = κ(x y) is a valid kernel if only if κ(δ) is the Fourier transform of a non-negative measure, i.e., κ(x y) = p(ω)e jω (x y) dω RFF (Rahimi & Recht, 2008): generate a set of ω 1,..., ω m R d following p(ω). For an example x R d, construct x = (cos(ω 1 x), sin(ω 1 x),..., cos(ω mx), sin(ω mx)) R 2m RBF kernel exp( x y 2 2 2γ 2 ): p(ω) = N (0, γ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 188 / 234

Randomized Algorithms Randomized Kernel methods Random Fourier Features (RFF) Bochner s theorem A shift-invariant kernel κ(x, y) = κ(x y) is a valid kernel if only if κ(δ) is the Fourier transform of a non-negative measure, i.e., κ(x y) = p(ω)e jω (x y) dω RFF (Rahimi & Recht, 2008): generate a set of ω 1,..., ω m R d following p(ω). For an example x R d, construct x = (cos(ω 1 x), sin(ω 1 x),..., cos(ω mx), sin(ω mx)) R 2m RBF kernel exp( x y 2 2 2γ 2 ): p(ω) = N (0, γ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 188 / 234

Randomized Algorithms Randomized Kernel methods The Nyström method vs RFF (Yang et al., 2012) functional approximation framework The Nyström method: data-dependent bases RFF: data independent bases In certain cases (e.g., large eigen-gap, skewed eigen-value distribution): the generalization performance of the Nyström method is better than RFF Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 189 / 234

Randomized Algorithms Randomized Kernel methods The Nyström method vs RFF Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 190 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 191 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized low-rank matrix approximation Let X R n d. The goal is to obtain Û Σ V X where Û Rn k, V R d k have orthonormal columns, Σ R k k is a diagonal matrix with nonegative entries k is target rank The best rank-k approximation X k = U k Σ k V k Approximation error Û Σ V X ξ (1 + ɛ) U k Σ k V k X ξ where ξ = F or ξ = 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 192 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Why low-rank approximation? Applications in Data mining and Machine learning PCA Spectral clustering Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 193 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Why randomized algorithms? Deterministic Algorithms Truncated SVD O(nd min(n, d)) Rank-Revealing QR factorization O(ndk) Krylov subspace method (e.g. Lanczos algorithm): O(kT mult + (n + d)k 2 ), where T mult denotes the cost of matrix-vector product. Randomized Algorithms Speed can be faster (e.g., O(nd log(k))) Output more robust (e.g. Lanczos requires sophisticated modifications) Can be pass efficient Can exploit parallel algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 194 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Why randomized algorithms? Deterministic Algorithms Truncated SVD O(nd min(n, d)) Rank-Revealing QR factorization O(ndk) Krylov subspace method (e.g. Lanczos algorithm): O(kT mult + (n + d)k 2 ), where T mult denotes the cost of matrix-vector product. Randomized Algorithms Speed can be faster (e.g., O(nd log(k))) Output more robust (e.g. Lanczos requires sophisticated modifications) Can be pass efficient Can exploit parallel algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 194 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation The Basic Randomized Algorithms for Approximating X R n d (Halko et al., 2011) 1 Obtain a small sketch by Y = XA R n m 2 Compute Q R n m that contains the orthonormal basis of col(y ) 3 Compute SVD of Q T X = UΣV 4 Approximation X ŨΣV, where Ũ = QU Explanation: If col(xa) captures the top-k column space of X well, i.e., X QQ X ε then X ŨΣV ε Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 195 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Three questions: 1 What is the value of m? m = k + p, p is the oversampling parameter. In practice p = 5 or 10 gives superb results 2 What is the computational cost? Subsampled Randomized Hadmard Transform: can be as fast as O(nd log(k) + k 2 (n + d)) 3 What is the quality? Theoretical Guarantee: Practically, very accurate Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 196 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Three questions: 1 What is the value of m? m = k + p, p is the oversampling parameter. In practice p = 5 or 10 gives superb results 2 What is the computational cost? Subsampled Randomized Hadmard Transform: can be as fast as O(nd log(k) + k 2 (n + d)) 3 What is the quality? Theoretical Guarantee: Practically, very accurate Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 196 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Three questions: 1 What is the value of m? m = k + p, p is the oversampling parameter. In practice p = 5 or 10 gives superb results 2 What is the computational cost? Subsampled Randomized Hadmard Transform: can be as fast as O(nd log(k) + k 2 (n + d)) 3 What is the quality? Theoretical Guarantee: Practically, very accurate Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 196 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Three questions: 1 What is the value of m? m = k + p, p is the oversampling parameter. In practice p = 5 or 10 gives superb results 2 What is the computational cost? Subsampled Randomized Hadmard Transform: can be as fast as O(nd log(k) + k 2 (n + d)) 3 What is the quality? Theoretical Guarantee: Practically, very accurate Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 196 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 197 / 234

Randomized Algorithms Randomized Low-rank Matrix Approximation Randomized algorithms for low-rank matrix approximation Other things Use power iteration to reduce the error: use (XX ) q X Can use sparse JL transform/subspace embedding matrices (Frobenius norm guarantee only) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 198 / 234

Concluding Remarks Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 199 / 234

Concluding Remarks How to address big data challenge? Optimization perspective: improve convergence rates, exploring properties of functions stochastic optimization (e.g., SDCA, SVRG, SAGA) distributed optimization (e.g., DisDCA) Randomization perspective: reduce data size, exploring properties of data randomized feature reduction (e.g., reduce the number of features) randomized instance reduction (e.g., reduce the number of instances) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 200 / 234

Concluding Remarks How can we address big data challenge? Optimization perspective: improve convergence rates, exploring properties of functions Pro: can obtain the optimal solution Con: high computational/communication costs Randomization perspective: reduce data size, exploring properties of data Pro: fast Con: still exists recovery error Can we combine the benefits of two techniques? Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 201 / 234

Concluding Remarks Combine Randomization and Optimization (Yang et al., 2015) Use randomization (Dual Spare Recovery) to obtain a good initial solution Initialize distributed optimization (DisDCA) to reduce cost of computation/communication Observe 1 or 2 epochs of computations (1 or 2 communications) suffice to obtain the same performance of pure optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 202 / 234

Concluding Remarks Big Data Experiments KDDcup Data: n = 8, 407, 752, d = 29, 890, 095, 10 machines, m = 1024 kdd kdd Testing Error (%) 15 10 5 0 DSRR DSRR Rec DSRR DisDCA 1 DSRR DisDCA 2 DisDCA time (s) 250 200 150 100 50 0 DSRR DSRR Rec DSRR DisDCA 1 DSRR DisDCA 2 DisDCA Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 203 / 234

Concluding Remarks Thank You! Questions? Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 204 / 234

Concluding Remarks References I Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671 687, 2003. Balcan, Maria-Florina, Blum, Avrim, and Vempala, Santosh. Kernels as features: on kernels, margins, and low-dimensional mappings. Machine Learning, 65(1):79 94, 2006. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction methods of multiplies. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 205 / 234

Concluding Remarks References II Cohen, Michael B., Elder, Sam, Musco, Cameron, Musco, Christopher, and Persu, Madalina. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pp. 163 172, 2015. Dasgupta, Anirban, Kumar, Ravi, and Sarlos, Tamás. A sparse johnson: Lindenstrauss transform. In Proceedings of the 42nd ACM symposium on Theory of computing, STOC 10, pp. 341 350, 2010. Dasgupta, Sanjoy and Gupta, Anupam. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1): 60 65, 2003. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, 2014. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 206 / 234

Concluding Remarks References III Drineas, Petros and Mahoney, Michael W. On the nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2005, 2005. Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Sampling algorithms for l2 regression and applications. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1127 1136, 2006. Drineas, Petros, Mahoney, Michael W., Muthukrishnan, S., and Sarlós, Tamàs. Faster least squares approximation. Numerische Mathematik, 117(2):219 249, February 2011. Gittens, Alex. The spectral norm error of the naive nystrom extension. CoRR, 2011. Golub, Gene H. and Ye, Qiang. Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM J. Sci. Comput, 21:1305 1320, 1997. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 207 / 234

Concluding Remarks References IV Halko, Nathan, Martinsson, Per Gunnar., and Tropp, Joel A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, May 2011. Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, and Sundararajan, S. A dual coordinate descent method for large-scale linear svm. In ICML, pp. 408 415, 2008. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013. Johnson, William and Lindenstrauss, Joram. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26, pp. 189 206. 1984. Kane, Daniel M. and Nelson, Jelani. Sparser johnson-lindenstrauss transforms. Journal of the ACM, 61:4:1 4:23, 2014. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 208 / 234

Concluding Remarks References V Lee, Jason, Ma, Tengyu, and Lin, Qihang. Distributed stochastic variance reduced gradient methods. Technical report, UC Berkeley, 2015. Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. In NIPS, 2014. Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtárik, Peter, and Takác, Martin. Adding vs. averaging in distributed primal-dual optimization. In ICML, 2015. Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. CoRR, abs/1211.1002, 2012. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 209 / 234

Concluding Remarks References VI Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 117 126, 2013. Nemirovski, A. and Yudin, D. On cezari s convergence of the steepest descent method for approximating saddle point of convex-concave functons. Soviet Math Dkl., 19:341 362, 1978. Nesterov, Yurii. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22:341 362, 2012. Ozdaglar, Asu. Distributed multiagent optimization linear convergence rate of admm. Technical report, MIT, 2015. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 210 / 234

Concluding Remarks References VII Paul, Saurabh, Boutsidis, Christos, Magdon-Ismail, Malik, and Drineas, Petros. Random projections for support vector machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 498 506, 2013. Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, pp. 1177 1184, 2008. Recht, Benjamin. A simpler approach to matrix completion. Journal Machine Learning Research (JMLR), pp. 3413 3430, 2011. Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. CoRR, 2012. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 211 / 234

Concluding Remarks References VIII Sarlós, Tamás. Improved approximation algorithms for large matrices via random projections. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 143 152, 2006. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14: 567 599, 2013. Shamir, Ohad, Srebro, Nathan, and Zhang, Tong. Communication-efficient distributed optimiztion using an approximate newton-type method. In ICML, 2014. Tropp, Joel A. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(1-2):115 126, 2011. Tropp, Joel A. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389 434, August 2012. ISSN 1615-3375. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 212 / 234

Concluding Remarks References IX Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): 2057 2075, 2014. Yang, Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent. NIPS 13, pp., 2013. Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou, Zhi-Hua. Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems (NIPS), pp. 485 493, 2012. Yang, Tianbao, Zhang, Lijun, Jin, Rong, and Zhu, Shenghuo. Theory of dual-sparse regularized randomized reduction. In Proceedings of the 32nd International Conference on Machine Learning, (ICML), pp. 305 314, 2015. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 213 / 234

Concluding Remarks References X Zhang, Lijun, Mahdavi, Mehrdad, and Jin, Rong. Linear convergence with condition number independent access of full gradients. In NIPS, pp. 980 988. 2013. Zhang, Lijun, Mahdavi, Mehrdad, Jin, Rong, Yang, Tianbao, and Zhu, Shenghuo. Random projections for classification: A recovery approach. IEEE Transactions on Information Theory (IEEE TIT), 60(11): 7300 7316, 2014. Zhang, Yuchen and Xiao, Lin. Communication-efficient distributed optimization of self-concordant empirical loss. In ICML, 2015. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 214 / 234

Appendix Examples of Convex functions ax + b, Ax + b x 2, x 2 2 exp(ax), exp(w x) log(1 + exp(ax)), log(1 + exp(w x)) x log(x), i x i log(x i ) x p, p 1, x 2 p max i (x i ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 215 / 234

Appendix Operations that preserve convexity Nonnegative scale: a f (x) where a 0 Sum: f (x) + g(x) Composition with affine function f (Ax + b) Point-wise maximum: max i f i (x) Examples: Least-squares regression: Ax b 2 SVM: 1 ni=1 n max(0, 1 y i w x i ) + λ 2 w 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 216 / 234

Appendix Smooth Convex function smooth: e.g. logisticsmoothness loss f (x) = log(1 + exp( x)) constant 6 5 f(x) log(1+exp( x)) f (x) f (y) 2 L x y 2 where L > 0 Second Order Derivative is upper bounded 2 f (x) 2 L 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 217 / 234

Appendix Smooth Convex function smooth: e.g. logisticsmoothness loss f (x) = log(1 + exp( x)) constant 6 5 f(x) log(1+exp( x)) f (x) f (y) 2 L x y 2 where L > 0 Second Order Derivative is upper bounded 2 f (x) 2 L 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 217 / 234

Appendix Strongly Convex function strongly convex: e.g. strong Euclidean convexity norm f (x) = 1 2 x 2 2 constant f (x) f (y) 2 λ x y 2 where λ > 0 0.8 0.6 0.4 x 2 Second Order Derivative is lower bounded 2 f (x) 2 λ 0.2 smooth 0 gradient 0.2 1 0.5 0 0.5 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 218 / 234

Appendix Strongly Convex function strongly convex: e.g. strong Euclidean convexity norm f (x) = 1 2 x 2 2 constant f (x) f (y) 2 λ x y 2 where λ > 0 0.8 0.6 0.4 x 2 Second Order Derivative is lower bounded 2 f (x) 2 λ 0.2 smooth 0 gradient 0.2 1 0.5 0 0.5 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 218 / 234

Appendix Smooth and Strongly Convex function smooth and strongly convex: e.g. quadratic function: f (z) = 1 2 (z 1)2 λ x y 2 f (x) f (y) 2 L x y 2, L λ > 0 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 219 / 234

Appendix Chernoff bound Let X 1,..., X n be independent random variables. Assume 0 X i 1. Let X = X 1 +... + X n. µ = E[X]. Then Pr(X (1 + ɛ)µ) exp 2 + ɛ µ ( ɛ22 ) µ Pr(X (1 ɛ)µ) exp ( ɛ2 ) or Pr( X µ ɛµ) 2 exp ( ɛ2 2 + ɛ µ ) ( ɛ23 ) µ 2 exp the last inequality holds when 0 < ɛ 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 220 / 234

Appendix Theoretical Guarantee of RA for low-rank approximation X = U [ Σ1 Σ 2 ] [ V 1 V 2 ] X R m n : the target matrix Σ 1 R k k, V 1 R n k A R n l : random reduction matrix Y = XA R m l : the small sketch Key inequality: (I P Y )X 2 Σ 2 2 + Σ 2 Ω 2 Ω 1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 221 / 234

Appendix Gaussian Matrices G is a standard Gaussian matrix U and V are orthonormal matrices U T GV follows the standard Gaussian distribution E[ SGT 2 F ] = S 2 F T 2 F E[ SGT ] S T F + S F T Concentration for function of a Gaussian matrix. Suppose h is a Lipschitz function on matrices h(x) h(y ) L X Y F Then Pr(h(G) E[h(G)] + Lt) e t2 /2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 222 / 234

Appendix Analysis for Randomized Least-square regression Let X = UΣV w = arg min w R d Xw b 2 Let Z = Xw b 2, ω = b Xw, and Xw = Uα ŵ = arg min w R d A(Xw b) 2 Since b Xw = b X(X X) X b = (I UU )b, Xŵ Xw = Uβ. Then Xŵ b 2 = Xw b 2 + Xŵ Xw 2 = Z + β 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 223 / 234

Appendix Analysis for Randomized Least-square regression AU(α + β) = AXŵ = AX(AX) Ab = P AX (Ab) = P AU (Ab) P AU (Ab) = P AU (A(ω + Uα)) = AUα + P AU (Aω) Hence U A AUβ = (AU) (AU)(AU) Aω = (AU) (AU)((AU) AU) 1 (AU) Aω where we use AU is full column matrix. Then U A AUβ = U A Aω β 2 2/2 U A AUβ 2 2 = U A Aω 2 2 ɛ 2 U 2 F ω 2 2 where the last inequality uses the matrix products approximation shown in next slide. Since U 2 F d, setting ɛ = ɛ d suffices. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 224 / 234

Appendix Approximate Matrix Products Given X R n d and Y R d p, let A R m d one of the following matrices a JL transform matrix with m = Θ(ɛ 2 log((n + p)/δ)) the sparse subspace embedding with m = Θ(ɛ 2 ) leverage-score sampling matrix based on p i X i 2 2 2 X 2 F Then w.h.p 1 δ XA AY XY F ɛ X F Y F and m = Θ(ɛ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 225 / 234

Appendix Analysis for Randomized Least-square regression A R m n 1. Subspace embedding: AU full column rank 2. Matrix product approximation: ɛ/d Order of m JL transforms: 1. O(d log(d)), 2. O(d log(d)ɛ 1 ) O(d log(d)ɛ 1 ) Sparse subspace embedding: 1. O(d 2 ), 2. O(dɛ 1 ) O(d 2 ɛ 1 ) If we use SSE (A 1 R m 1 n ) and JL transform A 2 R m 2 m 1 A 2 A 1 (Xw 2 b) 2 (1 + ɛ) A 1 (Xw 1 b) 2 (1 + ɛ) A 1 (Xw b) 2 (1 + ɛ) 2 Xw b with m 1 = O(d 2 ɛ 2 ) and m 2 = d log(d)ɛ 1, w 2 is the optimal solution using A 2 A 1 and w 1 is the optimal using A 1 and w is the original optimal solution. Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 226 / 234

Appendix Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 If A is a fast JL transform with m = Θ(ɛ 1 d log(d)): Total Time O(nd log(m) + d 3 log(d)ɛ 1 ) If A is a Sparse Subspace Embedding with m = Θ(d 2 ɛ 1 ): Total Time O(nnz(X) + d 4 ɛ 1 ) If A = A 1 A 2 combine fast JL (m 1 = Θ(ɛ 1 d log(d))) and SSE (m 2 = Θ(d 2 ɛ 2 )): Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 227 / 234

Appendix Matrix Chernoff bound Lemma (Matrix Chernoff (Tropp, 2012)) Let X be a finite set of PSD matrices with dimension k, and suppose that max X X λ max (X) B. Sample {X 1,..., X l } independently from X. Compute µ max = lλ max (E[X 1 ]), µ min = lλ min (E[X 1 ]) Then Pr Pr { { ( l ) λ max X i (1 + δ)µ max } k i=1 ( l ) λ min X i (1 δ)µ min } k i=1 [ [ e δ (1 + δ) 1+δ e δ (1 δ) 1 δ ] µmax B ] µ min B Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 228 / 234

Appendix To simplify the usage of Matrix Chernoff bound, we note that [ e δ [1 δ] 1 δ [ ] µ exp ( δ2 2 e δ ] µ ( ) (1 + δ) 1+δ exp µδ 2 /3, δ 1 [ ] µ exp ( µδ log(δ)/2), δ > 1 e δ (1 + δ) 1+δ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 229 / 234

Appendix Noncommutative Bernstein Inequality Lemma (Noncommutative Bernstein Inequality (Recht, 2011)) Let Z 1,..., Z L be independent { zero-mean random matrices } of dimension d 1 d 2. Suppose τj 2 = max E[Z j Zj ] 2, E[Zj Z j 2 and Z j 2 M almost surely for all k. Then, for any ɛ > 0, [ L Pr Z j ɛ > ɛ 2 ] /2 (d 1 + d 2 ) exp Lj=1 τ 2 2 j + Mɛ/3 j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 230 / 234

Appendix Randomized Algorithms for K-means Clustering K-means: k j=1 x i C j x i µ j 2 2 = X CC X 2 F where C R n k is the scaled cluster indicator matrix such that C C = I. Constrained Low-rank Approximation (Cohen et al., 2015) min P S X PX 2 F where S = QQ is any set of rank k orthogonal projection matrix with orthonormal Q R n k Low-rank Approximation: S is the set of all rank k orthogonal projection matrix. P = U k U k Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 231 / 234

Appendix Randomized Algorithms for K-means Clustering K-means: k j=1 x i C j x i µ j 2 2 = X CC X 2 F where C R n k is the scaled cluster indicator matrix such that C C = I. Constrained Low-rank Approximation (Cohen et al., 2015) min P S X PX 2 F where S = QQ is any set of rank k orthogonal projection matrix with orthonormal Q R n k Low-rank Approximation: S is the set of all rank k orthogonal projection matrix. P = U k U k Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 231 / 234

Appendix Randomized Algorithms for K-means Clustering K-means: k j=1 x i C j x i µ j 2 2 = X CC X 2 F where C R n k is the scaled cluster indicator matrix such that C C = I. Constrained Low-rank Approximation (Cohen et al., 2015) min P S X PX 2 F where S = QQ is any set of rank k orthogonal projection matrix with orthonormal Q R n k Low-rank Approximation: S is the set of all rank k orthogonal projection matrix. P = U k U k Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 231 / 234

Appendix Randomized Algorithms for K-means Clustering Define Guarantees on Approximation P = min P S X P X 2 F P = min P S X PX 2 F X P X 2 F 1 + ɛ 1 ɛ X P X 2 F Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 232 / 234

Appendix Properties of Leverage-score sampling We prove the properties using Matrix Chernoff bound. Let Ω = AU. Ω Ω = (AU) (AU) = m j=1 1 mp ij u ij u i j Let X i = 1 mp i u i u i. E[X i ] = 1 m I k. Therefore λ max (X i ) = λ min (X i ) = 1 m. u And λ max (X i ) max i 2 2 i mp i = k m. Applying the Matrix Chernoff bound for the minimum and maximum eigen-value, we have ( ) ( ) Pr(λ min (Ω Ω) (1 ɛ)) k exp mɛ2 k exp mɛ2 2k 3k ( ) Pr(λ max (Ω Ω) (1 + ɛ)) k exp mɛ2 3k Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 233 / 234

Appendix When uniform sampling makes sense? Coherence measure µ k = d k max U i 2 2 1 i d ]) ( [ When µ k τ and m = Θ kτ log 2k ɛ 2 δ A formed by uniform sampling (and scaling) AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) (1 + ɛ) (1 + ɛ)2 w.h.p 1 δ, Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang, Lin, Jin Tutorial for KDD 15 August 10, 2015 234 / 234