Distributed Machine Learning and Big Data

Similar documents

Big Data Analytics. Lucas Rego Drumond

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Support Vector Machines

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization: Algorithms 3: Interior-point methods

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data - Lecture 1 Optimization reminders

Statistical Machine Learning

10. Proximal point method

Linear Threshold Units

Lecture 2: The SVM classifier

Simple and efficient online algorithms for real world applications

Support Vector Machine (SVM)

Several Views of Support Vector Machines

Big Data Analytics: Optimization and Randomization

Lecture 6: Logistic Regression

Logistic Regression for Spam Filtering

Support Vector Machines Explained

Support Vector Machines

Big Data Analytics CSCI 4030

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

A Simple Introduction to Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine Learning Big Data using Map Reduce

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

CSCI567 Machine Learning (Fall 2014)

Lecture 3: Linear methods for classification

Nonlinear Programming Methods.S2 Quadratic Programming

Date: April 12, Contents

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Statistical machine learning, high dimension and big data

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Support Vector Machines with Clustering for Training with Very Large Datasets

Adaptive Online Gradient Descent

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

2.3 Convex Constrained Optimization Problems

Introduction to Online Learning Theory

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

A Distributed Line Search for Network Optimization

Online Convex Optimization

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Two-Stage Stochastic Linear Programs

Making Sense of the Mayhem: Machine Learning and March Madness

1 Introduction to Matrices

Supervised Learning (Big Data Analytics)

Interior-Point Algorithms for Quadratic Programming

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

An Introduction to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

L3: Statistical Modeling with Hadoop

Online Algorithms: Learning & Optimization with No Regret.

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Approximation Algorithms

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Predict Influencers in the Social Network

(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0,

Perron vector Optimization applied to search engines

A fast multi-class SVM learning method for huge databases

Proximal mapping via network optimization

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Machine Learning Logistic Regression

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Introduction to Logistic Regression

STA 4273H: Statistical Machine Learning

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Distributed Structured Prediction for Big Data

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Lecture 8 February 4

Factorization Theorems

Machine Learning and Pattern Recognition Logistic Regression

Factorization Machines

Introduction to Machine Learning Using Python. Vikram Kamath

Question 2 Naïve Bayes (16 points)

Duality of linear conic problems

DATA ANALYSIS II. Matrix Algorithms

Social Media Mining. Data Mining Essentials

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

GI01/M055 Supervised Learning Proximal Methods

Direct Methods for Solving Linear Systems. Matrix Factorization

A Study on SMO-type Decomposition Methods for Support Vector Machines

Bilinear Prediction Using Low-Rank Models

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Transcription:

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent 2 Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 2 / 76

What is Big Data? Machine Learning and Big Data 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion emails per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76

Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B emails. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 4 / 76

Machine Learning and Big Data Classification example Email spam classification. Features (u i )): Vector of counts of all words. No. of Features (d): Words in vocabulary ( 100,000). No. of non-zero features: 100. No. of emails per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: 41.67 hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76

Big Data Paradigm Machine Learning and Big Data 3V s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers 1 10 100 Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 6 / 76

Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints (S) = {(u i, v i ), i = 1,..., n}, u i R d and v i {+1, 1} Linear Predictor function: v = sign(x T u) Error function: E = n i=1 1(v ix T u i 0) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76

Machine Learning and Big Data Logistic Regression Probability of v is given by: Learning problem is: Given dataset S, estimate x. P(v u, x) = σ(vx T 1 u) = 1 + e vxt u Maximizing the regularized log likelihood: x = argmin x n log(1 + e v i x T u i ) + λ 2 xt x i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 8 / 76

Convex Function Machine Learning and Big Data f is a Convex function: f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76

Machine Learning and Big Data Convex Optimization Convex optimization problem where: minimize x f (x) subject to: g i (x) 0, i = 1,..., k f, g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 10 / 76

Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76

Machine Learning and Big Data Classification Problem Support Vector Machines Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 12 / 76

Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ±1 Margin (perpendicular distance between parallel hyperplanes): 2 x Correct classification of training datapoints: v i x T u i 1, i Allowing error (slack), ξ i : v i x T u i 1 ξ i, i Max-margin formulation: min x,ξ 1 2 x 2 + C n i=1 ξ i subject to: v i x T u i 1 ξ i, ξ i 0 i = 1,..., n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76

SVM: dual Machine Learning and Big Data Support Vector Machines Lagrangian: L = 1 2 xt x + C n ξ i + i=1 n α i (1 ξ i v i x T u i ) + i=1 Dual problem: (x, α, µ ) = max α,µ min x L(x, α, µ) n µ i ξ i For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x = n α i v i u i i=1 C = α i + µ i i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 14 / 76

Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: max α n α i 1 2 i=1 n,n i=1,j=1 subject to: 0 α i C, i α i α j v i v j u T i u j The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(u i, u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n 3 ), usually O(n 2 ). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76

SVM Machine Learning and Big Data Support Vector Machines A more compact form: min x n i=1 max(0, 1 v ix T u i ) + λ x 2 2 Or: min x n i=1 l(x, u i, v i ) + λω(x) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 16 / 76

Machine Learning and Big Data Multi-class classification Support Vector Machines There are m classes. v i {1,..., m} Most popular scheme: v i = argmax v {1,...,m} x T v u i Given example (u i, v i ), x T v i u i x T j u i j {1,..., m} Using a margin of at least 1, loss l(u i, v i ) = max j {1,...,vi 1,v i +1,...,m}{0, 1 (x T v i u i x T j u i )} Given dataset D, solve the problem m min l(u i, v i ) + λ x j 2 x 1,...,x m i D This can be extended to many settings e.g. sequence labeling, learning to rank, etc. j=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76

Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: min x Logistic Regression: General form: min x n max{0, 1 v i x T u i } + λ x 2 2 i=1 n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 min x n l(x, u i, v i ) + λω(x) i=1 l: loss function, Ω: regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 18 / 76

Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f (x) f (x 0 ) v T (x x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76

Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Randomly initialize x 0 Iterate x k = x k 1 t k g(x k 1 ), k = 1, 2, 3,.... Where g is a sub-gradient of f. t k = 1. k x best (k) = min i=1,...,k f (x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 20 / 76

Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O( 1 k ). Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1 ɛ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 22 / 76

Machine Learning and Big Data Stochastic Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76

Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m. There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m j=1 i C j l i (x) + λω(x), where l i (x) = l(x, u i, v i ) The gradient (in case of differentiable loss): loss(x) = m ( l i (x)) + λω(x) i C j j=1 Compute l j (x) = i C j l i (x) on the j th computer. Communicate to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 24 / 76

Distributed Optimization Distributed gradient descent Compute loss(x) = m j=1 l j(x) + Ω(x) at the central computer. The gradient descent update: x k+1 = x k α loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ɛ tolerance, Gradient descent (Logistic regression): O(1/ɛ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1 ɛ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76

Distributed Optimization ADMM Alternating Direction Method of Multipliers Problem Algorithm Iterate till convergence: minimize x,z f (x) + g(z) subject to: Ax + Bz = c x k+1 = argmin x f (x) + ρ 2 Ax + Bzk c + u k 2 2 z k+1 = argmin z g(z) + ρ 2 Ax k+1 + Bz c + u k 2 2 u k+1 = u k + Ax k+1 + Bz k+1 c Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 26 / 76

Stopping criteria Distributed Optimization ADMM Stop when primal and dual residuals small: r k 2 ɛ pri and s k 2 ɛ dual Hence, r k 2 0 and s k 2 0 as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 27 / 76

Distributed Optimization ADMM Observations x- update requires solving an optimization problem with, v = Bz k c + u k Similarly for z-update. Sometimes has a closed form. min x f (x) + ρ 2 Ax v 2 2 ADMM is a meta optimization algorithm. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 28 / 76

Distributed Optimization Convergence Convergence of ADMM Assumption 1: Functions f : R n R and g : R m R are closed, proper and convex. Same as assuming epif = {(x, t) R n R f (x) t} is closed and convex. Assumption 2: The unaugmented Lagrangian L 0 (x, y, z) has a saddle point (x, z, y ): L 0 (x, z, y) L 0 (x, z, y ) L 0 (x, z, y ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 29 / 76

Distributed Optimization Convergence Convergence of ADMM Primal residual: r = Ax + Bz c Optimal objective: p = inf x,z {f (x) + g(z) Ax + Bz = c} Convergence results: Primal residual convergence: r k 0 as k Dual residual convergence: s k 0 as k Objective convergence: f (x) + g(z) p as k Dual variable convergence: y k y as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 30 / 76

Distributed Optimization Distributed Loss Minimization Decomposition If f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) A is conformably block separable; i.e. A T A is block diagonal. Then, x-update splits into N parallel updates of x i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 31 / 76

Distributed Optimization Consensus Optimization Distributed Loss Minimization Problem: ADMM form: min x i,z min x f (x) = N f i (x i ) i=1 N f i (x) i=1 s.t. x i z = 0, i = 1,..., N Augmented lagrangian: L ρ (x 1,..., x N, z, y) = N i=1 (f i (x i ) + y T i (x i z) + ρ 2 x i z 2 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 32 / 76

Distributed Optimization Consensus Optimization Distributed Loss Minimization ADMM algorithm: x k+1 i z k+1 = 1 N y k+1 i = argmin xi (f i (x i ) + y kt i (x i z k ) + ρ 2 x i z k 2 2 ) N i=1 Final solution is z k. (x k+1 i + 1 ρ y k i ) = y k i + ρ(x k+1 i z k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 33 / 76

Distributed Optimization Distributed Loss Minimization Consensus Optimization z-update can be written as: z k+1 = x k+1 + 1 ρ ȳ k+1 Averaging the y-updates: ȳ k+1 = ȳ k + ρ( x k+1 z k+1 ) Substituting first into second: ȳ k+1 = 0. Hence z k = x k. Revised algorithm: x k+1 i y k+1 i Final solution is z k. = argmin xi (f i (x i ) + y kt i (x i x k ) + ρ 2 x i x k 2 2 ) = y k i + ρ(x k+1 i x k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 34 / 76

Distributed Optimization Distributed Loss minimization Problem: Partition A and b by rows: A = Distributed Loss Minimization min l(ax b) + r(x) x A 1. A N where, A i R m i m and b i R m i ADMM formulation: N min l i (A i x i b i ) + r(z) x i,z i=1, b = b 1. b N, s.t.: x i z = 0, i = 1,..., N Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 35 / 76

Distributed Optimization Distributed Loss minimization Distributed Loss Minimization ADMM solution: x k+1 i = argmin xi (l i (A i x i b i ) + ρ 2 x i z k + u k i 2 2 ) z k+1 = argmin z (r(z) + Nρ 2 z x k+1 + ū k 2 2 ) u k+1 i = ui k + x k+1 i z k+1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 36 / 76

ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 37 / 76

ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 38 / 76

Distributed Optimization Results Other Machine Learning Problems Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 39 / 76

ADMM Results Distributed Optimization Results Lasso Results (Boyd et al.): Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 40 / 76

ADMM Results Distributed Optimization Results SVM primal residual: Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 41 / 76

ADMM Results SVM Accuracy: Distributed Optimization Results Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 42 / 76

Results Distributed Optimization Results Risk and Hyperplane Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 43 / 76

Dual Ascent Distributed Optimization Development of ADMM Convex equality constrained problem: min f (x) x subject to: Ax = b Lagrangian: L(x, y) = f (x) + y T (Ax b) Dual function: g(y) = inf x L(x, y) Dual problem: max y g(y) Final solution: x = argmin x L(x, y) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 44 / 76

Distributed Optimization Development of ADMM Dual Ascent Gradient descent for dual problem: y k+1 = y k + α k y k g(y k ) y k g(y k ) = A x b, where x = argmin x L(x, y k ) Dual ascent algorithm: x k+1 = argmin x L(x, y k ) y k+1 = y k + α k (Ax k+1 b) Assumptions: L(x, y k ) is strictly convex. Else, the first step can have multiple solutions. L(x, y k ) is bounded below. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 45 / 76

Dual Decomposition Distributed Optimization Development of ADMM Suppose f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) L is separable in x: L(x, y) = L 1 (x 1, y) + + L N (x N, y) y T b, where L i (x i, y) = f i (x i ) + y T A i x i x minimization splits into N separate problems: x k+1 i = argmin xi L i (x i, y k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 46 / 76

Dual Decomposition Distributed Optimization Development of ADMM Dual decomposition: = argmin xi L i (x i, y k ), i = 1,..., N N y k+1 = y k + α k ( A i x i b) x k+1 i i=1 Distributed solution: Scatter y k to individual nodes Compute x i in the i th node (distributed step) Gather A i x i from the i th node All drawbacks of dual ascent exist Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 47 / 76

Distributed Optimization Development of ADMM Method of Multipliers Make dual ascent work under more general conditions Use augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + ρ 2 Ax b 2 2 Method of multipliers: x k+1 = argmin x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 48 / 76

Distributed Optimization Development of ADMM Methods of Multipliers Optimality conditions (for differentiable f ): Primal feasibility: Ax b = 0 Dual feasibility: f (x ) + A T y = 0 Since x k+1 minimizes L ρ (x, y k ) 0 = x L ρ (x k+1, y k ) = x f (x k+1 ) + A T (y k + ρ(ax k+1 b)) = x f (x k+1 ) + A T y k+1 Dual update y k+1 = y k + ρ(ax k+1 b) makes (x k+1, y k+1 ) dual feasible Primal feasibility is achieved in the limit: (Ax k+1 b) 0 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 49 / 76

Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 50 / 76

Distributed Optimization Development of ADMM Alternating direction method of multipliers ADMM problem: min x,z f (x) + g(z) subject to: Ax + Bz = c Lagrangian: L ρ (x, z, y) = f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 ADMM: x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 51 / 76

Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 52 / 76

Distributed Optimization Development of ADMM ADMM Optimality conditions Optimality conditions (differentiable case): Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 53 / 76

Distributed Optimization Development of ADMM ADMM Optimality conditions Primal residual: r k = Ax k + Bz k c Since x k+1 minimizes L ρ (x, z k, y k ): 0 = f (x k+1 ) + A T y k + ρa T (Ax k+1 + Bz k c) = f (x k+1 ) + A T (y k + ρr k+1 + ρb(z k z k+1 ) = f (x k+1 ) + A T y k+1 + ρa T B(z k z k+1 ) or, ρa T B(z k z k+1 ) = f (x k+1 ) + A T y k+1 Hence, s k+1 = ρa T B(z k z k+1 ) can be thought as dual residual. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 54 / 76

Distributed Optimization Development of ADMM ADMM with scaled dual variables Combine the linear and quadratic terms Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 55 / 76

Applications and extensions Weighted Parameter Averaging Distributed Support Vector Machines Training dataset partitioned into M partitions (S m, m = 1,..., M). Each partition has L datapoints: S m = {(x ml, y ml )}, l = 1,..., L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min M w m,z m=1 l=1 L loss(w m ; (x ml, y ml )) + r(z) s.t.w m z = 0, m = 1,, M, l = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 56 / 76

Applications and extensions Weighted Parameter Averaging Parameter Averaging Parameter averaging, also called mixture weights proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on S m : ŵ m = argmin w 1 L L loss(w; x ml, y ml ) + λ w 2, m = 1,..., M l=1 The final SVM parameter is given by: w PA = 1 M M m=1 ŵ m Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 57 / 76

Applications and extensions Problem with Parameter Averaging Weighted Parameter Averaging PA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 58 / 76

Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Final hypothesis is a weighted sum of the parameters ŵ m. Also proposed in [?]. How to get β m? w = M β m ^wm m=1 Notation: β = [β 1,, β M ] T, ^W = [ŵ 1,, ˆ w M ] w = ^Wβ Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 59 / 76

Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Find the optimal set of weights β which attains the lowest regularized hinge loss: min β,ξ λ ^Wβ 2 + 1 ML M L m=1 i=1 ξ mi subject to: y mi (β T ^W T x mi ) 1 ξ mi, ξ mi 0, Ŵ is a pre-computed parameter. i, m m = 1,..., M, i = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 60 / 76

Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed version of primal weighted parameter averaging: min γ m,β 1 ML M m=1 l=1 L loss(ŵ γ m; x ml, y ml ) + r(β) s.t. γ m β = 0, m = 1,, M, r(β) = λ Ŵβ 2, γ m weights for m th computer, β consensus weight. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 61 / 76

Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed algorithm using ADMM: γ k+1 m := argmin γ (loss(a i γ) + (ρ/2) γ β k + u k m 2 2 ) β k+1 := argmin β (r(β) + (Mρ/2) β γ k+1 u k 2 2 ) u k+1 m = u k m + γ k+1 m β k+1. u m are the scaled Lagrange multipliers, γ = 1 M M m=1 γ m and u = 1 M M m=1 u m. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 62 / 76

Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging PA (left) and WPA (right) with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 63 / 76

Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging Accuracy of PA and WPA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 64 / 76

Applications and extensions Real World Datasets Weighted Parameter Averaging Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 65 / 76

Applications and extensions Real World Datasets Weighted Parameter Averaging Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 66 / 76

Applications and extensions Real World Datasets Weighted Parameter Averaging Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 67 / 76

Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of test accuracy with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 68 / 76

Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of primal residual with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 69 / 76

Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible only along edges of the network. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 70 / 76

Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network SVM optimization problem: n 1 J j min w,b,ξ 2 w 2 + C j=1 n=1 ξ jn s.t.: y jn (w t x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn 0, j J, n = 1,..., N j Node j has a copy of w j, b j. Distributed formulation: min {w j,b j,ξ jn } 1 2 J w j 2 + JC j=1 n J j j=1 n=1 s.t.: y jn (w t j x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn ξ jn 0, j J, n = 1,..., N j w j = w i, j, i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 71 / 76

Algorithm Applications and extensions Fully-distributed SVM Using v j = [w T j b j ] T, X j = [[x j1,..., x jnj ] T 1 j ] and Y j = diag([y j1,..., y jnj ]): min {v j,ξ jn,ω ji } 1 2 J r(v j ) + JC j=1 Surrogate augmented Lagrangian: L({v j }, { ξ j }, {ω ji }, {α ijk }) = 1 2 n J j j=1 n=1 s.t.: Y j X j v j 1 ξ j, j J ξ j 0, j J ξ jn v j = ω ji, v i = ω ji, j, i B j J r(v j ) + JC j=1 n J j j=1 n=1 J + (αij1(v T j ω ji ) + αij2(v T i ω ji )) + η ( v j ω ji 2 + v i ω ji 2 ) j=1 i B j i B j ξ jn Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 72 / 76

Applications and extensions Fully-distributed SVM Algorithm ADMM based algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {ω t ji }, {αt ijk }) {ω t+1 ji } = argmin ωji L({v j } t+1, { ξ t+1 j }, {ω ji }, {α t ijk }) α t+1 ji1 = αji1 t t+1 + η(vj ω t+1 ji ) α t+1 ji2 From the second equation: = α t ji2 + η(ωt+1 ji v t+1 i ) ω t+1 ji = 1 2η (αt ji1 αt ji2 ) + 1 t+1 (v 2 j + v t+1 i ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 73 / 76

Algorithm Applications and extensions Fully-distributed SVM Hence: α t+1 ji1 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) α t+1 ji2 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) Substituting ω t+1 ji = 1 t+1 2 (vj + v t+1 i ) into surrogate augmented lagrangian, the third term becomes: J j=1 i B j α T ij1 (v j v i ) = J j (αji1 t αt ij1 ) v T j=1 i B j Substitute α t j = i B j (α t ji1 αt ij1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 74 / 76

Algorithm Applications and extensions Fully-distributed SVM The final algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {αj t }) α t+1 j = α t j + η 2 (v t+1 j v t+1 i ) i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 75 / 76

Applications and extensions Fully-distributed SVM Thank you! Questions? Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 76 / 76