Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76
Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent 2 Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 2 / 76
What is Big Data? Machine Learning and Big Data 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion emails per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76
Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B emails. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 4 / 76
Machine Learning and Big Data Classification example Email spam classification. Features (u i )): Vector of counts of all words. No. of Features (d): Words in vocabulary ( 100,000). No. of non-zero features: 100. No. of emails per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: 41.67 hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76
Big Data Paradigm Machine Learning and Big Data 3V s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers 1 10 100 Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 6 / 76
Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints (S) = {(u i, v i ), i = 1,..., n}, u i R d and v i {+1, 1} Linear Predictor function: v = sign(x T u) Error function: E = n i=1 1(v ix T u i 0) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76
Machine Learning and Big Data Logistic Regression Probability of v is given by: Learning problem is: Given dataset S, estimate x. P(v u, x) = σ(vx T 1 u) = 1 + e vxt u Maximizing the regularized log likelihood: x = argmin x n log(1 + e v i x T u i ) + λ 2 xt x i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 8 / 76
Convex Function Machine Learning and Big Data f is a Convex function: f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76
Machine Learning and Big Data Convex Optimization Convex optimization problem where: minimize x f (x) subject to: g i (x) 0, i = 1,..., k f, g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 10 / 76
Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76
Machine Learning and Big Data Classification Problem Support Vector Machines Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 12 / 76
Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ±1 Margin (perpendicular distance between parallel hyperplanes): 2 x Correct classification of training datapoints: v i x T u i 1, i Allowing error (slack), ξ i : v i x T u i 1 ξ i, i Max-margin formulation: min x,ξ 1 2 x 2 + C n i=1 ξ i subject to: v i x T u i 1 ξ i, ξ i 0 i = 1,..., n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76
SVM: dual Machine Learning and Big Data Support Vector Machines Lagrangian: L = 1 2 xt x + C n ξ i + i=1 n α i (1 ξ i v i x T u i ) + i=1 Dual problem: (x, α, µ ) = max α,µ min x L(x, α, µ) n µ i ξ i For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x = n α i v i u i i=1 C = α i + µ i i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 14 / 76
Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: max α n α i 1 2 i=1 n,n i=1,j=1 subject to: 0 α i C, i α i α j v i v j u T i u j The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(u i, u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n 3 ), usually O(n 2 ). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76
SVM Machine Learning and Big Data Support Vector Machines A more compact form: min x n i=1 max(0, 1 v ix T u i ) + λ x 2 2 Or: min x n i=1 l(x, u i, v i ) + λω(x) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 16 / 76
Machine Learning and Big Data Multi-class classification Support Vector Machines There are m classes. v i {1,..., m} Most popular scheme: v i = argmax v {1,...,m} x T v u i Given example (u i, v i ), x T v i u i x T j u i j {1,..., m} Using a margin of at least 1, loss l(u i, v i ) = max j {1,...,vi 1,v i +1,...,m}{0, 1 (x T v i u i x T j u i )} Given dataset D, solve the problem m min l(u i, v i ) + λ x j 2 x 1,...,x m i D This can be extended to many settings e.g. sequence labeling, learning to rank, etc. j=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76
Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: min x Logistic Regression: General form: min x n max{0, 1 v i x T u i } + λ x 2 2 i=1 n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 min x n l(x, u i, v i ) + λω(x) i=1 l: loss function, Ω: regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 18 / 76
Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f (x) f (x 0 ) v T (x x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76
Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Randomly initialize x 0 Iterate x k = x k 1 t k g(x k 1 ), k = 1, 2, 3,.... Where g is a sub-gradient of f. t k = 1. k x best (k) = min i=1,...,k f (x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 20 / 76
Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O( 1 k ). Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1 ɛ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 22 / 76
Machine Learning and Big Data Stochastic Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76
Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m. There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m j=1 i C j l i (x) + λω(x), where l i (x) = l(x, u i, v i ) The gradient (in case of differentiable loss): loss(x) = m ( l i (x)) + λω(x) i C j j=1 Compute l j (x) = i C j l i (x) on the j th computer. Communicate to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 24 / 76
Distributed Optimization Distributed gradient descent Compute loss(x) = m j=1 l j(x) + Ω(x) at the central computer. The gradient descent update: x k+1 = x k α loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ɛ tolerance, Gradient descent (Logistic regression): O(1/ɛ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1 ɛ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76
Distributed Optimization ADMM Alternating Direction Method of Multipliers Problem Algorithm Iterate till convergence: minimize x,z f (x) + g(z) subject to: Ax + Bz = c x k+1 = argmin x f (x) + ρ 2 Ax + Bzk c + u k 2 2 z k+1 = argmin z g(z) + ρ 2 Ax k+1 + Bz c + u k 2 2 u k+1 = u k + Ax k+1 + Bz k+1 c Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 26 / 76
Stopping criteria Distributed Optimization ADMM Stop when primal and dual residuals small: r k 2 ɛ pri and s k 2 ɛ dual Hence, r k 2 0 and s k 2 0 as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 27 / 76
Distributed Optimization ADMM Observations x- update requires solving an optimization problem with, v = Bz k c + u k Similarly for z-update. Sometimes has a closed form. min x f (x) + ρ 2 Ax v 2 2 ADMM is a meta optimization algorithm. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 28 / 76
Distributed Optimization Convergence Convergence of ADMM Assumption 1: Functions f : R n R and g : R m R are closed, proper and convex. Same as assuming epif = {(x, t) R n R f (x) t} is closed and convex. Assumption 2: The unaugmented Lagrangian L 0 (x, y, z) has a saddle point (x, z, y ): L 0 (x, z, y) L 0 (x, z, y ) L 0 (x, z, y ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 29 / 76
Distributed Optimization Convergence Convergence of ADMM Primal residual: r = Ax + Bz c Optimal objective: p = inf x,z {f (x) + g(z) Ax + Bz = c} Convergence results: Primal residual convergence: r k 0 as k Dual residual convergence: s k 0 as k Objective convergence: f (x) + g(z) p as k Dual variable convergence: y k y as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 30 / 76
Distributed Optimization Distributed Loss Minimization Decomposition If f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) A is conformably block separable; i.e. A T A is block diagonal. Then, x-update splits into N parallel updates of x i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 31 / 76
Distributed Optimization Consensus Optimization Distributed Loss Minimization Problem: ADMM form: min x i,z min x f (x) = N f i (x i ) i=1 N f i (x) i=1 s.t. x i z = 0, i = 1,..., N Augmented lagrangian: L ρ (x 1,..., x N, z, y) = N i=1 (f i (x i ) + y T i (x i z) + ρ 2 x i z 2 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 32 / 76
Distributed Optimization Consensus Optimization Distributed Loss Minimization ADMM algorithm: x k+1 i z k+1 = 1 N y k+1 i = argmin xi (f i (x i ) + y kt i (x i z k ) + ρ 2 x i z k 2 2 ) N i=1 Final solution is z k. (x k+1 i + 1 ρ y k i ) = y k i + ρ(x k+1 i z k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 33 / 76
Distributed Optimization Distributed Loss Minimization Consensus Optimization z-update can be written as: z k+1 = x k+1 + 1 ρ ȳ k+1 Averaging the y-updates: ȳ k+1 = ȳ k + ρ( x k+1 z k+1 ) Substituting first into second: ȳ k+1 = 0. Hence z k = x k. Revised algorithm: x k+1 i y k+1 i Final solution is z k. = argmin xi (f i (x i ) + y kt i (x i x k ) + ρ 2 x i x k 2 2 ) = y k i + ρ(x k+1 i x k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 34 / 76
Distributed Optimization Distributed Loss minimization Problem: Partition A and b by rows: A = Distributed Loss Minimization min l(ax b) + r(x) x A 1. A N where, A i R m i m and b i R m i ADMM formulation: N min l i (A i x i b i ) + r(z) x i,z i=1, b = b 1. b N, s.t.: x i z = 0, i = 1,..., N Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 35 / 76
Distributed Optimization Distributed Loss minimization Distributed Loss Minimization ADMM solution: x k+1 i = argmin xi (l i (A i x i b i ) + ρ 2 x i z k + u k i 2 2 ) z k+1 = argmin z (r(z) + Nρ 2 z x k+1 + ū k 2 2 ) u k+1 i = ui k + x k+1 i z k+1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 36 / 76
ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 37 / 76
ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 38 / 76
Distributed Optimization Results Other Machine Learning Problems Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 39 / 76
ADMM Results Distributed Optimization Results Lasso Results (Boyd et al.): Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 40 / 76
ADMM Results Distributed Optimization Results SVM primal residual: Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 41 / 76
ADMM Results SVM Accuracy: Distributed Optimization Results Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 42 / 76
Results Distributed Optimization Results Risk and Hyperplane Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 43 / 76
Dual Ascent Distributed Optimization Development of ADMM Convex equality constrained problem: min f (x) x subject to: Ax = b Lagrangian: L(x, y) = f (x) + y T (Ax b) Dual function: g(y) = inf x L(x, y) Dual problem: max y g(y) Final solution: x = argmin x L(x, y) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 44 / 76
Distributed Optimization Development of ADMM Dual Ascent Gradient descent for dual problem: y k+1 = y k + α k y k g(y k ) y k g(y k ) = A x b, where x = argmin x L(x, y k ) Dual ascent algorithm: x k+1 = argmin x L(x, y k ) y k+1 = y k + α k (Ax k+1 b) Assumptions: L(x, y k ) is strictly convex. Else, the first step can have multiple solutions. L(x, y k ) is bounded below. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 45 / 76
Dual Decomposition Distributed Optimization Development of ADMM Suppose f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) L is separable in x: L(x, y) = L 1 (x 1, y) + + L N (x N, y) y T b, where L i (x i, y) = f i (x i ) + y T A i x i x minimization splits into N separate problems: x k+1 i = argmin xi L i (x i, y k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 46 / 76
Dual Decomposition Distributed Optimization Development of ADMM Dual decomposition: = argmin xi L i (x i, y k ), i = 1,..., N N y k+1 = y k + α k ( A i x i b) x k+1 i i=1 Distributed solution: Scatter y k to individual nodes Compute x i in the i th node (distributed step) Gather A i x i from the i th node All drawbacks of dual ascent exist Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 47 / 76
Distributed Optimization Development of ADMM Method of Multipliers Make dual ascent work under more general conditions Use augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + ρ 2 Ax b 2 2 Method of multipliers: x k+1 = argmin x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 48 / 76
Distributed Optimization Development of ADMM Methods of Multipliers Optimality conditions (for differentiable f ): Primal feasibility: Ax b = 0 Dual feasibility: f (x ) + A T y = 0 Since x k+1 minimizes L ρ (x, y k ) 0 = x L ρ (x k+1, y k ) = x f (x k+1 ) + A T (y k + ρ(ax k+1 b)) = x f (x k+1 ) + A T y k+1 Dual update y k+1 = y k + ρ(ax k+1 b) makes (x k+1, y k+1 ) dual feasible Primal feasibility is achieved in the limit: (Ax k+1 b) 0 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 49 / 76
Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 50 / 76
Distributed Optimization Development of ADMM Alternating direction method of multipliers ADMM problem: min x,z f (x) + g(z) subject to: Ax + Bz = c Lagrangian: L ρ (x, z, y) = f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 ADMM: x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 51 / 76
Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 52 / 76
Distributed Optimization Development of ADMM ADMM Optimality conditions Optimality conditions (differentiable case): Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 53 / 76
Distributed Optimization Development of ADMM ADMM Optimality conditions Primal residual: r k = Ax k + Bz k c Since x k+1 minimizes L ρ (x, z k, y k ): 0 = f (x k+1 ) + A T y k + ρa T (Ax k+1 + Bz k c) = f (x k+1 ) + A T (y k + ρr k+1 + ρb(z k z k+1 ) = f (x k+1 ) + A T y k+1 + ρa T B(z k z k+1 ) or, ρa T B(z k z k+1 ) = f (x k+1 ) + A T y k+1 Hence, s k+1 = ρa T B(z k z k+1 ) can be thought as dual residual. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 54 / 76
Distributed Optimization Development of ADMM ADMM with scaled dual variables Combine the linear and quadratic terms Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 55 / 76
Applications and extensions Weighted Parameter Averaging Distributed Support Vector Machines Training dataset partitioned into M partitions (S m, m = 1,..., M). Each partition has L datapoints: S m = {(x ml, y ml )}, l = 1,..., L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min M w m,z m=1 l=1 L loss(w m ; (x ml, y ml )) + r(z) s.t.w m z = 0, m = 1,, M, l = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 56 / 76
Applications and extensions Weighted Parameter Averaging Parameter Averaging Parameter averaging, also called mixture weights proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on S m : ŵ m = argmin w 1 L L loss(w; x ml, y ml ) + λ w 2, m = 1,..., M l=1 The final SVM parameter is given by: w PA = 1 M M m=1 ŵ m Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 57 / 76
Applications and extensions Problem with Parameter Averaging Weighted Parameter Averaging PA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 58 / 76
Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Final hypothesis is a weighted sum of the parameters ŵ m. Also proposed in [?]. How to get β m? w = M β m ^wm m=1 Notation: β = [β 1,, β M ] T, ^W = [ŵ 1,, ˆ w M ] w = ^Wβ Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 59 / 76
Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Find the optimal set of weights β which attains the lowest regularized hinge loss: min β,ξ λ ^Wβ 2 + 1 ML M L m=1 i=1 ξ mi subject to: y mi (β T ^W T x mi ) 1 ξ mi, ξ mi 0, Ŵ is a pre-computed parameter. i, m m = 1,..., M, i = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 60 / 76
Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed version of primal weighted parameter averaging: min γ m,β 1 ML M m=1 l=1 L loss(ŵ γ m; x ml, y ml ) + r(β) s.t. γ m β = 0, m = 1,, M, r(β) = λ Ŵβ 2, γ m weights for m th computer, β consensus weight. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 61 / 76
Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed algorithm using ADMM: γ k+1 m := argmin γ (loss(a i γ) + (ρ/2) γ β k + u k m 2 2 ) β k+1 := argmin β (r(β) + (Mρ/2) β γ k+1 u k 2 2 ) u k+1 m = u k m + γ k+1 m β k+1. u m are the scaled Lagrange multipliers, γ = 1 M M m=1 γ m and u = 1 M M m=1 u m. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 62 / 76
Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging PA (left) and WPA (right) with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 63 / 76
Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging Accuracy of PA and WPA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 64 / 76
Applications and extensions Real World Datasets Weighted Parameter Averaging Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 65 / 76
Applications and extensions Real World Datasets Weighted Parameter Averaging Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 66 / 76
Applications and extensions Real World Datasets Weighted Parameter Averaging Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 67 / 76
Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of test accuracy with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 68 / 76
Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of primal residual with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 69 / 76
Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible only along edges of the network. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 70 / 76
Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network SVM optimization problem: n 1 J j min w,b,ξ 2 w 2 + C j=1 n=1 ξ jn s.t.: y jn (w t x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn 0, j J, n = 1,..., N j Node j has a copy of w j, b j. Distributed formulation: min {w j,b j,ξ jn } 1 2 J w j 2 + JC j=1 n J j j=1 n=1 s.t.: y jn (w t j x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn ξ jn 0, j J, n = 1,..., N j w j = w i, j, i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 71 / 76
Algorithm Applications and extensions Fully-distributed SVM Using v j = [w T j b j ] T, X j = [[x j1,..., x jnj ] T 1 j ] and Y j = diag([y j1,..., y jnj ]): min {v j,ξ jn,ω ji } 1 2 J r(v j ) + JC j=1 Surrogate augmented Lagrangian: L({v j }, { ξ j }, {ω ji }, {α ijk }) = 1 2 n J j j=1 n=1 s.t.: Y j X j v j 1 ξ j, j J ξ j 0, j J ξ jn v j = ω ji, v i = ω ji, j, i B j J r(v j ) + JC j=1 n J j j=1 n=1 J + (αij1(v T j ω ji ) + αij2(v T i ω ji )) + η ( v j ω ji 2 + v i ω ji 2 ) j=1 i B j i B j ξ jn Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 72 / 76
Applications and extensions Fully-distributed SVM Algorithm ADMM based algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {ω t ji }, {αt ijk }) {ω t+1 ji } = argmin ωji L({v j } t+1, { ξ t+1 j }, {ω ji }, {α t ijk }) α t+1 ji1 = αji1 t t+1 + η(vj ω t+1 ji ) α t+1 ji2 From the second equation: = α t ji2 + η(ωt+1 ji v t+1 i ) ω t+1 ji = 1 2η (αt ji1 αt ji2 ) + 1 t+1 (v 2 j + v t+1 i ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 73 / 76
Algorithm Applications and extensions Fully-distributed SVM Hence: α t+1 ji1 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) α t+1 ji2 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) Substituting ω t+1 ji = 1 t+1 2 (vj + v t+1 i ) into surrogate augmented lagrangian, the third term becomes: J j=1 i B j α T ij1 (v j v i ) = J j (αji1 t αt ij1 ) v T j=1 i B j Substitute α t j = i B j (α t ji1 αt ij1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 74 / 76
Algorithm Applications and extensions Fully-distributed SVM The final algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {αj t }) α t+1 j = α t j + η 2 (v t+1 j v t+1 i ) i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 75 / 76
Applications and extensions Fully-distributed SVM Thank you! Questions? Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 76 / 76