Distributed Machine Learning and Big Data

Transcription

1 Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

2 Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent 2 Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

3 What is Big Data? Machine Learning and Big Data 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion s per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

4 Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B s. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

5 Machine Learning and Big Data Classification example spam classification. Features (u i )): Vector of counts of all words. No. of Features (d): Words in vocabulary ( 100,000). No. of non-zero features: 100. No. of s per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

6 Big Data Paradigm Machine Learning and Big Data 3V s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers Chance of a failure in an hour Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

7 Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints (S) = {(u i, v i ), i = 1,..., n}, u i R d and v i {+1, 1} Linear Predictor function: v = sign(x T u) Error function: E = n i=1 1(v ix T u i 0) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

8 Machine Learning and Big Data Logistic Regression Probability of v is given by: Learning problem is: Given dataset S, estimate x. P(v u, x) = σ(vx T 1 u) = 1 + e vxt u Maximizing the regularized log likelihood: x = argmin x n log(1 + e v i x T u i ) + λ 2 xt x i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

9 Convex Function Machine Learning and Big Data f is a Convex function: f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

10 Machine Learning and Big Data Convex Optimization Convex optimization problem where: minimize x f (x) subject to: g i (x) 0, i = 1,..., k f, g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

11 Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

12 Machine Learning and Big Data Classification Problem Support Vector Machines Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

13 Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ±1 Margin (perpendicular distance between parallel hyperplanes): 2 x Correct classification of training datapoints: v i x T u i 1, i Allowing error (slack), ξ i : v i x T u i 1 ξ i, i Max-margin formulation: min x,ξ 1 2 x 2 + C n i=1 ξ i subject to: v i x T u i 1 ξ i, ξ i 0 i = 1,..., n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

14 SVM: dual Machine Learning and Big Data Support Vector Machines Lagrangian: L = 1 2 xt x + C n ξ i + i=1 n α i (1 ξ i v i x T u i ) + i=1 Dual problem: (x, α, µ ) = max α,µ min x L(x, α, µ) n µ i ξ i For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x = n α i v i u i i=1 C = α i + µ i i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

15 Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: max α n α i 1 2 i=1 n,n i=1,j=1 subject to: 0 α i C, i α i α j v i v j u T i u j The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(u i, u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n 3 ), usually O(n 2 ). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

16 SVM Machine Learning and Big Data Support Vector Machines A more compact form: min x n i=1 max(0, 1 v ix T u i ) + λ x 2 2 Or: min x n i=1 l(x, u i, v i ) + λω(x) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

17 Machine Learning and Big Data Multi-class classification Support Vector Machines There are m classes. v i {1,..., m} Most popular scheme: v i = argmax v {1,...,m} x T v u i Given example (u i, v i ), x T v i u i x T j u i j {1,..., m} Using a margin of at least 1, loss l(u i, v i ) = max j {1,...,vi 1,v i +1,...,m}{0, 1 (x T v i u i x T j u i )} Given dataset D, solve the problem m min l(u i, v i ) + λ x j 2 x 1,...,x m i D This can be extended to many settings e.g. sequence labeling, learning to rank, etc. j=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

18 Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: min x Logistic Regression: General form: min x n max{0, 1 v i x T u i } + λ x 2 2 i=1 n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 min x n l(x, u i, v i ) + λω(x) i=1 l: loss function, Ω: regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

19 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f (x) f (x 0 ) v T (x x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

20 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Randomly initialize x 0 Iterate x k = x k 1 t k g(x k 1 ), k = 1, 2, 3,.... Where g is a sub-gradient of f. t k = 1. k x best (k) = min i=1,...,k f (x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

21 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

22 Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O( 1 k ). Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1 ɛ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

23 Machine Learning and Big Data Stochastic Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

24 Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m. There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m j=1 i C j l i (x) + λω(x), where l i (x) = l(x, u i, v i ) The gradient (in case of differentiable loss): loss(x) = m ( l i (x)) + λω(x) i C j j=1 Compute l j (x) = i C j l i (x) on the j th computer. Communicate to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

25 Distributed Optimization Distributed gradient descent Compute loss(x) = m j=1 l j(x) + Ω(x) at the central computer. The gradient descent update: x k+1 = x k α loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ɛ tolerance, Gradient descent (Logistic regression): O(1/ɛ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1 ɛ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

26 Distributed Optimization ADMM Alternating Direction Method of Multipliers Problem Algorithm Iterate till convergence: minimize x,z f (x) + g(z) subject to: Ax + Bz = c x k+1 = argmin x f (x) + ρ 2 Ax + Bzk c + u k 2 2 z k+1 = argmin z g(z) + ρ 2 Ax k+1 + Bz c + u k 2 2 u k+1 = u k + Ax k+1 + Bz k+1 c Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

27 Stopping criteria Distributed Optimization ADMM Stop when primal and dual residuals small: r k 2 ɛ pri and s k 2 ɛ dual Hence, r k 2 0 and s k 2 0 as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

28 Distributed Optimization ADMM Observations x- update requires solving an optimization problem with, v = Bz k c + u k Similarly for z-update. Sometimes has a closed form. min x f (x) + ρ 2 Ax v 2 2 ADMM is a meta optimization algorithm. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

29 Distributed Optimization Convergence Convergence of ADMM Assumption 1: Functions f : R n R and g : R m R are closed, proper and convex. Same as assuming epif = {(x, t) R n R f (x) t} is closed and convex. Assumption 2: The unaugmented Lagrangian L 0 (x, y, z) has a saddle point (x, z, y ): L 0 (x, z, y) L 0 (x, z, y ) L 0 (x, z, y ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

30 Distributed Optimization Convergence Convergence of ADMM Primal residual: r = Ax + Bz c Optimal objective: p = inf x,z {f (x) + g(z) Ax + Bz = c} Convergence results: Primal residual convergence: r k 0 as k Dual residual convergence: s k 0 as k Objective convergence: f (x) + g(z) p as k Dual variable convergence: y k y as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

31 Distributed Optimization Distributed Loss Minimization Decomposition If f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) A is conformably block separable; i.e. A T A is block diagonal. Then, x-update splits into N parallel updates of x i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

32 Distributed Optimization Consensus Optimization Distributed Loss Minimization Problem: ADMM form: min x i,z min x f (x) = N f i (x i ) i=1 N f i (x) i=1 s.t. x i z = 0, i = 1,..., N Augmented lagrangian: L ρ (x 1,..., x N, z, y) = N i=1 (f i (x i ) + y T i (x i z) + ρ 2 x i z 2 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

33 Distributed Optimization Consensus Optimization Distributed Loss Minimization ADMM algorithm: x k+1 i z k+1 = 1 N y k+1 i = argmin xi (f i (x i ) + y kt i (x i z k ) + ρ 2 x i z k 2 2 ) N i=1 Final solution is z k. (x k+1 i + 1 ρ y k i ) = y k i + ρ(x k+1 i z k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

34 Distributed Optimization Distributed Loss Minimization Consensus Optimization z-update can be written as: z k+1 = x k ρ ȳ k+1 Averaging the y-updates: ȳ k+1 = ȳ k + ρ( x k+1 z k+1 ) Substituting first into second: ȳ k+1 = 0. Hence z k = x k. Revised algorithm: x k+1 i y k+1 i Final solution is z k. = argmin xi (f i (x i ) + y kt i (x i x k ) + ρ 2 x i x k 2 2 ) = y k i + ρ(x k+1 i x k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

35 Distributed Optimization Distributed Loss minimization Problem: Partition A and b by rows: A = Distributed Loss Minimization min l(ax b) + r(x) x A 1. A N where, A i R m i m and b i R m i ADMM formulation: N min l i (A i x i b i ) + r(z) x i,z i=1, b = b 1. b N, s.t.: x i z = 0, i = 1,..., N Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

36 Distributed Optimization Distributed Loss minimization Distributed Loss Minimization ADMM solution: x k+1 i = argmin xi (l i (A i x i b i ) + ρ 2 x i z k + u k i 2 2 ) z k+1 = argmin z (r(z) + Nρ 2 z x k+1 + ū k 2 2 ) u k+1 i = ui k + x k+1 i z k+1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

37 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

38 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

39 Distributed Optimization Results Other Machine Learning Problems Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

40 ADMM Results Distributed Optimization Results Lasso Results (Boyd et al.): Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

41 ADMM Results Distributed Optimization Results SVM primal residual: Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

42 ADMM Results SVM Accuracy: Distributed Optimization Results Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

43 Results Distributed Optimization Results Risk and Hyperplane Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

44 Dual Ascent Distributed Optimization Development of ADMM Convex equality constrained problem: min f (x) x subject to: Ax = b Lagrangian: L(x, y) = f (x) + y T (Ax b) Dual function: g(y) = inf x L(x, y) Dual problem: max y g(y) Final solution: x = argmin x L(x, y) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

45 Distributed Optimization Development of ADMM Dual Ascent Gradient descent for dual problem: y k+1 = y k + α k y k g(y k ) y k g(y k ) = A x b, where x = argmin x L(x, y k ) Dual ascent algorithm: x k+1 = argmin x L(x, y k ) y k+1 = y k + α k (Ax k+1 b) Assumptions: L(x, y k ) is strictly convex. Else, the first step can have multiple solutions. L(x, y k ) is bounded below. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

46 Dual Decomposition Distributed Optimization Development of ADMM Suppose f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) L is separable in x: L(x, y) = L 1 (x 1, y) + + L N (x N, y) y T b, where L i (x i, y) = f i (x i ) + y T A i x i x minimization splits into N separate problems: x k+1 i = argmin xi L i (x i, y k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

47 Dual Decomposition Distributed Optimization Development of ADMM Dual decomposition: = argmin xi L i (x i, y k ), i = 1,..., N N y k+1 = y k + α k ( A i x i b) x k+1 i i=1 Distributed solution: Scatter y k to individual nodes Compute x i in the i th node (distributed step) Gather A i x i from the i th node All drawbacks of dual ascent exist Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

48 Distributed Optimization Development of ADMM Method of Multipliers Make dual ascent work under more general conditions Use augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + ρ 2 Ax b 2 2 Method of multipliers: x k+1 = argmin x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

49 Distributed Optimization Development of ADMM Methods of Multipliers Optimality conditions (for differentiable f ): Primal feasibility: Ax b = 0 Dual feasibility: f (x ) + A T y = 0 Since x k+1 minimizes L ρ (x, y k ) 0 = x L ρ (x k+1, y k ) = x f (x k+1 ) + A T (y k + ρ(ax k+1 b)) = x f (x k+1 ) + A T y k+1 Dual update y k+1 = y k + ρ(ax k+1 b) makes (x k+1, y k+1 ) dual feasible Primal feasibility is achieved in the limit: (Ax k+1 b) 0 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

50 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

51 Distributed Optimization Development of ADMM Alternating direction method of multipliers ADMM problem: min x,z f (x) + g(z) subject to: Ax + Bz = c Lagrangian: L ρ (x, z, y) = f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 ADMM: x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

52 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

53 Distributed Optimization Development of ADMM ADMM Optimality conditions Optimality conditions (differentiable case): Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

54 Distributed Optimization Development of ADMM ADMM Optimality conditions Primal residual: r k = Ax k + Bz k c Since x k+1 minimizes L ρ (x, z k, y k ): 0 = f (x k+1 ) + A T y k + ρa T (Ax k+1 + Bz k c) = f (x k+1 ) + A T (y k + ρr k+1 + ρb(z k z k+1 ) = f (x k+1 ) + A T y k+1 + ρa T B(z k z k+1 ) or, ρa T B(z k z k+1 ) = f (x k+1 ) + A T y k+1 Hence, s k+1 = ρa T B(z k z k+1 ) can be thought as dual residual. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

55 Distributed Optimization Development of ADMM ADMM with scaled dual variables Combine the linear and quadratic terms Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

56 Applications and extensions Weighted Parameter Averaging Distributed Support Vector Machines Training dataset partitioned into M partitions (S m, m = 1,..., M). Each partition has L datapoints: S m = {(x ml, y ml )}, l = 1,..., L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min M w m,z m=1 l=1 L loss(w m ; (x ml, y ml )) + r(z) s.t.w m z = 0, m = 1,, M, l = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

57 Applications and extensions Weighted Parameter Averaging Parameter Averaging Parameter averaging, also called mixture weights proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on S m : ŵ m = argmin w 1 L L loss(w; x ml, y ml ) + λ w 2, m = 1,..., M l=1 The final SVM parameter is given by: w PA = 1 M M m=1 ŵ m Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

58 Applications and extensions Problem with Parameter Averaging Weighted Parameter Averaging PA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

59 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Final hypothesis is a weighted sum of the parameters ŵ m. Also proposed in [?]. How to get β m? w = M β m ^wm m=1 Notation: β = [β 1,, β M ] T, ^W = [ŵ 1,, ˆ w M ] w = ^Wβ Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

60 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Find the optimal set of weights β which attains the lowest regularized hinge loss: min β,ξ λ ^Wβ ML M L m=1 i=1 ξ mi subject to: y mi (β T ^W T x mi ) 1 ξ mi, ξ mi 0, Ŵ is a pre-computed parameter. i, m m = 1,..., M, i = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

61 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed version of primal weighted parameter averaging: min γ m,β 1 ML M m=1 l=1 L loss(ŵ γ m; x ml, y ml ) + r(β) s.t. γ m β = 0, m = 1,, M, r(β) = λ Ŵβ 2, γ m weights for m th computer, β consensus weight. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

62 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed algorithm using ADMM: γ k+1 m := argmin γ (loss(a i γ) + (ρ/2) γ β k + u k m 2 2 ) β k+1 := argmin β (r(β) + (Mρ/2) β γ k+1 u k 2 2 ) u k+1 m = u k m + γ k+1 m β k+1. u m are the scaled Lagrange multipliers, γ = 1 M M m=1 γ m and u = 1 M M m=1 u m. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

63 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging PA (left) and WPA (right) with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

64 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging Accuracy of PA and WPA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

65 Applications and extensions Real World Datasets Weighted Parameter Averaging Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

66 Applications and extensions Real World Datasets Weighted Parameter Averaging Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

67 Applications and extensions Real World Datasets Weighted Parameter Averaging Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

68 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of test accuracy with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

69 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of primal residual with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

70 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible only along edges of the network. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

71 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network SVM optimization problem: n 1 J j min w,b,ξ 2 w 2 + C j=1 n=1 ξ jn s.t.: y jn (w t x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn 0, j J, n = 1,..., N j Node j has a copy of w j, b j. Distributed formulation: min {w j,b j,ξ jn } 1 2 J w j 2 + JC j=1 n J j j=1 n=1 s.t.: y jn (w t j x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn ξ jn 0, j J, n = 1,..., N j w j = w i, j, i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

72 Algorithm Applications and extensions Fully-distributed SVM Using v j = [w T j b j ] T, X j = [[x j1,..., x jnj ] T 1 j ] and Y j = diag([y j1,..., y jnj ]): min {v j,ξ jn,ω ji } 1 2 J r(v j ) + JC j=1 Surrogate augmented Lagrangian: L({v j }, { ξ j }, {ω ji }, {α ijk }) = 1 2 n J j j=1 n=1 s.t.: Y j X j v j 1 ξ j, j J ξ j 0, j J ξ jn v j = ω ji, v i = ω ji, j, i B j J r(v j ) + JC j=1 n J j j=1 n=1 J + (αij1(v T j ω ji ) + αij2(v T i ω ji )) + η ( v j ω ji 2 + v i ω ji 2 ) j=1 i B j i B j ξ jn Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

73 Applications and extensions Fully-distributed SVM Algorithm ADMM based algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {ω t ji }, {αt ijk }) {ω t+1 ji } = argmin ωji L({v j } t+1, { ξ t+1 j }, {ω ji }, {α t ijk }) α t+1 ji1 = αji1 t t+1 + η(vj ω t+1 ji ) α t+1 ji2 From the second equation: = α t ji2 + η(ωt+1 ji v t+1 i ) ω t+1 ji = 1 2η (αt ji1 αt ji2 ) + 1 t+1 (v 2 j + v t+1 i ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

74 Algorithm Applications and extensions Fully-distributed SVM Hence: α t+1 ji1 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) α t+1 ji2 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) Substituting ω t+1 ji = 1 t+1 2 (vj + v t+1 i ) into surrogate augmented lagrangian, the third term becomes: J j=1 i B j α T ij1 (v j v i ) = J j (αji1 t αt ij1 ) v T j=1 i B j Substitute α t j = i B j (α t ji1 αt ij1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

75 Algorithm Applications and extensions Fully-distributed SVM The final algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {αj t }) α t+1 j = α t j + η 2 (v t+1 j v t+1 i ) i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

76 Applications and extensions Fully-distributed SVM Thank you! Questions? Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76