Distributed Machine Learning and Big Data

Size: px
Start display at page:

Download "Distributed Machine Learning and Big Data"

Transcription

1 Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

2 Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent 2 Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

3 What is Big Data? Machine Learning and Big Data 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion s per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

4 Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B s. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

5 Machine Learning and Big Data Classification example spam classification. Features (u i )): Vector of counts of all words. No. of Features (d): Words in vocabulary ( 100,000). No. of non-zero features: 100. No. of s per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

6 Big Data Paradigm Machine Learning and Big Data 3V s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers Chance of a failure in an hour Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

7 Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints (S) = {(u i, v i ), i = 1,..., n}, u i R d and v i {+1, 1} Linear Predictor function: v = sign(x T u) Error function: E = n i=1 1(v ix T u i 0) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

8 Machine Learning and Big Data Logistic Regression Probability of v is given by: Learning problem is: Given dataset S, estimate x. P(v u, x) = σ(vx T 1 u) = 1 + e vxt u Maximizing the regularized log likelihood: x = argmin x n log(1 + e v i x T u i ) + λ 2 xt x i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

9 Convex Function Machine Learning and Big Data f is a Convex function: f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

10 Machine Learning and Big Data Convex Optimization Convex optimization problem where: minimize x f (x) subject to: g i (x) 0, i = 1,..., k f, g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

11 Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

12 Machine Learning and Big Data Classification Problem Support Vector Machines Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

13 Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ±1 Margin (perpendicular distance between parallel hyperplanes): 2 x Correct classification of training datapoints: v i x T u i 1, i Allowing error (slack), ξ i : v i x T u i 1 ξ i, i Max-margin formulation: min x,ξ 1 2 x 2 + C n i=1 ξ i subject to: v i x T u i 1 ξ i, ξ i 0 i = 1,..., n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

14 SVM: dual Machine Learning and Big Data Support Vector Machines Lagrangian: L = 1 2 xt x + C n ξ i + i=1 n α i (1 ξ i v i x T u i ) + i=1 Dual problem: (x, α, µ ) = max α,µ min x L(x, α, µ) n µ i ξ i For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x = n α i v i u i i=1 C = α i + µ i i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

15 Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: max α n α i 1 2 i=1 n,n i=1,j=1 subject to: 0 α i C, i α i α j v i v j u T i u j The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(u i, u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n 3 ), usually O(n 2 ). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

16 SVM Machine Learning and Big Data Support Vector Machines A more compact form: min x n i=1 max(0, 1 v ix T u i ) + λ x 2 2 Or: min x n i=1 l(x, u i, v i ) + λω(x) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

17 Machine Learning and Big Data Multi-class classification Support Vector Machines There are m classes. v i {1,..., m} Most popular scheme: v i = argmax v {1,...,m} x T v u i Given example (u i, v i ), x T v i u i x T j u i j {1,..., m} Using a margin of at least 1, loss l(u i, v i ) = max j {1,...,vi 1,v i +1,...,m}{0, 1 (x T v i u i x T j u i )} Given dataset D, solve the problem m min l(u i, v i ) + λ x j 2 x 1,...,x m i D This can be extended to many settings e.g. sequence labeling, learning to rank, etc. j=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

18 Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: min x Logistic Regression: General form: min x n max{0, 1 v i x T u i } + λ x 2 2 i=1 n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 min x n l(x, u i, v i ) + λω(x) i=1 l: loss function, Ω: regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

19 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f (x) f (x 0 ) v T (x x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

20 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Randomly initialize x 0 Iterate x k = x k 1 t k g(x k 1 ), k = 1, 2, 3,.... Where g is a sub-gradient of f. t k = 1. k x best (k) = min i=1,...,k f (x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

21 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

22 Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O( 1 k ). Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1 ɛ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

23 Machine Learning and Big Data Stochastic Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

24 Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m. There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m j=1 i C j l i (x) + λω(x), where l i (x) = l(x, u i, v i ) The gradient (in case of differentiable loss): loss(x) = m ( l i (x)) + λω(x) i C j j=1 Compute l j (x) = i C j l i (x) on the j th computer. Communicate to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

25 Distributed Optimization Distributed gradient descent Compute loss(x) = m j=1 l j(x) + Ω(x) at the central computer. The gradient descent update: x k+1 = x k α loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ɛ tolerance, Gradient descent (Logistic regression): O(1/ɛ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1 ɛ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

26 Distributed Optimization ADMM Alternating Direction Method of Multipliers Problem Algorithm Iterate till convergence: minimize x,z f (x) + g(z) subject to: Ax + Bz = c x k+1 = argmin x f (x) + ρ 2 Ax + Bzk c + u k 2 2 z k+1 = argmin z g(z) + ρ 2 Ax k+1 + Bz c + u k 2 2 u k+1 = u k + Ax k+1 + Bz k+1 c Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

27 Stopping criteria Distributed Optimization ADMM Stop when primal and dual residuals small: r k 2 ɛ pri and s k 2 ɛ dual Hence, r k 2 0 and s k 2 0 as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

28 Distributed Optimization ADMM Observations x- update requires solving an optimization problem with, v = Bz k c + u k Similarly for z-update. Sometimes has a closed form. min x f (x) + ρ 2 Ax v 2 2 ADMM is a meta optimization algorithm. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

29 Distributed Optimization Convergence Convergence of ADMM Assumption 1: Functions f : R n R and g : R m R are closed, proper and convex. Same as assuming epif = {(x, t) R n R f (x) t} is closed and convex. Assumption 2: The unaugmented Lagrangian L 0 (x, y, z) has a saddle point (x, z, y ): L 0 (x, z, y) L 0 (x, z, y ) L 0 (x, z, y ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

30 Distributed Optimization Convergence Convergence of ADMM Primal residual: r = Ax + Bz c Optimal objective: p = inf x,z {f (x) + g(z) Ax + Bz = c} Convergence results: Primal residual convergence: r k 0 as k Dual residual convergence: s k 0 as k Objective convergence: f (x) + g(z) p as k Dual variable convergence: y k y as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

31 Distributed Optimization Distributed Loss Minimization Decomposition If f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) A is conformably block separable; i.e. A T A is block diagonal. Then, x-update splits into N parallel updates of x i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

32 Distributed Optimization Consensus Optimization Distributed Loss Minimization Problem: ADMM form: min x i,z min x f (x) = N f i (x i ) i=1 N f i (x) i=1 s.t. x i z = 0, i = 1,..., N Augmented lagrangian: L ρ (x 1,..., x N, z, y) = N i=1 (f i (x i ) + y T i (x i z) + ρ 2 x i z 2 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

33 Distributed Optimization Consensus Optimization Distributed Loss Minimization ADMM algorithm: x k+1 i z k+1 = 1 N y k+1 i = argmin xi (f i (x i ) + y kt i (x i z k ) + ρ 2 x i z k 2 2 ) N i=1 Final solution is z k. (x k+1 i + 1 ρ y k i ) = y k i + ρ(x k+1 i z k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

34 Distributed Optimization Distributed Loss Minimization Consensus Optimization z-update can be written as: z k+1 = x k ρ ȳ k+1 Averaging the y-updates: ȳ k+1 = ȳ k + ρ( x k+1 z k+1 ) Substituting first into second: ȳ k+1 = 0. Hence z k = x k. Revised algorithm: x k+1 i y k+1 i Final solution is z k. = argmin xi (f i (x i ) + y kt i (x i x k ) + ρ 2 x i x k 2 2 ) = y k i + ρ(x k+1 i x k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

35 Distributed Optimization Distributed Loss minimization Problem: Partition A and b by rows: A = Distributed Loss Minimization min l(ax b) + r(x) x A 1. A N where, A i R m i m and b i R m i ADMM formulation: N min l i (A i x i b i ) + r(z) x i,z i=1, b = b 1. b N, s.t.: x i z = 0, i = 1,..., N Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

36 Distributed Optimization Distributed Loss minimization Distributed Loss Minimization ADMM solution: x k+1 i = argmin xi (l i (A i x i b i ) + ρ 2 x i z k + u k i 2 2 ) z k+1 = argmin z (r(z) + Nρ 2 z x k+1 + ū k 2 2 ) u k+1 i = ui k + x k+1 i z k+1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

37 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

38 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

39 Distributed Optimization Results Other Machine Learning Problems Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

40 ADMM Results Distributed Optimization Results Lasso Results (Boyd et al.): Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

41 ADMM Results Distributed Optimization Results SVM primal residual: Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

42 ADMM Results SVM Accuracy: Distributed Optimization Results Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

43 Results Distributed Optimization Results Risk and Hyperplane Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

44 Dual Ascent Distributed Optimization Development of ADMM Convex equality constrained problem: min f (x) x subject to: Ax = b Lagrangian: L(x, y) = f (x) + y T (Ax b) Dual function: g(y) = inf x L(x, y) Dual problem: max y g(y) Final solution: x = argmin x L(x, y) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

45 Distributed Optimization Development of ADMM Dual Ascent Gradient descent for dual problem: y k+1 = y k + α k y k g(y k ) y k g(y k ) = A x b, where x = argmin x L(x, y k ) Dual ascent algorithm: x k+1 = argmin x L(x, y k ) y k+1 = y k + α k (Ax k+1 b) Assumptions: L(x, y k ) is strictly convex. Else, the first step can have multiple solutions. L(x, y k ) is bounded below. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

46 Dual Decomposition Distributed Optimization Development of ADMM Suppose f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) L is separable in x: L(x, y) = L 1 (x 1, y) + + L N (x N, y) y T b, where L i (x i, y) = f i (x i ) + y T A i x i x minimization splits into N separate problems: x k+1 i = argmin xi L i (x i, y k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

47 Dual Decomposition Distributed Optimization Development of ADMM Dual decomposition: = argmin xi L i (x i, y k ), i = 1,..., N N y k+1 = y k + α k ( A i x i b) x k+1 i i=1 Distributed solution: Scatter y k to individual nodes Compute x i in the i th node (distributed step) Gather A i x i from the i th node All drawbacks of dual ascent exist Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

48 Distributed Optimization Development of ADMM Method of Multipliers Make dual ascent work under more general conditions Use augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + ρ 2 Ax b 2 2 Method of multipliers: x k+1 = argmin x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

49 Distributed Optimization Development of ADMM Methods of Multipliers Optimality conditions (for differentiable f ): Primal feasibility: Ax b = 0 Dual feasibility: f (x ) + A T y = 0 Since x k+1 minimizes L ρ (x, y k ) 0 = x L ρ (x k+1, y k ) = x f (x k+1 ) + A T (y k + ρ(ax k+1 b)) = x f (x k+1 ) + A T y k+1 Dual update y k+1 = y k + ρ(ax k+1 b) makes (x k+1, y k+1 ) dual feasible Primal feasibility is achieved in the limit: (Ax k+1 b) 0 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

50 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

51 Distributed Optimization Development of ADMM Alternating direction method of multipliers ADMM problem: min x,z f (x) + g(z) subject to: Ax + Bz = c Lagrangian: L ρ (x, z, y) = f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 ADMM: x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

52 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

53 Distributed Optimization Development of ADMM ADMM Optimality conditions Optimality conditions (differentiable case): Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

54 Distributed Optimization Development of ADMM ADMM Optimality conditions Primal residual: r k = Ax k + Bz k c Since x k+1 minimizes L ρ (x, z k, y k ): 0 = f (x k+1 ) + A T y k + ρa T (Ax k+1 + Bz k c) = f (x k+1 ) + A T (y k + ρr k+1 + ρb(z k z k+1 ) = f (x k+1 ) + A T y k+1 + ρa T B(z k z k+1 ) or, ρa T B(z k z k+1 ) = f (x k+1 ) + A T y k+1 Hence, s k+1 = ρa T B(z k z k+1 ) can be thought as dual residual. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

55 Distributed Optimization Development of ADMM ADMM with scaled dual variables Combine the linear and quadratic terms Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

56 Applications and extensions Weighted Parameter Averaging Distributed Support Vector Machines Training dataset partitioned into M partitions (S m, m = 1,..., M). Each partition has L datapoints: S m = {(x ml, y ml )}, l = 1,..., L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min M w m,z m=1 l=1 L loss(w m ; (x ml, y ml )) + r(z) s.t.w m z = 0, m = 1,, M, l = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

57 Applications and extensions Weighted Parameter Averaging Parameter Averaging Parameter averaging, also called mixture weights proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on S m : ŵ m = argmin w 1 L L loss(w; x ml, y ml ) + λ w 2, m = 1,..., M l=1 The final SVM parameter is given by: w PA = 1 M M m=1 ŵ m Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

58 Applications and extensions Problem with Parameter Averaging Weighted Parameter Averaging PA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

59 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Final hypothesis is a weighted sum of the parameters ŵ m. Also proposed in [?]. How to get β m? w = M β m ^wm m=1 Notation: β = [β 1,, β M ] T, ^W = [ŵ 1,, ˆ w M ] w = ^Wβ Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

60 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Find the optimal set of weights β which attains the lowest regularized hinge loss: min β,ξ λ ^Wβ ML M L m=1 i=1 ξ mi subject to: y mi (β T ^W T x mi ) 1 ξ mi, ξ mi 0, Ŵ is a pre-computed parameter. i, m m = 1,..., M, i = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

61 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed version of primal weighted parameter averaging: min γ m,β 1 ML M m=1 l=1 L loss(ŵ γ m; x ml, y ml ) + r(β) s.t. γ m β = 0, m = 1,, M, r(β) = λ Ŵβ 2, γ m weights for m th computer, β consensus weight. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

62 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed algorithm using ADMM: γ k+1 m := argmin γ (loss(a i γ) + (ρ/2) γ β k + u k m 2 2 ) β k+1 := argmin β (r(β) + (Mρ/2) β γ k+1 u k 2 2 ) u k+1 m = u k m + γ k+1 m β k+1. u m are the scaled Lagrange multipliers, γ = 1 M M m=1 γ m and u = 1 M M m=1 u m. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

63 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging PA (left) and WPA (right) with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

64 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging Accuracy of PA and WPA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

65 Applications and extensions Real World Datasets Weighted Parameter Averaging Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

66 Applications and extensions Real World Datasets Weighted Parameter Averaging Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

67 Applications and extensions Real World Datasets Weighted Parameter Averaging Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

68 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of test accuracy with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

69 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of primal residual with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

70 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible only along edges of the network. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

71 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network SVM optimization problem: n 1 J j min w,b,ξ 2 w 2 + C j=1 n=1 ξ jn s.t.: y jn (w t x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn 0, j J, n = 1,..., N j Node j has a copy of w j, b j. Distributed formulation: min {w j,b j,ξ jn } 1 2 J w j 2 + JC j=1 n J j j=1 n=1 s.t.: y jn (w t j x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn ξ jn 0, j J, n = 1,..., N j w j = w i, j, i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

72 Algorithm Applications and extensions Fully-distributed SVM Using v j = [w T j b j ] T, X j = [[x j1,..., x jnj ] T 1 j ] and Y j = diag([y j1,..., y jnj ]): min {v j,ξ jn,ω ji } 1 2 J r(v j ) + JC j=1 Surrogate augmented Lagrangian: L({v j }, { ξ j }, {ω ji }, {α ijk }) = 1 2 n J j j=1 n=1 s.t.: Y j X j v j 1 ξ j, j J ξ j 0, j J ξ jn v j = ω ji, v i = ω ji, j, i B j J r(v j ) + JC j=1 n J j j=1 n=1 J + (αij1(v T j ω ji ) + αij2(v T i ω ji )) + η ( v j ω ji 2 + v i ω ji 2 ) j=1 i B j i B j ξ jn Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

73 Applications and extensions Fully-distributed SVM Algorithm ADMM based algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {ω t ji }, {αt ijk }) {ω t+1 ji } = argmin ωji L({v j } t+1, { ξ t+1 j }, {ω ji }, {α t ijk }) α t+1 ji1 = αji1 t t+1 + η(vj ω t+1 ji ) α t+1 ji2 From the second equation: = α t ji2 + η(ωt+1 ji v t+1 i ) ω t+1 ji = 1 2η (αt ji1 αt ji2 ) + 1 t+1 (v 2 j + v t+1 i ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

74 Algorithm Applications and extensions Fully-distributed SVM Hence: α t+1 ji1 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) α t+1 ji2 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) Substituting ω t+1 ji = 1 t+1 2 (vj + v t+1 i ) into surrogate augmented lagrangian, the third term becomes: J j=1 i B j α T ij1 (v j v i ) = J j (αji1 t αt ij1 ) v T j=1 i B j Substitute α t j = i B j (α t ji1 αt ij1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

75 Algorithm Applications and extensions Fully-distributed SVM The final algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {αj t }) α t+1 j = α t j + η 2 (v t+1 j v t+1 i ) i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

76 Applications and extensions Fully-distributed SVM Thank you! Questions? Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data Techniques Applied to Very Short-term Wind Power Forecasting Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher (ricardo.j.bessa@inesctec.pt) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Support Vector Machines

Support Vector Machines CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

More information

Date: April 12, 2001. Contents

Date: April 12, 2001. Contents 2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR Solutions Of Some Non-Linear Programming Problems A PROJECT REPORT submitted by BIJAN KUMAR PATEL for the partial fulfilment for the award of the degree of Master of Science in Mathematics under the supervision

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Big Data Optimization for Modern Communication Networks

Big Data Optimization for Modern Communication Networks Big Data Optimization for Modern Communication Networks Lanchao Liu Advisor: Dr. Zhu Han Co-Advisor: Dr. Wei-Chuan Shih Wireless Networking, Signal Processing and Security Lab Electrical and Computer Engineering

More information

A Distributed Line Search for Network Optimization

A Distributed Line Search for Network Optimization 01 American Control Conference Fairmont Queen Elizabeth, Montréal, Canada June 7-June 9, 01 A Distributed Line Search for Networ Optimization Michael Zargham, Alejandro Ribeiro, Ali Jadbabaie Abstract

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass. Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing

More information

Two-Stage Stochastic Linear Programs

Two-Stage Stochastic Linear Programs Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Interior-Point Algorithms for Quadratic Programming

Interior-Point Algorithms for Quadratic Programming Interior-Point Algorithms for Quadratic Programming Thomas Reslow Krüth Kongens Lyngby 2008 IMM-M.Sc-2008-19 Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Online Algorithms: Learning & Optimization with No Regret.

Online Algorithms: Learning & Optimization with No Regret. Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the

More information

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output

More information

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach MASTER S THESIS Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach PAULINE ALDENVIK MIRJAM SCHIERSCHER Department of Mathematical

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

International Doctoral School Algorithmic Decision Theory: MCDA and MOO International Doctoral School Algorithmic Decision Theory: MCDA and MOO Lecture 2: Multiobjective Linear Programming Department of Engineering Science, The University of Auckland, New Zealand Laboratoire

More information

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

More information

Online Learning of Optimal Strategies in Unknown Environments

Online Learning of Optimal Strategies in Unknown Environments 1 Online Learning of Optimal Strategies in Unknown Environments Santiago Paternain and Alejandro Ribeiro Abstract Define an environment as a set of convex constraint functions that vary arbitrarily over

More information

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen (für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0,

(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0, Name: Solutions to Practice Final. Consider the line r(t) = 3 + t, t, 6. (a) Find symmetric equations for this line. (b) Find the point where the first line r(t) intersects the surface z = x + y. (a) We

More information

Perron vector Optimization applied to search engines

Perron vector Optimization applied to search engines Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Proximal mapping via network optimization

Proximal mapping via network optimization L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where. Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

More information

Distributed Structured Prediction for Big Data

Distributed Structured Prediction for Big Data Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Factorization Machines

Factorization Machines Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan rendle@ar.sanken.osaka-u.ac.jp Abstract In this

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Distributed Coordinate Descent Method for Learning with Big Data

Distributed Coordinate Descent Method for Learning with Big Data Peter Richtárik Martin Takáč University of Edinburgh, King s Buildings, EH9 3JZ Edinburgh, United Kingdom PETER.RICHTARIK@ED.AC.UK MARTIN.TAKI@GMAIL.COM Abstract In this paper we develop and analyze Hydra:

More information

Duality of linear conic problems

Duality of linear conic problems Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems. Matrix Factorization Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

More information

A Study on SMO-type Decomposition Methods for Support Vector Machines

A Study on SMO-type Decomposition Methods for Support Vector Machines 1 A Study on SMO-type Decomposition Methods for Support Vector Machines Pai-Hsuen Chen, Rong-En Fan, and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw

More information

Bilinear Prediction Using Low-Rank Models

Bilinear Prediction Using Low-Rank Models Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J.

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA This handout presents the second derivative test for a local extrema of a Lagrange multiplier problem. The Section 1 presents a geometric motivation for the

More information