Distributed Machine Learning and Big Data
|
|
- Thomasina Foster
- 8 years ago
- Views:
Transcription
1 Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
2 Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent 2 Distributed Optimization ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
3 What is Big Data? Machine Learning and Big Data 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion s per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
4 Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B s. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
5 Machine Learning and Big Data Classification example spam classification. Features (u i )): Vector of counts of all words. No. of Features (d): Words in vocabulary ( 100,000). No. of non-zero features: 100. No. of s per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
6 Big Data Paradigm Machine Learning and Big Data 3V s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers Chance of a failure in an hour Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
7 Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints (S) = {(u i, v i ), i = 1,..., n}, u i R d and v i {+1, 1} Linear Predictor function: v = sign(x T u) Error function: E = n i=1 1(v ix T u i 0) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
8 Machine Learning and Big Data Logistic Regression Probability of v is given by: Learning problem is: Given dataset S, estimate x. P(v u, x) = σ(vx T 1 u) = 1 + e vxt u Maximizing the regularized log likelihood: x = argmin x n log(1 + e v i x T u i ) + λ 2 xt x i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
9 Convex Function Machine Learning and Big Data f is a Convex function: f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
10 Machine Learning and Big Data Convex Optimization Convex optimization problem where: minimize x f (x) subject to: g i (x) 0, i = 1,..., k f, g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
11 Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
12 Machine Learning and Big Data Classification Problem Support Vector Machines Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
13 Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ±1 Margin (perpendicular distance between parallel hyperplanes): 2 x Correct classification of training datapoints: v i x T u i 1, i Allowing error (slack), ξ i : v i x T u i 1 ξ i, i Max-margin formulation: min x,ξ 1 2 x 2 + C n i=1 ξ i subject to: v i x T u i 1 ξ i, ξ i 0 i = 1,..., n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
14 SVM: dual Machine Learning and Big Data Support Vector Machines Lagrangian: L = 1 2 xt x + C n ξ i + i=1 n α i (1 ξ i v i x T u i ) + i=1 Dual problem: (x, α, µ ) = max α,µ min x L(x, α, µ) n µ i ξ i For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: x = n α i v i u i i=1 C = α i + µ i i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
15 Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: max α n α i 1 2 i=1 n,n i=1,j=1 subject to: 0 α i C, i α i α j v i v j u T i u j The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k(u i, u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O(n 3 ), usually O(n 2 ). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
16 SVM Machine Learning and Big Data Support Vector Machines A more compact form: min x n i=1 max(0, 1 v ix T u i ) + λ x 2 2 Or: min x n i=1 l(x, u i, v i ) + λω(x) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
17 Machine Learning and Big Data Multi-class classification Support Vector Machines There are m classes. v i {1,..., m} Most popular scheme: v i = argmax v {1,...,m} x T v u i Given example (u i, v i ), x T v i u i x T j u i j {1,..., m} Using a margin of at least 1, loss l(u i, v i ) = max j {1,...,vi 1,v i +1,...,m}{0, 1 (x T v i u i x T j u i )} Given dataset D, solve the problem m min l(u i, v i ) + λ x j 2 x 1,...,x m i D This can be extended to many settings e.g. sequence labeling, learning to rank, etc. j=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
18 Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: min x Logistic Regression: General form: min x n max{0, 1 v i x T u i } + λ x 2 2 i=1 n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 min x n l(x, u i, v i ) + λω(x) i=1 l: loss function, Ω: regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
19 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f (x) f (x 0 ) v T (x x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
20 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Randomly initialize x 0 Iterate x k = x k 1 t k g(x k 1 ), k = 1, 2, 3,.... Where g is a sub-gradient of f. t k = 1. k x best (k) = min i=1,...,k f (x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
21 Machine Learning and Big Data Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
22 Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O( 1 k ). Each iteration takes O(n) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O( 1 ɛ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
23 Machine Learning and Big Data Stochastic Sub-gradient Descent Stochastic Sub-gradient descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
24 Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m. There is one central computer. All computers can communicate with the central computer via network. Define loss(x) = m j=1 i C j l i (x) + λω(x), where l i (x) = l(x, u i, v i ) The gradient (in case of differentiable loss): loss(x) = m ( l i (x)) + λω(x) i C j j=1 Compute l j (x) = i C j l i (x) on the j th computer. Communicate to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
25 Distributed Optimization Distributed gradient descent Compute loss(x) = m j=1 l j(x) + Ω(x) at the central computer. The gradient descent update: x k+1 = x k α loss(x). α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ɛ tolerance, Gradient descent (Logistic regression): O(1/ɛ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O( 1 ɛ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
26 Distributed Optimization ADMM Alternating Direction Method of Multipliers Problem Algorithm Iterate till convergence: minimize x,z f (x) + g(z) subject to: Ax + Bz = c x k+1 = argmin x f (x) + ρ 2 Ax + Bzk c + u k 2 2 z k+1 = argmin z g(z) + ρ 2 Ax k+1 + Bz c + u k 2 2 u k+1 = u k + Ax k+1 + Bz k+1 c Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
27 Stopping criteria Distributed Optimization ADMM Stop when primal and dual residuals small: r k 2 ɛ pri and s k 2 ɛ dual Hence, r k 2 0 and s k 2 0 as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
28 Distributed Optimization ADMM Observations x- update requires solving an optimization problem with, v = Bz k c + u k Similarly for z-update. Sometimes has a closed form. min x f (x) + ρ 2 Ax v 2 2 ADMM is a meta optimization algorithm. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
29 Distributed Optimization Convergence Convergence of ADMM Assumption 1: Functions f : R n R and g : R m R are closed, proper and convex. Same as assuming epif = {(x, t) R n R f (x) t} is closed and convex. Assumption 2: The unaugmented Lagrangian L 0 (x, y, z) has a saddle point (x, z, y ): L 0 (x, z, y) L 0 (x, z, y ) L 0 (x, z, y ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
30 Distributed Optimization Convergence Convergence of ADMM Primal residual: r = Ax + Bz c Optimal objective: p = inf x,z {f (x) + g(z) Ax + Bz = c} Convergence results: Primal residual convergence: r k 0 as k Dual residual convergence: s k 0 as k Objective convergence: f (x) + g(z) p as k Dual variable convergence: y k y as k Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
31 Distributed Optimization Distributed Loss Minimization Decomposition If f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) A is conformably block separable; i.e. A T A is block diagonal. Then, x-update splits into N parallel updates of x i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
32 Distributed Optimization Consensus Optimization Distributed Loss Minimization Problem: ADMM form: min x i,z min x f (x) = N f i (x i ) i=1 N f i (x) i=1 s.t. x i z = 0, i = 1,..., N Augmented lagrangian: L ρ (x 1,..., x N, z, y) = N i=1 (f i (x i ) + y T i (x i z) + ρ 2 x i z 2 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
33 Distributed Optimization Consensus Optimization Distributed Loss Minimization ADMM algorithm: x k+1 i z k+1 = 1 N y k+1 i = argmin xi (f i (x i ) + y kt i (x i z k ) + ρ 2 x i z k 2 2 ) N i=1 Final solution is z k. (x k+1 i + 1 ρ y k i ) = y k i + ρ(x k+1 i z k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
34 Distributed Optimization Distributed Loss Minimization Consensus Optimization z-update can be written as: z k+1 = x k ρ ȳ k+1 Averaging the y-updates: ȳ k+1 = ȳ k + ρ( x k+1 z k+1 ) Substituting first into second: ȳ k+1 = 0. Hence z k = x k. Revised algorithm: x k+1 i y k+1 i Final solution is z k. = argmin xi (f i (x i ) + y kt i (x i x k ) + ρ 2 x i x k 2 2 ) = y k i + ρ(x k+1 i x k+1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
35 Distributed Optimization Distributed Loss minimization Problem: Partition A and b by rows: A = Distributed Loss Minimization min l(ax b) + r(x) x A 1. A N where, A i R m i m and b i R m i ADMM formulation: N min l i (A i x i b i ) + r(z) x i,z i=1, b = b 1. b N, s.t.: x i z = 0, i = 1,..., N Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
36 Distributed Optimization Distributed Loss minimization Distributed Loss Minimization ADMM solution: x k+1 i = argmin xi (l i (A i x i b i ) + ρ 2 x i z k + u k i 2 2 ) z k+1 = argmin z (r(z) + Nρ 2 z x k+1 + ū k 2 2 ) u k+1 i = ui k + x k+1 i z k+1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
37 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
38 ADMM Results Distributed Optimization Results Logistic Regression using the loss minimization formulation (Boyd et al.): min x n log(1 + exp( v i x T u i )) + λ x 2 2 i=1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
39 Distributed Optimization Results Other Machine Learning Problems Ridge Regression. Lasso. Multi-class SVM. Ranking. Structured output prediction. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
40 ADMM Results Distributed Optimization Results Lasso Results (Boyd et al.): Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
41 ADMM Results Distributed Optimization Results SVM primal residual: Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
42 ADMM Results SVM Accuracy: Distributed Optimization Results Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
43 Results Distributed Optimization Results Risk and Hyperplane Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
44 Dual Ascent Distributed Optimization Development of ADMM Convex equality constrained problem: min f (x) x subject to: Ax = b Lagrangian: L(x, y) = f (x) + y T (Ax b) Dual function: g(y) = inf x L(x, y) Dual problem: max y g(y) Final solution: x = argmin x L(x, y) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
45 Distributed Optimization Development of ADMM Dual Ascent Gradient descent for dual problem: y k+1 = y k + α k y k g(y k ) y k g(y k ) = A x b, where x = argmin x L(x, y k ) Dual ascent algorithm: x k+1 = argmin x L(x, y k ) y k+1 = y k + α k (Ax k+1 b) Assumptions: L(x, y k ) is strictly convex. Else, the first step can have multiple solutions. L(x, y k ) is bounded below. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
46 Dual Decomposition Distributed Optimization Development of ADMM Suppose f is separable: f (x) = f 1 (x 1 ) + + f N (x N ), x = (x 1,..., x N ) L is separable in x: L(x, y) = L 1 (x 1, y) + + L N (x N, y) y T b, where L i (x i, y) = f i (x i ) + y T A i x i x minimization splits into N separate problems: x k+1 i = argmin xi L i (x i, y k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
47 Dual Decomposition Distributed Optimization Development of ADMM Dual decomposition: = argmin xi L i (x i, y k ), i = 1,..., N N y k+1 = y k + α k ( A i x i b) x k+1 i i=1 Distributed solution: Scatter y k to individual nodes Compute x i in the i th node (distributed step) Gather A i x i from the i th node All drawbacks of dual ascent exist Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
48 Distributed Optimization Development of ADMM Method of Multipliers Make dual ascent work under more general conditions Use augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + ρ 2 Ax b 2 2 Method of multipliers: x k+1 = argmin x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
49 Distributed Optimization Development of ADMM Methods of Multipliers Optimality conditions (for differentiable f ): Primal feasibility: Ax b = 0 Dual feasibility: f (x ) + A T y = 0 Since x k+1 minimizes L ρ (x, y k ) 0 = x L ρ (x k+1, y k ) = x f (x k+1 ) + A T (y k + ρ(ax k+1 b)) = x f (x k+1 ) + A T y k+1 Dual update y k+1 = y k + ρ(ax k+1 b) makes (x k+1, y k+1 ) dual feasible Primal feasibility is achieved in the limit: (Ax k+1 b) 0 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
50 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
51 Distributed Optimization Development of ADMM Alternating direction method of multipliers ADMM problem: min x,z f (x) + g(z) subject to: Ax + Bz = c Lagrangian: L ρ (x, z, y) = f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 ADMM: x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
52 Distributed Optimization Development of ADMM Alternating direction method of multipliers Problem with applying standard method of multipliers for distributed optimization: there is no problem decomposition even if f is separable. due to square term ρ 2 Ax b 2 2 The above technique reduces to method of multipliers if we do joint minimization of x and z Since we split the joint x, z minimization step, the problem can be decomposed. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
53 Distributed Optimization Development of ADMM ADMM Optimality conditions Optimality conditions (differentiable case): Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
54 Distributed Optimization Development of ADMM ADMM Optimality conditions Primal residual: r k = Ax k + Bz k c Since x k+1 minimizes L ρ (x, z k, y k ): 0 = f (x k+1 ) + A T y k + ρa T (Ax k+1 + Bz k c) = f (x k+1 ) + A T (y k + ρr k+1 + ρb(z k z k+1 ) = f (x k+1 ) + A T y k+1 + ρa T B(z k z k+1 ) or, ρa T B(z k z k+1 ) = f (x k+1 ) + A T y k+1 Hence, s k+1 = ρa T B(z k z k+1 ) can be thought as dual residual. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
55 Distributed Optimization Development of ADMM ADMM with scaled dual variables Combine the linear and quadratic terms Primal feasibility: Ax + Bz c = 0 Dual feasibility: f (x) + A T y = 0 and g(z) + B T y = 0 Since z k+1 minimizes L ρ (x k+1, z, y k ): 0 = g(z k+1 ) + B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 ) + B T y k+1 So, the dual variable update satisfies the second dual feasibility constraint. Primal feasibility and first dual feasibility are satisfied asymptotically. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
56 Applications and extensions Weighted Parameter Averaging Distributed Support Vector Machines Training dataset partitioned into M partitions (S m, m = 1,..., M). Each partition has L datapoints: S m = {(x ml, y ml )}, l = 1,..., L. Each partition can be processed locally on a single computer. Distributed SVM training problem [?]: min M w m,z m=1 l=1 L loss(w m ; (x ml, y ml )) + r(z) s.t.w m z = 0, m = 1,, M, l = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
57 Applications and extensions Weighted Parameter Averaging Parameter Averaging Parameter averaging, also called mixture weights proposed in [?], for logistic regression. Results hold true for SVMs with suitable sub-derivative. Locally learn SVM on S m : ŵ m = argmin w 1 L L loss(w; x ml, y ml ) + λ w 2, m = 1,..., M l=1 The final SVM parameter is given by: w PA = 1 M M m=1 ŵ m Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
58 Applications and extensions Problem with Parameter Averaging Weighted Parameter Averaging PA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
59 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Final hypothesis is a weighted sum of the parameters ŵ m. Also proposed in [?]. How to get β m? w = M β m ^wm m=1 Notation: β = [β 1,, β M ] T, ^W = [ŵ 1,, ˆ w M ] w = ^Wβ Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
60 Applications and extensions Weighted Parameter Averaging Weighted Parameter Averaging Find the optimal set of weights β which attains the lowest regularized hinge loss: min β,ξ λ ^Wβ ML M L m=1 i=1 ξ mi subject to: y mi (β T ^W T x mi ) 1 ξ mi, ξ mi 0, Ŵ is a pre-computed parameter. i, m m = 1,..., M, i = 1,..., L Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
61 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed version of primal weighted parameter averaging: min γ m,β 1 ML M m=1 l=1 L loss(ŵ γ m; x ml, y ml ) + r(β) s.t. γ m β = 0, m = 1,, M, r(β) = λ Ŵβ 2, γ m weights for m th computer, β consensus weight. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
62 Applications and extensions Weighted Parameter Averaging Distributed Weighted Parameter Averaging Distributed algorithm using ADMM: γ k+1 m := argmin γ (loss(a i γ) + (ρ/2) γ β k + u k m 2 2 ) β k+1 := argmin β (r(β) + (Mρ/2) β γ k+1 u k 2 2 ) u k+1 m = u k m + γ k+1 m β k+1. u m are the scaled Lagrange multipliers, γ = 1 M M m=1 γ m and u = 1 M M m=1 u m. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
63 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging PA (left) and WPA (right) with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
64 Applications and extensions Toy Dataset - PA and WPA Weighted Parameter Averaging Accuracy of PA and WPA with varying number of partitions - Toy dataset. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
65 Applications and extensions Real World Datasets Weighted Parameter Averaging Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
66 Applications and extensions Real World Datasets Weighted Parameter Averaging Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
67 Applications and extensions Real World Datasets Weighted Parameter Averaging Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
68 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of test accuracy with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
69 Applications and extensions Real World Datasets Weighted Parameter Averaging Convergence of primal residual with iterations (200 partitions). Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
70 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network Motivations: Sensor Networks. Corporate networks. Privacy. Assumptions: Data is available at nodes of network Communication is possible only along edges of the network. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
71 Applications and extensions Fully-distributed SVM Distributed SVM on Arbitrary Network SVM optimization problem: n 1 J j min w,b,ξ 2 w 2 + C j=1 n=1 ξ jn s.t.: y jn (w t x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn 0, j J, n = 1,..., N j Node j has a copy of w j, b j. Distributed formulation: min {w j,b j,ξ jn } 1 2 J w j 2 + JC j=1 n J j j=1 n=1 s.t.: y jn (w t j x jn + b) 1 ξ jn, j J, n = 1,..., N j ξ jn ξ jn 0, j J, n = 1,..., N j w j = w i, j, i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
72 Algorithm Applications and extensions Fully-distributed SVM Using v j = [w T j b j ] T, X j = [[x j1,..., x jnj ] T 1 j ] and Y j = diag([y j1,..., y jnj ]): min {v j,ξ jn,ω ji } 1 2 J r(v j ) + JC j=1 Surrogate augmented Lagrangian: L({v j }, { ξ j }, {ω ji }, {α ijk }) = 1 2 n J j j=1 n=1 s.t.: Y j X j v j 1 ξ j, j J ξ j 0, j J ξ jn v j = ω ji, v i = ω ji, j, i B j J r(v j ) + JC j=1 n J j j=1 n=1 J + (αij1(v T j ω ji ) + αij2(v T i ω ji )) + η ( v j ω ji 2 + v i ω ji 2 ) j=1 i B j i B j ξ jn Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
73 Applications and extensions Fully-distributed SVM Algorithm ADMM based algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {ω t ji }, {αt ijk }) {ω t+1 ji } = argmin ωji L({v j } t+1, { ξ t+1 j }, {ω ji }, {α t ijk }) α t+1 ji1 = αji1 t t+1 + η(vj ω t+1 ji ) α t+1 ji2 From the second equation: = α t ji2 + η(ωt+1 ji v t+1 i ) ω t+1 ji = 1 2η (αt ji1 αt ji2 ) + 1 t+1 (v 2 j + v t+1 i ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
74 Algorithm Applications and extensions Fully-distributed SVM Hence: α t+1 ji1 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) α t+1 ji2 = 1 2 (αt ji1 + αt ji2 ) + η t+1 (v 2 j v t+1 i ) Substituting ω t+1 ji = 1 t+1 2 (vj + v t+1 i ) into surrogate augmented lagrangian, the third term becomes: J j=1 i B j α T ij1 (v j v i ) = J j (αji1 t αt ij1 ) v T j=1 i B j Substitute α t j = i B j (α t ji1 αt ij1 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
75 Algorithm Applications and extensions Fully-distributed SVM The final algorithm: {v t+1 j, ξ t+1 jn } = argmin {v j, ξ j } W L({v j}, { ξ j }, {αj t }) α t+1 j = α t j + η 2 (v t+1 j v t+1 i ) i B j Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
76 Applications and extensions Fully-distributed SVM Thank you! Questions? Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, / 76
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationSupport Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
More informationDuality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725
Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
More informationNonlinear Optimization: Algorithms 3: Interior-point methods
Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,
More informationBig Data Techniques Applied to Very Short-term Wind Power Forecasting
Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher (ricardo.j.bessa@inesctec.pt) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More information10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
More informationLecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationSupport Vector Machines
CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationLABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationNonlinear Programming Methods.S2 Quadratic Programming
Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective
More informationDate: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationStatistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationSolutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR
Solutions Of Some Non-Linear Programming Problems A PROJECT REPORT submitted by BIJAN KUMAR PATEL for the partial fulfilment for the award of the degree of Master of Science in Mathematics under the supervision
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationBig Data Optimization for Modern Communication Networks
Big Data Optimization for Modern Communication Networks Lanchao Liu Advisor: Dr. Zhu Han Co-Advisor: Dr. Wei-Chuan Shih Wireless Networking, Signal Processing and Security Lab Electrical and Computer Engineering
More informationA Distributed Line Search for Network Optimization
01 American Control Conference Fairmont Queen Elizabeth, Montréal, Canada June 7-June 9, 01 A Distributed Line Search for Networ Optimization Michael Zargham, Alejandro Ribeiro, Ali Jadbabaie Abstract
More informationOnline Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationFurther Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing
More informationTwo-Stage Stochastic Linear Programs
Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationInterior-Point Algorithms for Quadratic Programming
Interior-Point Algorithms for Quadratic Programming Thomas Reslow Krüth Kongens Lyngby 2008 IMM-M.Sc-2008-19 Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationL3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
More informationOnline Algorithms: Learning & Optimization with No Regret.
Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the
More informationAIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi
AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output
More informationRecovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach
MASTER S THESIS Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach PAULINE ALDENVIK MIRJAM SCHIERSCHER Department of Mathematical
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationInternational Doctoral School Algorithmic Decision Theory: MCDA and MOO
International Doctoral School Algorithmic Decision Theory: MCDA and MOO Lecture 2: Multiobjective Linear Programming Department of Engineering Science, The University of Auckland, New Zealand Laboratoire
More informationStochastic Optimization for Big Data Analytics: Algorithms and Libraries
Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationMulticlass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationNatural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression
Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15
More informationOnline Learning of Optimal Strategies in Unknown Environments
1 Online Learning of Optimal Strategies in Unknown Environments Santiago Paternain and Alejandro Ribeiro Abstract Define an environment as a set of convex constraint functions that vary arbitrarily over
More informationNumerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen
(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More information(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0,
Name: Solutions to Practice Final. Consider the line r(t) = 3 + t, t, 6. (a) Find symmetric equations for this line. (b) Find the point where the first line r(t) intersects the surface z = x + y. (a) We
More informationPerron vector Optimization applied to search engines
Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink
More informationA fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
More informationProximal mapping via network optimization
L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More informationIntroduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More information1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.
Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S
More informationDistributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationLecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
More informationFactorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationFactorization Machines
Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan rendle@ar.sanken.osaka-u.ac.jp Abstract In this
More informationIntroduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationDistributed Coordinate Descent Method for Learning with Big Data
Peter Richtárik Martin Takáč University of Edinburgh, King s Buildings, EH9 3JZ Edinburgh, United Kingdom PETER.RICHTARIK@ED.AC.UK MARTIN.TAKI@GMAIL.COM Abstract In this paper we develop and analyze Hydra:
More informationDuality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationSemi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationGI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
More informationDirect Methods for Solving Linear Systems. Matrix Factorization
Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011
More informationA Study on SMO-type Decomposition Methods for Support Vector Machines
1 A Study on SMO-type Decomposition Methods for Support Vector Machines Pai-Hsuen Chen, Rong-En Fan, and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw
More informationBilinear Prediction Using Low-Rank Models
Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J.
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationParallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
More informationSECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA
SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA This handout presents the second derivative test for a local extrema of a Lagrange multiplier problem. The Section 1 presents a geometric motivation for the
More information