Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1 / 31
Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 1 / 31
0. Review Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 1 / 31
0. Review Prediction Problem: Formally Let X be any set (the predictor space), Y be any set (the target space) Task: Given: some training data D train X Y a loss function l : Y Y R that measures how bad is a prediction ŷ if the true value is y compute a prediction function: ŷ : X Y such that the empirical risk is minimum: risk(ŷ, D test ) := 1 D test (x,y) D test l(y, ŷ(x)) Going For Large Scale 1 / 31
0. Review Solving Prediction Tasks Now estimating the parameters β is done by solving the following optimization task: arg min β (x,y) D train l(y, ŷ(x; β)) + λr(β) When solving a prediction task we need to define the following components: A prediction function ŷ(x; β) A loss function l(y, ŷ(x; β)) A regularization function R(β) A learning algorithm to solve the optimization task above. Going For Large Scale 2 / 31
0. Review Descent Methods 1: procedure DescentMethod input: f Choose an initial point β (0) R n 2: Get initial point β (0) 3: t 0 The next point is generated 4: repeat using 5: Get Update Direction β (t) A step size µ 6: Get Step Size µ A direction β such that 7: β (t+1) β (t) + µ β (t) 8: t t + 1 f (β (t) + µ β (t 1) ) < f (β (t 1) ) 9: until convergence 10: return β, f (β) 11: end procedure Going For Large Scale 3 / 31
0. Review Descent Methods Gradient Descent: GD β = f (β) Stochastic Gradient Descent: For: f (β) = m f i (β) i=1 f i (β) = l(y i, ŷ(x i ; β)) + λ m R(β) The Update Rule is: SGD β = β f i (β) Newton s Method: Newton β = 2 f (β) 1 f (β) Going For Large Scale 4 / 31
1. Learning from Huge Scale Data Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 5 / 31
1. Learning from Huge Scale Data Huge Scale Data The computational complexity of learning algorithms depends (among other things) on the amount of data The larger D train, the longer it takes to compute the gradients Parallelization is not trivial: each update depends on the results of the previous one The whole dataset may not reside in the same machine Going For Large Scale 5 / 31
1. Learning from Huge Scale Data Huge Scale dasets X Y Going For Large Scale 6 / 31
1. Learning from Huge Scale Data Instance distribution X Y Going For Large Scale 7 / 31
1. Learning from Huge Scale Data Feature distribution X Y Going For Large Scale 8 / 31
1. Learning from Huge Scale Data Learning from Huge Scale datasets What to do? Use only one portion of the data (subsampling) Learn different models in different portions of the data and combine them (ensembling) Use algorithms that are by nature parallelizable (distributed optimization) Exploit distributed computational models Going For Large Scale 9 / 31
2. Does more data help? Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 10 / 31
2. Does more data help? Subsampling Do we really need so much data? Let us do an experiment: 1. Uniformly sample a subset of the training set D samp D train such that D samp = S 2. Train a model on D samp and evaluate it on D test 3. Increase S and repeat the experiment Going For Large Scale 10 / 31
2. Does more data help? Set up Model: logistic regression with L2 regularization Dataset: A9A 1. Binary classification dataset: determine whether a person makes over 50K a year. 2. Total number of training instances: 32561 3. Number of features: 123 4. Number of test instances: 16281 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ binary.html#a9a Going For Large Scale 11 / 31
2. Does more data help? Experiment on the A9A dataset Accuracy on A9A Accuracy 70 75 80 85 90 95 100 0 5000 10000 15000 20000 25000 30000 Number of Training Instances Going For Large Scale 12 / 31
2. Does more data help? Set up Model: logistic regression with L2 regularization Dataset: KDD Cup 2010: Student performance prediction 1. Binary classification dataset: predict whether a student will answer an exercise correctly. 2. Total number of training instances: 8,407,752 3. Number of features: 20,216,830 4. Number of test instances: 510,302 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ binary.html#kdd2010(algebra) Going For Large Scale 13 / 31
2. Does more data help? Experiment on the KDD Challenge 2010 dataset Accuracy on the KDD Challenge Dataset Accuracy 50 60 70 80 90 100 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Number of Training Instances Going For Large Scale 14 / 31
2. Does more data help? Considerations Using a sample of the data can lead to good results with simple models The more model parameters, the more data we need to learn them More complex models will require more data For more complex models we need more efficient algorithms Going For Large Scale 15 / 31
Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 16 / 31
Distributed Learning Assume our data is partitioned accross N machines. Let us denote the data on machine i as D train i N i=1 D train i = D train such that: One solution is to learn one set of local parameters β (i) on each machine and make sure they all agree with a global variable α minimize subject to N i=1 β (i) = α (x,y) D train i l(y, ŷ(x; β (i) )) + λr(α) Going For Large Scale 16 / 31
Solving Problems in the Dual Given an equality constrained problem: minimize subject to f (β) Aβ = b The Lagrangian is given by: And the dual: We solve it by 1. ν := arg max β g(ν) 2. β := arg min β L(β, ν ) L(β, ν) = f (β) + ν(aβ b) g(ν) = inf β L(β, ν) Going For Large Scale 17 / 31
Dual Ascent We can aplly the gradient method for maximizing the dual: ν t+1 = ν t + µ t g(ν t ) where, given that β = arg min L(β, ν t ): g(ν t ) = Aβ + b Which gives us the following method for solving the dual: β t+1 arg min L(β, ν t ) ν t+1 ν t + µ t (Aβ t+1 + b) Going For Large Scale 18 / 31
Dual Decomposition Now suppose f can be rewritten like this: f (β) = f 1 (β 1 ) + f 2 (β 2 ) +... + f N (β N ), β = (β 1, β 2,..., β N ) Partitioning A = [A 1 A N ] so that n i=1 A iβ i = Aβ, we can write the Lagrangian as: L(β, ν) = f (β) + ν(aβ b) L(β, ν) = f 1 (β 1 ) +... + f N (β N ) + ν(a 1 β 1 +... + A n β n b) L(β, ν) = f 1 (β 1 ) + ν T A 1 β 1 +... + f N (β N ) + ν T A N β N ν T b L(β, ν) = L 1 (β 1, ν) +... + L N (β N, ν) ν T b Where: L i (β i, ν) = f i (β i ) + ν T A i β i Going For Large Scale 19 / 31
Dual Decomposition The problem With the Lagrange function β t+1 := arg min L(β, ν t ) β L(β, ν) = N L i (β i, ν) ν T b i=1 where L i (β i, ν) = f i (β i ) + ν T A i β i Can be solved by N minimization steps: carried out in parallel β t+1 i := arg min β i L i (β i, ν t ) Going For Large Scale 20 / 31
Dual Decomposition Dual decomposition method: β t+1 i arg min β i L i (β i, ν t ) ν t+1 ν t + µ t ( n i=1 A i β t+1 i + b ) At each step: ν t has to be broadcasted β t+1 i are updated in parallel A i β t+1 i are gathered to compute the sum n i=1 A iβ t+1 i Works if assumptions hold but often slow!! Going For Large Scale 21 / 31
Method of Multipliers The method of multipliers uses the augmented lagrangian, s > 0: L s (β, ν) = f (β) + ν(aβ b) + ( s 2 ) Aβ b 2 2 and solves the dual problem through the following steps: β t+1 arg min L s (β, ν t ) β ν t+1 ν t + s ( Aβ t+1 + b ) Converges under more relaxed assumptions Augmented Lagrangian not separable because of the additional term (no parallelization) Going For Large Scale 22 / 31
Alternating Direction Method of Multipliers (ADMM) ADMM assumes the problem can take the form: minimize f (β) + h(α) subject to Aβ + Bα = c which has the following augmented lagragian: L s (β, α, ν) = f (β) + h(α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 and solves the dual problem through the following steps: β t+1 arg min L s (β, α t, ν t ) β α t+1 arg min L s (β t+1, α, ν t ) α ν t+1 ν t + s ( Aβ t+1 + Bα t+1 c ) Going For Large Scale 23 / 31
Alternating Direction Method of Multipliers (ADMM) β t+1 arg min L s (β, α t, ν t ) β α t+1 arg min L s (β t+1, α, ν t ) α ν t+1 ν t + s ( Aβ t+1 + Bα t+1 c ) At first ADMM seems very similar to the method of multipliers It reduces to the method of multipliers if β and α are optimized jointly if f or h are separable we can now perform the updates of β (or α) in parallel Going For Large Scale 24 / 31
ADMM: scaled form We can rewrite the ADMM algorithm in a more convenient form. From the augumented Lagrangian: L s (β, α, ν) = f (β) + h(α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 we can define: r = Aβ + Bα c so that: ν T r + ( s 2 ) r 2 2 = s 2 r 1 s ν 2 2 1 2s ν 2 2 = s 2 r + u 2 2 s 2 u 2 2 where u = 1 s ν Going For Large Scale 25 / 31
ADMM: scaled form From the augumented Lagrangian: L s (β, α, ν) = f (β) + h 0 (α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 And r = Aβ + Bα c: ν T r + ( s 2 ) r 2 2 = s 2 r + u 2 2 s 2 u 2 2 where u = 1 s ν, we have: β t+1 arg min f (β) + s β 2 Aβ + Bαt c + u t 2 2 α t+1 arg min h 0 (α) + s α 2 Aβt+1 + Bα c + u t 2 2 u t+1 u t + Aβ t+1 + Bα t+1 c Going For Large Scale 26 / 31
Applying ADMM to predictive problems The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m We want to learn a model to predict y from X through parameters β: ŷ i = β T x i = β 0 1 + β 1 x i,1 + β 2 x i,2 +... + β n x i,n Going For Large Scale 27 / 31
Applying ADMM to predictive problems Our loss function: L(D train, β) = (x,y) D train l(y, ŷ(x; β)) can easily be decomposed on losses on each subset of the data: minimize N i=1 L(Di train, β) + λr(β) Which can also be written as: minimize N i=1 (x,y) D train i l(y, ŷ(x; β)) + λr(β) Going For Large Scale 28 / 31
Example: Ridge Regression General form: minimize N i=1 (x,y) D train i l(y, ŷ(x; β)) + λr(β) Example: Ridge (Linear) Regression: minimize X β y 2 2 + λ β 2 2 = N minimize (ŷ(x; β) y) 2 + λ i=1 (x,y) D train i n j=1 β 2 j Going For Large Scale 29 / 31
Applying ADMM to predictive problems Now we can rewrite our problem as minimize L(D train, β) + R(β) minimize L(D train, β) + R(α) subject to β α = 0 And solve it through ADMM: β t+1 arg min L(D train, β) + ν t T (β α t ) + s β 2 β αt 2 2 α t+1 arg min R(α) + ν t T (β t+1 α) + s α 2 βt+1 α 2 2 ν t+1 ν t + s ( β t+1 + α t+1) Going For Large Scale 30 / 31
Applying ADMM to predictive problems Given that the loss function is separable: N minimize l(y, ŷ(x; β (i) )) + R(α) i=1 (x,y) D train i subject to β (i) α = 0 We can rewrite the algorithm like: β (i)t+1 arg min β (x,y) D train i l(y, ŷ(x; β (i) )) + ν (i)t T (β α t ) + s 2 β αt 2 2 N α t+1 arg min R(α) + ν (i)t T (β (i) t+1 α) + s α 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 + α t+1) And solve for different β i in parallel Going For Large Scale 31 / 31