Big Data Analytics. Lucas Rego Drumond

Similar documents
Big Data Analytics. Lucas Rego Drumond

Distributed Machine Learning and Big Data

L3: Statistical Modeling with Hadoop

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond

Big Data - Lecture 1 Optimization reminders

CSCI567 Machine Learning (Fall 2014)

Introduction to Online Learning Theory

Big Data Analytics. Lucas Rego Drumond

Statistical Machine Learning

Linear Threshold Units

Simple and efficient online algorithms for real world applications

Big Data Analytics: Optimization and Randomization

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Lecture 3: Linear methods for classification

Journée Thématique Big Data 13/03/2015

Support Vector Machines

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data Analytics. Lucas Rego Drumond

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Machine Learning Big Data using Map Reduce

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Predict Influencers in the Social Network

Data Mining. Nonlinear Classification

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Online Convex Optimization

Weekly Sales Forecasting

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Fast Analytics on Big Data with H20

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Package acrm. R topics documented: February 19, 2015

Federated Optimization: Distributed Optimization Beyond the Datacenter

Factoring Quadratic Expressions

How To Make A Credit Risk Model For A Bank Account

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

10. Proximal point method

A Logistic Regression Approach to Ad Click Prediction

Machine Learning Logistic Regression

Data Mining Practical Machine Learning Tools and Techniques

Large-Scale Machine Learning with Stochastic Gradient Descent

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Integrating Benders decomposition within Constraint Programming

Distributed Structured Prediction for Big Data

Consistent Binary Classification with Generalized Performance Metrics

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

The Need for Training in Big Data: Experiences and Case Studies

On Adaboost and Optimal Betting Strategies

Logistic Regression for Spam Filtering

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Nonlinear Optimization: Algorithms 3: Interior-point methods

Linear Programming Notes V Problem Transformations

Better credit models benefit us all

Introduction: Overview of Kernel Methods

Bayesian Factorization Machines

The Predictive Data Mining Revolution in Scorecards:

A Simple Introduction to Support Vector Machines

Lecture 2: The SVM classifier

The Data Mining Process

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

L25: Ensemble learning

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Beyond stochastic gradient descent for large-scale machine learning

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Big Data Analytics CSCI 4030

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Introduction to Machine Learning Using Python. Vikram Kamath

Factorization Machines

Statistical machine learning, high dimension and big data

Lecture 8 February 4

How I won the Chess Ratings: Elo vs the rest of the world Competition

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Making Sense of the Mayhem: Machine Learning and March Madness

Adaptive Online Gradient Descent

Online Lazy Updates for Portfolio Selection with Transaction Costs

Penalized regression: Introduction

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

Efficient variable selection in support vector machines via the alternating direction method of multipliers

STA 4273H: Statistical Machine Learning

The equivalence of logistic regression and maximum entropy models

Logistic Regression (1/24/13)

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Prediction of Stock Performance Using Analytical Techniques

Analyze It use cases in telecom & healthcare

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

An interval linear programming contractor

Transcription:

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1 / 31

Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 1 / 31

0. Review Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 1 / 31

0. Review Prediction Problem: Formally Let X be any set (the predictor space), Y be any set (the target space) Task: Given: some training data D train X Y a loss function l : Y Y R that measures how bad is a prediction ŷ if the true value is y compute a prediction function: ŷ : X Y such that the empirical risk is minimum: risk(ŷ, D test ) := 1 D test (x,y) D test l(y, ŷ(x)) Going For Large Scale 1 / 31

0. Review Solving Prediction Tasks Now estimating the parameters β is done by solving the following optimization task: arg min β (x,y) D train l(y, ŷ(x; β)) + λr(β) When solving a prediction task we need to define the following components: A prediction function ŷ(x; β) A loss function l(y, ŷ(x; β)) A regularization function R(β) A learning algorithm to solve the optimization task above. Going For Large Scale 2 / 31

0. Review Descent Methods 1: procedure DescentMethod input: f Choose an initial point β (0) R n 2: Get initial point β (0) 3: t 0 The next point is generated 4: repeat using 5: Get Update Direction β (t) A step size µ 6: Get Step Size µ A direction β such that 7: β (t+1) β (t) + µ β (t) 8: t t + 1 f (β (t) + µ β (t 1) ) < f (β (t 1) ) 9: until convergence 10: return β, f (β) 11: end procedure Going For Large Scale 3 / 31

0. Review Descent Methods Gradient Descent: GD β = f (β) Stochastic Gradient Descent: For: f (β) = m f i (β) i=1 f i (β) = l(y i, ŷ(x i ; β)) + λ m R(β) The Update Rule is: SGD β = β f i (β) Newton s Method: Newton β = 2 f (β) 1 f (β) Going For Large Scale 4 / 31

1. Learning from Huge Scale Data Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 5 / 31

1. Learning from Huge Scale Data Huge Scale Data The computational complexity of learning algorithms depends (among other things) on the amount of data The larger D train, the longer it takes to compute the gradients Parallelization is not trivial: each update depends on the results of the previous one The whole dataset may not reside in the same machine Going For Large Scale 5 / 31

1. Learning from Huge Scale Data Huge Scale dasets X Y Going For Large Scale 6 / 31

1. Learning from Huge Scale Data Instance distribution X Y Going For Large Scale 7 / 31

1. Learning from Huge Scale Data Feature distribution X Y Going For Large Scale 8 / 31

1. Learning from Huge Scale Data Learning from Huge Scale datasets What to do? Use only one portion of the data (subsampling) Learn different models in different portions of the data and combine them (ensembling) Use algorithms that are by nature parallelizable (distributed optimization) Exploit distributed computational models Going For Large Scale 9 / 31

2. Does more data help? Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 10 / 31

2. Does more data help? Subsampling Do we really need so much data? Let us do an experiment: 1. Uniformly sample a subset of the training set D samp D train such that D samp = S 2. Train a model on D samp and evaluate it on D test 3. Increase S and repeat the experiment Going For Large Scale 10 / 31

2. Does more data help? Set up Model: logistic regression with L2 regularization Dataset: A9A 1. Binary classification dataset: determine whether a person makes over 50K a year. 2. Total number of training instances: 32561 3. Number of features: 123 4. Number of test instances: 16281 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ binary.html#a9a Going For Large Scale 11 / 31

2. Does more data help? Experiment on the A9A dataset Accuracy on A9A Accuracy 70 75 80 85 90 95 100 0 5000 10000 15000 20000 25000 30000 Number of Training Instances Going For Large Scale 12 / 31

2. Does more data help? Set up Model: logistic regression with L2 regularization Dataset: KDD Cup 2010: Student performance prediction 1. Binary classification dataset: predict whether a student will answer an exercise correctly. 2. Total number of training instances: 8,407,752 3. Number of features: 20,216,830 4. Number of test instances: 510,302 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ binary.html#kdd2010(algebra) Going For Large Scale 13 / 31

2. Does more data help? Experiment on the KDD Challenge 2010 dataset Accuracy on the KDD Challenge Dataset Accuracy 50 60 70 80 90 100 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Number of Training Instances Going For Large Scale 14 / 31

2. Does more data help? Considerations Using a sample of the data can lead to good results with simple models The more model parameters, the more data we need to learn them More complex models will require more data For more complex models we need more efficient algorithms Going For Large Scale 15 / 31

Outline 0. Review 1. Learning from Huge Scale Data 2. Does more data help? Going For Large Scale 16 / 31

Distributed Learning Assume our data is partitioned accross N machines. Let us denote the data on machine i as D train i N i=1 D train i = D train such that: One solution is to learn one set of local parameters β (i) on each machine and make sure they all agree with a global variable α minimize subject to N i=1 β (i) = α (x,y) D train i l(y, ŷ(x; β (i) )) + λr(α) Going For Large Scale 16 / 31

Solving Problems in the Dual Given an equality constrained problem: minimize subject to f (β) Aβ = b The Lagrangian is given by: And the dual: We solve it by 1. ν := arg max β g(ν) 2. β := arg min β L(β, ν ) L(β, ν) = f (β) + ν(aβ b) g(ν) = inf β L(β, ν) Going For Large Scale 17 / 31

Dual Ascent We can aplly the gradient method for maximizing the dual: ν t+1 = ν t + µ t g(ν t ) where, given that β = arg min L(β, ν t ): g(ν t ) = Aβ + b Which gives us the following method for solving the dual: β t+1 arg min L(β, ν t ) ν t+1 ν t + µ t (Aβ t+1 + b) Going For Large Scale 18 / 31

Dual Decomposition Now suppose f can be rewritten like this: f (β) = f 1 (β 1 ) + f 2 (β 2 ) +... + f N (β N ), β = (β 1, β 2,..., β N ) Partitioning A = [A 1 A N ] so that n i=1 A iβ i = Aβ, we can write the Lagrangian as: L(β, ν) = f (β) + ν(aβ b) L(β, ν) = f 1 (β 1 ) +... + f N (β N ) + ν(a 1 β 1 +... + A n β n b) L(β, ν) = f 1 (β 1 ) + ν T A 1 β 1 +... + f N (β N ) + ν T A N β N ν T b L(β, ν) = L 1 (β 1, ν) +... + L N (β N, ν) ν T b Where: L i (β i, ν) = f i (β i ) + ν T A i β i Going For Large Scale 19 / 31

Dual Decomposition The problem With the Lagrange function β t+1 := arg min L(β, ν t ) β L(β, ν) = N L i (β i, ν) ν T b i=1 where L i (β i, ν) = f i (β i ) + ν T A i β i Can be solved by N minimization steps: carried out in parallel β t+1 i := arg min β i L i (β i, ν t ) Going For Large Scale 20 / 31

Dual Decomposition Dual decomposition method: β t+1 i arg min β i L i (β i, ν t ) ν t+1 ν t + µ t ( n i=1 A i β t+1 i + b ) At each step: ν t has to be broadcasted β t+1 i are updated in parallel A i β t+1 i are gathered to compute the sum n i=1 A iβ t+1 i Works if assumptions hold but often slow!! Going For Large Scale 21 / 31

Method of Multipliers The method of multipliers uses the augmented lagrangian, s > 0: L s (β, ν) = f (β) + ν(aβ b) + ( s 2 ) Aβ b 2 2 and solves the dual problem through the following steps: β t+1 arg min L s (β, ν t ) β ν t+1 ν t + s ( Aβ t+1 + b ) Converges under more relaxed assumptions Augmented Lagrangian not separable because of the additional term (no parallelization) Going For Large Scale 22 / 31

Alternating Direction Method of Multipliers (ADMM) ADMM assumes the problem can take the form: minimize f (β) + h(α) subject to Aβ + Bα = c which has the following augmented lagragian: L s (β, α, ν) = f (β) + h(α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 and solves the dual problem through the following steps: β t+1 arg min L s (β, α t, ν t ) β α t+1 arg min L s (β t+1, α, ν t ) α ν t+1 ν t + s ( Aβ t+1 + Bα t+1 c ) Going For Large Scale 23 / 31

Alternating Direction Method of Multipliers (ADMM) β t+1 arg min L s (β, α t, ν t ) β α t+1 arg min L s (β t+1, α, ν t ) α ν t+1 ν t + s ( Aβ t+1 + Bα t+1 c ) At first ADMM seems very similar to the method of multipliers It reduces to the method of multipliers if β and α are optimized jointly if f or h are separable we can now perform the updates of β (or α) in parallel Going For Large Scale 24 / 31

ADMM: scaled form We can rewrite the ADMM algorithm in a more convenient form. From the augumented Lagrangian: L s (β, α, ν) = f (β) + h(α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 we can define: r = Aβ + Bα c so that: ν T r + ( s 2 ) r 2 2 = s 2 r 1 s ν 2 2 1 2s ν 2 2 = s 2 r + u 2 2 s 2 u 2 2 where u = 1 s ν Going For Large Scale 25 / 31

ADMM: scaled form From the augumented Lagrangian: L s (β, α, ν) = f (β) + h 0 (α) + ν T (Aβ + Bα c) + ( s 2 ) Aβ + Bα c 2 2 And r = Aβ + Bα c: ν T r + ( s 2 ) r 2 2 = s 2 r + u 2 2 s 2 u 2 2 where u = 1 s ν, we have: β t+1 arg min f (β) + s β 2 Aβ + Bαt c + u t 2 2 α t+1 arg min h 0 (α) + s α 2 Aβt+1 + Bα c + u t 2 2 u t+1 u t + Aβ t+1 + Bα t+1 c Going For Large Scale 26 / 31

Applying ADMM to predictive problems The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m We want to learn a model to predict y from X through parameters β: ŷ i = β T x i = β 0 1 + β 1 x i,1 + β 2 x i,2 +... + β n x i,n Going For Large Scale 27 / 31

Applying ADMM to predictive problems Our loss function: L(D train, β) = (x,y) D train l(y, ŷ(x; β)) can easily be decomposed on losses on each subset of the data: minimize N i=1 L(Di train, β) + λr(β) Which can also be written as: minimize N i=1 (x,y) D train i l(y, ŷ(x; β)) + λr(β) Going For Large Scale 28 / 31

Example: Ridge Regression General form: minimize N i=1 (x,y) D train i l(y, ŷ(x; β)) + λr(β) Example: Ridge (Linear) Regression: minimize X β y 2 2 + λ β 2 2 = N minimize (ŷ(x; β) y) 2 + λ i=1 (x,y) D train i n j=1 β 2 j Going For Large Scale 29 / 31

Applying ADMM to predictive problems Now we can rewrite our problem as minimize L(D train, β) + R(β) minimize L(D train, β) + R(α) subject to β α = 0 And solve it through ADMM: β t+1 arg min L(D train, β) + ν t T (β α t ) + s β 2 β αt 2 2 α t+1 arg min R(α) + ν t T (β t+1 α) + s α 2 βt+1 α 2 2 ν t+1 ν t + s ( β t+1 + α t+1) Going For Large Scale 30 / 31

Applying ADMM to predictive problems Given that the loss function is separable: N minimize l(y, ŷ(x; β (i) )) + R(α) i=1 (x,y) D train i subject to β (i) α = 0 We can rewrite the algorithm like: β (i)t+1 arg min β (x,y) D train i l(y, ŷ(x; β (i) )) + ν (i)t T (β α t ) + s 2 β αt 2 2 N α t+1 arg min R(α) + ν (i)t T (β (i) t+1 α) + s α 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 + α t+1) And solve for different β i in parallel Going For Large Scale 31 / 31