The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Size: px

Start display at page:

Download "The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia"

Claribel McCormick
10 years ago
Views:

1 The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Expedia

2 Who am I? Senior Business Expedia Working within the competitive intelligence unit Responsible for : Algorithm that score new hotels Algorithm that predicts room nights sold on existing Expedia hotels Scraping competitor sites Other stuff.

3 Structure Big data: promises and challenges Classic algorithms: logistic regression How to use logistic regression in a big data world Outlook

4 The Promise of Big Data Real time data Data driven decision Granularity More accurate and robust models

5 Big Data Challenges How do we train algorithms on data sets that do not fit into memory? Speed at which to use data how fast should we update algorithms? Data Processing not going to talk about this.

6 Big Data Challenges Taken from:

7 Classification - Logistic Regression One classic task in machine learning / statistics is to classify some objects/events/decisions correctly Examples are: Customer churn Click behavior Purchase behavior. One of the most popular algorithms to carry out these tasks is logistic regression

correctly Examples are: Customer churn Click behavior Purchase behavior.

8 What is logistic regression? Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other Pr y x = 1 1+e xβ The challenge is to choose the optimal beta(s) To do that we minimize a cost function

outcomes, showing how likely they are to belong to one class or

9 Why Use Logistic Regression? It is simple and well understood algorithm Outputs probabilities There are tried and tested models to estimate the parameters It is flexible can handle a number of different inputs, and feature transformations

10 Usual Approaches Batch training (offline approach) Get all the data and train the algorithm in one go Disadvantages when data is big Requires all data to be loaded into memory Periodic retraining is necessary Very time consuming with big data!

when data is big Requires all data to be loaded into memory

11 Batch Training

12 Examples of Logistic Regression in Industry Settings Real Time Bidding RTB RTB algorithms are usually based on logistic regression Whether or not to bid on a user is determined by the probability that the user will click on an add Each day billions of bids are processed Each bid has to be processed within 80 milliseconds

user is determined by the probability that the user will click on an add Each

13 Examples of Logistic Regression in Industry Settings Fraud Detection Detecting Fraudulent Credit Card Transactions The probability that a transaction is using a stolen credit card is typically estimated with logistic regression Billions of transactions are analyzed each day

transaction is using a stolen credit card is typically estimated

14 How Slow is the Batch Version of Logistic Regression? One target variable and two feature vectors. All randomly generated.

15 Big Data Friendly Approaches Online Training Pass each data point sequentially through the algorithm Only requires one data point at a time in memory Allows for on-the-fly training of the algorithm MapReduce Split the estimation into several chunks, run the algorithm on each chunk, combine the outputs Runs in parallel, fast for training models

training of the algorithm MapReduce Split the estimation into several chunks, run the

16 MapReduce Framework The master returns the aggregated output The master distributes the data across the mappers Data The Master sends the key-value pairs to the reducer, that aggregates the calculations Output Master Reducer The mappers carry out the calculations and return key-value pairs Mapper Mapper to the master Mapper

to the reducer, that aggregates the calculations Output Master Reducer The

17 MapReduce Framework Data Master returns the updated θ Output Master distributes data and initializes θ (a vector of coefficients) Master The reducer sums up the calculations Reducer Each mapper computes: subgroup y h θ x And h θ x h θ x 1 x i x j x i Mapper Mapper Mapper

Master The reducer sums up the calculations Reducer Each mapper

18 MapReduce Framework Approaches the maximum of the function in a direct manner, where each step is a MapReduce job

19 Online Learning We want to learn a vector of weights Initialize all weights. Begin loop: 1. Get training example 2. Make a prediction for the target variable 3. Learn the true value of the target 4. Update the weights and go to 1

20 Online Learning Initialise all weights. Begin loop: Repeat { For i = 1 to m { } the partial derivative of the cost functions θ j = θ j α θ j cost(θ, (x i, y i )) } the step size how fast we should climb the gradient the cost function given theta and row i, i.e. how wrong Are we?

cost functions θ j = θ j α θ j cost(θ, (x i, y i )) } the step size

21 Online Learning Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.

22 Batch vs. MapReduce vs. Online Learning One target variable and two feature vectors. All randomly generated. Batch MapReduce Online Learning

23 Online Learning Vs. MapReduce Online Learning When we have a continuous stream of data When It is important to update the algorithm in real time can hit a moving target Parameters are jumpy around the optimal values Can be tricky to implement into a system MapReduce When data is updated in batches When updating the algorithm does not need to happen in real time Parameters settle on optimal values If a company already uses Hadoop, it is easy to implement

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...