The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia
Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit Responsible for : Algorithm that score new hotels Algorithm that predicts room nights sold on existing Expedia hotels Scraping competitor sites Other stuff.
Structure Big data: promises and challenges Classic algorithms: logistic regression How to use logistic regression in a big data world Outlook
The Promise of Big Data Real time data Data driven decision Granularity More accurate and robust models
Big Data Challenges How do we train algorithms on data sets that do not fit into memory? Speed at which to use data how fast should we update algorithms? Data Processing not going to talk about this.
Big Data Challenges Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Classification - Logistic Regression One classic task in machine learning / statistics is to classify some objects/events/decisions correctly Examples are: Customer churn Click behavior Purchase behavior. One of the most popular algorithms to carry out these tasks is logistic regression
What is logistic regression? Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other Pr y x = 1 1+e xβ The challenge is to choose the optimal beta(s) To do that we minimize a cost function
Why Use Logistic Regression? It is simple and well understood algorithm Outputs probabilities There are tried and tested models to estimate the parameters It is flexible can handle a number of different inputs, and feature transformations
Usual Approaches Batch training (offline approach) Get all the data and train the algorithm in one go Disadvantages when data is big Requires all data to be loaded into memory Periodic retraining is necessary Very time consuming with big data!
Batch Training
Examples of Logistic Regression in Industry Settings Real Time Bidding RTB RTB algorithms are usually based on logistic regression Whether or not to bid on a user is determined by the probability that the user will click on an add Each day billions of bids are processed Each bid has to be processed within 80 milliseconds
Examples of Logistic Regression in Industry Settings Fraud Detection Detecting Fraudulent Credit Card Transactions The probability that a transaction is using a stolen credit card is typically estimated with logistic regression Billions of transactions are analyzed each day
How Slow is the Batch Version of Logistic Regression? One target variable and two feature vectors. All randomly generated.
Big Data Friendly Approaches Online Training Pass each data point sequentially through the algorithm Only requires one data point at a time in memory Allows for on-the-fly training of the algorithm MapReduce Split the estimation into several chunks, run the algorithm on each chunk, combine the outputs Runs in parallel, fast for training models
MapReduce Framework The master returns the aggregated output The master distributes the data across the mappers Data The Master sends the key-value pairs to the reducer, that aggregates the calculations Output Master Reducer The mappers carry out the calculations and return key-value pairs Mapper Mapper to the master Mapper
MapReduce Framework Data Master returns the updated θ Output Master distributes data and initializes θ (a vector of coefficients) Master The reducer sums up the calculations Reducer Each mapper computes: subgroup y h θ x And h θ x h θ x 1 x i x j x i Mapper Mapper Mapper
MapReduce Framework Approaches the maximum of the function in a direct manner, where each step is a MapReduce job
Online Learning We want to learn a vector of weights Initialize all weights. Begin loop: 1. Get training example 2. Make a prediction for the target variable 3. Learn the true value of the target 4. Update the weights and go to 1
Online Learning Initialise all weights. Begin loop: Repeat { For i = 1 to m { } the partial derivative of the cost functions θ j = θ j α θ j cost(θ, (x i, y i )) } the step size how fast we should climb the gradient the cost function given theta and row i, i.e. how wrong Are we?
Online Learning Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.
Batch vs. MapReduce vs. Online Learning One target variable and two feature vectors. All randomly generated. Batch MapReduce Online Learning
Online Learning Vs. MapReduce Online Learning When we have a continuous stream of data When It is important to update the algorithm in real time can hit a moving target Parameters are jumpy around the optimal values Can be tricky to implement into a system MapReduce When data is updated in batches When updating the algorithm does not need to happen in real time Parameters settle on optimal values If a company already uses Hadoop, it is easy to implement