The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia



Similar documents
L3: Statistical Modeling with Hadoop

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Machine Learning using MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce

Programming Exercise 3: Multi-class Classification and Neural Networks

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Cross Validation. Dr. Thomas Jensen Expedia.com

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Journée Thématique Big Data 13/03/2015

Big Data Processing with Google s MapReduce. Alexandru Costan

How To Write A Data Processing Pipeline In R

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Map-Reduce for Machine Learning on Multicore

Advanced Big Data Analytics with R and Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

How To Handle Big Data With A Data Scientist

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Boarding to Big data

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Large Scale Learning

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Hadoop Parallel Data Processing

Internals of Hadoop Application Framework and Distributed File System

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Decision Trees from large Databases: SLIQ

Big Data and Scripting map/reduce in Hadoop

Lecture 10 - Functional programming: Hadoop and MapReduce

In this tutorial, we try to build a roc curve from a logistic regression.

Developing MapReduce Programs

RevoScaleR Speed and Scalability

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Cloud Computing at Google. Architecture

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Hadoop and Map-Reduce. Swati Gore

Big Data with Rough Set Using Map- Reduce

Transforming the Telecoms Business using Big Data and Analytics


Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Graph Processing and Social Networks

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Infrastructures for big data

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Classification On The Clouds Using MapReduce

The Artificial Prediction Market

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Machine Learning Big Data using Map Reduce

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Logistic Regression for Spam Filtering

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Hadoop Architecture. Part 1

Data Mining III: Numeric Estimation

Big Data. Fast Forward. Putting data to productive use

Similarity Search in a Very Large Scale Using Hadoop and HBase

Introduction to Hadoop

The Stratosphere Big Data Analytics Platform

The Data Mining Process

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Advanced In-Database Analytics

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop SNS. renren.com. Saturday, December 3, 11

BIG DATA What it is and how to use?

Web analytics: Data Collected via the Internet

Image Search by MapReduce

Leveraging Ensemble Models in SAS Enterprise Miner

not possible or was possible at a high cost for collecting the data.

Data Mining. Nonlinear Classification

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data: Big N. V.C Note. December 2, 2014

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to DISC and Hadoop

Data Mining - Evaluation of Classifiers

Why is Internal Audit so Hard?

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Big Data Analytics. Lucas Rego Drumond

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Big Data With Hadoop

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

An Overview of Knowledge Discovery Database and Data mining Techniques

Sibyl: a system for large scale machine learning

Transcription:

The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia

Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit Responsible for : Algorithm that score new hotels Algorithm that predicts room nights sold on existing Expedia hotels Scraping competitor sites Other stuff.

Structure Big data: promises and challenges Classic algorithms: logistic regression How to use logistic regression in a big data world Outlook

The Promise of Big Data Real time data Data driven decision Granularity More accurate and robust models

Big Data Challenges How do we train algorithms on data sets that do not fit into memory? Speed at which to use data how fast should we update algorithms? Data Processing not going to talk about this.

Big Data Challenges Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Classification - Logistic Regression One classic task in machine learning / statistics is to classify some objects/events/decisions correctly Examples are: Customer churn Click behavior Purchase behavior. One of the most popular algorithms to carry out these tasks is logistic regression

What is logistic regression? Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other Pr y x = 1 1+e xβ The challenge is to choose the optimal beta(s) To do that we minimize a cost function

Why Use Logistic Regression? It is simple and well understood algorithm Outputs probabilities There are tried and tested models to estimate the parameters It is flexible can handle a number of different inputs, and feature transformations

Usual Approaches Batch training (offline approach) Get all the data and train the algorithm in one go Disadvantages when data is big Requires all data to be loaded into memory Periodic retraining is necessary Very time consuming with big data!

Batch Training

Examples of Logistic Regression in Industry Settings Real Time Bidding RTB RTB algorithms are usually based on logistic regression Whether or not to bid on a user is determined by the probability that the user will click on an add Each day billions of bids are processed Each bid has to be processed within 80 milliseconds

Examples of Logistic Regression in Industry Settings Fraud Detection Detecting Fraudulent Credit Card Transactions The probability that a transaction is using a stolen credit card is typically estimated with logistic regression Billions of transactions are analyzed each day

How Slow is the Batch Version of Logistic Regression? One target variable and two feature vectors. All randomly generated.

Big Data Friendly Approaches Online Training Pass each data point sequentially through the algorithm Only requires one data point at a time in memory Allows for on-the-fly training of the algorithm MapReduce Split the estimation into several chunks, run the algorithm on each chunk, combine the outputs Runs in parallel, fast for training models

MapReduce Framework The master returns the aggregated output The master distributes the data across the mappers Data The Master sends the key-value pairs to the reducer, that aggregates the calculations Output Master Reducer The mappers carry out the calculations and return key-value pairs Mapper Mapper to the master Mapper

MapReduce Framework Data Master returns the updated θ Output Master distributes data and initializes θ (a vector of coefficients) Master The reducer sums up the calculations Reducer Each mapper computes: subgroup y h θ x And h θ x h θ x 1 x i x j x i Mapper Mapper Mapper

MapReduce Framework Approaches the maximum of the function in a direct manner, where each step is a MapReduce job

Online Learning We want to learn a vector of weights Initialize all weights. Begin loop: 1. Get training example 2. Make a prediction for the target variable 3. Learn the true value of the target 4. Update the weights and go to 1

Online Learning Initialise all weights. Begin loop: Repeat { For i = 1 to m { } the partial derivative of the cost functions θ j = θ j α θ j cost(θ, (x i, y i )) } the step size how fast we should climb the gradient the cost function given theta and row i, i.e. how wrong Are we?

Online Learning Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.

Batch vs. MapReduce vs. Online Learning One target variable and two feature vectors. All randomly generated. Batch MapReduce Online Learning

Online Learning Vs. MapReduce Online Learning When we have a continuous stream of data When It is important to update the algorithm in real time can hit a moving target Parameters are jumpy around the optimal values Can be tricky to implement into a system MapReduce When data is updated in batches When updating the algorithm does not need to happen in real time Parameters settle on optimal values If a company already uses Hadoop, it is easy to implement