Scalable Distributed Decision Trees in Spark MLlib

Similar documents
Use of Hadoop File System for Nuclear Physics Analyses in STAR

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Classification On The Clouds Using MapReduce

Fast Analytics on Big Data with H20

Spark. Fast, Interactive, Language- Integrated Cluster Computing

SEIZE THE DATA SEIZE THE DATA. 2015

Developing MapReduce Programs

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

BIG DATA What it is and how to use?

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Journée Thématique Big Data 13/03/2015

Decision Trees from large Databases: SLIQ

Spark: Cluster Computing with Working Sets

Map-Reduce for Machine Learning on Multicore

Analytics on Big Data

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Rethinking SIMD Vectorization for In-Memory Databases

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Data Mining in the Swamp

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Mining Practical Machine Learning Tools and Techniques

Microsoft Azure Machine learning Algorithms

SEIZE THE DATA SEIZE THE DATA. 2015

Applied Multivariate Analysis - Big data analytics

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Bringing Big Data Modelling into the Hands of Domain Experts

Report: Declarative Machine Learning on MapReduce (SystemML)

Hadoop Parallel Data Processing

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics. Lucas Rego Drumond

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Chapter 7. Using Hadoop Cluster and MapReduce

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

High Performance Predictive Analytics in R and Hadoop:

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Package bigrf. February 19, 2015

Unified Big Data Analytics Pipeline. 连 城

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Predicting Flight Delays

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

MLlib: Scalable Machine Learning on Spark

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Advanced In-Database Analytics

How To Make A Credit Risk Model For A Bank Account

Predicting borrowers chance of defaulting on credit loans

GraySort on Apache Spark by Databricks

Hadoop Ecosystem B Y R A H I M A.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

L3: Statistical Modeling with Hadoop

The Artificial Prediction Market

Bayesian networks - Time-series models - Apache Spark & Scala

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Advanced Big Data Analytics with R and Hadoop

How Companies are! Using Spark

Challenges for Data Driven Systems

Model Combination. 24 Novembre 2009

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Beyond Hadoop with Apache Spark and BDAS

Using multiple models: Bagging, Boosting, Ensembles, Forests

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Duke University

Big Data With Hadoop

Architectures for massive data management

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Big Data Analytics Hadoop and Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Distributed forests for MapReduce-based machine learning

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Data Mining III: Numeric Estimation

CSE-E5430 Scalable Cloud Computing Lecture 11

Data Mining. Nonlinear Classification

High Productivity Data Processing Analytics Methods with Applications

Distributed R for Big Data

Introduction to Parallel Programming and MapReduce

Ensembles and PMML in KNIME

Big Data Analytics and Healthcare

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Mining Large Datasets: Case of Mining Graph Data in the Cloud

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Data Mining Methods: Applications for Institutional Research

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Transcription:

Scalable Distributed Decision Trees in Spark MLlib Manish Amde, Origami Logic Hirakendu Das, Yahoo! Labs Evan Sparks, UC Berkeley Ameet Talwalkar, UC Berkeley

Who is this guy? Ph.D. in ECE from UC San Diego Currently, data science at Origami Logic Origami Logic Search-based marketing intelligence platform Work with global brands on large, unstructured marketing datasets Using Spark for analytics

Overview Decision Tree 101 Distributed Decision Trees in MLlib Experiments Ensembles Future work

Supervised Learning Train Features Features Features Features Features Label Label Label Label Label Predict Features?

Classification / Regression Features Features Features Features Features Label Label Label Label Label Classification Labels denote classes! Regression Labels denote real numbers

Car mileage from 1971! horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low Learn a model to predict the mileage (Binary classification!)

Let s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low

Let s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low

Let s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low If weight is high, mileage is low

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high :4,low :2}

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high :4,low :2} Training Data

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high :4,low :2} Split Candidates Training Data

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high :4,low :2}

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High labels : {high :4,low :2} Yes No labels : {high :2,low :0} labels : {high :2,low :2}

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High labels : {high :4,low :2} Yes No labels : {high :2,low :0} labels : {high :2,low :2} Chose a split that causes maximum reduction in the label variability! Chose a split that maximizes information gain

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High labels : {high :4,low :2} Yes No labels : {high :2,low :0} labels : {high :2,low :2} No increase in information gain possible

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High labels : {high :4,low :2} Yes No labels : {high :2,low :0} labels : {high :2,low :2} High Mileage

Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High labels : {high :4,low :2} Yes No labels : {high :2,low :0} labels : {high :2,low :2} High Mileage Still have work to do here..

Let s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high

Let s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high horsepower 70 86 90 95

Let s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high horsepower 70 86 90 95 If horsepower <= 86, mileage is high. Else, it s low.

Mileage Classification Tree Weight == High labels : {high :4,low :2} Yes No Low Mileage Horsepower <= 86 labels : {high :2,low :2} labels : {high :2,low :0} Yes No High Mileage Low Mileage labels : {high :2,low :0} labels : {high :0,low :2}

Let s predict horsepower weight mileage prediction 90 high 80 low 70 high

Let s predict horsepower weight mileage prediction 90 high low 80 low low 70 high high

Let s predict horsepower weight mileage prediction 90 high low Correct! 80 low low Correct! 70 high high Wrong!

Complex in Practice X[0] <= 15.0450 error = 0.660316349195 samples = 569 value = [ 357. 212.] X[1] <= 19.6100 error = 0.383455365224 samples = 397 value = [ 346. 51.] X[1] <= 16.3950 error = 0.237709879353 samples = 172 value = [ 11. 161.] error = 0.1839 samples = 266 value = [ 254. 12.] error = 0.6089 samples = 131 value = [ 92. 39.] error = 0.6931 samples = 18 value = [ 9. 9.] error = 0.0693 samples = 154 value = [ 2. 152.]

Why Decision Trees? Easy to interpret Handle categorical variables (Multi-class) classification and regression No feature scaling Capture non-linearities and feature interactions Handle missing values Ensembles are top performers

Overview Decision Tree 101 Distributed Decision Trees in MLlib Experiments Ensembles Future work

Dataset: Single Machine Typically, dataset is loaded in memory as a matrix or dataframe Perform multiple passes over the data R, scikit-learn, 90 70 lo lo high low 86 76 hig lo high 88 hig low low

Distributed Dataset hp weight mileage hp weight mileage hp weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low

Distributed Dataset Learn multiple models and combine them hp weight mileage hp weight mileage hp weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low

Distributed Dataset Learn multiple models and combine them hp weight mileage hp weight mileage hp weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low Does not work well for all data partitioning Still need inter-machine communication to combine models

Distributed Dataset

Distributed Dataset Hadoop MapReduce No implementations when we started Currently: RHadoop, Oryx, OxData,.

Distributed Dataset Hadoop MapReduce No implementations when we started Currently: RHadoop, Oryx, OxData,. PLANET Decision trees using MapReduce Not open source Extend with several optimizations

Distributed Dataset Hadoop MapReduce No implementations when we started Currently: RHadoop, Oryx, OxData,. PLANET Decision trees using MapReduce Not open source Extend with several optimizations Spark Iterative machine learning No trees support in initial versions

Split Candidates for Distributed Implementation Splits candidates for continuous features Costly to find all unique feature values Sorted splits desirable for fast computation High cardinality of splits leads to significant computation and communication overhead Approximate quantiles (percentiles by default) median (2-quantile) 70 86 88 90 95 horsepower

Typical MapReduce Implementation: Algorithm flatmap input: instance output: list(split, label)! reducebykey input: split, list(label) output: split, labelhistograms

Typical MapReduce Implementation: Example flatmap!!!!!! reducebykey

Typical MapReduce Implementation: Example flatmap!!!!!! hp weight mileage 76 high low reducebykey

Typical MapReduce Implementation: Example flatmap!!!!!! hp weight mileage 76 high low reducebykey (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low

Typical MapReduce Implementation: Example flatmap!!!!!! hp weight mileage 76 high low reducebykey (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low (weight, high), [low, low]

Typical MapReduce Implementation: Example flatmap!!!!!! hp weight mileage 76 high low reducebykey (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low (weight, high), [low, low] (weight, high), {low: 2, high: 0}

Typical MapReduce Implementation: Example flatmap!!!!!! hp weight mileage 76 high low reducebykey (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low (weight, high), [low, low] (weight, high), {low: 2, high: 0} (weight,!high), {low: 2, high: 2}

Typical MapReduce Implementation: Issues For k features, m splits/feature and n instances, the map operation emits O(k*m*n) values per best split computation at a node Communication overhead Can we do better?

Avoiding Map in MapReduce Map operation essential when keys not known For e.g., words in word count Splits known in advance No map avoids object creation overhead avoids communication overhead due to shuffle

Optimization 1: Aggregate (Distributed Fold)

Optimization 1: Aggregate (Distributed Fold) RDD partition 1 RDD partition 2 RDD partition N

Optimization 1: Aggregate (Distributed Fold) Empty Array Agg 1 Agg 2 Agg N Fold (Reduce) RDD partition 1 RDD partition 2 RDD partition N Partial Statistics Agg 1 Agg 2 Agg N

Optimization 1: Aggregate (Distributed Fold) Empty Array Agg 1 Agg 2 Agg N Fold (Reduce) RDD partition 1 RDD partition 2 RDD partition N Partial Statistics Fold (Reduce) Agg 1 Agg 2 Agg N X Sufficient Statistics Agg

Sufficient Statistics Left and right child node statistics for each split Classification: label counts Regression: count, sum, sum^2

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 horsepower label histograms

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 label histograms low: 0 high: 1 low: 0 high: 1 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 1 high: 0 low: 2 high: 0 low: 1 high: 0 horsepower

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 low: 0 high: 1 low: 0 high: 1 low: 1 high: 0 low: 1 high: 0 low: 0 high: 1 low: 2 high: 1 horsepower label low: 0 high: 2 low: 2 high: 0 histograms low: 1 high: 2 low: 1 high: 0

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 low: 0 high: 1 low: 0 high: 1 low: 1 high: 0 low: 1 high: 0 low: 0 high: 1 low: 2 high: 1 horsepower label low: 0 high: 2 low: 2 high: 0 histograms low: 1 high: 2 low: 1 high: 0

Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 low: 0 high: 1 low: 0 high: 1 low: 1 high: 0 low: 1 high: 0 low: 0 high: 1 low: 2 high: 1 horsepower label low: 0 high: 2 low: 2 high: 0 histograms low: 1 high: 2 low: 1 high: 0

Bin-based Info Gain m splits per feature! Binning using binary search log(m) versus m! Bin histogram update factor of m savings Significant savings in computation

Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset

Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Node 2

Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Node 2 Node 3

Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Node 2 Node 3

Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Node 2 Node 3 Perform level-wise training of nodes

Level-wise training L passes instead of 2 L - 1 for full tree Depth 4: 4 passes instead of 15 Depth 10: 10 passes instead of 1023. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MLlib decision tree features Binary classification and regression (1.0) Categorical variable support (1.0) Arbitrarily deep trees (1.1) Multiclass classification* (under review for 1.1) Sample weights* (under review for 1.1)

Overview Decision Tree 101 Distributed Decision Trees in MLlib Experiments Ensembles Future work

Strong Scaling Experiment 8 Ideal Experiment 6 Baseline: 20m samples with 20 features on 2 workers Speedup 4 2 0 2 4 8 16 Machines

Strong Scaling Experiment 8 Ideal Experiment 6 Baseline: 20m samples with 20 features on 2 workers 7.22 Speedup 4 3.7 2 0 1.84 1 2 4 8 16 Machines

Strong Scaling Results Synthetic dataset! 10 to 50 million instances! 10 to 50 features! 2 to 16 machines! 700 MB to 18 GB dataset! Average speedup from 2 to 16 machines was 6.6X!

Large-scale experiment 0.5 billion instances, 20 features, 90 GB dataset Training Time (s) 10000 1000 Depth 3 Depth 5 Depth 10 100 30 60 100 150 200 300 600 Machines

Large-scale experiment 0.5 billion instances, 20 features, 90 GB dataset Works on large datasets Training Time (s) 10000 1000 Depth 3 Depth 5 Depth 10 100 30 60 100 150 200 300 600 Machines

Large-scale experiment Works on 0.5 billion instances, 20 features, 90 GB dataset large datasets Deep trees require more comp. Training Time (s) 10000 1000 Depth 3 Depth 5 Depth 10 100 30 60 100 150 200 300 600 Machines

Large-scale experiment Works on 0.5 billion instances, 20 features, 90 GB dataset large datasets Deep trees require more Comp. vs comp. Comm. Training Time (s) 10000 1000 Depth 3 Depth 5 Depth 10 100 30 60 100 150 200 300 600 Machines

Overview Decision Tree 101 Distributed Decision Trees in MLlib Experiments Ensembles Future work

Tree Ensembles Decision trees are building blocks Boosting sequential sample weight Random Forests parallel construction level-wise training extension to multiple trees

AdaBoost wrapper

AdaBoost wrapper

AdaBoost wrapper

Overview Decision Tree 101 Distributed Decision Trees in MLlib Experiments Ensembles Future work

Future Work Ensembles (stretch goal for 1.1) Feature importances Decision tree visualizations Testing over a variety of user datasets

Requests

Requests

Requests

Requests 90 70 lo lo high low 86 76 hig lo high 88 hig low low

Requests 90 70 lo lo high low 86 76 hig lo high 88 hig low low

Thanks