Sequoia Forest. A Scalable Random Forest Implementation on Spark. Sung Hwan Chung, Alpine Data Labs

Similar documents
Decision Trees from large Databases: SLIQ

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Gerry Hobbs, Department of Statistics, West Virginia University

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Data Mining Practical Machine Learning Tools and Techniques

Data Mining. Nonlinear Classification

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Lecture 10: Regression Trees

Big Data Processing with Google s MapReduce. Alexandru Costan

Spam Detection A Machine Learning Approach

CART 6.0 Feature Matrix

MHI3000 Big Data Analytics for Health Care Final Project Report

Fast Analytics on Big Data with H20

Data Mining Classification: Decision Trees

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Chapter 12 Bagging and Random Forests

Data Mining Methods: Applications for Institutional Research

Binary search tree with SIMD bandwidth optimization using SSE

RevoScaleR Speed and Scalability

How To Make A Credit Risk Model For A Bank Account

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Using multiple models: Bagging, Boosting, Ensembles, Forests

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Cloud Computing at Google. Architecture

Knowledge Discovery and Data Mining

Data mining techniques: decision trees

Applying Data Science to Sales Pipelines for Fun and Profit

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Leveraging Ensemble Models in SAS Enterprise Miner

M-way Trees and B-Trees

Hadoop SNS. renren.com. Saturday, December 3, 11

Model Combination. 24 Novembre 2009

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

SEIZE THE DATA SEIZE THE DATA. 2015

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

GraySort on Apache Spark by Databricks

Big Data Analytics and Optimization

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Knowledge Discovery and Data Mining

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Classification and Prediction

Scalable Internet Services and Load Balancing

Using Ensemble of Decision Trees to Forecast Travel Time

The Stratosphere Big Data Analytics Platform

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Ensembles and PMML in KNIME

Protein Protein Interaction Networks

Big Data and Scripting. Part 4: Memory Hierarchies

Classification On The Clouds Using MapReduce

Graph Database Proof of Concept Report

Decompose Error Rate into components, some of which can be measured on unlabeled data

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

HiBench Introduction. Carson Wang Software & Services Group

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Database Design Patterns. Winter Lecture 24

Unified Big Data Analytics Pipeline. 连 城

Spark and the Big Data Library

Predicting Flight Delays

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

CSE 326: Data Structures B-Trees and B+ Trees

Classification of Bad Accounts in Credit Card Industry

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Oracle Database In-Memory The Next Big Thing

Distributed R for Big Data

Energy Efficient MapReduce

Name: 1. CS372H: Spring 2009 Final Exam

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Classification/Decision Trees (II)

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Paradigms in Python

Distributed Computing and Big Data: Hadoop and MapReduce

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

The Sierra Clustered Database Engine, the technology at the heart of

CLOUDS: A Decision Tree Classifier for Large Datasets

Data Mining Part 5. Prediction

Performance And Scalability In Oracle9i And SQL Server 2000

Microsoft Azure Machine learning Algorithms

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Unified Big Data Processing with Apache Spark. Matei

How To Run Statistical Tests in Excel

Transcription:

Sequoia Forest A Scalable Random Forest Implementation on Spark Sung Hwan Chung, Alpine Data Labs

Random Forest Overview It s a technique to reduce the variance of single decision tree predictions by averaging the predictions of many decorrelated trees. The decorrelation of many decision trees is achieved through Bagging and/or randomly selecting features per tree node.

Bagging Training Data Bootstrap Samples Bagged Tree Ensemble

Random Features Per Node Only examine a smaller random set of features. E.g., if there are 10000 features, randomly select 100 features for each node and determine optimal split based on these. This decorrelates trees more and often works better than bagging.

Bagging vs Random Features MNIST data set Red : Bagging, No Random Features Blue : No Bagging, Random Features

Random Forest in Spark The main challenge is to train multiple individual decision trees over distributed data. Additionally, in Random Forest, decision trees are typically fully grown. So they get large particularly for a large data set.

Random Forest in Spark Contd. Mllib already has a decision tree implementation but it doesn t support random features and is not optimized for training a large number of nodes. We implemented our own to fit our needs.

Distributed Tree Training The big idea is similar to Google s PLANET implementation (also adopted by Mllib). Individual trees are built node by node in the driver node. At each iteration, individual executors collect partition statistics that are required to determine node splits and predictions.

Preprocessing - Feature Discretization For efficiency, features are discretized first.

Preprocessing - Bagging Approximate random sampling - with/without replacement. executors tag rows with sample counts. One column per tree. Bernoulli sampling for without-replacement. Poisson sampling for with-replacement. Sample Row Sample Row 2 1 0 2 1 2 Sample counts

Mutable Node ID Tag Remember the last node that the row belonged to. One column per tree again. Sample Row 2 0 1 6 7 4 Sample Row 1 2 2 6 7 3 Sample counts Last node IDs This speeds up the process considerably since we don t have to pass large filters/trees to executors.

Training/Splitting Nodes Driver Ask executors to compute node/bin statistics (map). Collect/merge partition statistics (reduce). Select optimal feature/splits using merged partition statistics at bin boundaries. Repeat on child nodes. Partition Statistics Node Split Filters Executor Loop through filters and select rows that match the filters. Collect statistics required for splitting at bin granularities. Send the aggregated statistics back to the driver.

Partition Statistics Aggregation Partition 1 Partition 2 Partition 3 2 0 4 1 1 3 7 0 0 5 2 1 Bin/Label Counts Bin/Label Counts Bin/Label Counts Driver 3 8 13 2 Aggregated Counts

Random Features per Node Unlike a single decision tree, we need to randomly select a smaller number of features per node. To synchronize among different executors, the driver passes random seeds for different trees to executors. The executors add the node Ids to the seeds and use them to randomly select features for different tree nodes.

Local Sub-tree Training Although the first few levels of trees see a lot of training examples, descendant nodes see fewer and fewer training examples. Once the number of training samples get small enough, we train small sub-trees locally by shuffling sub-tree training samples to matching executors.

Distributed Sub-tree Training Root Train by aggregating partition stats subtree subtree subtree subtree Train locally by shuffling training data

The Overall Recipe 1. Train the nodes in a breadth-first manner. 2. Aggregate partition statistics and find optimal splits. 3. If a child of a split has a size smaller than local_threshold, shuffle matching data to an executor and train locally.

Performance factors 1. Time it takes to perform distributed node splits. 2. Time it takes to locally train all the sub-trees.

Performance dependencies The performance of partition statistics aggregation is determined by two factors. CPU-bound statistics collections at individual executors. This should scale linearly with the number of executors. The size of the statistics and the message passing/network/shuffle speed. The size is proportional to #_of_nodes * #_of_features *

Performance dependencies For information gain, the statistics size is equal to the number of target classes. For regression, the statistics size is equal to 3 - sum, square_sum, count.

Performance dependencies For local sub-tree training, the performance is largely embarassingly parallel - performance almost scales linearly with the number of executors. The training data shuffle performance depends on disk-io and network bandwidth.

Performance scalability MNIST 8 million samples (784 features, 10 classes). 100 Trees No limit on tree depth or size. # of random features 28 (sqrt(784)) 100% sampling without replacement. 9 machines in AWS, each with 32 cores. Measure the time to train vs the number of executors.

Performance scalability Blue : Training Time Red : Parse/Preprocess * The number of machines is fixed at 9. So this doesn t tell the whole story. E.g., we noticed that adding more executors within the same machine will actually slow down the algorithm due to disk IO bottle neck during groupby/shuffle.

Benefit of local training MNIST 8 million 10 trees 50 executors The local subtree training improves performance drastically.

Large scale data testing 200 million samples, 1000 features, 10 classes. Affine-transformed MNIST data with additional features. No tree size limit. 100 trees # of random features 33 100% sampling without replacement. 50 Executors

Large scale data performance # of nodes per tree goes up linearly with training duration.

Performance comparisons Blue : Train Red : Parse Orange : Total MNIST 8m 10 Trees 50 Executors * Wise.IO, H20 benchmarks are taken from blog entries and are not using the same machine as Sequoia Forest.

Random Forest vs Pruning Typically, Random Forest trees are not pruned. The following graphs show how the performance can be affected when leaf sizes get larger.

Accuracy vs Leaf Sizes 100 Trees MNIST 60000 Testing data MNIST 10000

Pruning for large scale data? For billions of rows, individual unpruned trees could be too large, even reaching hundreds of millions of nodes per tree. To save space and/or training-time, need to investigate whether some pruning could be beneficial from the accuracy-efficiency tradeoff point of view.

Shameless Alpine Ad. = Alpine is one of the very first companies to receive an official Spark certification. We re looking for brilliant machinelearning/platform/qa engineers.

Conclusion Spark enables what was previously impossible - training many fully grown humongous trees on big data. The performance scales well with # of executors. Further studies needed to understand accuracy and tree-size trade-offs.