Fast Analytics on Big Data with H20



Similar documents
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

RevoScaleR Speed and Scalability

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

High Performance Predictive Analytics in R and Hadoop:

Hadoop Architecture. Part 1

Decision Trees from large Databases: SLIQ

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Integrating Apache Spark with an Enterprise Data Warehouse

A Performance Analysis of Distributed Indexing using Terrier

H2O on Hadoop. September 30,

Bringing Big Data Modelling into the Hands of Domain Experts

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Open source Google-style large scale data analysis with Hadoop

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How To Make A Credit Risk Model For A Bank Account

BIG DATA What it is and how to use?

2015 The MathWorks, Inc. 1

Distributed forests for MapReduce-based machine learning

Advanced Big Data Analytics with R and Hadoop

Azure Machine Learning, SQL Data Mining and R

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cliff Click 0xdata Big Data Small Computers

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Predicting Flight Delays

Spark and the Big Data Library

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Gerry Hobbs, Department of Statistics, West Virginia University

Machine Learning Capacity and Performance Analysis and R

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Bayesian networks - Time-series models - Apache Spark & Scala

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data and Data Science: Behind the Buzz Words

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

SEIZE THE DATA SEIZE THE DATA. 2015

Scaling Out With Apache Spark. DTL Meeting Slides based on

Similarity Search in a Very Large Scale Using Hadoop and HBase

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

The Data Mining Process

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Package bigrf. February 19, 2015

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Using multiple models: Bagging, Boosting, Ensembles, Forests

Journée Thématique Big Data 13/03/2015

MapReduce and Hadoop Distributed File System

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data Mining. Nonlinear Classification

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Ensembles and PMML in KNIME

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

16.1 MAPREDUCE. For personal use only, not for distribution. 333

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

NoSQL and Hadoop Technologies On Oracle Cloud

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

The Operational Value of Social Media Information. Social Media and Customer Interaction

Hadoop and Map-Reduce. Swati Gore

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

MapReduce and Hadoop Distributed File System V I J A Y R A O

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

MLlib: Scalable Machine Learning on Spark

MS1b Statistical Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Advanced In-Database Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Applied Multivariate Analysis - Big data analytics

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Data Mining Practical Machine Learning Tools and Techniques

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop Scheduler w i t h Deadline Constraint

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Map-Reduce for Machine Learning on Multicore

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data Analytics with Cassandra, Spark & MLLib

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Benchmarking Hadoop & HBase on Violin

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Transcription:

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj

Team

About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java, Apache v2 Open Source Easy deployment with a single jar, automatic cloud discovery https://github.com/0xdata/h2o https://github.com/0xdata/h2o-dev Google group h2ostream ~15000 commits over two years, very active developers

Overview H2O Architecture GLM on H2O demo Random Forest

H2O Architecture

Practical Data Science Data scientists are not necessarily trained as computer scientists A typical data science team is about 20% CS, working mostly on UI and visualization tools An example is Netflix Statisticians prototype in R When done, developers recode the code in Java and Hadoop

What we want from modern machine learning platform Requirements Fast & Interactive Big Data (no sampling) Flexibility Extensibility Portability Infrastructure Solution In-Memory Distributed Open Source API/SDK Java, REST/JSON Cloud or On-Premise Hadoop or Private Cluster

H2O Architecture Frontends REST API, R, Python, Web Interface Algorithms GBM, Random Forest, GLM, PCA, K-Means, Deep Learning Core Distributed Tasks Map/Reduce Distributed in memory K/V Store Column Compressed Data Memory Managed Data Sources HDFS, S3, NFS, Web Upload

Distributed Data Taxonomy Vector

Distributed Data Taxonomy Vector The vector may be very large ~ billions of rows - Store compressed (often 2-4x) - Access as Java primitives with on the fly decompression - Support fast Random access - Modifiable with Java memory semantics

Distributed Data Taxonomy Vector Large vectors must be distributed over multiple JVMs - Vector is split into chunks - Chunk is a unit of parallel access - Each chunk ~ 1000 elements - Per chunk compression - Homed to a single node - Can be spilled to disk - GC very cheap

Distributed Data Taxonomy age sex zip ID Distributed data frame A row is always stored in a single JVM - Similar to R frame - Adding and removing columns is cheap - Row-wise access

Distributed Data Taxonomy Elem a java double Chunk a collection of thousands to millions of elems Vec a collection of Chunks Frame a collection of Vecs Row i - i th elements of all the vecs in a frame

Distributed Fork/Join JVM task JVM task JVM task JVM task JVM task

Distributed Fork/Join JVM task Task is distributed in a tree pattern JVM task JVM task - Results are reduced at each inner node - Returns with a single result when all subtasks done JVM task JVM task

Distributed Fork/Join JVM task JVM task task task chunk task task chunk chunk - On each node the task is parallelized over home chunks using Fork/Join - No blocked thread using continuation passing style

Distributed Code Simple tasks Executed on a single remote node Map/Reduce Two operations map(x) -> y reduce(y, y) -> y Automatically distributed amongst the cluster and worker threads inside the nodes

Distributed Code double sumy2 = new MRTask2(){ double map(double x){ return x*x; } double reduce(double x, double y){ return x + y; } }.doall(vec);

Demo GLM

CTR Prediction Contest Kaggle contest- clickthrought rate prediction Data 11 days worth of clickthrough data from Avazu ~ 8GB, ~ 44 million rows Mostly categoricals Large number of features (predictors), good fit for linear models

Linear Regression Least Squares Fit

Logistic Regression Least Squares Fit

Logistic Regression GLM Fit

Generalized Linear Modelling Solved by iterative reweighted least squares Computation in two parts Compute X T X Compute inverse of X T X (Cholesky Decomposition) Assumption Number of rows >> number of cols (use strong rules to filter out inactive columns) Complexity Nrows * Ncols2/N*P +Ncols3/P

Generalized Linear Modelling Solved by iterative reweighted least squares Computation in two parts Distributed Compute X T X Compute inverse of X T X (Cholesky Decomposition) Assumption Single Node Number of rows >> number of cols (use strong rules to filter out inactive columns) Complexity Nrows * Ncols2/N*P +Ncols3/P

Random Forest

How Big is Big? Data set size is relative Does the data fit in one machine s RAM Does the data fit in one machine s disk Does the data fit in several machine s RAM Does the data fit in several machine s disk

Why so Random? Introducing Random Forest Bagging Out of bag error estimate Confusion matrix Leo Breiman: Random Forests. Machine Learning, 2001

Classification Trees Consider a supervised learning problem with a simple data set with two classes and two features x in [1,4] and y in [5,8] We can build a classification tree to predict of new observations

Classification Trees Classification trees often overfit the data

Random Forest Overfiting is avoided by building multiple randomized and far less precise (partial) trees All these trees in fact underfit Result is obtained by a vote over the ensemble of the decision trees Different voting strategies possible

Random Forest Each tree sees a different part of the training set and captures the information it contains

Random Forest Each tree sees a different random selection of the training set (without replacement) Bagging At each split, a random subset of features is selected over which the decision should maximize gain Gini Impurity Information gain

Random Forest Each tree sees a different random selection of the training set (without replacement) Bagging At each split, a random subset of features is selected over which the decision should maximize gain Gini Impurity Information gain

Random Forest Each tree sees a different random selection of the training set (without replacement) Bagging At each split, a random subset of features is selected over which the decision should maximize gain Gini Impurity Information gain

Random Forest Each tree sees a different random selection of the training set (without replacement) Bagging At each split, a random subset of features is selected over which the decision should maximize gain Gini Impurity Information gain

Validating the trees We can exploit the fact that each tree sees only a subset of the training data Each tree in the forest is validated on the training data it has never seen

Validating the trees We can exploit the fact that each tree sees only a subset of the training data Each tree in the forest is validated on the training data it has never seen Original training data

Validating the trees We can exploit the fact that each tree sees only a subset of the training data Each tree in the forest is validated on the training data it has never seen Data used to construct the tree

Validating the trees We can exploit the fact that each tree sees only a subset of the training data Each tree in the forest is validated on the training data it has never seen Data used to validate the tree

Validating the trees We can exploit the fact that each tree sees only a subset of the training data Each tree in the forest is validated on the training data it has never seen Errors (Out of Bag Error)

Validating the Forest Confusion Matrix is build for the forest and training data During a vote, trees trained on the current row are ignored actual/ assigned Red Green Red 15 5 33% Green 1 10 10%

Distributing and Parallelizing How do we sample? How do we select splits? How do we estimate OOBE?

Distributing and Parallelizing How do we sample? How do we select splits? How do we estimate OOBE? When random data sample fits in memory, RF building parallelize extremely well Parallel tree building is trivial Validation requires trees to be collocated with data Moving trees to data (large training datasets can produce huge trees!)

Random Forest in H2O Trees must be built in parallel over randomized data samples To calculate gains, feature sets must be sorted at each split

Random Forest in H2O Trees must be built in parallel over randomized data samples H2O reads data and distributes them over the nodes Each node builds trees in parallel on a sample of the data that fits locally To calculate gains, feature sets must be sorted at each split

Random Forest in H2O Trees must be built in parallel over randomized data samples To calculate gains, feature sets must be sorted at each split the values are discretized -> instead of sorting features are represented as arrays of their cardinality { (2, red), (3.4, red), (5, green), (6.1, green) } becomes { (1, red), (2, red), (3, green), (4, green) } But trees can be very large (~100k splits)

Random Forest in H2O Trees must be built in parallel over randomized data samples To calculate gains, feature sets must be sorted at each split the values are discretized -> instead of sorting features are represented as arrays of their cardinality { (2, red), (3.4, red), (5, green), (6.1, green) } becomes { (1, red), (2, red), (3, green), (4, green) } But trees can be very large (~100k splits) Binning

Lessons Learned Java Random is not really random Small seeds give very bad random sequences resulting in poor RF performance And we of course started with a deterministic seed of 42:) But determinism is important for debugging Linux kernel drops TCP connections silently when under stress Sender opens connection, sends, closes w/o exceptions, but receiver never sees the data Need to recycle TCP connections and use TCP reliable delayer Good Diagnostics to detect hardware issues is needed Specific UDP packet drops with 100% chance

Demo Continued

Q & A Thank you