Scalable Data Science and Deep Learning with H2O. Arno Candel, PhD Chief Architect, H2O.ai



Similar documents
Fast Analytics on Big Data with H20

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Deep Learning with H2O. Arno Candel & Viraj Parmar

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Azure Machine Learning, SQL Data Mining and R

Spark and the Big Data Library

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Spark: Cluster Computing with Working Sets

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Journée Thématique Big Data 13/03/2015

The Data Mining Process

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

SEIZE THE DATA SEIZE THE DATA. 2015

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How Companies are! Using Spark

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Scalable Machine Learning - or what to do with all that Big Data infrastructure

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Challenges for Data Driven Systems

Programming Exercise 3: Multi-class Classification and Neural Networks

Unified Big Data Processing with Apache Spark. Matei

How To Handle Big Data With A Data Scientist

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Distributed Computing and Big Data: Hadoop and MapReduce

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Apache Flink Next-gen data analysis. Kostas

Bayesian networks - Time-series models - Apache Spark & Scala

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scaling Out With Apache Spark. DTL Meeting Slides based on

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Machine Learning Capacity and Performance Analysis and R

How To Make Sense Of Data With Altilia

RevoScaleR Speed and Scalability

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

How To Make A Credit Risk Model For A Bank Account

Big Data Analytics Hadoop and Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Hadoop Ecosystem B Y R A H I M A.

How To Create A Data Visualization With Apache Spark And Zeppelin

Advanced Big Data Analytics with R and Hadoop

MLlib: Scalable Machine Learning on Spark

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Application of Predictive Analytics for Better Alignment of Business and IT

Reference Architecture, Requirements, Gaps, Roles

High Performance Predictive Analytics in R and Hadoop:

Apache Hama Design Document v0.6

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Brave New World: Hadoop vs. Spark

Unlocking the True Value of Hadoop with Open Data Science

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Sentimental Analysis using Hadoop Phase 2: Week 2

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Advanced In-Database Analytics

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Spark: Making Big Data Interactive & Real-Time

Analytics on Spark &

Automated Machine Learning For Autonomic Computing

HiBench Introduction. Carson Wang Software & Services Group

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Analytics with Cassandra, Spark & MLLib

CSE-E5430 Scalable Cloud Computing Lecture 11

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Microsoft Azure Machine learning Algorithms

the missing log collector Treasure Data, Inc. Muga Nishizawa

Big-Data Computing with Smart Clouds and IoT Sensing

The Internet of Things and Big Data: Intro

Integrating Apache Spark with an Enterprise Data Warehouse

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Streamdrill: Analyzing Big Data Streams in Realtime

Transcription:

Scalable Data Science and Deep Learning with H2O Arno Candel, PhD Chief Architect, H2O.ai

Who Am I? Arno Candel Chief Architect, Physicist & Hacker at H2O.ai PhD Physics, ETH Zurich 2005 10+ yrs Supercomputing (HPC) 6 yrs at SLAC (Stanford Lin. Accel.) 3.5 yrs Machine Learning 1.5 yrs at H2O.ai Fortune Magazine Big Data All Star 2014 Follow me @ArnoCandel 2

Outline Introduction H2O Deep Learning Architecture Live Demos: Flow GUI - Airline Logistic Regression Scoring Engine - Million Songs Classification R - MNIST Unsupervised Anomaly Detection Flow GUI - Higgs Boson Classification Sparkling Water - Chicago Crime Prediction ipython - CitiBike Demand Prediction Outlook 3

H2O - Product Overview In-Memory ML Distributed Open Source APIs Memory-Efficient Data Structures Cutting-Edge Algorithms Use all your Data (No Sampling) Accuracy with Speed and Scale Ownership of Methods - Apache V2 Easy to Deploy: Bare, Hadoop, Spark, etc. Java, Scala, R, Python, JavaScript, JSON NanoFast Scoring Engine (POJO) 4

Team@H2O.ai 25,000 commits / 3yrs H2O World Conference 2014 5

Community & Install Base ML is the new SQL Prediction is the new Search Active Users Mar 2014 July 2014 Mar 2015 150+ Companies Users 103 634 2789 463 2,887 13,237 6

Accuracy with Speed and Scale HDFS Fast Modeling Engine S3 Distributed In-Memory Map Reduce/Fork Join Columnar Compression Matrix Factorization Feature Engineering Munging Supervised Unsupervised Clustering Classification Regression GLM, Deep Learning K-Means, PCA, NB, Cox Random Forest / GBM Ensembles SQL all in-house code NoSQL Streaming Nano Fast Java Scoring Engines (POJO code generation) 7

Actual Customer Use Cases Real-time marketing (H2O is 10x faster than anything else) Ad Optimization (200% CPA Lift with H2O) P2B Model Factory (60k models, 15x faster with H2O than before) Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions) and many large insurance and financial services companies! 8

H2O - Easy Installation h2o.ai/download Try it out! Today s live demos will be run on yesterday s nightly build of the next-gen product (alpha, in QA)! 9

Demo: GLM (Elastic Net) on Airlines 8 nodes on EC2: all cores active. Trains on 116M rows in 17 seconds! Eastern Airlines: Always on time (No schedule!) 10

POJO Scoring Engine Standalone Java scoring code is auto-generated! Example: First GBM tree Note: no heap allocation, pure decision-making Fast and easy path to production (batch or real-time)! 11

H2O Deep Learning Multi-layer feed-forward Neural Network Trained with back-propagation (SGD, AdaDelta) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async fork/join worker threads operate at FORTRAN speeds) + smart algorithms for fast & accurate results (automatic standardization, one-hot encoding of categoricals, missing value imputation, weight & bias initialization, adaptive learning rate, momentum, dropout/l1/l2 regularization, grid search, N-fold cross-validation, checkpointing, load balancing, auto-tuning, model averaging, etc.) = powerful tool for (un)supervised machine learning on real-world data all 320 cores maxed out 12

H2O Deep Learning Architecture H2O in-memory nonblocking hash map: K-V store Query & display the model via JSON, WWW i K-V HTTPD nodes/jvms: sync threads: async communication K-V HTTPD initial model: weights and biases w w 1 w w 1 2 w w w w 1 2 4 3 w1 w3 w2 1 3 2 w2+w4 w1+w3 1 2 w * updated model: w * w4 4 = (w1+w2+w3+w4)/4 1 Main Loop: map: each node trains a copy of the weights and biases with (some * or all of) its local data with asynchronous F/J threads * auto-tuned (default) or userspecified number of rows per MapReduce iteration reduce: model averaging: average weights and biases from all nodes, speedup is at least #nodes/ log(#rows) http://arxiv.org/abs/1209.4129 Keep iterating over the data ( epochs ), score at user-given times 13

H2O Deep Learning beats MNIST Handwritten digits: 28^2=784 gray-scale pixel values Standard 60k/10k data No distortions No convolutions No unsupervised training No ensemble 10 hours on 10 16-core servers World-record! 0.83% test set error A Classic Benchmark! 14

Unsupervised Anomaly Detection Learns Compressed Identity Function The good The bad The ugly Try it yourself! 15

Higgs Boson - Classification Problem Large Hadron Collider: Largest experiment of mankind! $13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc. Higgs boson discovery (July 12) led to 2013 Nobel prize! Higgs vs Background Images courtesy CERN / LHC HIGGS UCI Dataset: 21 low-level features AND 7 high-level derived features (physics formulae) Train: 10M rows, Valid: 500k, Test: 500k rows 16

Deep Learning for Higgs Detection???? Former baseline for AUC: 0.733 and 0.816 H2O Algorithm low-level H2O AUC all features H2O AUC Generalized Linear Model 0.596 0.684 Random Forest 0.764 add derived features 0.840 Gradient Boosted Trees 0.753 0.839 Neural Net 1 hidden layer 0.760 0.830 H2O Deep Learning? 17

H2O Deep Learning Higgs Demo Deep DL model on 21 low-level features (no physics formula based extra features) 10 nodes: AUC = 0.84 after 40 mins AUC = 0.87+ after 8 hours valid 500k rows test 500k rows train 10M rows Deep Learning learns Physics! 18

H2O Deep Learning in the News http://www.datanami.com/2015/05/07/what-police-can-learn-from-deep-learning/ H http://www.slideshare.net/0xdata/crime-deeplearningkey 2 Alex, Michal, et al. 19

Weather + Census + Crime Data 20

Integration with Spark Ecosystem Spark RDD and H2O Frames share same JVM 21

Sparkling Water Demo Instructions at h2o.ai/download 22

Parse & Munge with H2O, Convert to RDD H2O Parser: Robust & Fast Simple Column Selection 23

Parse & Munge with H2O, Convert to RDD Munging: Date Manipulations Conversion to SchemaRDD 24

Join RDDs with SQL, Convert to H2O Spark SQL Query Execution Convert back to H2OFrame Split into Train 80% / Test 20% 25

Build H2O Deep Learning Model Train a H2O Deep Learning Model on Data obtained by Spark SQL Query Predict whether Arrest will be made with AUC of 0.90+ 26

Visualize Results with Flow Using Flow to interactively plot Arrest Rate (blue) vs Relative Occurrence (red) per crime type. 27

R s data.table now in H2O! 28

ipython Notebook CitiBike Demo Cliff et al. 29

Group-By Aggregation ipython Notebook Demo 30

Model Building And Scoring ipython Notebook Demo 91% AUC baseline 31

Joining Bikes-Per-Day with Weather Data Compression Summary: bitset 0/1 signed byte -128..127 unsigned byte 0..255 1 byte floating point (e.g., 0.493..0.684) short integers (2 bytes) sparse doubles dense double 32

Improved Models with Weather Data 93% AUC after joining bike and weather data 33

More Info in H2O Booklets https://leanpub.com/u/h2oai http://learn.h2o.ai 34

Competitive Data Science fingers crossed! Mark Landry (joined H2O!) will hold a master class on May 19! http://www.meetup.com/silicon-valley-big-data-science/events/222303884/ 35

Past Kaggle Starter Scripts still ongoing! 36

Hyper-Parameter Tuning 93 numerical features 9 output classes 62k training set rows 144k test set rows 37

Outlook - Algorithm Roadmap Ensembles (Erin LeDell et al.) Automatic Hyper-Parameter Tuning Convolutional Layers for Deep Learning Natural Language Processing: tf-idf, Word2Vec Generalized Low Rank Models PCA, SVD, K-Means, Matrix Factorization Recommender Systems And many more! Public JIRAs - Join H2O! 38

Key Take-Aways H2O is an open source predictive analytics platform for data scientists and business analysts who need scalable, fast and accurate machine learning. H2O Deep Learning is ready to take your advanced analytics to the next level. Try it on your data! https://github.com/h2oai H2O Google Group http://h2o.ai @h2oai Thank You! 39

Questions? Please remember to evaluate via the GOTO Guide App