H2O in KNIME: Integrating High Performance Machine Learning

Similar documents
Fast Analytics on Big Data with H20

KNIME Big Data Workshop

What s Cooking in KNIME

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Data Science in Action

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining. Nonlinear Classification

Azure Machine Learning, SQL Data Mining and R

Predictive Data modeling for health care: Comparative performance study of different prediction models

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Predictive Modeling Techniques in Insurance

Spark and the Big Data Library

DATA ANALYTICS USING R

SEIZE THE DATA SEIZE THE DATA. 2015

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

High Performance Predictive Analytics in R and Hadoop:

Scalable Data Science and Deep Learning with H2O. Arno Candel, PhD Chief Architect, H2O.ai

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Advanced analytics at your hands

Bayesian networks - Time-series models - Apache Spark & Scala

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Scalable Data Science with Hadoop, Spark and R. Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

The Internet of Things and Big Data: Intro

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

What s new in SAS Enterprise Miner 13.1 and Beyond SAS Talks February 20, 2014

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Oracle Data Miner (Extension of SQL Developer 4.0)

Advanced Data Science on Spark

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Car Insurance. Prvák, Tomi, Havri

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

How To Make A Credit Risk Model For A Bank Account

Understanding the Benefits of IBM SPSS Statistics Server

Analytics on Big Data

The Data Mining Process

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Learning outcomes. Knowledge and understanding. Competence and skills

Microsoft Azure Machine learning Algorithms

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

MACHINE LEARNING AN INTRODUCTION

Knowledge Discovery and Data Mining

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Scalable Developments for Big Data Analytics in Remote Sensing

Advanced In-Database Analytics

Data Mining: Overview. What is Data Mining?

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

MHI3000 Big Data Analytics for Health Care Final Project Report

HT2015: SC4 Statistical Data Mining and Machine Learning

ANALYTICS CENTER LEARNING PROGRAM

Classification of Bad Accounts in Credit Card Industry

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

IST 557 Final Project

Learning outcomes. Knowledge and understanding. Ability and Competences. Evaluation capability and scientific approach

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Journée Thématique Big Data 13/03/2015

Data Mining Applications in Higher Education

Unified Big Data Processing with Apache Spark. Matei

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

SEIZE THE DATA SEIZE THE DATA. 2015

extreme Datamining mit Oracle R Enterprise

Data Mining is the process of knowledge discovery involving finding

Role Description. Position of a Data Scientist Machine Learning at Fractal Analytics

Big Data and Data Science: Behind the Buzz Words

Big Data and Marketing

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

BIG DATA What it is and how to use?

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Machine learning for algo trading

Data Mining mit der JMSL Numerical Library for Java Applications

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Model Deployment. Dr. Saed Sayad. University of Toronto

Challenges for Data Driven Systems

Qlik Sense scalability

Predict Influencers in the Social Network

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Using multiple models: Bagging, Boosting, Ensembles, Forests

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Transcription:

H2O in KNIME: Integrating High Performance Machine Learning Jo-Fai Chow (H2O.ai), Marten Pfannenschmidt (KNIME), Christian Dietz (KNIME) 2018 KNIME AG. All Rights Reserved.

H2O for High Performance Machine Learning 2018 KNIME AG. All Rights Reserved. 2

3

Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html 4

High Level Architecture HDFS H 2 O Compute Engine S3 NFS Load Data Distributed In-Memory Loss-less Compression Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised ing Evaluation & Selection Predict Data & Storage Local SQL Data Prep Export: Plain Old Java Object Export: Plain Old Java Object Production Scoring Environment Your Imagination 5

High Level Architecture Multiple Interfaces HDFS H 2 O Compute Engine S3 NFS Load Data Distributed In-Memory Loss-less Compression Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised ing Evaluation & Selection Predict Data & Storage Local SQL Data Prep Export: Plain Old Java Object Export: Plain Old Java Object Production Scoring Environment Your Imagination 6

Algorithms Overview Supervised Learning Unsupervised Learning Statistical Analysis Generalized Linear s: Binomial, Gaussian, Gamma, Poisson and Tweedie Naïve Bayes Clustering K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Ensembles Distributed Random Forest: Classification or regression models Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Dimensionality Reduction Principal Component Analysis: Linearly transforms correlated variables to independent components Generalized Low Rank s: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Deep Neural Networks Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Anomaly Detection Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning 7

Scientific Advisory Council 8

H 2 O on a Single Machine Building CPU H2O

H 2 O on a Multi-Node Cluster Firewall or Cloud Building SQL NFS H2O Distributed In-Memory S3 CPU CPU CPU YARN

Foundation for Distributed Algorithms Distributed Algorithms Parallel Parse into Distributed Rows Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM Advantageous Foundation Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression All algorithms are distributed in H 2 O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations. Only enterprise-grade, open-source distributed algorithms in the market User Benefits Out-of-box functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java Designed for all sizes of data sets, especially large data Highly optimized Java code for model exports In-house expertise for all algorithms 11

Performance Benchmark (GBM) by Szilard Pafka (https://github.com/szilard/benchm-ml) Time (s) H 2 O Gradient Boosting Machine (GBM) is the fastest algorithm in the test of 10M Samples. Area Under Curve (AUC) Comparable with xgboost Better than R/Spark MLLib AUC Benchmark results for other algorithms (Generalized Linear, Random Forest etc.) are available in Szilard s GitHub repository. X = million of samples 12

H2O in KNIME 2018 KNIME AG. All Rights Reserved. 13

H2O in KNIME Offer our users high-performance machine learning algorithms from H2O in KNIME Allow to mix & match with other KNIME functionality Data wrangling KNIME Analytics Platform functionality KNIME Big-Data Connectors Text Mining, Image Processing, Cheminformatics, and more! 2018 KNIME AG. All Rights Reserved. 14

H2O in KNIME Live Demo 2018 KNIME AG. All Rights Reserved. 15

Customer prediction with H2O in KNIME 2018 KNIME AG. All Rights Reserved. 16

Kaggle.com Get the data Build a model Make a submission 2018 KNIME AG. All Rights Reserved. 17

The competition Date Store ID Visitors Provided data: Number of visitors Reservations Store information Calendar date info 2016-01-01 ba937bf13d40fb24 28 2017-04-22 324f7c39a8410e7c 216 Date Store ID Visitors 2017-04-23 e8ed9335d0c38333? 2017-05-31 8f13ef0f5e8c64dd? 2018 KNIME AG. All Rights Reserved. 18

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 19

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 20

Data preparation with KNIME Nodes 2018 KNIME AG. All Rights Reserved. 21

The blended dataset Data used for model training Date Store ID Visitors Mean visitors Area Reservations Holiday Genre 2016-01-01 ba937bf13d40fb24 28 23.8 Area1 15 False Bar 2017-04-22 324f7c39a8410e7c 216 45.3 Area2 22 True Karaoke Data used for Kaggle prediction Date Store ID Visitors Mean visitors Area Reservations Holiday Genre 2017-04-23 e8ed9335d0c38333? 12.5 Area1 10 False Italian 2017-05-31 8f13ef0f5e8c64dd? 54.1 Area7 20 False Izakaya 2018 KNIME AG. All Rights Reserved. 22

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 23

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 24

ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 25

ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 26

ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 27

ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 28

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 29

Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment. 2018 KNIME AG. All Rights Reserved. 30

What s Cooking? 2018 KNIME AG. All Rights Reserved. 31

What s Cooking: Scoring with H2O MOJOs on Spark Scoring using MOJOs 2018 KNIME AG. All Rights Reserved. 32

What s Cooking: Scoring with H2O MOJOs on Spark New: Scoring using MOJOs on Spark 2018 KNIME AG. All Rights Reserved. 33

What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 34

What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 35

What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 36

What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 37

What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 38

2018 KNIME AG. All Rights Reserved. The KNIME trademark and logo and OPEN FOR INNOVATION trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States. KNIME is also registered in Germany. 39