H2O in KNIME: Integrating High Performance Machine Learning Jo-Fai Chow (H2O.ai), Marten Pfannenschmidt (KNIME), Christian Dietz (KNIME) 2018 KNIME AG. All Rights Reserved.
H2O for High Performance Machine Learning 2018 KNIME AG. All Rights Reserved. 2
3
Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html 4
High Level Architecture HDFS H 2 O Compute Engine S3 NFS Load Data Distributed In-Memory Loss-less Compression Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised ing Evaluation & Selection Predict Data & Storage Local SQL Data Prep Export: Plain Old Java Object Export: Plain Old Java Object Production Scoring Environment Your Imagination 5
High Level Architecture Multiple Interfaces HDFS H 2 O Compute Engine S3 NFS Load Data Distributed In-Memory Loss-less Compression Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised ing Evaluation & Selection Predict Data & Storage Local SQL Data Prep Export: Plain Old Java Object Export: Plain Old Java Object Production Scoring Environment Your Imagination 6
Algorithms Overview Supervised Learning Unsupervised Learning Statistical Analysis Generalized Linear s: Binomial, Gaussian, Gamma, Poisson and Tweedie Naïve Bayes Clustering K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Ensembles Distributed Random Forest: Classification or regression models Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Dimensionality Reduction Principal Component Analysis: Linearly transforms correlated variables to independent components Generalized Low Rank s: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Deep Neural Networks Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Anomaly Detection Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning 7
Scientific Advisory Council 8
H 2 O on a Single Machine Building CPU H2O
H 2 O on a Multi-Node Cluster Firewall or Cloud Building SQL NFS H2O Distributed In-Memory S3 CPU CPU CPU YARN
Foundation for Distributed Algorithms Distributed Algorithms Parallel Parse into Distributed Rows Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM Advantageous Foundation Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression All algorithms are distributed in H 2 O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations. Only enterprise-grade, open-source distributed algorithms in the market User Benefits Out-of-box functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java Designed for all sizes of data sets, especially large data Highly optimized Java code for model exports In-house expertise for all algorithms 11
Performance Benchmark (GBM) by Szilard Pafka (https://github.com/szilard/benchm-ml) Time (s) H 2 O Gradient Boosting Machine (GBM) is the fastest algorithm in the test of 10M Samples. Area Under Curve (AUC) Comparable with xgboost Better than R/Spark MLLib AUC Benchmark results for other algorithms (Generalized Linear, Random Forest etc.) are available in Szilard s GitHub repository. X = million of samples 12
H2O in KNIME 2018 KNIME AG. All Rights Reserved. 13
H2O in KNIME Offer our users high-performance machine learning algorithms from H2O in KNIME Allow to mix & match with other KNIME functionality Data wrangling KNIME Analytics Platform functionality KNIME Big-Data Connectors Text Mining, Image Processing, Cheminformatics, and more! 2018 KNIME AG. All Rights Reserved. 14
H2O in KNIME Live Demo 2018 KNIME AG. All Rights Reserved. 15
Customer prediction with H2O in KNIME 2018 KNIME AG. All Rights Reserved. 16
Kaggle.com Get the data Build a model Make a submission 2018 KNIME AG. All Rights Reserved. 17
The competition Date Store ID Visitors Provided data: Number of visitors Reservations Store information Calendar date info 2016-01-01 ba937bf13d40fb24 28 2017-04-22 324f7c39a8410e7c 216 Date Store ID Visitors 2017-04-23 e8ed9335d0c38333? 2017-05-31 8f13ef0f5e8c64dd? 2018 KNIME AG. All Rights Reserved. 18
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 19
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 20
Data preparation with KNIME Nodes 2018 KNIME AG. All Rights Reserved. 21
The blended dataset Data used for model training Date Store ID Visitors Mean visitors Area Reservations Holiday Genre 2016-01-01 ba937bf13d40fb24 28 23.8 Area1 15 False Bar 2017-04-22 324f7c39a8410e7c 216 45.3 Area2 22 True Karaoke Data used for Kaggle prediction Date Store ID Visitors Mean visitors Area Reservations Holiday Genre 2017-04-23 e8ed9335d0c38333? 12.5 Area1 10 False Italian 2017-05-31 8f13ef0f5e8c64dd? 54.1 Area7 20 False Izakaya 2018 KNIME AG. All Rights Reserved. 22
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 23
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 24
ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 25
ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 26
ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 27
ing with the new H2O nodes 2018 KNIME AG. All Rights Reserved. 28
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment 2018 KNIME AG. All Rights Reserved. 29
Solving a Kaggle competition with KNIME and H2O Data preparation training optimization evaluation Deployment. 2018 KNIME AG. All Rights Reserved. 30
What s Cooking? 2018 KNIME AG. All Rights Reserved. 31
What s Cooking: Scoring with H2O MOJOs on Spark Scoring using MOJOs 2018 KNIME AG. All Rights Reserved. 32
What s Cooking: Scoring with H2O MOJOs on Spark New: Scoring using MOJOs on Spark 2018 KNIME AG. All Rights Reserved. 33
What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 34
What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 35
What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 36
What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 37
What s Cooking: H2O Sparkling Water in KNIME 2018 KNIME AG. All Rights Reserved. 38
2018 KNIME AG. All Rights Reserved. The KNIME trademark and logo and OPEN FOR INNOVATION trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States. KNIME is also registered in Germany. 39