KNIME Big Data Workshop Tobias Kötter and Björn Lohrmann KNIME 2016 KNIME.com AG. All Rights Reserved.
Variety, Volume, Velocity Variety: integrating heterogeneous data.. and tools Volume: from small files......to distributed data repositories (Hadoop) bring the computation to the data Velocity: from distributing computationally heavy computations......to real time scoring of millions of records/sec. 2016 KNIME.com AG. All Rights Reserved. 2 2
Variety 2016 KNIME.com AG. All Rights Reserved. 3
The KNIME Analytics Platform: Open for Every Data, Tool, and User Data Scientist Business Analyst External Data Connectors KNIME Analytics Platform Native Data Access, Analysis, Visualization, and Reporting External and Legacy Tools Distributed / Cloud Execution 2016 KNIME.com AG. All Rights Reserved. 4
Data Integration 2016 KNIME.com AG. All Rights Reserved. 5
Integrating R and Python 2016 KNIME.com AG. All Rights Reserved. 6
Modular Integrations 2016 KNIME.com AG. All Rights Reserved. 7
Other Programming/Scripting Integrations 2016 KNIME.com AG. All Rights Reserved. 8
Volume 2016 KNIME.com AG. All Rights Reserved. 9
Database Extension Visually assemble complex SQL statements Connect to almost all JDBC-compliant databases Harness the power of your database within KNIME 2016 KNIME.com AG. All Rights Reserved. 10
In-Database Processing Operations are performed within the database 2016 KNIME.com AG. All Rights Reserved. 11
Tip SQL statements are logged in KNIME log file 2016 KNIME.com AG. All Rights Reserved. 12
Database Port Types Database Connection Port (dark red) Connection information SQL statement Database JDBC Connection Port (light red) Connection information Database Connection Ports can be connected to Database JDBC Connection Ports but not vice versa 2016 KNIME.com AG. All Rights Reserved. 13
Database Connectors Nodes to connect to specific Databases Bundling necessary JDBC drivers Easy to use DB specific behavior/capability Hive and Impala connector part of the commercial KNIME Big Data Connectors extension General Database Connector Can connect to any JDBC source Register new JDBC driver via preferences page 2016 KNIME.com AG. All Rights Reserved. 14
Register JDBC Driver Register new JDBC drivers Increase connection timeout for long running database operations Open KNIME and go to File -> Preferences 2016 KNIME.com AG. All Rights Reserved. 15
Query Nodes Filter rows and columns Join tables/queries Extract samples Bin numeric columns Sort your data Write your own query Aggregate your data 2016 KNIME.com AG. All Rights Reserved. 16
Database GroupBy Manual Aggregation Returns number of rows per group 2016 KNIME.com AG. All Rights Reserved. 17
Database GroupBy Pattern Based Aggregation Tick this option if the search pattern is a regular expression otherwise it is treated as string with wildcards ('*' and '?') 2016 KNIME.com AG. All Rights Reserved. 18
Database GroupBy Type Based Aggregation Matches all columns Matches all numeric columns 2016 KNIME.com AG. All Rights Reserved. 19
Database GroupBy DB Specific Aggregation Methods SQLite 7 aggregation functions PostgreSQL 25 aggregation functions 2016 KNIME.com AG. All Rights Reserved. 20
Database GroupBy Custom Aggregation Function 2016 KNIME.com AG. All Rights Reserved. 21
Database Writing Nodes Create table as select Insert/append data Update values in table Delete rows from table 2016 KNIME.com AG. All Rights Reserved. 22
Performance Tip Increase batch size in database manipulation nodes Increase batch size for better performance 2016 KNIME.com AG. All Rights Reserved. 23
KNIME Big Data Connectors Package required drivers/libraries for HDFS, Hive, Impala access Preconfigured connectors Hive Cloudera Impala Extends the open source database integration 2016 KNIME.com AG. All Rights Reserved. 24
Hive/Impala Loader Batch upload a KNIME data table to Hive/Impala 2016 KNIME.com AG. All Rights Reserved. 25
HDFS File Handling New nodes HDFS Connection HDFS File Permission Utilize the existing remote file handling nodes Upload/download files Create/list directories Delete files 2016 KNIME.com AG. All Rights Reserved. 26
HDFS File Handling 2016 KNIME.com AG. All Rights Reserved. 27
KNIME Spark Executor Based on Spark MLlib Scalable machine learning library Runs on Hadoop Algorithms for Classification (decision tree, naïve bayes, ) Regression (logistic regression, linear regression, ) Clustering (k-means) Collaborative filtering (ALS) Dimensionality reduction (SVD, PCA) 2016 KNIME.com AG. All Rights Reserved. 28
Familiar Usage Model Usage model and dialogs similar to existing nodes No coding required 2016 KNIME.com AG. All Rights Reserved. 29
MLlib Integration MLlib model ports for model transfer Native MLlib model learning and prediction Spark nodes start and manage Spark jobs Including Spark job cancelation Native MLlib model 2016 KNIME.com AG. All Rights Reserved. 30
Data Stays Within Your Cluster Spark RDDs as input/output format Data stays within your cluster No unnecessary data movements Several input/output nodes e.g. Hive, hdfs files, 2016 KNIME.com AG. All Rights Reserved. 31
Machine Learning Unsupervised Learning Example 2016 KNIME.com AG. All Rights Reserved. 32
Machine Learning Supervised Learning Example 2016 KNIME.com AG. All Rights Reserved. 33
Mass Learning Fast Event Prediction Convert supported MLlib models to PMML 2016 KNIME.com AG. All Rights Reserved. 34
Sophisticated Learning - Mass Prediction Supports KNIME models and pre-processing steps 2016 KNIME.com AG. All Rights Reserved. 35
Closing the Loop Apply model on demand Learn model at scale PMML model MLlib model Sophisticated model learning Apply model at scale 2016 KNIME.com AG. All Rights Reserved. 36
Mix and Match Combine with existing KNIME nodes such as loops 2016 KNIME.com AG. All Rights Reserved. 37
Modularize and Execute Your Own Spark Code 2016 KNIME.com AG. All Rights Reserved. 38
Lazy Evaluation in Spark Transformations are lazy Actions trigger evaluation 2016 KNIME.com AG. All Rights Reserved. 39
Spark Node Overview 2016 KNIME.com AG. All Rights Reserved. 40
Spark Job Server * Hiveserver 2 Impala KNIME Big Data Architecture Scheduled execution and RESTful workflow submission Submit Impala queries via JDBC Hadoop Cluster Build Spark workflows graphically KNIME Server with extensions: KNIME Big Data Connectors KNIME Big Data Executor for Spark Workflow Upload via HTTP(S) Submit Hive queries via JDBC Submit Spark jobs via HTTP(S) KNIME Analytics Platform with extensions: KNIME Big Data Connectors KNIME Big Data Executor for Spark *Software provided by KNIME, based on https://github.com/spark-jobserver/spark-jobserver 2016 KNIME.com AG. All Rights Reserved. 41
Velocity 2016 KNIME.com AG. All Rights Reserved. 42
Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows 2016 KNIME.com AG. All Rights Reserved. 43
KNIME Cluster Executor: Distributed Data 2016 KNIME.com AG. All Rights Reserved. 44
KNIME Cluster Execution: Distributed Analytics 2016 KNIME.com AG. All Rights Reserved. 45
KNIME Batch Executors on Hadoop (via Oozie) Hadoop Master Node Hadoop Worker Node YARN Container REST Client Submit workflow parameters via REST Oozie Server Executes KNIME Workflow Designer Designs and uploads workflow YARN HDFS 2016 KNIME.com AG. All Rights Reserved. 46
KNIME Executors on Hadoop (via YARN) KNIME Workflow Users REST Client KNIME Workflow Designer Hadoop Worker Node YARN Container Hadoop Worker Node YARN Container Execute workflow Executes Executes KNIME Server YARN 2016 KNIME.com AG. All Rights Reserved. 47
Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows High Demand Scoring/Prediction: High Performance Scoring using generic Workflows High Performance Scoring of Predictive Models 2016 KNIME.com AG. All Rights Reserved. 48
High Performance Scoring via Workflows Record (or small batch) based processing Exposed as RESTful web service 2016 KNIME.com AG. All Rights Reserved. 49
High Performance Scoring using Models KNIME PMML Scoring via compiled PMML Deployed on KNIME Server Exposed as RESTful web service Partnership with Zementis ADAPA Real Time Scoring UPPI Big Data Scoring Engine 2016 KNIME.com AG. All Rights Reserved. 50
Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows High Demand Scoring/Prediction: High Performance Scoring using generic Workflows High Performance Scoring of Predictive Models Continuous Data Streams Streaming in KNIME Spark Streaming 2016 KNIME.com AG. All Rights Reserved. 51
Streaming in KNIME 2016 KNIME.com AG. All Rights Reserved. 52
Spark Streaming 2016 KNIME.com AG. All Rights Reserved. 53
Big Data, IoT, and the three V Variety: KNIME inherently well-suited: open platform broad data source/type support extensive tool integration Volume: Bring the computation to the data Big Data Extensions cover ETL and model learning Velocity: Distributed Execution of heavy workflows High Performance Scoring of predictive models Streaming execution 2016 KNIME.com AG. All Rights Reserved. 54
Demo 2016 KNIME.com AG. All Rights Reserved. 55
Virtual Machines Hortonworks: http://hortonworks.com/products/hortonworks-sandbox/ Cloudera: http://www.cloudera.com/content/cloudera/en/downloa ds/quickstart_vms.html KNIME VM based on Hortonworks sandbox with preinstalled Spark Job Server is coming soon 2016 KNIME.com AG. All Rights Reserved. 56
Resources SQL Syntax and Examples (www.w3schools.com) Apache Spark MLlib (http://spark.apache.org/mllib/) The KNIME Website (www.knime.org) Database Documentation (https://tech.knime.org/databasedocumentation) Big Data Extensions (https://www.knime.org/knime-big-dataextensions) Forum (tech.knime.org/forum) LEARNING HUB under RESOURCES (www.knime.org/learning-hub) Blog for news, tips and tricks (www.knime.org/blog) KNIME TV channel on KNIME on @KNIME 2016 KNIME.com AG. All Rights Reserved. 57
The KNIME trademark and logo and OPEN FOR INNOVATION trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME is also registered in Germany. 2016 KNIME.com AG. All Rights Reserved. 58