KNIME Big Data Workshop

Similar documents
Harnessing Big Data with KNIME

What s Cooking in KNIME

KNIME opens the Doors to Big Data. A Practical example of Integrating any Big Data Platform into KNIME

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Data processing goes big

ANALYTICS CENTER LEARNING PROGRAM

SEIZE THE DATA SEIZE THE DATA. 2015

Hadoop Ecosystem B Y R A H I M A.

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Create A Data Visualization With Apache Spark And Zeppelin

Ensembles and PMML in KNIME

The Internet of Things and Big Data: Intro

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

What's New in SAS Data Management

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

RapidMiner Radoop Documentation

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Comprehensive Analytics on the Hortonworks Data Platform

Lofan Abrams Data Services for Big Data Session # 2987

Ganzheitliches Datenmanagement

Big Data and Data Science: Behind the Buzz Words

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Moving From Hadoop to Spark

PivotalR: A Package for Machine Learning on Big Data

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Unlocking the True Value of Hadoop with Open Data Science

HADOOP. Revised 10/19/2015

AtScale Intelligence Platform

Unified Batch & Stream Processing Platform

Best Practices for Hadoop Data Analysis with Tableau

QUEST meeting Big Data Analytics

Cloudera Manager Training: Hands-On Exercises

OpenText Actuate Big Data Analytics 5.2

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Native Connectivity to Big Data Sources in MSTR 10

Constructing a Data Lake: Hadoop and Oracle Database United!

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database


Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Set Up Hortonworks Hadoop with SQL Anywhere

Hadoop & SAS Data Loader for Hadoop

An In-Depth Look at In-Memory Predictive Analytics for Developers

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Creating Big Data Applications with Spring XD

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

locuz.com Big Data Services

Integrating Apache Spark with an Enterprise Data Warehouse

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA What it is and how to use?

Safe Harbor Statement

Bringing the Power of SAS to Hadoop. White Paper

Unified Big Data Processing with Apache Spark. Matei

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop Job Oriented Training Agenda

Databricks. A Primer

Journée Thématique Big Data 13/03/2015

Shark Installation Guide Week 3 Report. Ankush Arora

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

#TalendSandbox for Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Spectrum Technology Platform. Version 9.0. Administration Guide

Cloudera Certified Developer for Apache Hadoop

Actian SQL in Hadoop Buyer s Guide

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

and Hadoop Technology

Interactive data analytics drive insights

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Information Builders Mission & Value Proposition

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Operationalise Predictive Analytics

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Data Domain Profiling and Data Masking for Hadoop

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Advanced Big Data Analytics with R and Hadoop

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Architectures for massive data management

Actian Analytics Platform Express Hadoop SQL Edition 2.0

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

From Spark to Ignition:

Big Data and Market Surveillance. April 28, 2014

Reference Architecture, Requirements, Gaps, Roles

Advanced In-Database Analytics

Transcription:

KNIME Big Data Workshop Tobias Kötter and Björn Lohrmann KNIME 2016 KNIME.com AG. All Rights Reserved.

Variety, Volume, Velocity Variety: integrating heterogeneous data.. and tools Volume: from small files......to distributed data repositories (Hadoop) bring the computation to the data Velocity: from distributing computationally heavy computations......to real time scoring of millions of records/sec. 2016 KNIME.com AG. All Rights Reserved. 2 2

Variety 2016 KNIME.com AG. All Rights Reserved. 3

The KNIME Analytics Platform: Open for Every Data, Tool, and User Data Scientist Business Analyst External Data Connectors KNIME Analytics Platform Native Data Access, Analysis, Visualization, and Reporting External and Legacy Tools Distributed / Cloud Execution 2016 KNIME.com AG. All Rights Reserved. 4

Data Integration 2016 KNIME.com AG. All Rights Reserved. 5

Integrating R and Python 2016 KNIME.com AG. All Rights Reserved. 6

Modular Integrations 2016 KNIME.com AG. All Rights Reserved. 7

Other Programming/Scripting Integrations 2016 KNIME.com AG. All Rights Reserved. 8

Volume 2016 KNIME.com AG. All Rights Reserved. 9

Database Extension Visually assemble complex SQL statements Connect to almost all JDBC-compliant databases Harness the power of your database within KNIME 2016 KNIME.com AG. All Rights Reserved. 10

In-Database Processing Operations are performed within the database 2016 KNIME.com AG. All Rights Reserved. 11

Tip SQL statements are logged in KNIME log file 2016 KNIME.com AG. All Rights Reserved. 12

Database Port Types Database Connection Port (dark red) Connection information SQL statement Database JDBC Connection Port (light red) Connection information Database Connection Ports can be connected to Database JDBC Connection Ports but not vice versa 2016 KNIME.com AG. All Rights Reserved. 13

Database Connectors Nodes to connect to specific Databases Bundling necessary JDBC drivers Easy to use DB specific behavior/capability Hive and Impala connector part of the commercial KNIME Big Data Connectors extension General Database Connector Can connect to any JDBC source Register new JDBC driver via preferences page 2016 KNIME.com AG. All Rights Reserved. 14

Register JDBC Driver Register new JDBC drivers Increase connection timeout for long running database operations Open KNIME and go to File -> Preferences 2016 KNIME.com AG. All Rights Reserved. 15

Query Nodes Filter rows and columns Join tables/queries Extract samples Bin numeric columns Sort your data Write your own query Aggregate your data 2016 KNIME.com AG. All Rights Reserved. 16

Database GroupBy Manual Aggregation Returns number of rows per group 2016 KNIME.com AG. All Rights Reserved. 17

Database GroupBy Pattern Based Aggregation Tick this option if the search pattern is a regular expression otherwise it is treated as string with wildcards ('*' and '?') 2016 KNIME.com AG. All Rights Reserved. 18

Database GroupBy Type Based Aggregation Matches all columns Matches all numeric columns 2016 KNIME.com AG. All Rights Reserved. 19

Database GroupBy DB Specific Aggregation Methods SQLite 7 aggregation functions PostgreSQL 25 aggregation functions 2016 KNIME.com AG. All Rights Reserved. 20

Database GroupBy Custom Aggregation Function 2016 KNIME.com AG. All Rights Reserved. 21

Database Writing Nodes Create table as select Insert/append data Update values in table Delete rows from table 2016 KNIME.com AG. All Rights Reserved. 22

Performance Tip Increase batch size in database manipulation nodes Increase batch size for better performance 2016 KNIME.com AG. All Rights Reserved. 23

KNIME Big Data Connectors Package required drivers/libraries for HDFS, Hive, Impala access Preconfigured connectors Hive Cloudera Impala Extends the open source database integration 2016 KNIME.com AG. All Rights Reserved. 24

Hive/Impala Loader Batch upload a KNIME data table to Hive/Impala 2016 KNIME.com AG. All Rights Reserved. 25

HDFS File Handling New nodes HDFS Connection HDFS File Permission Utilize the existing remote file handling nodes Upload/download files Create/list directories Delete files 2016 KNIME.com AG. All Rights Reserved. 26

HDFS File Handling 2016 KNIME.com AG. All Rights Reserved. 27

KNIME Spark Executor Based on Spark MLlib Scalable machine learning library Runs on Hadoop Algorithms for Classification (decision tree, naïve bayes, ) Regression (logistic regression, linear regression, ) Clustering (k-means) Collaborative filtering (ALS) Dimensionality reduction (SVD, PCA) 2016 KNIME.com AG. All Rights Reserved. 28

Familiar Usage Model Usage model and dialogs similar to existing nodes No coding required 2016 KNIME.com AG. All Rights Reserved. 29

MLlib Integration MLlib model ports for model transfer Native MLlib model learning and prediction Spark nodes start and manage Spark jobs Including Spark job cancelation Native MLlib model 2016 KNIME.com AG. All Rights Reserved. 30

Data Stays Within Your Cluster Spark RDDs as input/output format Data stays within your cluster No unnecessary data movements Several input/output nodes e.g. Hive, hdfs files, 2016 KNIME.com AG. All Rights Reserved. 31

Machine Learning Unsupervised Learning Example 2016 KNIME.com AG. All Rights Reserved. 32

Machine Learning Supervised Learning Example 2016 KNIME.com AG. All Rights Reserved. 33

Mass Learning Fast Event Prediction Convert supported MLlib models to PMML 2016 KNIME.com AG. All Rights Reserved. 34

Sophisticated Learning - Mass Prediction Supports KNIME models and pre-processing steps 2016 KNIME.com AG. All Rights Reserved. 35

Closing the Loop Apply model on demand Learn model at scale PMML model MLlib model Sophisticated model learning Apply model at scale 2016 KNIME.com AG. All Rights Reserved. 36

Mix and Match Combine with existing KNIME nodes such as loops 2016 KNIME.com AG. All Rights Reserved. 37

Modularize and Execute Your Own Spark Code 2016 KNIME.com AG. All Rights Reserved. 38

Lazy Evaluation in Spark Transformations are lazy Actions trigger evaluation 2016 KNIME.com AG. All Rights Reserved. 39

Spark Node Overview 2016 KNIME.com AG. All Rights Reserved. 40

Spark Job Server * Hiveserver 2 Impala KNIME Big Data Architecture Scheduled execution and RESTful workflow submission Submit Impala queries via JDBC Hadoop Cluster Build Spark workflows graphically KNIME Server with extensions: KNIME Big Data Connectors KNIME Big Data Executor for Spark Workflow Upload via HTTP(S) Submit Hive queries via JDBC Submit Spark jobs via HTTP(S) KNIME Analytics Platform with extensions: KNIME Big Data Connectors KNIME Big Data Executor for Spark *Software provided by KNIME, based on https://github.com/spark-jobserver/spark-jobserver 2016 KNIME.com AG. All Rights Reserved. 41

Velocity 2016 KNIME.com AG. All Rights Reserved. 42

Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows 2016 KNIME.com AG. All Rights Reserved. 43

KNIME Cluster Executor: Distributed Data 2016 KNIME.com AG. All Rights Reserved. 44

KNIME Cluster Execution: Distributed Analytics 2016 KNIME.com AG. All Rights Reserved. 45

KNIME Batch Executors on Hadoop (via Oozie) Hadoop Master Node Hadoop Worker Node YARN Container REST Client Submit workflow parameters via REST Oozie Server Executes KNIME Workflow Designer Designs and uploads workflow YARN HDFS 2016 KNIME.com AG. All Rights Reserved. 46

KNIME Executors on Hadoop (via YARN) KNIME Workflow Users REST Client KNIME Workflow Designer Hadoop Worker Node YARN Container Hadoop Worker Node YARN Container Execute workflow Executes Executes KNIME Server YARN 2016 KNIME.com AG. All Rights Reserved. 47

Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows High Demand Scoring/Prediction: High Performance Scoring using generic Workflows High Performance Scoring of Predictive Models 2016 KNIME.com AG. All Rights Reserved. 48

High Performance Scoring via Workflows Record (or small batch) based processing Exposed as RESTful web service 2016 KNIME.com AG. All Rights Reserved. 49

High Performance Scoring using Models KNIME PMML Scoring via compiled PMML Deployed on KNIME Server Exposed as RESTful web service Partnership with Zementis ADAPA Real Time Scoring UPPI Big Data Scoring Engine 2016 KNIME.com AG. All Rights Reserved. 50

Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Parallel Execution of workflows High Demand Scoring/Prediction: High Performance Scoring using generic Workflows High Performance Scoring of Predictive Models Continuous Data Streams Streaming in KNIME Spark Streaming 2016 KNIME.com AG. All Rights Reserved. 51

Streaming in KNIME 2016 KNIME.com AG. All Rights Reserved. 52

Spark Streaming 2016 KNIME.com AG. All Rights Reserved. 53

Big Data, IoT, and the three V Variety: KNIME inherently well-suited: open platform broad data source/type support extensive tool integration Volume: Bring the computation to the data Big Data Extensions cover ETL and model learning Velocity: Distributed Execution of heavy workflows High Performance Scoring of predictive models Streaming execution 2016 KNIME.com AG. All Rights Reserved. 54

Demo 2016 KNIME.com AG. All Rights Reserved. 55

Virtual Machines Hortonworks: http://hortonworks.com/products/hortonworks-sandbox/ Cloudera: http://www.cloudera.com/content/cloudera/en/downloa ds/quickstart_vms.html KNIME VM based on Hortonworks sandbox with preinstalled Spark Job Server is coming soon 2016 KNIME.com AG. All Rights Reserved. 56

Resources SQL Syntax and Examples (www.w3schools.com) Apache Spark MLlib (http://spark.apache.org/mllib/) The KNIME Website (www.knime.org) Database Documentation (https://tech.knime.org/databasedocumentation) Big Data Extensions (https://www.knime.org/knime-big-dataextensions) Forum (tech.knime.org/forum) LEARNING HUB under RESOURCES (www.knime.org/learning-hub) Blog for news, tips and tricks (www.knime.org/blog) KNIME TV channel on KNIME on @KNIME 2016 KNIME.com AG. All Rights Reserved. 57

The KNIME trademark and logo and OPEN FOR INNOVATION trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME is also registered in Germany. 2016 KNIME.com AG. All Rights Reserved. 58