Harnessing Big Data with KNIME



Similar documents
KNIME opens the Doors to Big Data. A Practical example of Integrating any Big Data Platform into KNIME

What s Cooking in KNIME

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop & SAS Data Loader for Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop Ecosystem B Y R A H I M A.

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering


The Internet of Things and Big Data: Intro

Data processing goes big

Constructing a Data Lake: Hadoop and Oracle Database United!

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Lofan Abrams Data Services for Big Data Session # 2987

Big Data and Market Surveillance. April 28, 2014

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

SEIZE THE DATA SEIZE THE DATA. 2015

Best Practices for Hadoop Data Analysis with Tableau

Hadoop Job Oriented Training Agenda

Creating a universe on Hive with Hortonworks HDP 2.0

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Predictive Analytics at the Speed of Business

What's New in SAS Data Management

Unified Big Data Analytics Pipeline. 连 城

Workshop on Hadoop with Big Data

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Moving From Hadoop to Spark

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop & Spark Using Amazon EMR

How Companies are! Using Spark

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Bringing the Power of SAS to Hadoop. White Paper

Ensembles and PMML in KNIME

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Oracle Big Data SQL Technical Update

#TalendSandbox for Big Data

Predictive Analytics: Seeing the Whole Picture

Why Big Data in the Cloud?

Connecting Hadoop with Oracle Database

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Next Generation Data Mining. Data Mining Automation & Realtime-Scoring "on-the-cloud.

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Too Big To Ignore

Trafodion Operational SQL-on-Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Cloudera Certified Developer for Apache Hadoop

Big Data Analytics with Cassandra, Spark & MLLib

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Integrating Apache Spark with an Enterprise Data Warehouse

Actian Vortex Express 3.0

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Actian Analytics Platform Express Hadoop SQL Edition 2.0

Fast Innovation requires Fast IT

Unified Big Data Processing with Apache Spark. Matei

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data Course Highlights

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Real Time Big Data Processing

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

HADOOP. Revised 10/19/2015

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

The Inside Scoop on Hadoop

Plug-In for Informatica Guide

Creating Big Data Applications with Spring XD

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

project collects data from national events, both natural and manmade, to be stored and evaluated by

and Hadoop Technology

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Easy Execution of Data Mining Models through PMML

Information Architecture

How To Scale Out Of A Nosql Database

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Oracle Big Data Essentials

Cloudera Manager Training: Hands-On Exercises

Tungsten Replicator, more open than ever!

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

RUN BETTER SAP AG. All rights reserved. 1

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

BIG DATA SOLUTION DATA SHEET

Using Normalized Status Change Events Data in Business Intelligence. Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Transcription:

Harnessing Big Data with KNIME Tobias Kötter KNIME.com

Agenda The three V s of Big Data Big Data Extension and Databases Nodes Demo 2

Variety, Volume, Velocity Variety: integrating heterogeneous data (and tools) Volume: from small files......to distributed data repositories (Hadoop) bring the tools to the data Velocity: from distributing computationally heavy computations......to real time scoring of millions of records/sec. 3

Variety 4

Variety Data Integration Small (Ascii) Proprietary (XLS, SAS...) Medium (Databases) Large (Hive, Impala, ParStream, HP Vertica...) Diverse (Numbers, Texts, Images, Networks, Sequences...) Tool Integration Native Legacy, Inhouse R, Python, Matlab,...

Volume 6

Every Minute 7

IoT 8

Big Data Support KNIME Database Nodes in database processing preconfigured connectors KNIME Big Data Extension package required drivers/libraries for specific HDFS, Hive, Impala access Spark MLlib integration (coming soon)

Velocity 10

Velocity Computationally Heavy Analytics: Distributed Execution of one workflow branch Parallel Execution of workflow branches Hosted Analytics/Prediction Web service Deployment of Workflows High Demand Scoring/Prediction: Continuous Execution of Workflow parts High Performance Scoring using generic Workflows High Performance Scoring of Predictive Models

KNIME Cluster Execution: Distributed Data

KNIME Cluster Execution: Distributed Analytics

Deployed Workflows Application Access Custom API WSDL/SOAP based

Continuous Scoring using Workflows Exposes workflow fragment as RESTful web service Deployed on KNIME Server (v4.0 1H2015)

High Performance Scoring via Workflows Streaming Executor Deployed via KNIME Server (v4.1 2H2015/2016) Record (or small batch) based processing Exposed as RESTful web service

High Performance Scoring using Models Deployed on KNIME Server (v4.0 1H2015) KNIME PMML Scoring via compiled PMML Exposed as RESTful web service Partnership with Zementis ADAPA Real Time Scoring UPPI Big Data Scoring Engine

Big Data, IoT, and the three V Variety: KNIME inherently well-suited: open platform broad data source/type support extensive tool integration Volume: Big Data Extensions cover Hadoop based data integration and aggregation Big Data Executors will address model building and streaming execution Velocity: Distributed Execution of heavy workflows to... High Performance Scoring of predictive models.

Big Data Extension and Database Nodes 19

Database Port Types Database Connection Port (dark red) Connection information SQL statement Database JDBC Connection Port (light red) Connection information Database Connection Ports can be connected to Database JDBC Connection Ports but not vice versa 20

Database JDBC Connection Port View 21

Database Connection Port View Copy SQL statement 22

Database Connectors Nodes to connect to specific Databases Bundling necessary JDBC drivers Easy to use DB specific behavior/capability Hive and Impala connector part of the commercial Big Data Extension General Database Connector Can connect to any JDBC source Register new JDBC driver via preferences page 23

Register JDBC Driver Increase connection timeout for long running database operations Open KNIME and go to File -> Preferences 24

Reader/Writer Table selection Load data into KNIME Create table as select Insert/append data Delete rows from table Update values in table 25

Hive/Impala Loader Upload a KNIME data table to Hive/Impala Part of the commercial Big Data Extension 26

Hive/Impala Loader Partitioning influences performance 27

Manipulation Filter rows and columns Join tables/queries Sort your data Write your own query Aggregate your data 28

Database GroupBy DB Specific Aggregation Methods SQLite 7 aggregation functions PostgreSQL 25 aggregation functions 29

Database GroupBy Aggregation Method Description 30

Database GroupBy Manual Aggregation Returns number of rows per group 31

Database GroupBy Pattern Based Aggregation Tick this option if the search pattern is a regular expression otherwise it is treated as string with wildcards ('*' and '?') 32

Database GroupBy Type Based Aggregation Matches all cells Matches all numeric cells 33

Database GroupBy Custom Aggregation Function 34

Utility Drop table missing table handling cascade option Execute any SQL statement e.g. DDL Manipulate existing queries Executes several queries separated by ; and new line 35

In-Database Processing Loads your pre-processed data into KNIME 36

HDFS File Handling KNIME & Extensions -> KNIME File Handling Nodes HDFS Connection and HDFS File Permission nodes part of the commercial Big Data Extension 37

HDFS File Handling 38

Virtual Machines Hortonworks: http://hortonworks.com/products/hortonworks-sandbox/ Cloudera: http://www.cloudera.com/content/cloudera/en/downloads/quickstart_v ms.html Virtual Box https://www.virtualbox.org/ VMWare Player https://www.virtualbox.org/ 39

Demo 40

Resources KNIME (www.knime.org) BLOG for news, tips and tricks(www.knime.org/blog) FORUM for questions and answers (tech.knime.org/forum) EXAMPLE SERVER for example workflows LEARNING HUB (www.knime.org/learning-hub) KNIME TV channel on KNIME on @KNIME KNIME on https://www.facebook.com/knimeanalytics 41