Crazy NoSQL Data Integration with Pentaho



Similar documents
Data Integration Checklist

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Performance and Scalability Overview

Integrating Salesforce Using Talend Integration Cloud

Performance and Scalability Overview

Pentaho Data Integration 4 and MySQL. Matt Casters: Pentaho's Chief Data Integration Kettle Project Founder

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Applications for Big Data Analytics

Business Intelligence for Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Peninsula Strategy. Creating Strategy and Implementing Change

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Open Source Business Intelligence Intro

Preparing Your Data For Cloud

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Sisense. Product Highlights.

Big Data Technologies Compared June 2014

Big Data and Data Science: Behind the Buzz Words

Big Data Analytics - Accelerated. stream-horizon.com

Three Open Blueprints For Big Data Success

Hurtownie Danych i Business Intelligence: Big Data

3 Case Studies of NoSQL and Java Apps in the Real World

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

How To Scale Out Of A Nosql Database

Open Source Business Intelligence

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

NoSQL Data Base Basics

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data at Cloud Scale

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Pentaho BI Capability Profile

Integrating data in the Information System An Open Source approach

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

Big Data Visualization and Dashboards

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Il mondo dei DB Cambia : Tecnologie e opportunita`

Big Data Analytics Nokia

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Native Connectivity to Big Data Sources in MSTR 10

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Cloud Scale Distributed Data Storage. Jürmo Mehine

Ad Hoc Analysis of Big Data Visualization

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Moving From Hadoop to Spark

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Open source large scale distributed data management with Google s MapReduce and Bigtable

Structured Data Storage

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

GeoKettle: A powerful open source spatial ETL tool

Oracle Big Data SQL Technical Update

QlikView Business Discovery Platform. Algol Consulting Srl

Organisaties groot en klein, beginnen zich meer en meer te realiseren dat inzicht in (real-time) data helpt

Cost-Effective Business Intelligence with Red Hat and Open Source

Challenges for Data Driven Systems

Luncheon Webinar Series May 13, 2013

So What s the Big Deal?

Next-Generation Cloud Analytics with Amazon Redshift

Database Performance with In-Memory Solutions

Embedded Analytics & Big Data Visualization in Any App

SAP Lumira 1.28 Product Availability Matrix (PAM)


TRAINING PROGRAM ON BIGDATA/HADOOP

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

SAP Real-time Data Platform. April 2013

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Integrating Ingres in the Information System: An Open Source Approach

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Geo Analysis, Visualization and Performance with JReport 13

Workday Big Data Analytics

This Symposium brought to you by

Reference Architecture, Requirements, Gaps, Roles

SAP BusinessObjects Business Intelligence 4.1 One Strategy for Enterprise BI. May 2013

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

The Future of Data Management

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Agile Business Intelligence Data Lake Architecture

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

HDP Hadoop From concept to deployment.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SAP Predictive Analytics 2.3 Supported Platforms (PAM)

Foundations of Business Intelligence: Databases and Information Management

Data processing goes big

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Databases 2 (VU) ( )

Logi Analytics provides a broad, enterprise BI solution that embraces operational reporting,

Decision Ready Data: Power Your Analytics with Great Data. Murthy Mathiprakasam

Transcription:

Crazy NoSQL Data Integration with Pentaho NoSQL Matters, Cologne Germany May 30 th, 2012 Matt Casters

About Matt Chief of Data Integration at Pentaho Lead Development Project manager Community contact Kettle Project Founder Author of Pentaho Kettle Solutions Published by Wiley 650 pages 2

Pentaho Mission Delivering the future of analytics today: modern, unified data integration and business intelligence platform Full business analytics & data integration Easy to incorporate big and diverse data Embeddable, cloud-ready analytics Open source development model enables fast and broad innovation One Download Every 30 Seconds 3

A modern, unified embeddable platform built for the future of analytics, including big data and cloud-ready analytics ACCESS All Enterprise Data Sources STREAMLINE Information Delivery VISUALIZE & Report Information In Any Style DELIVER When & Where Users Need It Relational Data Sources Direct Access REPORTING Interactive STANDALONE Web ERP / CRM / Enterprise Apps (e.g. SAP, Oracle) Cloud (e.g. Salesforce, Amazon, Dell) Hadoop & NoSQL Data Unstructured & semi-structured (XML, Excel, Files, etc.) INTEGRATE, CLEANSE, & ENRICH DATA Graphical ETL Designer Enterprise Scalability Hadoop Clustering METADATA LAYER Data Integration Relational OLAP Cubes In Memory Caching High Performance Operational Enterprise ANALYSIS Ad hoc Exploration Multi-Dimensional DASHBOARDS Interactive Metrics Rich Visualizations DATA MINING Advanced & Predictive Analytics EMBEDDED Mobile E-Mail Print ISV & Packaged Applications SaaS / Cloud Applications CENTRAL ADMINISTRATION, AUDITING & MONITORING 4

Better Together: Purpose Built for Business Analytics Dramatically reduces time spent to get to useful insights data source to dashboard Graphical ETL Designer Integrate your data sources Auto-generate Data Models & Data Visualizations with just a right-click button ETL Metadata Modeling Multi-Dimensional OLAP Model is auto-generated dynamically Tweak & automatically visualize results Interactive Data Analysis & Reporting Auto-generated dynamically Slice and dice until data is fit for purpose Iterate changing the model and integration flows Visualize results immediately Integrated ETL, OLAP, & Visualization Visualization

Pentaho in the Big Data Fabric Pentaho Business Analytics 3rd Party Tools R 3rd Party BI Tools Applications Analytics Data Integration Job Orchestration Workflow Scheduling High Performance Visual IDE Data Integration Hadoop Java MapReduce, Pig Pentaho MapReduce NoSQL Databases Analytic Databases Big Data Mgmt 6

Data Sources Using the right tool for the job... In what is the worlds data stored? 7

Relational Data Sources Relational database examples: Apache Derby, AS/400, Borland Interbase, Calpont InfiniDB, dbase III-IV-5, ExtenDB, Firebird SQL, Generic database, Greenplum, Gupta SQL Base, H2, Hypersonic, IBM DB2, Infobright, Informix, Ingres, Ingres VectorWise, Intersystems Cache, KingbaseES, LucidDB, MaxDB (SAP DB), MonetDB, MS Access, MS SQL Server, MySQL, Native Mondrian, Neoview, Netezza, Oracle, Oracle RDB, PostgreSQL, Remedy Action Request System, SAP ERP System, SQLite, Sybase, SybaseIQ, Teradata, UniVerse database, Vertica Split up into different categories: ISAM like xbase variants, SQLite Relational like MySQL, PostgreSQL, Oracle, SQL Server Columnar like InfiniDB, InfoBright, LucidDB, VectorWise, Sybase IQ All different in support for the SQL standard All aiming for various use-cases All have specific advantages and disadvantages. 8

Text File Data Sources Comma Separated Values Fixed Width Varied width (Cobol redefines) Complicating matters: (binary) delimiters, line separators, enclosures Single, double, triple & quadruple byte encoding Various compression formats (.z,.gz,.zip,.rar,...) 9

Other File Data Sources XML files : any kind of content INI files : key-value pairs Properties files : key-value pairs xbase files : rows of data JSON SOAP/WSDL MS Access : binary ISAM Spreadsheets: xls, xlsx, ods LDAP / LDIF SAS YAML 10

Non-relational Data Sources SAP R/3 application server LDAP Message Queues (JMS, MQSeries,...) HL7 EDIFACT Windows Messaging SNMP... 11

Cloud Data Sources SalesForce, SugarCRM,... Amazon S3 Google Analytics Twitter... All different in format without a standard Various Web Services and authentication methods 12

NoSQL Data Sources Large and diverse group of different types of data sources Document store Apache CouchDB, Jackrabbit, MongoDB, SimpleDB,... Graph databases Neo4J, HyperGraphDB Key Value stores (cont.) http://en.wikipedia.org/wiki/wikipedia:citation_needed 13

NoSQL Data Sources (more confusion) Key-Value stores: Eventually consistent: Apache Cassandra, Dynamo, Project Voldemort Hierarchical: GT.M Hosted: Freebase RAM Cached: memcached, Velocity, Citrusleaf Disk stored: BigTable, MemcacheDB, Citrusleaf, MongoDB Ordered: Berkeley DB, Informix C-ISAM, MemcacheDB, NDBM Multivalue: Intersystems Caché, ESE/NT, OpenQM Object databases: Caché, JADE, ObjectDB, ObjectStore, db4o Tabular: BigTable, Apache Hadoop, Apache Hbase, Hypertable, Mnesia Tuple store: Apache river Source: http://en.wikipedia.org/wiki/nosql 14

Other Encountered Data Sources Home brew data formats... Legacy stuff Because there just aren't enough options out there! Non-structured data formats PDF Images Video Web content 15

But... Now the integration goes crazy! 16

Adding complexity by mixing data sources Reading from and writing to... Performing lookups Joining data Calculating Scripting... EXPLODING COMPLEXITY X systems, Y architectures, Z operations Trashing it all when new technology surfaces! 17

Classical Solutions... Standardize on software stacks Limit the number of talking systems Good advice but... Perceived as limiting, stifling and preventing innovation. 18

Consequences! Sooner or later you're going to be using any of these data sources... to read from or to write to... with more other data sources coming after it Get ready to learn more and more APIs Rate of change seems to be increasing 19

Solving it the Kettle way... with abstraction layers Task design Save work Execution Graphical Batch <XML/> Local Remote Java API XML generation Repositories Java API Reporting M/R Embedded 20

Demo The design environment Load some data into MySQL 21

What about NoSQL & Big Data Big Data lakes Danger of creating isolated data silos Connecting with classical tech is important Similar abstraction solutions are needed 22

Pentaho for Hadoop Value Proposition Lowers technical barriers through a graphical design environment for creating and managing MapReduce jobs Automatic ETL scalability through deployment across the Hadoop cluster Easily integrate external reference data with data from Hadoop Easily spin-off high performance data marts for interactive analysis Provides an end-to-end BI solution for Big Data including reporting, ad hoc query and interactive analysis Interactive Analysis Data Marts Agile BI Hadoop Batch Reporting and Ad Hoc Query Pentaho Data Integration Jobs PDI PDI Log Files DBs and other sources

Pentaho for NoSQL Value Proposition Lowers technical barriers through a graphical design environment for creating and managing NoSQL load and extraction jobs Easily integrate external reference data with data from NoSQL Easily spin-off high performance data marts for interactive analysis Provides an end-to-end BI solution for Big Data including reporting, ad hoc query and interactive analysis Interactive Analysis Data Marts Agile BI PDI Batch Reporting and Ad Hoc Query PDI

Demo Execution of a Hadoop Map/Reduce job Retrieval of wordcount results, copy to MySQL Parallel execution of partitioned MySQL table copy to HDFS Insert MySQL data into MongoDB and CouchDB Read data from MongoDB & CouchDB and bulk load into MySQL

Open Source Software Complete stack of BI and Data Integration software Software under Open source license(s) Kettle 4.3 is Apache Licensed Free to download and use by each and everyone Projects can be integrated or separately used Don't re-invent the wheel! Spent time on the fun stuff! 26

Pentaho Kettle for Big Data Benefits Visual tools delivers 10x boost in productivity for developers Makes big data platforms usable for a huge breadth of developers Enables easy visual orchestration of big data tasks Fully leverages the full capabilities of each big data platform Provides an easy on-ramp to Pentaho Business Analytics 27

Thank You Join the conversation. You can find us on: http://blog.pentaho.com @Pentaho http://www.ibridge.be @mattcasters Facebook.com/Pentaho Pentaho Business Analytics