Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder



Similar documents
Pentaho Data Integration 4 and MySQL. Matt Casters: Pentaho's Chief Data Integration Kettle Project Founder

GeoKettle: A powerful open source spatial ETL tool

Sisense. Product Highlights.

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Oracle Database Cloud Service Rick Greenwald, Director, Product Management, Database Cloud

Database Scalability {Patterns} / Robert Treat

Application Development

Using Pentaho Data Integration (PDI) with Oracle Nabil Juwale Al Lopez. RMOUG Training Days February 11-13, 2013

Transforming Data Integration from "Create" to "Connect"

Cloud Computing: Making the right choices

Best Practices for Using MySQL in the Cloud

Crazy NoSQL Data Integration with Pentaho

Pentaho Kettle Solutions. Building Open Source ETL Solutions with Pentaho Data Integration

SQL Server PDW. Artur Vieira Premier Field Engineer

CLOUD STORAGE USING HADOOP AND PLAY

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Oracle WebLogic Foundation of Oracle Fusion Middleware. Lawrence Manickam Toyork Systems Inc

Quick start. A project with SpagoBI 3.x

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

Automate Your BI Administration to Save Millions with Command Manager and System Manager

Top 10 Oracle SQL Developer Tips and Tricks

JBoss & Infinispan open source data grids for the cloud era

Building a BI Solution in the Cloud

Data processing goes big

MySQL Enterprise Monitor

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Hadoop and MySQL for Big Data

SaaS, PaaS & TaaS. By: Raza Usmani

MONGODB - THE NOSQL DATABASE

Oracle Database Cloud

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

Jitterbit Technical Overview : Salesforce

How, What, and Where of Data Warehouses for MySQL

Editions Comparison Chart

Performance and Scalability Overview

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

AWS Schema Conversion Tool. User Guide Version 1.0

Search and Real-Time Analytics on Big Data

SQL Server Training Course Content

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Upgrading From PDI 4.0 to 4.1.0

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Cloud Computing Summary and Preparation for Examination

Sugar Professional. Approvals Competitor tracking Territory management Third-party sales methodologies

LoadRunner and Performance Center v11.52 Technical Awareness Webinar Training

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Upgrading From PDI 4.1.x to 4.1.3

DATA MINING USING PENTAHO / WEKA

Getting Started with Attunity CloudBeam for Azure SQL Data Warehouse BYOL

Cloud computing - Architecting in the cloud

Oracle Warehouse Builder 10g

Relational Databases for the Business Analyst

Big Data Technologies Compared June 2014

Getting Started Hacking on OpenNebula

Products and Solutions

Data Integration Checklist

PCCW Solutions Ltd. I. Service Category (A) Productivity Apps

Oracle Architecture, Concepts & Facilities

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

The OpenNebula Cloud Platform for Data Center Virtualization

ITG Software Engineering

Tushar Joshi Turtle Networks Ltd

Pentaho Data Integration Tool

SaaS-Based Employee Benefits Enrollment System

ORACLE APPLICATION EXPRESS 5.0

Introduction to ovirt

How to Implement Multi-way Active/Active Replication SIMPLY

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

What it is and why you might use it

Integrating Ingres in the Information System: An Open Source Approach

Integration in the cloud - IPaaS with Fuse technology. Charles Moulliard Apache Committer

RDS Migration Tool Customer FAQ Updated 7/23/2015

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

SkySQL Data Suite. A New Open Source Approach to MySQL Distributed Systems. Serge Frezefond V

A Tool for Evaluation and Optimization of Web Application Performance

Liferay Performance Tuning

Adding Indirection Enhances Functionality

LearnFromGuru Polish your knowledge

Amazon Elastic Beanstalk

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Scaling Analysis Services in the Cloud

WHITE PAPER. Domo Advanced Architecture

High Availability Solutions for the MariaDB and MySQL Database

Sugar Professional. Approvals Competitor tracking Territory management Third-party sales methodologies

Performance and Scalability Overview

Alfresco Enterprise on Azure: Reference Architecture. September 2014

Real-time High Volume Data Replication White Paper

Cloud Computing Technology

Media Upload and Sharing Website using HBASE

Oracle Application Express MS Access on Steroids

Salesforce Integration

Transcription:

Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief Data Integration at Pentaho Kettle project founder

1-2 Agenda Introduction to Kettle Introduction Use-cases + load demo Performance / Scalability Kettle Slave Servers What & How Kettle Cluster Schemas What & How The Cloud Cloud Examples Q & A

Pentaho Data Integration - Kettle 1-3 PDI is the product associated with the KETTLE open source project KETTLE provides open source software PDI is a whole product Member of the Pentaho BI Suite Kettle = PDI CE PDI EE Management Services Console Knowledge Base Documentation Portal Support License Indemnification...

Introduction Kettle : Kettle 1-4 Kettle Extraction Transportation Transformation Loading Environment

Introduction - Kettle : Extraction 1-5 Extract data from : 35+ database types MySQL, PostgreSQL, SQLite,... Oracle, SQL Server, etc Text files XML files XLS files Xbase files (dbase, Foxpro, etc) File systems information Generated data MS Access files LDAP Geo-data...

Introduction - Kettle : Transportation 1-6 Transportation of data Engine based data transfer (no code generator) Very flexible pathways: splitting partitioning merging joining duplicating clustering (MPP)

Introduction - Kettle : Transformation 1-7 Flexibly transform data Looking up data databases files memory... Calculating Scripting JavaScript, SQL, RegExp Splitting Mapping Selecting Filtering Pivotting...

Introduction - Kettle : Loading 1-8 Load data into a target format Database loads Data warehouse population Partitioned loading Bulk loading Parallel loading Clustering

Introduction - Kettle : Environment 1-9 Full GUI called Spoon to edit every option in Kettle Drag & Drop Debugger Rich GUI Command line tools execute jobs execute transformations Web server clustering remote execution Programming API for Java Plugin eco-system...

Introduction Kettle : Conceptual model 1-10

1-11 Introduction Kettle : Conceptual model

Introduction Kettle : User community 1-12 Paying Pentaho customers Large and small corporations All possible sectors Lone rangers & Hobbiests All regions on Earth Meet on our Forum : +30,000 posts in 3 years Use our JIRA case tracking systems Download more than 10,000 copies of Kettle per month http://www.ohloh.net/projects/3624?p=kettle http://www.softpedia.com/progclean/kettle-clean-80094.html

Typical use-cases 1-13 Load data from text files and store it into a database [demo] Export data from database to text-file or more other databases Data migration between database applications Exploration of data in existing databases (tables, views, etc.) Information improvement using lookups Data cleaning Application integration Data warehouse population Application integration Report data generation...

1-14 Agenda Introduction to Kettle Introduction Use-cases + load demo Performance / Scalability Kettle Slave Servers What & How Kettle Cluster Schemas What & How The Cloud Cloud Examples Q & A

1-15 Party time!!!! You're preparing for a big event You bought plenty of food: 500 hot-dogs 500 burgers Distaster strikes: invitees are not showing up! What do you do with all that food?

1-16 Call Joey Chestnut!! Fastest eater in the world ** Local boy from Vallejo 103 burgers in 8 minutes 66 hot dogs in 12 minutes ** Ranked #1 by the International Federation of Competitive Eating

1-17 Problems with the chosen solution You would need at least 10 Joey Chestnuts to get the job done Getting Joeys to show up costs money $ It's more fun with a crowd!

1-18 Performance / Scalability Demo: what about MySQL limits? [loading data continued]

1-19 Performance / Scalability MySQL bulk loading limitations... Memory backed B-tree grows and need to swap out to disk Kills performance MySQL Performance Blog (Percona) Predicting how long data load would take Predicting performance improvements from memory increase

1-20 Performance / Scalability Limits Single threaded limits Multi-threaded limits The weakest link Solutions? Optimizing Tweaking Prodding Removing bottlenecks... Scaling out Scaling out with Pentaho Data Integration Clustering Partitioning Database sharding/partitioning

1-21 Performance / Scalability The Ideal architecture has linear scalability N systems 2006-2007 Pentaho Corporation. All Rights Reserved N times faster

1-22 Scalability The answer from PDI to this challenge is multi-layered: Clustering : multiple servers running in parallel Partitioning : directing data Database sharding : scaling the database Sales Sales table Sales table 2003 2004 2005 Sales 2003 2004 2005 2006 Sales DB1 Year 2003 Partition 2006 DB2 2003 Year 2004 Partition Sales 2004 Year 2005 Partition Year 2006 Partition 2003 2004 2005 2006 DB3 2005 2006-2007 Pentaho Corporation. All Rights Reserved 2006 DB4

1-23 Agenda Introduction to Kettle Introduction Use-cases + load demo Scalability Kettle Slave Servers What & How Kettle Cluster Schemas What & How The Cloud Cloud Examples Q & A

1-24 Slave Servers Building block of the PDI clustering offering Small embedded webserver (Jetty) Controlled over HTTP Spits out XML or HTTP [demo] Easy to start / configure / use Available HTTP services: Start / Stop transformation or Job Pauze transformation Adding (posting) transformation or job Get status of server, transformation or job Cleanup of transformation Allocate a socket port Register slave Get list of slaves

1-25 Slave Server Configuration Simple configuration : Hostname & HTTP Control port sh carte.sh localhost 8080 XML configuration: Optionally look at network interface to grab address Optionally report to a central server Etc.

1-26 Agenda Introduction to Kettle Introduction Use-cases + load demo Scalability Kettle Slave Servers What & How Kettle Cluster Schemas What & How The Cloud Cloud Examples Q & A

1-27 Clustering Schema Consists of a collection of one or more slave servers Is made up of Master : at least one per cluster Slaves : parallel worker nodes Simple local sample [demo] Master Slave Slave Slave Slave Slave Slave Cluster 2006-2007 Pentaho Corporation. All Rights Reserved

1-28 Agenda Introduction to Kettle Introduction Use-cases + load demo Scalability Kettle Slave Servers What & How Kettle Cluster Schemas What & How The Cloud Cloud Examples Q & A

1-29 The Cloud Wikipedia: Cloud computing is a style of computing in which dynamically scalable and often virtualised resources are provided as a service over the Internet

1-30 The Cloud : types Infrastructure as a Service (IaaS) : Amazon EC2, Eucalyptus, GoGrid, Nymbus,... Platform as a Service (PaaS) : Amazon Web Services, AppJet (ex Google guys), Azure Services Platform (MS), Force.com (SalesForce),... Software as a Service (SaaS) : SalesForce, Google, Lucidera and many others

1-31 The Cloud : white paper Bayon Technologies (Pentaho Partner) http://www.bayontechnologies.com Sorting 600M line-item rows from TCP-H

1-32 The Cloud : Demo time Start a dynamic cluster on EC2 Job: Execute DDL on all slave servers Design a first transformation : Load a file on the cloud in parallel in 10 MySQL dabases in 10 tables, split the rows

1-33 The Kettle Cloud : links My blog : http://www.ibridge.be http://wiki.pentaho.com/display/eai/dynamic+clusters

Q & A 1-34 Our homepage: http://kettle.pentaho.org Our Forum: http://forums.pentaho.org/forumdisplay.php?f=69 Our case tracker: http://jira.pentaho.org/browse/pdi Our wiki : http://wiki.pentaho.org/ http://wiki.pentaho.com/display/eai Our IRC Channel: ##pentaho (on Freenode) Developers mailing list: http://groups.google.com/group/kettle-developers My humble blog: http://www.ibridge.be My coordinates: mcasters@pentaho.org