PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER



Similar documents
OBIEE 11g Analytics Using EMC Greenplum Database

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

INTEROPERABILITY OF SAP BUSINESS OBJECTS 4.0 WITH GREENPLUM DATABASE - AN INTEGRATION GUIDE FOR WINDOWS USERS (64 BIT)

Greenplum Database 4.2 Load Tools for Windows. P/N: Rev: A06

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Hadoop and MySQL for Big Data

Working with the Cognos BI Server Using the Greenplum Database

Greenplum Database (software-only environments): Greenplum Database (4.0 and higher supported, or higher recommended)

Integrating VoltDB with Hadoop

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

EMC GREENPLUM DATABASE

DiskPulse DISK CHANGE MONITOR

EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Integrating with BarTender Integration Builder

Process Integrator Deployment on IBM Webspher Application Server Cluster

Novell Identity Manager

EMC AVAMAR INTEGRATION WITH EMC DATA DOMAIN SYSTEMS

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Leverage Your EMC Storage Investment with User Provisioning for Syncplicity:

How To Load Data Into An Org Database Cloud Service - Multitenant Edition

Installing Management Applications on VNX for File

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

Actian Analytics Platform Express Hadoop SQL Edition 2.0

Cluster Guide. Version: 9.0 Released: March Companion Guides:

QAD Business Intelligence Release Notes

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

IBM WEBSPHERE LOAD BALANCING SUPPORT FOR EMC DOCUMENTUM WDK/WEBTOP IN A CLUSTERED ENVIRONMENT

EMC Data Domain Management Center

vcenter Operations Management Pack for SAP HANA Installation and Configuration Guide

Creating a universe on Hive with Hortonworks HDP 2.0

EMC Documentum Interactive Delivery Services Accelerated Overview

Kaseya Server Instal ation User Guide June 6, 2008

enicq 5 System Administrator s Guide

Deploying Business Objects Crystal Reports Server on IBM InfoSphere Balanced Warehouse C-Class Solution for Windows

FileMaker 11. ODBC and JDBC Guide

Setting Up a Unisphere Management Station for the VNX Series P/N Revision A01 January 5, 2010

Setting Up ALERE with Client/Server Data

Best Practices for Managing and Monitoring SAS Data Management Solutions. Gregory S. Nelson

Synthetic Monitoring Scripting Framework. User Guide

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

EMC Documentum Interactive Delivery Services Accelerated: Step-by-Step Setup Guide

Novell Identity Manager

Lesson 5 Build Transformations

How To Use Gfi Mailarchiver On A Pc Or Macbook With Gfi From A Windows 7.5 (Windows 7) On A Microsoft Mail Server On A Gfi Server On An Ipod Or Gfi.Org (

Data processing goes big

Acronis Backup & Recovery 11.5 Quick Start Guide

TSM Studio Server User Guide

SQL Server Integration Services with Oracle Database 10g

Greenplum Database 4.0 Connectivity Tools for Windows

NETWRIX EVENT LOG MANAGER

How To Backup A Database In Navision

Copyright. Copyright. Arbutus Software Inc Roberts Street Burnaby, British Columbia Canada V5G 4E1

Cluster Guide. Released: February Companion Guides:

Getting Started Guide

Advanced In-Database Analytics

Plug-In for Informatica Guide

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

How To Create An Easybelle History Database On A Microsoft Powerbook (Windows)

Planning the Installation and Installing SQL Server

Enterprise Manager. Version 6.2. Installation Guide

EMC AVAMAR INTEGRATION GUIDE AND DATA DOMAIN 6.0 P/N REV A02

LOAD BALANCING 2X APPLICATIONSERVER XG SECURE CLIENT GATEWAYS THROUGH MICROSOFT NETWORK LOAD BALANCING

Foglight. Foglight for Virtualization, Free Edition Installation and Configuration Guide

Acronis SharePoint Explorer. User Guide

HYPERION SYSTEM 9 N-TIER INSTALLATION GUIDE MASTER DATA MANAGEMENT RELEASE 9.2

CS WinOMS Practice Management Software Server Migration Help Guide

Using Windows Administrative Tools on VNX

TIBCO Hawk SNMP Adapter Installation

Simba XMLA Provider for Oracle OLAP 2.0. Linux Administration Guide. Simba Technologies Inc. April 23, 2013

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

CitusDB Architecture for Real-Time Big Data

Initializing SAS Environment Manager Service Architecture Framework for SAS 9.4M2. Last revised September 26, 2014

Replicating VNXe3100/VNXe3150/VNXe3300 CIFS/NFS Shared Folders to VNX Technical Notes P/N h REV A01 Date June, 2011

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

Informatica Cloud & Redshift Getting Started User Guide

ORACLE BUSINESS INTELLIGENCE WORKSHOP

Technical Notes P/N Rev 01

Centralizing Windows Events with Event Forwarding

Adaptive Log Exporter Users Guide

SOLARWINDS ORION. Patch Manager Evaluation Guide for ConfigMgr 2012

StreamServe Persuasion SP4

What's New in SAS Data Management

JD Edwards EnterpriseOne Tools. 1 Understanding JD Edwards EnterpriseOne Business Intelligence Integration. 1.1 Oracle Business Intelligence

Netwrix Auditor for Windows Server

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

SETTING UP ACTIVE DIRECTORY (AD) ON WINDOWS 2008 FOR EROOM

Moving the TRITON Reporting Databases

EventTracker: Configuring DLA Extension for AWStats Report AWStats Reports

Rebasoft Auditor Quick Start Guide

Getting Started with Pentaho Data Integration

SAP Business Intelligence Suite Patch 10.x Update Guide

Advanced Service Design

StarWind Virtual SAN Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster

Application Discovery Manager User s Guide vcenter Application Discovery Manager 6.2.1

FileMaker 13. ODBC and JDBC Guide

FileMaker 12. ODBC and JDBC Guide

Microsoft SQL Server Installation Guide

Transcription:

White Paper PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER The interoperability between Pentaho Data Integration and Greenplum Database with Greenplum Loader Abstract This white paper explains how Pentaho Data Integration (Kettle) can be configured and used with Greenplum database by using Greenplum Loader (GPLOAD). This boosts connectivity and interoperability of Pentaho Data Integration with Greenplum Database. February 2012

Copyright 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. VMware is a registered trademark of VMware, Inc. All other trademarks used herein are the property of their respective owners. Part Number h8309 2

Table of Contents Executive summary... 4 Audience... 4 Organization of this paper... 5 Overview of Pentaho Data Integration... 6 Overview of Greenplum Database... 6 Integration of Pentaho PDI and Greenplum Database... 7 Using JDBC drivers for Greenplum database connections... 8 Installation of new driver... 9 Greenplum Loader: Greenplum s Scatter/Gather Streaming Technology... 10 Parallel Loading... 10 External Tables... 11 Greenplum Parallel File Distribution Server(gpfdist)... 11 How does gpfdist work?... 12 Using gpload to invoke gpfdist... 12 1) Single ETL Server, Multiple NICs... 16 2) Multiple ETL Servers... 16 Usage: How to use Greenplum Loader in Pentaho Data Integration... 17 Setup... 17 Future expansion and interoperability... 22 Conclusion... 23 References... 24 3

Executive summary Greenplum database is a popular analytical database which works with different open source data integration products like Pentaho Data Integration (PDI), a.k.a. Kettle. Pentaho Kettle is part of Pentaho Business Intelligence suite. Greenplum Database is capable of managing, storing and analyzing large amount of data. One of the latest enhancements that Pentaho did for expanded support for OLAP includes a native bulk loader integration with EMC Greenplum to improve the data loading performance. Pentaho is offering a native adaptor support for Greenplum GPLoad capability (bulk loader), which enables joint customers to leverage data integration capabilities to quickly capture, transform and load massive amounts of data into Greenplum Databases. Currently, Pentaho Data Integration is connected to Greenplum through JDBC (Java Database Connectivity) drivers. Greenplum Database can be used both on the source and target sides in the Pentaho ETL transformations. Audience This white paper is intended for EMC field facing employees such as sales, technical consultants, support, as well as customers who will be using Pentaho Data Integration tool to integrate their ETL work. This is neither an installation guide nor an introductory material on Pentaho. It documents the Pentaho connectivity and operation capabilities with Greenplum Loader, and shows the readers how Pentaho PDI can be used in conjunction with Greenplum database to retrieve, transform and present data to users. Though the reader is not expected to have extensive Pentaho knowledge, basic understanding of Pentaho data integration concepts and ETL tools would help the reader understand this document better. 4

Organization of this paper This paper covers the following topics: Executive summary Organization of this paper Overview of Pentaho Data Integration (PDI) Overview of Greenplum Database Integration of Pentaho PDI and Greenplum Database Using JDBC drivers for Greenplum database connections Greenplum Loader: Greenplum s Scatter/Gather Streaming Technology Usage: How to use Greenplum Loader in Pentaho Data Integration Future expansion and interoperability Conclusion 5

Overview of Pentaho Data Integration Pentaho Data Integration (PDI) delivers comprehensive Extraction, Transformation and Loading (ETL) capabilities using a meta-data driven approach. It is commonly used in building data warehouses, designing business intelligence applications, migrating data and integrating data models. It consists of different components: Spoon Main GUI, graphical Jobs/Transformation Designer Carte HTTP server for remote execution of Jobs/Transformations Pan Command line execution of Transformations Kitchen Command line execution of Jobs Encr Command line tool for encrypting strings for storage Enterprise Edition (EE) Data Integration Server Data Integration Engine, Security integration with LDAP/Active Directory, Monitor/Scheduler, Content Management Pentaho is capable of loading big data sets in terms of Terabytes or Petabytes into Greenplum Database taking full advantage of the massively parallel processing environment provided by the Greenplum product family. Overview of Greenplum Database Greenplum Database is designed based on a MPP (Massively Parallel Processing) sharednothing architecture which facilitates Business Intelligence, data integration and big data analytics. Data is distributed and replicated across multiple nodes in the Greenplum Database, the parallel architecture. Greenplum s MPP architecture allows for increased scalability vs. traditional databases and leverages parallelism to ensure orders of magnitude of improvement in query performance. Shared-nothing architecture is optimal for fast queries and loads because processors are placed as close as possible to the data itself for faster operations with the maximum degree of parallelism possible. Highlights of the Greenplum Database: Dynamic Query Prioritization - Provides continuous real-time balancing of the resources across queries. 6

Self-Healing Fault Tolerance - Provides intelligent fault detection and fast online differential recovery. Polymorphic Data Storage-MultiStorage/SSD Support - Includes tunable compression and support for both row-and column-oriented storage. Analytics Support - Supports analytical functions for advanced in-database analytics. Health Monitoring and Alerting - Provides integrated Greenplum Command Center for advanced support capabilities. Integration of Pentaho PDI and Greenplum Database The following diagram shows the basic interoperability between Pentaho Data Integration with the Greenplum Database: 7

Using JDBC drivers for Greenplum database connections Pentaho Kettle ships with many different JDBC drivers that reside in a single java archive (.jar) file that are present in the libext/jdbc directory. By default, Pentaho PDI is shipped with a postgresql jdbc jar file, which is used to connect through Greenplum loader (gpload/gpfdist) when you defined your database connection and choose Native (JDBC) as access. Java JDK 1.6 is required for the installation. There is a startup script, which adds all these.jar files to the environment. 8

Installation of new driver To add a new driver, simply drop/copy the.jar file containing the driver into the libext/jdbc directory. For example, For Data Integration Server: <Pentaho_installed_directory>/server/dataintegration-server/tomcat/lib/ For Data Integration client: <Pentaho_installed_directory>/design-tools/dataintegration/libext/JDBC/ For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/ For Enterprise Console: <Pentaho_installed_directory>/server/enterpriseconsole/jdbc/ If you installed a new JDBC driver for Greenplum to the BI Server or DI Server, you have to restart all affected servers to load the newly installed database driver. In addition, if you want to establish a Greenplum data source in the Pentaho Enterprise Console, you must install that JDBC driver in both Enterprise Console and the BI Server to make it effective. In brief, to update the driver, the user would need to update the jar file in /dataintegration/libext/jdbc/. Assume that there is a Greenplum Database (GPDB) installed and ready to use, you can define the Greenplum database connections in the Database Connection dialog. You can give a connection name, choose Greenplum as the Connection Type, choose Native (JDBC) in the Access field, and give the Host Name, Database Name, Port Number, User Name and Password in the Setting section. Special attention may be required to setup the host files and configuration files in Greenplum database as well as the hosts in which Pentaho is installed. For instance, in Greenplum database, the user may need to configure pg_hba.conf with the IP address of the Pentaho host. In addition, the user may need to add the hostnames and the corresponding IP address in both systems (i.e. Pentaho PDI server and the Greenplum Database) in order to ensure both machines can communicate. 9

Greenplum Loader: Greenplum s Scatter/Gather Streaming Technology Parallel Loading Greenplum's Scatter/Gather Streaming (SGS) technology, typically referred to as gpfdist, eliminates the bottlenecks associated to data loading, enabling ETL applications to stream data into the Greenplum database quickly. This technology is intended for loading huge data sets that are normally used in large-scale analytics and data warehousing. This technology manages the flow of data into all nodes of the database Figure 1 shows how Greenplum utilizes a parallel everywhere approach to loading. In this approach, data flows from one or more source systems to every node of the database without any sequential bottlenecks. Figure 1 Greenplum s SGS technology ensures parallelism by scattering data from source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed via a flexible and programmable external table (explained below) interface and a traditional command-line loading interface. 10

Figure 2 External Tables External tables enable users to access data in external sources as if it were in a table in the database. In Greenplum database, there are two types of external data sources, external tables and Web tables. They have different access methods, external tables contain static data that can be scanned multiple times. The data does not change during queries. Web tables provide access to dynamic data sources as if those sources were regular database tables. Web tables cannot be scanned multiple times. The data can change during the course of a query. Greenplum Parallel File Distribution Server(gpfdist) gpfdist is Greenplum s parallel file distribution server utility software. It is used with readonly external tables for fast, parallel data loading of text, CSV, XML files into a Greenplum database. The benefit of using gpfdist is that users can take advantages of maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables. gpfdist can be considered as a networking protocol, much like the http protocol. Running gpfdist is similar to running a HTTP server. It exposes the target file via TCP/IP to a local file directory containing the files. The files are usually delimited files or CSV files, although it can also read tar and gziped files. In the case of tar and gzip files, the PATH contains the location of the tar and gzip utilities. For data uploading into a Greenplum database, you can generate the flat files from an operational database or transactional database, using export, COPY, dump, or user-written software, depending on the business requirements. This process can be automated to run periodically. 11

How does gpfdist work? gpfdist runs in a client-server model. To start the gpfdist process, you can indicate the directory where they drop/copy their source files. Optionally, you may also designate the TCP port number to be used. A simple startup of the gpfdist server is the following command syntax: gpfdist d <file_files_directory> p <port_number> l <log_file> & For example: # gpfdist -d /etl-data -p 8887 -l gpfdist_8887.log & [1] 28519 # Serving HTTP on port 8887, directory /home/gpadmin/etl-log In the above example, gpfdist is set up to run on the Greenplum DIA server, anticipating data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and listening for data requests, and a log file is created in /home/gpadmin called etl-log. Using gpload to invoke gpfdist Pentaho leverages the parallel bulk loading capabilities of GPDB using the Greenplum data loading utility - gpload. gpload is a data loading utility that acts as an interface to Greenplum Database s external table parallel loading feature. The Greenplum EXTERNAL TABLE feature allows us to define network data sources as tables that we can query to speed up the data loading process. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum parallel file server (gpdist) Greenplum s parallel file distribution program, creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all. The basic structure of a load control file: --- VERSION: 1.0.0.1 DATABASE: db_name USER: db_username HOST: master_hostname 12

PORT: master_port GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - hostname_or_ip PORT: http_port PORT_RANGE: [start_port_range, end_port_range] FILE: - /path/to/input_file - COLUMNS: - field_name: data_type - FORMAT: text csv - DELIMITER: 'delimiter_character' - ESCAPE: 'escape_character' 'OFF' - NULL_AS: 'null_string' - FORCE_NOT_NULL: true false - QUOTE: 'csv_quote_character' - HEADER: true false - ENCODING: database_encoding - ERROR_LIMIT: integer - ERROR_TABLE: schema.table_name OUTPUT: - TABLE: schema.table_name - MODE: insert update merge - MATCH_COLUMNS: - target_column_name - UPDATE_COLUMNS: - target_column_name - UPDATE_CONDITION: 'boolean_condition' - MAPPING: target_column_name: source_column_name 'expression' PRELOAD: - TRUNCATE: true false - REUSE_TABLES: true false SQL: - BEFORE: "sql_command" 13

- AFTER: "sql_command" Above example shows syntax for GPLOAD using YAML file. This file is divided into sections for easy reference, those horizontal lines are not to be placed in a YAML file. For example, users can run a load job as defined in my_load.yml using gpload: gpload -f my_load.yml It is recommended that we confirm that gpload is running successfully, to reduce the chance of future errors. As a first step, you can run gpload at the system (command) prompt to verify. By copying a small representation of a source file and a control (YAML) file, you can run gpload.py using a sample load control file. If gpload.py script is not successfully executed, please confirm the following settings: - Check if the correct version is installed by checking the gpload readme. - Check the environment variables for PATH, GPHOME_LOADERS and PYTHONPATH are correctly installed. - Check if the pathname environmental variables are pointing or including to the correct path Example of the load control file - my_load.yml: --- VERSION: 1.0.0.1 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - etl1-1 - etl1-2 - etl1-3 - etl1-4 PORT: 8081 FILE: - /var/load/data/* - COLUMNS: - name: text - amount: float4 - category: text - desc: text - date: date - FORMAT: text 14

- DELIMITER: ' ' - ERROR_LIMIT: 25 - ERROR_TABLE: payables.err_expenses OUTPUT: - TABLE: payables.expenses - MODE: INSERT SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" Note: YAML file is not a free formatted file, field names and most of the content need to be in a certain format. By using Pentaho, you do not need to write your own YAML file; there are some pre-built steps inside the Bulk loading folder in the Design windows of Spoon. The customized Greenplum step is called Greenplum Load, which will help to generate the YAML file when all the necessary details are provided. The Greenplum Load step wraps the Greenplum GPLoad data loading utility we just discussed. The GPLoad data loading utility is used for massively parallel data loading using Greenplum's external table parallel loading feature. As you can see in the above example, four ETL servers are used for feeding data into Greenplum through GPLOAD. GPLoad can be implemented in either single or multiple Pentaho ETL servers. The following diagrams show the typical deployment scenarios for performing parallel loading to Greenplum Database: 15

1) Single ETL Server, Multiple NICs 2) Multiple ETL Servers 16

Usage: How to use Greenplum Loader in Pentaho Data Integration Setup Here are the steps to setup a simple transformation to test out the Greenplum Loader: 1) Create the Text File Input Steps by defining a source file (e.g. csv, delimited file). Choose Text File Input component under Design tab and inside Input folder: Double Click on the Text File Input and choose the right input delimited file. 17

2. Click on the next tab of Contents to define how to parse the CSV file: 3. Go to the next tab Fields and click on Get Fields to define all the fields: A sample source file lineitem.csv/lineitem.dat should look like this: 1 155190 7706 1 17 21168.23 0.04 0.02 N O 1996-03-13 1996-02-12 1996-03- 22 DELIVER IN PERSON TRUCK lineitem 1 comments 2 67310 7311 2 36 45983.16 0.09 0.06 N O 1996-04-12 1996-02-28 1996-04- 20 TAKE BACK RETURN MAIL lineitem 2 comments. 100 61336 8855 1 31 40217.23 0.09 0.04 A F 1993-10-29 1993-12-19 1993-11- 08 COLLECT COD TRUCK lineitem 100 comments 18

4. You should create a target table called lineitem which contains: CREATE TABLE lineitem ( l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity numeric(15,2), l_extendedprice numeric(15,2), l_discount numeric(15,2), l_tax numeric(15,2), l_returnflag character(1), l_linestatus character(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct character(25), l_shipmode character(10), l_comment character varying(44) ) WITH ( OIDS=FALSE ) DISTRIBUTED BY (l_orderkey); ALTER TABLE lineitem OWNER TO gpadmin; Next, you will need to create the Greenplum Load Step: 19

The details of the Greenplum Load step need to be defined as the following: First, you have to choose the correct connection and target table. Then, please click on Get fields button in order to generate all the target table fields: After that, click on the Edit Mapping button to define all the mappings from the sources to targets: 20

Next, go to the GP Configuration tab in order to define the correct GPLOAD, control file, data file location: Once you complete the definitions, please click OK to save. A sample job can be created through adding the Hop between the Text Input and Greenplum Load steps. 21

When everything is defined and saved, you can execute the transformation/job by click the GREEN arrow on the top left corner. Once the execution is finished, you can check the Logging and Step Metrics sections to see if the transformation is successfully executed. You can also verify if data is loaded into this target Greenplum database table, lineitem through gpload. The above transformation is just a sample; therefore, user can add different components in this transformation or incorporate into a well developed job for transforming the data. Future expansion and interoperability Both Greenplum and Pentaho are rapidly innovating and extending their capabilities to satisfy the requirements in the BIG DATA industry. In order to meet the challenges of fast data loading, the EMC Data Integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Pentaho. Therefore, both companies are working together to expand their interoperability to adopt the constantly growing demands. 22

Conclusion In this white paper, the process of how to use Greenplum Loader Step(GPLOAD) to enhance the loading capability and performance of Pentaho Data Integration is discussed. It covers the preliminary interoperability between both Pentaho PDI and Greenplum database for data integration and business intelligence projects by using Greenplum s Scatter/Gather Streaming Technology embedded in Greenplum Loader. 23

References 1) Pentaho Kettle Solutions Building Open Source ETL Solutions with Pentaho Data Integration (ISBN-10: 0470635177 / ISBN-13: 978-0470635179) 2) Getting Started with Pentaho Data Integration guide from www.pentaho.com 3) Greenplum Database 4.1 Load tools for UNIX guide 4) Greenplum Database 4.1 Load Tools for Windows guide 5) Pentaho Community - Greenplum Load 24