Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

White Paper Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database Abstract This white paper explores the technology and architecture of EMC s Greenplum database and explains how Attunity Replicate can leverage the MPP architecture to perform high speed data loading. September 2012

Copyright 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. Part Number h11024 2

Table of Contents Executive Summary...... 4 Audience... 4 Scope... 4 Big Data......... 5 The data explosion... 5 Overview of Greenplum Database...... 5 EMC Greenplum Data Computing Appliance... 6 Greenplum s Scatter/Gather Streaming Technology... 6 Parallel Loading... 6 External Tables... 8 Overview of Attunity Replicate...... 8 Attunity Replicate Architecture...... 9 Requirements for Running Attunity Replicate... 10 Testing Attunity Replicate...... 11 Testing with Single-node GP databases... 11 Testing with full rack DCA... 13 Summary of Attunity Replicate testing... 13 Conclusion......... 14 References......... 15 3

Executive Summary Companies have been mining their data warehouses since the 1990s for business intelligence and analytics. More and more, they depend on their data infrastructures for competitive advantage and everyday business decisions. As a result, businesses are increasingly looking for quicker access to all their data from their legacy systems, such as business databases, transactional processing databases, and social media. They rely on data replication software to keep their data warehouses as current as possible. The following is a list of activities that the solutions architect and data administrators are looking for in an effort to perform the data replication: Schema translation Bulk data load for data migration Continuous Change Data Capture (CDC) Attunity Replicate is a software solution that handles these requirements easily and effectively. It efficiently applies these features to Greenplum databases. It will read the source schema and automatically create the destination database tables. For the data initial data loading, it makes use of Greenplum database s parallel loading capabilities to import the data to the target Greenplum databases, and once this is completed, the Change Data Capture is automatically configured. Audience This white paper is intended for EMC field facing employees such as sales, technical consultants, support, as well as customers who are considering using the EMC DCA with Attunity Replicate to accomplish data migration and data replication activities to their Greenplum Database. Scope This document is not intended to be an Attunity Replicate installation guide or to supplant training material on Attunity Replicate. It simply illustrates the basic functionality and the interoperability of Attunity Replicate with Greenplum Database, and to present some examples of use cases based on Attunity Replicate. 4

Big Data The data explosion All over the world, corporations are dealing with a massive data explosion. It is said that 90 percent of the data today have been created in the last two years alone. Data comes from everywhere: business transactions, automated data collection sensors, RFID data, cell phone data, posts made to social sites and media sites, online purchases and transactions, electric meter readings, and so on. Ever increasing, the amount of corporate data is in the order of Terabytes, Exabytes, and Zettabytes of data. This is Big Data. Working with Big Data allows us to spot business trends with high accuracies. It offers us an opportunity to find new insights and trends in our businesses, and enables to answer questions that would have been beyond reach in the past. In so doing, it allows us to be more flexible and agile in meeting our business requirements. In today s competitive business arena, it is not a matter of whether one should jump in feet first into the Big Data bandwagon, but when, if not sooner. Customers have found that with Big Data, the traditional relational databases and desktop analysis tools, statistical packages like SPSS, Microsoft Excel are no longer adequate. They cannot efficiently store, search, share, analyze and visualize the large amount of data. Instead, they now require massively parallel software running on tens, hundreds, or even thousands of servers, as in the Grid Computing movement in the early 21 st Century. Overview of Greenplum Database Greenplum Database is designed based on a share-nothing MPP (Massively Parallel Processing) architecture which facilitates Business Intelligence and analytical processing built on top of it using commodity hardware. Data is distributed across multiple segment servers in the Greenplum Database to achieve no disk-level sharing. The segment servers are able to process queries in a parallel manner in order to promote the high degree of parallelism and scalability. Highlights of the Greenplum Database are: Dynamic Query Prioritization - Provides continuous real-time balancing of the resources across queries Self-Healing Fault Tolerance - Provides intelligent fault detection and fast online differential recovery Polymorphic Data Storage-MultiStorage/SSD Support 5

- Includes tunable compression and support for both row and columnoriented storage Analytics and Language Support - Supports analytical functions for advanced in-database analytics Health Monitoring and Alerting - Provides integrated email and SNMP notification for advanced support capabilities EMC Greenplum Data Computing Appliance EMC s Greenplum Data Computing Appliance (DCA) is a purpose-build, massively parallel processing (MPP) data warehousing appliance that is created to integrate storage, database, and networking into a single enterprise-class system based on the Greenplum Database. It is built to deliver the industry s fastest data loading speed, and can linearly expand to accommodate customers storage requirements for Big Data. It takes advantage of large clusters of increasingly powerful, commodity servers, storage and network switches to minimize the customers cost of ownership. The database software utilizes a shared-nothing architecture that is optimized for fast queries and data loading, for fastest operations with the maximum degree of parallelism possible. To meet the challenges of fast data loading, the EMC Data integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Informatica PowerCenter and Attunity Replicate, which are able to benefit from the Greenplum bulk-load capabilities. Greenplum s Scatter/Gather Streaming Technology Parallel Loading Greenplum's Scatter/Gather Streaming (SGS) technology, typically referred to as gpfdist, eliminates the bottlenecks associated to data loading, enabling ETL applications to stream data into the Greenplum database very quickly. This technology is intended for loading big data sets that are normally used in large-scale analytics and data warehousing. The SGS technology manages the flow of data into all nodes of the database. It does not require additional software or systems and takes advantage of the same parallel dataflow engine nodes in Greenplum database. Figure 1 shows how Greenplum utilizes a parallel everywhere approach to loading. In this approach data flows from one or more source systems to every node of the database without any sequential choke points. 6

Figure 1 Greenplum s SGS technology ensures parallelism by scattering data from source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed via a flexible and programmable external table (explained below) interface and a traditional command-line loading interface. Figure 2 7

External Tables External tables enable the users to access data in external sources as if it were in a table in the database. The external table definition in Greenplum includes an access definition to the gpfdist server that provides the SGS streaming of data as described above. This approach is very flexible, and allows one to access the gpfdist provided data in it external location without movement into the database itself. It also allows for high speed data loading into the database proper through standard SQL commands, where one reads data from the external table and inserts it into Greenplum managed database tables in a single insert statement. Overview of Attunity Replicate The website TechTarget defines Database Replication as the frequent electronic copying of data from a database in one computer to a database in another so that all users share the same level of information. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others. Attunity Replicate is software that falls under this definition and implements replication by taking full advantage of the MPP characteristics of the Greenplum database. Organizations can use Attunity Replicate to Load data to the target Greenplum database. Attunity Replicate support many forms of input data, including generic files, SQL scripts, Oracle source, SQL Server/SQL Azure source, DB2 source and ODBC sources. For full initial load of data into Greenplum databases, Attunity Replicate will use gpfdist, the Greenplum parallel file distribution program mention above to quickly and efficiently load data into the database. Please note that Greenplum database does not participate as a source database; only as a target. Create copies of production databases. On the Greenplum side, only the database needs to be created. Attunity Replicate will take care of creating the schema and tables. The target Greenplum database allows organizations to offload database queries from their operational systems, eliminating heavy database traffic and increasing operational performance. This is a good means to distribute an organization s data across multiple data centers. Facilitate zero-downtime migration and upgrades. For organizations looking to migrate to Greenplum, Attunity Replicate can be used to perform the initial load. Immediately following the initial load, the business operations may cut over to the Greenplum database. From there, the Change Data Capture (CDC) capability of Attunity Replicate will keep the source and target databases in synch without any downtime. 8

Attunity Replicate is able to accomplish all these activities with their simplified GUI called Click-2-Replicate, which is used in the Attunity Replicate Console for designing and monitoring replication tasks. Many of the steps necessary to build a replication solution has been automated to provide an easy to learn system. Attunity Replicate uses log-based capture and delivery of transaction data that does not require agents to be installed on the source and target servers. This zero-footprint technology allows the operational databases to continue operations without any impact. Attunity Replicate Architecture The Attunity Replicate environment consists of three main components: the source database, the target (Greenplum) database, and the Attunity replication server (figure 3). The end users interact with these components through the Attunity Replicate Console, which allows them to design, run and monitor replication jobs. This web console is an intuitive, easy-to-use interface using a concept called Attunity Click-2-Replicate designer. When you design a replication task, all the databases that has been pre-defined and known to Attunity Replicate are displayed for the users to click and drag to the design console. Once a database has been selected, its properties are preloaded into the design console drop-down lists for the users to pick and choose. The whole experience is very intuitive even for the beginner users. Figure 3 9

Attunity Replicate supports full initial data loading and CDC replication. During a full load of a source database to a Greenplum database, the metadata that is required to create the target tables are automatically generated, and the target tables are created. The target tables are populated with data from the source database. During this process, Attunity Replicate makes full use of the MPP capability of the Greenplum database to load data into multiple segments simultaneously. This can be done while the source database is being accessed, and subjected to update activity. The full load process can also be interrupted, and will resume at the point it was stopped when the process is restarted. Once the full load is completed, the CDC process is automatically activated. The CDC process reads from the transaction log and archive log files, and buffer all the changes for a given transaction into a single unit before it is sent to the target when the transaction commits. At the target, the transaction updates the affected tables, and the CDC process continues to the next transaction. If for any reason the changes cannot be applied to the target database within a reasonable timeframe, they are buffered on the replication server until the target is available. This alleviates the need to access the transaction and archive logs of the source database, further improving the performance and efficiency of the CDC process. During the CDC process, if new source tables are added, or new columns are added to an existing source table, or columns are deleted from a source table, Attunity Replicate will automatically apply the changes in metadata to the target Greenplum database, keeping both source and database at balance, without the need for user intervention. During the full load and CDC process, the user may apply filtering conditions on one or more source columns. Rows and columns which are not relevant are discarded before replicating the filtered data to the target database. There are times when the source data may not be the exact equivalent to the target data, Attunity Replicate will allow users to define the target data type, and then automatically apply those changes to the tables. At times when no user defined transformation is set, but replication is done between heterogeneous databases, some transformation between the different database data types may be necessary, Attunity Replicate will automatically take care of the required transformations and computations during the data transfers. The Attunity Replicate architecture uses a zero-footprint architecture that is designed so that the CDC processes can run without having agents being placed on the source or target database servers. Using log-based capture and delivery of transaction data allows Attunity Replicate to function as a very low impact application, with low overhead on the database servers. Requirements for Running Attunity Replicate The Attunity Replicate Console can be installed on Windows Server 2008 x64, Windows Server 2008 R2 x64 and Windows 7 x64. We have also tested the software on Windows 2003 x64. 10

The minimum hardware and software environment to run this software is as follows: 1.4 GHz processor or faster 2 GB RAM 5 GB free disk space.net framework 4 One of the following internet browser o Microsoft Internet Explorer version 8 and above o Mozilla Firefox version 3.6 and above o Google Chrome Adobe Flash Player version 10.2.5 or later Testing Attunity Replicate Testing with Single-node node GP databases In our test environment, we have loaded four versions of Greenplum databases (versions 4.1 and 4.2) in various VMware VM servers. We ran Attunity Replicate on various Windows platforms: Windows 2003, 2008R2, and Windows 7. The installation process consists of a single step: running a setup program to install the program group and program icons. There is no special setup necessary for these single-node databases. It is very easy to define a source or target database in Attunity Replicate. In the Manage Databases screen, select Add Database, and enter the necessary information. If the database host name and port number are correctly configured, the drop down box for the database names will be populated for users to select the required database. The next thing to do is the select the Test Connection button. If this is successful, then the database is correctly defined, and is ready for use in defining the replication task. Setting up a new replication task is a matter of dragging the defined databases to the pre-defined boxes in Figure 4 11

the design console. When New task is selected, a designer screen shows two boxes connected by a link. On the left hand side are the defined databases. Users click on the required source and target databases and drag them to the boxes labeled Drop source database here or Drop target database here. It is as simple as that (see figure xxx). The Settings tab in the designer screen gives users opportunity to further define how the initial full load is processed. For example, the user may want to drop and create new tables at the target, or re-use the existing tables by appending data to the table, or truncate the existing tables (see figure 5). For the initial target table creation, the users may want to make sure that the schema name is correctly entered (see Figure 6); otherwise, the default settings are usually adequate for most of the common tasks. Figure 5 Figure 6 12

Testing with full rack DCA For testing with a full rack EMC DCA, we need to ensure that the Attunity server has full communications insight to the segment servers in the DCA. In our tests, we use a physical server installed with Windows Server 2008 R2 operating systems. The server has dual-homed LAN networking. One network port is connected to the public LAN. The second network port is connected to a free port in the DCA Administrative switch, thus giving it access to the segment servers in the internal Interconnect switch network. The second port is assigned a network address in the 172.28.8.x range so it can access the segment servers. Since we are testing with the DCA, we were able to test the full load of a much larger database and experience the fast load speed of the gpfdist program. While running the initial data loading, the Attunity Replicate console displayed the progress graphically, reporting the completion rate, and the number of transactions that were buffered in the server. For DCA data loading, the buffers were kept to a minimum at low single digit numbers (Figure 7.) Figure 7 Summary of Attunity Replicate testing Running Attunity Replicate tests on both the Single-node Greenplum database and the DCA was a fun experience. The source and target database definitions were easy to create using the designer console. The replication tasks were easy to set up and configure. The Replicate log file can provide users multiple levels of error messages, and were very useful in pinpointing setup and configuration errors. 13

Once the tasks is running, it was fun to watch the console to see the progress and keep track of the number of records inserted, the number of DDL changes and other database activities that has occurred (Figure 7.) Attunity Replicate is very easy to set up and use with the Greenplum databases. Whether it was for a single-node database or a full rack DCA, the setup process is the same. Conclusion The Greenplum Database with its fast data loading and elastic scalability is an ideal platform for business intelligence and analytical processing. Customers using databases like Oracle, DB2, and SQL Server can exploit these advantages by migrating their databases to Greenplum database. This paper explored the migration process using tools from Attunity Replicate. Attunity Replicate is able to take advantage of the MPP and Scatter/Gather parallel processing technologies in the Greenplum database. It has an easy-to-use and intuitive user interface, and its CDC process automatically captures not only changes in data records, but changes in the metadata as well. It is an attractive choice as a migration and data replication tool for Greenplum database. 14

References 1) Attunity Replicate for Greenplum. URL: http://www.attunity.com/products/attunity-replicate-for-greenplum 2) Technical whitepaper: Attunity Replicate for EMC Greenplum. URL: http://resources.attunity.com/technical-whitepaper-attunity-replicate-for-emcgreenplum/ 3) Greenplum Database 4.2 Administrator Guide P/N: 300-013-163 Rev: A03. http://powerlink.emc.com/km/live1/en_us/offering_technical/technical_document ation/300-013-163.pdf 15