Move Data from Oracle to Hadoop and Gain New Business Insights

Move Data from Oracle to Hadoop and Gain New Business Insights Written by Lenka Vanek, senior director of engineering, Dell Software Abstract Today, the majority of data for transaction processing resides inside relational database management systems such as Oracle, where it supports critical business applications from order entry to financials. Processing performance of these systems is of utmost importance, so organizations usually keep only limited time slices of their data in Oracle. Moving data from Oracle to Apache Hadoop can enable organizations to analyze, access and reference their data for multi-year economic cycles. However, transferring data from Oracle to Hadoop is challenging, even with tools like Apache Sqoop. Replication offers a powerful alternative. SharePlex Connector for Hadoop replicates data from Oracle to the Hadoop Distributed File System (HDFS) in near real time and from Oracle to Apache HBase in real time, enabling you to gain new business insights. This technical brief explores the differences between Sqoop data transfer and SharePlex replication. In particular, organizations may want to explore the evolution of the committed transactions stored by Oracle, or change data capture (CDC) records. Hadoop and its cheap storage offer an ideal repository for these change trails, and, as we will see, SharePlex Connector for Hadoop supports capturing CDC records of Oracle tables to HDFS and Hive. Introduction The benefits of moving data from Oracle to Hadoop Many Hadoop analytic tasks are dependent on data traditionally maintained primarily in a relational database such as Oracle. For instance, an analysis of web logs might need to access the CUSTOMER table in Oracle to obtain the customer s geographic information, or it might need to access a FACT table in an Oracle data warehouse to correlate historical sales into a predictive model. Rather than use cross-database joins, which add additional load onto the Oracle database, it can be advantageous to move these key tables from Oracle to Hadoop.

Apache Sqoop is ideal for occasional batch data transfer to create a data snapshot, while SharePlex Connector for Hadoop performs near real-time or real-time data replication. The benefits and limitations of Sqoop and a better alternative Transferring data from an Oracle relational database to Hadoop, however, is challenging. Apache Sqoop was created to perform bi-directional data transfer between Hadoop and Oracle (or any other external structured datastore). Since 2010, Dell Software has provided an adaptor for Sqoop that accelerates data transfers between Hadoop and Oracle. This open-source adaptor, sometimes called OraOop and distributed under the official name of Data Connector for Oracle and Hadoop, is now part of core Sqoop as of version 1.4.5. Dell continues to be committed to improving the performance of Sqoop. However, even as we strive to make Sqoop more powerful, we are well aware that Sqoop is not the right solution for all scenarios. In particular, while Sqoop provides a way to copy data from a relational database to Hadoop, running Sqoop every time this information is needed places undue load on the database, increases the run time for Hadoop processing and increases management complexity. A replication solution like SharePlex Connector for Hadoop, which allows Oracle tables to be maintained up-to-date in Hadoop, solves all these problems. The key differences between Sqoop and SharePlex Connector for Hadoop The appendix at the end of this paper details the differences between Apache Sqoop and SharePlex Connector for Hadoop, but let s summarize the key differences here. First, these two technologies solve different business problems: Apache Sqoop is ideal for occasional batch data transfer to create a data snapshot, while SharePlex Connector for Hadoop performs near real-time or real-time data replication. Second, SharePlex Connector for Hadoop minimizes the impact on the source Oracle database by reading data from the Oracle logs, while Apache Sqoop accesses the Oracle system directly, as a user. Third, Sqoop offers an ability to do incremental imports, but the onus is on you and your Oracle application to have tables with specific values that reliably identify new inserts or update the values of the last imported rows, in order for Sqoop to start off where it ended. You also need to determine the precise timing of these incremental jobs to ensure consistent copy. Sqoop also does not support the capture of deletes, and Sqoop was not intended to deal with capturing Oracle change data records. SharePlex Connector for Hadoop, on the other hand, puts no prerequisites on your source Oracle database or application. Instead, it seamlessly, automatically and with minimal performance overhead maintains the real-time or near real-time copy of the source Oracle tables by capturing changes from the Oracle logs and posting those changes to Hadoop. Tapping the potential of change data records Let s explore the value of those change data records. At any given point in time, a relational database shows only the committed state of business data. However, sometimes organizations need to know not only the final state of transactions but also the history of the changes that led to it. They need to know the sequence of these transactional changes and be able to audit all insert, update and delete activity. The sequence of logged changes in a table in an Oracle database is called change data capture (CDC) records; it is also sometimes referred to as a digital exhaust. Oracle stores change data records in the redo logs. The sheer volume of this data, without easy access, prevented analytic mining of these records in the past. Decreases in the cost of both storage and computer power have made it feasible to collect these 2

digital log sequences and include this potentially very valuable data in business intelligence analysis. SharePlex Connector for Hadoop enables your organization to capture every insert, update and delete action on your Oracle tables and post this data to HDFS or HBase, along with the metadata needed to audit these changes. During CDC replication, SharePlex Connector for Hadoop adds additional columns to the target table to store metadata describing the nature of the changes. You can configure which operations to post into Hadoop and you can also choose to post a before image of update operations, if desired. Change data written to HDFS for a particular Oracle table can be accessed in two ways: via Hive SQL or via custom MapReduce programs. Oracle data type CHAR VARCHAR2 NCHAR NVARCHAR2 NUMBER FLOAT DATE TIMESTAMP TIMESTAMP WITH TIME ZONE TIMESTAMP WITH LOCAL TIME ZONE INTERVAL YEAR TO MONTH INTERVAL DAY TO SECOND ROWID Option 1: Access via Hive SQL Hive external tables pointing to CDC data on HDFS SharePlex Connector for Hadoop creates external tables in Hive for the CDC feature. External Hive tables are created in the Hive database in the same way that Oracle table owners and schema do. Each external Hive table points to data stored for that table in the CDC HDFS destination directory. SharePlex Connector for Hadoop maps Oracle data types to the appropriate data types in the external Hive table, as shown in Table 1. This allows the user to make use of built-in Hive UDFs available with the specific Hive data types. Option 2: Access via custom MapReduce programs To support access via custom MapReduce programs, SharePlex Connector for Hadoop generates Hive data type DOUBLE DOUBLE TIMESTAMP TIMESTAMP SharePlex Connector for Hadoop enables you to capture every insert, update and delete action on your Oracle tables and post this data to HDFS or HBase, along with the metadata needed to audit these changes. Table 1. How SharePlex Connector for Hadoop maps Oracle data types to Hive data types 3

Apache Hadoop enables you to store years of data to gain new business insights, and SharePlex Connector for Hadoop makes synchronization of your Oracle and Hadoop data systems seamless. change data records (Java utility classes) per CDC table. Change data records (utility classes) are generated under the lib directory. SharePlex Connector for Hadoop also enables you to write data to HDFS files in text, sequence or Apache Avro format. Avro format improves bandwidth usage and CPU processing time for real-time analytics at scale. It is designed to support data-intensive applications and is supported in a variety of programming languages. By supporting this data format, SharePlex Connector for Hadoop enables organizations not only to access CDC data but also to utilize that data in their predictive analytics. Conclusion Today, information is power. Apache Hadoop enables you to store years of data to gain new business insights, and SharePlex Connector for Hadoop makes synchronization of your Oracle and Hadoop data systems seamless, enabling the long-term and granular business analytics your organization needs to succeed. We invite you to visit our website to learn more about SharePlex and SharePlex Connector for Hadoop. About the author Lenka Vanek is senior director of engineering at Dell Software. She leads development of Toad and SharePlex product lines, which millions of users worldwide trust to work with their relational databases and new technologies such as Hadoop. Her key areas of interest are Oracle and Hadoop. She has presented at several technical conferences, including ASUG, NoCOUG and SAP TechEd. 4

Appendix A: When to use Apache Sqoop and when to use SharePlex Connector for Hadoop Business need Sqoop import SharePlex Connector for Hadoop Minimize impact on Oracle OLTP system performance and availability Continuously update the imported copy of the source Oracle table without user intervention Import new rows added to source table in Oracle (inserts only) Import changes to existing rows in Oracle (updates) Impacts performance because it needs to fetch data from Oracle to perform transfer to Hadoop. User intervention is needed to do the series of incremental Sqoop imports when required. Sqoop incremental import with append mode allows users to retrieve only rows newer than the previously imported set of rows. For Sqoop to detect new rows, the Oracle table must have a column with an incremental integer value. This column cannot have non-integer data types, and the user needs to know the value of the last imported row. Sqoop incremental import with lastmodified mode provides this functionality with limited support. Lastmodified mode requires a column holding a date value (suitable types are date, time, datetime and timestamp) specifying when each row was last updated. Sqoop will import only those rows that were updated after the last import. The onus is on your application to reliably update this column on every row change, or else results might be unpredictable. Further, after each incremental import, a user needs to run a Sqoop-merge job to combine old and newly imported data sets on HDFS. For HBase, table with updated rows will be available, as HBase maintains versions for each row change. The connector performs log-based replication of the source table in Oracle, providing a real-time or near real-time copy in the Hadoop environment and minimizing ongoing impact on availability and performance of Oracle system. The connector automatically detects changes taking place on source tables and updates the Hadoop copy accordingly. The connector maintains an updated replica of the Oracle table on Hadoop without user intervention. The connector can seamlessly append newly inserted rows to the previously imported data on HDFS or HBase, eliminating the need to modify the application to maintain a special column to detect changes. The connector detects changes taking place on rows in the source Oracle table for all data types and updates the files stored on HDFS and rows present in the HBase table accordingly, eliminating the need to modify the application to maintain a special column to detect changes. Delete operations Not supported. The connector will remove rows from HBase tables or HDFS files when they are deleted from the source Oracle table. 5

Business need Sqoop import SharePlex Connector for Hadoop Import changes to primary key value in source Oracle table Change the schema of the source table in Oracle Create data partitions on HDFS Replicate multiple tables concurrently Support for Hive external table over HDFS Create external tables in Hive over HBase imported data Track changes taking place on a source table in Oracle (CDC) Sqoop merge-job makes use of the primary key to merge a newly imported data set with the old one. This operation will fetch data incorrectly (duplicate records) if the primary key is modified. Similarly, HBase incremental import might fail if the source table primary key (acting as row-key in HBase) is modified. No support available. The user needs to import the entire table every time the schema changes. Sqoop has limited support to create partitions while importing data on HDFS. HCatalog import with dynamic partitioning moves the data within the Hive warehouse. Some of its shortcomings are: Sqoop does not support column name mapping. Sqoop does not support creating range partitions on HDFS data. The Hadoop cluster must have HCatalog installed and running. Users can run multiple Sqoop jobs simultaneously, but this would considerably degrade the performance of the RDBMS. Also, it would be painful for the user to constantly keep track of changes taking place on different tables and update them accordingly on Hadoop using incremental imports. The Hive import feature allows users to have access to the data, but this data is present within the Hive warehouse, and deletion of a table from Hive will eventually delete the data. Sqoop does not support creating external tables in Hive over HBase data. Sqoop imports are useful to take a snapshot (replicate the current state) of the table in a RDBMS like Oracle. However, Sqoop does not replicate logs representing the changes that took place on the table. The connector can handle primary key updates (and composite key updates as well) without compromising data integrity. The connector detects the change in schema and suggests that the user take a snapshot of the table. The connector can partition data replicated on HDFS. In fact, it provides a wide range of partitioning features to support different requirements, including: Custom partitions Range partitions Multi-level partitions Combination of custom and range partitions The connector maintains its own copy of Sqoop, which has been improved to add support for custom and range partitions without HCatalog integration in Sqoop. The connector needs to take a snapshot for all the tables just once, and then it will replicate changes as they occur, concurrently for all the tables added for replication. The connector replicates the data on HDFS and creates external tables in Hive to provide easy access over the HDFS data. When an external table is dropped, data in the table is not deleted from the file system. Because the data is independent of Hive, it can be used by other tools like Pig or even custom MapReduce jobs. The connector creates an external table in Hive that can point to the data replicated on HBase, in order to provide access with SQL queries over HBase data. The connector captures the change data log, which represents state change information for individual rows, enabling you analyze those changes or perform analytics using them. 6

Business need Sqoop import SharePlex Connector for Hadoop Ensure data integrity Be able to modify a source Oracle table while replication is in progress Access sequence type data on HDFS using Hive tables A Sqoop incremental import does not perform any validations to ensure data integrity. As a result, any minor mistake during incremental import may cause Sqoop to fetch wrong or redundant data without the user being aware of it. Sqoop does not guarantee its behavior if an Oracle table is concurrently modified when a Sqoop import is in progress: It may import none, some or all of the rows added or modified concurrently. As a result, the user will need to run incremental import to ensure that the imported data is in sync with the Oracle table. Not supported. The connector performs validations before replicating the data to Hadoop so as to maintain data integrity. It reports any inconsistencies found during replication and also makes an effort to rectify them. The connector will reliably replicate changes taking place concurrently while replication or an initial snapshot is in progress. Data replicated will be consistent with the source table, and the user is not required to snapshot the table. Users can run Hive queries to access the data replicated in sequence file format on HDFS. 7

For More Information 2014 Dell, Inc. ALL RIGHTS RESERVED. This document contains proprietary information protected by copyright. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording for any purpose without the written permission of Dell, Inc. ( Dell ). Dell, Dell Software, the Dell Software logo and products as identified in this document are registered trademarks of Dell, Inc. in the U.S.A. and/or other countries. All other trademarks and registered trademarks are property of their respective owners. The information in this document is provided in connection with Dell products. No license, express or implied, by estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Dell products. EXCEPT AS SET FORTH IN DELL S TERMS AND CONDITIONS AS SPECIFIED IN THE LICENSE AGREEMENT FOR THIS PRODUCT, DELL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN NO EVENT SHALL DELL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT, EVEN IF DELL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Dell makes no representations or warranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and product descriptions at any time without notice. Dell does not make any commitment to update the information contained in this document. About Dell Software Dell Software helps customers unlock greater potential through the power of technology delivering scalable, affordable and simple-to-use solutions that simplify IT and mitigate risk. The Dell Software portfolio addresses five key areas of customer needs: data center and cloud management, information management, mobile workforce management, security and data protection. This software, when combined with Dell hardware and services, drives unmatched efficiency and productivity to accelerate business results. www.dellsoftware.com. If you have any questions regarding your potential use of this material, contact: Dell Software 5 Polaris Way Aliso Viejo, CA 92656 www.dellsoftware.com Refer to our Web site for regional and international office information. 8 TechBrief-MoveDataOracletoHadoop-US-VG-25229