Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Similar documents
Interactive data analytics drive insights

Dell In-Memory Appliance for Cloudera Enterprise

Dell* In-Memory Appliance for Cloudera* Enterprise

SQL Server 2012 Parallel Data Warehouse. Solution Brief

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Cisco Data Preparation

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle Big Data SQL Technical Update

The Future of Data Management

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Implement Hadoop jobs to extract business value from large and varied data sets

Maximum performance, minimal risk for data warehousing

Protecting Big Data Data Protection Solutions for the Business Data Lake

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

HadoopTM Analytics DDN

Dell Reference Configuration for Hortonworks Data Platform

BIG DATA TRENDS AND TECHNOLOGIES

Testing Big data is one of the biggest

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Traditional BI vs. Business Data Lake A comparison

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

INTRODUCTION THE EVOLUTION OF ETL TOOLS A CHECKLIST FOR HIGH-PERFORMANCE ETL: DEVELOPMENT PRODUCTIVITY DYNAMIC ETL OPTIMIZATION

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Driving Growth in Insurance With a Big Data Architecture

TBR. IBM x86 Servers in the Cloud: Serving the Cloud. February 2012

OFFLOADING TERADATA. With Hadoop A APPROACH TO NEW HADOOP GUIDE!

The Inside Scoop on Hadoop

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Oracle Big Data Management System

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

More Data in Less Time

Why Big Data in the Cloud?

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Five Technology Trends for Improved Business Intelligence Performance

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Safe Harbor Statement

Oracle Big Data Building A Big Data Management System

Presenters: Luke Dougherty & Steve Crabb

Big Data and Natural Language: Extracting Insight From Text

Deploying an Operational Data Store Designed for Big Data

Next-Generation Cloud Analytics with Amazon Redshift

How To Use Hp Vertica Ondemand

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Big data: Unlocking strategic dimensions

Databricks. A Primer

IBM System x reference architecture solutions for big data

Microsoft Analytics Platform System. Solution Brief

How To Handle Big Data With A Data Scientist

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Luncheon Webinar Series May 13, 2013

Dell s SAP HANA Appliance

Cloudera Enterprise Data Hub in Telecom:

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Big Data at Cloud Scale

CDH AND BUSINESS CONTINUITY:

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Effective Data Integration - where to begin. Bryte Systems

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Integration: A Buyer's Guide

Simplifying Data Governance and Accelerating Real-time Big Data Analysis for Healthcare with MarkLogic Server and Intel

Changing the Equation on Big Data Spending

Cloudera in the Public Cloud

Integrated Grid Solutions. and Greenplum

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Fast, Low-Overhead Encryption for Apache Hadoop*

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

HITACHI DATA SYSTEMS HADOOP SOLUTION JUNE 12, 2012

I/O Considerations in Big Data Analytics

IBM Netezza High Capacity Appliance

Cost-Effective Business Intelligence with Red Hat and Open Source

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

WHITEPAPER. Why Dependency Mapping is Critical for the Modern Data Center

Big Data Too Big To Ignore

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Native Connectivity to Big Data Sources in MSTR 10

IBM InfoSphere BigInsights Enterprise Edition

Red Hat Enterprise Linux is open, scalable, and flexible

Big Data Management and Security

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

In-Memory Analytics for Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Your Data, Any Place, Any Time.

Transcription:

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload solution. A Dell Big Data White Paper by Armando Acosta, SME, Product Manager, Dell Big Data Hadoop Solutions Data transformation costs are on the rise Today s enterprises are struggling to ingest, store, process, transform and analyze data to build insights that turn into business value. Many Dell customers have turned to Hadoop to help solve these data challenges. At Dell, we recognize the need to help our customers better define Hadoop use case architectures to cut cost and gain operational efficiency. With those objectives in mind, we worked with our partners Intel, Cloudera and Syncsort to introduce the use case-based Reference Architecture for Data Warehouse Optimization for ETL Offload. ETL (Extract, Transform, Load) is the process by which raw data is moved from source systems, manipulated into a consumable format, and loaded into a target system for performing advanced analytics, analysis and reporting. Shifting this job into Hadoop can help your organization lower cost and increase efficiency by shortening batch windows with fresher data that can be queried faster because the EDW is not bogged down in data transformation jobs. Traditional ETL tools have not been able to handle the data growth over the past decade, forcing organizations to shift the transformation into the enterprise data warehouse (EDW). This has caused significant pain for customers, resulting in 70 percent of all data warehouses being performance

Build Your Hadoop Dell Reference Architectures 2011 - CDH 3 v1.4 2012 - CDH 3 v1.5, 1.6 2012 - CDH 4, 4.1 2013 - CDH 4.2, 4.5 2014 - CDH 5, 5.1 2015 - CDH 5.3, 5.4 Dell PowerEdge Cloudera Certified 2011 - PowerEdge C2100 2012 - PowerEdge R720/R720XD 2014 - PowerEdge R730/R730XD and capacity constrained. 1 EDWs are now unable to keep up with the most important demands business reporting and analysis. Additionally, data transformation jobs are very expensive to run in an EDW, based on larger data sets and the growing amount of data sources, and it is cost prohibitive to scale EDW environments. Augment the EDW with Hadoop The first use case in the big data journey typically begins with a goal to increase operational efficiency. Dell customers understand that they can use Hadoop to cut costs, yet they have asked us to make it simple. They want defined architectures that provide end-to-end solutions validated and engineered to work together. The Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Reference Architecture (RA) provides a blueprint to help your organization build an environment to augment your EDW. The RA provides the architecture, beginning from bare-metal hardware, for running ETL jobs in Cloudera Enterprise with Syncsort DMX-h software. Dell provides the cluster architecture, including configuration sizing for the edge nodes that ingest data and for the data nodes that do the data transformation work. Network configuration and setup are included in the RA to enable a ready-touse Hadoop cluster. Many of our customers have a skills-set gap when it comes to utilizing Hadoop for ETL in their environments. They don t have time to build up expertise in Hadoop. The software components of the Reference Architecture help you address this challenge. They make it easy, even for non-data-scientists, to build and deploy ETL jobs in Hadoop. The Syncsort software closes the skills gap between Hadoop and enterprise ETL, turning Hadoop into a more robust 1 Source: Gartner. and feature-rich ETL solution. Syncsort s high-performance ETL software enables your users to maximize the benefits of MapReduce without compromising on the capabilities and ease of use of conventional ETL tools. With Syncsort Hadoop ETL solutions, your organization can unleash Hadoop s full potential, leveraging the only architecture that runs ETL processes natively within Hadoop. Syncsort software enables faster time to value by reducing the need to develop expertise on Pig, Hive and Sqoop, technologies that are essential for creating ETL jobs in MapReduce. How did we get here? In the 1990s there was a vision of the enterprise data warehouse, a single, consistent version of the truth for all corporate data. At the core of the vision was a process through which organizations could take data from multiple transactional applications, transform it into a format suitable for analysis with operations such as sorting, aggregating, and joining and then load it into the data warehouse. The continued growth of data warehousing and the rise of relational databases led to the development of ETL tools purpose built for managing the increasing complexity and variety of applications and sources involved in data warehouses. These tools usually run on dedicated systems as a back-end part of the overall data warehouse environment. However, users got addicted to data, and early success resulted in greater demands for information: Data sources multiplied in number Data volumes grew exponentially Businesses demanded fresher data Mobile technologies, cloud computing and social media opened the doors for new types of users who demanded different, readily available views of the data 2

To cope with this demand, users were forced to push transformations down to the data warehouse, in many cases resorting back to hand coding. This shift turned the data warehouse architecture into a very different reality something that looks like a spaghetti architecture with data transformations all over the place because ETL tools couldn t cope with core operations, such as sort, join, and aggregations on increasing data volumes. This has caused a major performance and capacity problem for organizations. The agility and costs of the data warehouse have been impacted by: An increasing number of data sources New, unstructured data sources Exponential growth in data volumes Demands for fresher data The need for increased processing capacity The scalability and low storage cost of Hadoop is interesting to many data warehouse installations. Hadoop can be used as a complement to data warehousing activities, including batch processing, data archiving and the handling of unstructured data sources. When organizations consider Hadoop, offloading ETL workloads is one of the common starting points. Shifting ETL processing from the EDW to Hadoop and its supporting infrastructure offers three key benefits. It helps you: Achieve significant improvements in business agility Save money and defer unsustainable costs (particularly costly EDW upgrades just to keep lights on) Free up EDW capacity for faster queries and other workloads more suitable for the EDW The Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Reference Architecture is engineered to help our customers take the first step in the big data journey. It provides a validated architecture to help you build a data warehouse optimized for what it was meant to do. Additionally, the Dell solutions deliver faster time to value with Hadoop. Dell understands that Hadoop is not easy, and without the right tools designing, developing and maintaining a Hadoop cluster can drain lots of time, resources and money. Hadoop requires new skills that are in high demand (and expensive). Offloading heavy ETL processes to Hadoop provides high ROI and delivers operational savings, while allowing your organization to build the required skills to manage and maintain your EDH. The Dell Cloudera Syncsort solution is built to meet all these needs. 3

Faster time to value The Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Reference Architecture provides a blueprint to help you build an environment to augment your EDW. This Reference Architecture can help you reduce Hadoop deployment to weeks, develop Hadoop ETL jobs within hours and become fully productive within days. Dell, together with Cloudera, Syncsort and Intel, takes the hard work out of building, deploying, tuning, configuring and optimizing Hadoop environments. The solution is based on Dell PowerEdge R730 and R730xd servers, Dell s latest 13th Generation 2-socket, 2U rack servers that are designed to run complex workloads using highly scalable memory, I/O capacity and flexible network options. Both systems feature the Intel Xeon processor E5-2600 v3 product family (Haswell-EP), up to 24 DIMMS, PCI Express (PCIe) 3.0 enabled expansion slots and a choice of network interface technologies. The PowerEdge R730 is a Hadoop-purpose platform that is flexible enough to run balanced CPU-intensive or memory-intensive Hadoop workloads. Built with Cloudera Enterprise Data Hub, Cloudera Distribution of Hadoop (CDH) delivers the core elements of Hadoop scalable storage and distributed computing as well as all of the necessary enterprise capabilities, such as security, high availability and integration, with the large set of ecosystem tools. CDH also includes Cloudera Manager, the bestin-class holistic interface that provides end-to-end system management and key enterprise features to deliver granular visibility into and control over every part of an enterprise data hub. For tighter integration and ease of management, Syncsort has a dedicated tab in Cloudera Manager to monitor DMX-h. A key piece of the architecture is the Syncsort DMX-h software. Syncsort DMX-h is designed from the ground up to remove barriers to mainstream Hadoop adoption and deliver the best end-to-end approach for shifting heavy workloads into Hadoop. DMX-h provides all the connectivity you need to build your enterprise data hub. An intelligent execution layer allows you to design sophisticated data transformations, focusing solely on 4

business rules, not on the underlying platform or execution framework. This unique architecture future-proofs the process of collecting, blending, transforming and distributing data providing a consistent user experience while still taking advantage of the powerful native performance of the evolving compute frameworks that run on Hadoop. Syncsort also has developed a unique utility, SILQ, which takes a SQL script as an input and then provides a detailed flow chart of the entire data flow. Using an intuitive web-based interface, you can easily drill down to get detailed information about each step within the data flow, including tables and data transformations. SILQ even offers hints and best practices to develop equivalent transformations using Syncsort DMX-h, a unique solution for Hadoop ETL that eliminates the need for custom code, delivers smarter connectivity to all your data and improves Hadoop s processing efficiency. One of the biggest barriers to offloading from the data warehouse into Hadoop has been a legacy of thousands of scripts built and extended over time. Understanding and documenting massive amounts of SQL code and then mastering the advanced programming skills to offload these transformation has left many organizations reluctant to move. SILQ removes this roadblock, eliminating the complexity and risk. Dell Services can help provide additional velocity to the solution through implementation services for ETL offload or Hadoop Administration Services designed to support your needs from inception to steady state. The Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload solution At the foundation of the solution is the Hadoop cluster powered by Cloudera Enterprise. The Hadoop cluster is divided into infrastructure and data nodes. The infrastructure nodes are the hardware required for the core operations of the cluster. The administration node provides deployment, configuration management and monitoring of the cluster, while the name nodes provide Hadoop Distributed File System (HDFS) directory and Map Reduce job tracking services. Hadoop Cluster Architecture The edge node acts as a gateway to the cluster, and runs the Cloudera Manager server and various Hadoop client tools. In the RA, the edge nodes are also used for data ingest, so it may be necessary to account for additional disk space for data staging or intermediate files. The data nodes are the workhorses of the cluster, and make up the bulk of the nodes in a typical cluster. The Syncsort DMX-h software will run on each data node. DMX-h has been optimized, resulting in up to 75 percent less CPU and memory utilization and up to 90 percent less storage. Therefore, the data nodes don t need any increased processing capacity or memory performance. 5

The DMX-h client-server architecture enables your organization to costeffectively solve enterprise class data integration problems, irrespective of data volume, complexity or velocity. The key to building this framework, which is optimized for a wide variety of data integration requirements, relies on a single processing engine that has continually evolved since its inception. It is important to note that DMX-h has a very small-footprint architecture with no dependency on third-party applications like a relational database, compiler, or application server for design or runtime. DMX-h can be deployed virtually anywhere on premises in Linux, Unix and Windows or even within a Hadoop cluster. There are two major components of the DMX-h client-server platform: Client: A graphical user interface that allows users to design, execute and control data integration jobs Server: A combination of repository and engine: File-Based Metadata Repository Using the standard file system enables seamless design and runtime version control integration with source code control systems. This also provides high availability simply by inheriting the characteristics of the underlying file system between nodes. Engine A high-performance, linearly scalable and small-footprint engine includes a unique dynamic ETL Optimizer, which helps ensure maximum throughput at all times. 6

With traditional ETL tools, a majority of the large library of components is devoted to manually tuning performance and scalability. This forces you to make design decisions that can dramatically impact overall throughput. Moreover, it means that performance is heavily dependent on an individual developer s knowledge of the tool. In essence, the developer must not only code to meet the functional requirements, but also design for performance. DMX-h is different because the dynamic ETL Optimizer handles the performance aspects of any job or task. The designer only has to learn a core set of five stages/transforms (copy, sort, merge, join and aggregate). These simple tasks are combined to meet all functional requirements. This is what makes DMX-h so unique. The designer doesn t need to worry about performance because the Optimizer automatically delivers it to every job and task regardless of the environment. As a result, jobs have far fewer components and are easier to maintain and govern. With DMX-h, users design for functionality, and they simply inherit performance. Take your big data journey with Dell You can also look to Dell for the rest of the pieces of a complete big data solution, including unique software products for data analytics, data integration and data management. Dell offers all the tools you need to: Seamlessly join structured and unstructured data. Dell Statistica Big Data Analytics delivers integrated information modeling and visualization in a big data search and analytics platform. It seamlessly combines largescale structured data with a variety of unstructured data, such as text, imagery and biometrics. Simplify Oracle-to-Hadoop data integration. Dell SharePlex Connector for Hadoop enables you to load and continuously replicate changes from an Oracle database to a Hadoop cluster. This toolset maintains near-realtime copies of source tables without impacting system performance or Oracle online transaction processing applications. Synchronize data between critical applications. Dell Boomi enables you to synchronize data between missioncritical applications on-premises and in the cloud without the costs of procuring appliances, maintaining software or generating custom codes. Easily access and merge data types. Dell Toad Data Point can join data from relational and non-relational data sources, enabling you to easily share and view queries, files, objects and data sets. 7

Dell Big Data and Analytics Solutions To learn more, visit Dell.com/Hadoop Dell.com/BigData Software.Dell.com/Solutions 8 2015 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge ar e trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. June 2015 Version 1.0