INTRODUCTION THE EVOLUTION OF ETL TOOLS A CHECKLIST FOR HIGH-PERFORMANCE ETL: DEVELOPMENT PRODUCTIVITY DYNAMIC ETL OPTIMIZATION

Similar documents
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

An Accenture Point of View. Oracle Exalytics brings speed and unparalleled flexibility to business analytics

Five Technology Trends for Improved Business Intelligence Performance

The Future of Data Management

ENZO UNIFIED SOLVES THE CHALLENGES OF REAL-TIME DATA INTEGRATION

Presenters: Luke Dougherty & Steve Crabb

How to leverage SAP HANA for fast ROI and business advantage 5 STEPS. to success. with SAP HANA. Unleashing the value of HANA

Is ETL Becoming Obsolete?

Speeding ETL Processing in Data Warehouses White Paper

Understanding the Value of In-Memory in the IT Landscape

High performance ETL Benchmark

The IBM Cognos Platform for Enterprise Business Intelligence

Analance Data Integration Technical Whitepaper

Virtual Data Warehouse Appliances

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Key Attributes for Analytics in an IBM i environment

Oracle FS1 Flash Storage System

A HIGH-PERFORMANCE, SCALABLE BIG DATA APPLIANCE LAURA CHU-VIAL, SENIOR PRODUCT MARKETING MANAGER JOACHIM RAHMFELD, VP FIELD ALLIANCES OF SAP

How your business can successfully monetize API enablement. An illustrative case study

Analance Data Integration Technical Whitepaper

BIG Data Analytics Move to Competitive Advantage

OFFLOADING TERADATA. With Hadoop A APPROACH TO NEW HADOOP GUIDE!

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Why Big Data Analytics?

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Big Data Analytics - Accelerated. stream-horizon.com

Databricks. A Primer

The Rise of Industrial Big Data

HARNESS IT. An introduction to business intelligence solutions. THE SITUATION THE CHALLENGES THE SOLUTION THE BENEFITS

Make the Most of Big Data to Drive Innovation Through Reseach

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Enabling Real-Time Sharing and Synchronization over the WAN

Empowering the Masses with Analytics

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Big Data and Its Impact on the Data Warehousing Architecture

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

Enterprise Data Integration

THE FASTEST, EASIEST WAY TO INTEGRATE ORACLE SYSTEMS WITH SALESFORCE Real-Time Integration, Not Data Duplication

Datalogix. Using IBM Netezza data warehouse appliances to drive online sales with offline data. Overview. IBM Software Information Management

Innovative technology for big data analytics

SAP BW on HANA : Complete reference guide

Zend and IBM: Bringing the power of PHP applications to the enterprise

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data

Simple. Extensible. Open.

BUSINESSOBJECTS DATA INTEGRATOR

Business Usage Monitoring for Teradata

IBM System x reference architecture solutions for big data

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Actian Vector in Hadoop

The big data revolution

Databricks. A Primer

Building Data-Driven Internet of Things (IoT) Applications

Wait-Time Analysis Method: New Best Practice for Performance Management

Taming Big Data. 1010data ACCELERATES INSIGHT

High-Volume Data Warehousing in Centerprise. Product Datasheet

Why Big Data in the Cloud?

Unleash your intuition

Simplify Software as a Service (SaaS) Integration

ORACLE PROJECT ANALYTICS

A Unified View of Network Monitoring. One Cohesive Network Monitoring View and How You Can Achieve It with NMSaaS

Cisco Data Preparation

Using an In-Memory Data Grid for Near Real-Time Data Analysis

SQL Server 2005 Features Comparison

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

The Next Wave of Data Management. Is Big Data The New Normal?

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

WHITE PAPER OCTOBER Unified Monitoring. A Business Perspective

IBM Enterprise Linux Server

CA Workload Automation

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

IBM Global Business Services Microsoft Dynamics AX solutions from IBM

Tagetik Extends Customer Value with SQL Server 2012

NetApp Syncsort Integrated Backup

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Everything you need to know about flash storage performance

Big data: Unlocking strategic dimensions

Traditional BI vs. Business Data Lake A comparison

SSD Performance Tips: Avoid The Write Cliff

Table of Contents. Technical paper Open source comes of age for ERP customers

Patrick Firouzian, ebay

Transcription:

Table of Contents INTRODUCTION THE EVOLUTION OF ETL TOOLS A CHECKLIST FOR HIGH-PERFORMANCE ETL: DEVELOPMENT PRODUCTIVITY DYNAMIC ETL OPTIMIZATION PERVASIVE CONNECTIVITY HIGH-SPEED COMPRESSION SCALABLE ARCHITECTURE HIGH-PERFORMANCE ETL IN ACTION AT COMSCORE CONCLUSION

INTRODUCTION Do a Google search on Big Data and you ll get nearly 2 billion results. Clearly the term is top of mind, as well it should be. Nearly any organization of any size stands to gain from the enhanced services, better products, or operational efficiencies that greater data insights enable. But only a small fraction is maximizing these benefits today. According to IDC, the digital universe measures in the trillions of gigabytes and will continue to double every two years. While not all that data is valuable, still, less than.5% is currently being analyzed* not because organizations don t recognize the potential value to be gained, but because they either lack the tools do so, or their conventional approaches to data integration aren t able to keep pace with the Three V s (Volume, Velocity, and Variety). Whether you re dealing with petabytes of data or just a few gigabytes, having the right tools and integration architecture in place will help you quickly and effectively transform data, Big or otherwise, into competitive insights that will enable you to identify new revenue opportunities, save costs, increase operational efficiencies, improve products and services, and remain competitive. The Impact of Data Integration Done Right OPERATIONAL FINANCIAL BUSINESS Process more data in less time with less effort Less hardware & storage to maintain, manage, & replace Install & deploy with no worries about data volumes Quickly & easily respond to business-user requests Reduce data integration TCO by up to 65% Defer or eliminate additional infrastructure purchases Support future initiatives without increasing budgets Increase ROI of existing IT investments Maximize agility with quicker access to more data Uncover new revenue opportunities Reduce business risk & ensure compliance Align IT with strategic business objectives * The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012. 3

THE EVOLUTION OF ETL TOOLS Organizations have struggled to make sense of data for decades, using one-off point solutions and custom coding to try to extract meaningful information. But in the late 1990s a new way of thinking emerged. Instead of relying on skilled developers to use complex, manual structured query language (SQL) scripts for preparing and transforming data, ETL (Extract, Transform, Load) and Data Integration (DI) tools were introduced to simplify the process. In a time when data transformation was relatively straightforward, their engines and meta-data driven design enabled more users to build and deploy data integration flows. However, as data volumes and sources quickly grew these solutions were unable to keep up. Even organizations that didn t have huge volumes of data, but had a need for more complex data transformations, were faced with growing costs and performance issues. Reluctant to abandon their sizable investments in these tools, many IT departments tried to overcome performance and scalability challenges by returning to hand coding SQL, and by pushing transformations down to the data warehouse. But this kluged approach created unnecessary complexity, consumed significant resources, and piled on more costs, creating an unsustainable model moving forward. In fact, data integration now consumes up to 80% of database capacity. Oracle Files / XML ERP Mainframe Real-Time Hadoop Big Data Conventional DI Solutions ETL Data Warehouse ETL Data Mart Data Mart Data Mart That s why today, while organizations strive to harness the power of data for competitive advantage, the reality is that the high total cost of ownership, ongoing tuning and maintenance efforts, and performance limitations of current approaches stand in the way. This situation has prompted many organizations to step back and ask: the world has changed dramatically in the last 20 years, what does that mean for my approach to data integration and how can I adapt quickly enough to ensure a clear path forward? Whether you already have a set of ETL and DI tools or not, what follows is a checklist designed to help you evaluate high-performance ETL to ensure your next move will reduce the costs and complexity of your data integration initiatives as well as complement and optimize existing DI platforms for faster performance and lower resource utilization. 4

A CHECKLIST FOR HIGH-PERFORMANCE ETL Tapping into previously unused sources of information is changing the way business is done. For organizations it can uncover new revenue streams and operational efficiencies. For consumers it can literally change the way we live and work with products and services not only tailored to our individual needs, but even anticipating them. As you can image, to achieve this level of sophistication and speed, performance at scale is key. And performance, for any software system, is based on the performance triangle. Efficiency and speed require the balancing of CPU, memory and I/O. The Performance Triangle: The performance triangle accurately reflects the delicate balance between these three resources overuse of one has an immediate impact on the others. For example, executing a join that exceeds physical memory will require additional disk space and CPU time. Most conventional ETL tools are CPUand memory-bound but, ultimately, all I/O dependent. As a result, to increase performance you need an approach that minimizes the impact on every aspect of triangle no easy task. 5

Checklist: KEY CAPABILITIES HIGH-PERFORMANCE ETL MUST DELIVER TO ADDRESS THE PERFORMANCE TRIANGLE: Development productivity: Shifting the burden of handling common and repetitive tasks as well as performance tuning from the individual to the technology Dynamic optimization: Leveraging algorithms, optimizations, and smart technology to intelligently accelerate performance on-the-fly Pervasive connectivity: Enabling connectivity with a wide variety of sources and targets and incorporating innovations like Direct I/O to enable a more efficient transfer of larger blocks of data High-speed compression: Taking compression to a new level by incorporating algorithms and technologies to address the entire transformation process Scalable architecture: Designed for today s dynamic business requirements and environments with efficient processing methods dynamically executed as needed By checking all the boxes, you can be sure you ve identified a way to cost-effectively solve your enterpriseclass data integration challenges regardless of data volumes, complexity or velocity. Let s take an upclose look at each. 6

Development Productivity The initial promise of ETL and DI tools was user productivity existing IT teams with a broader set of skills and no specialized knowledge would be able to quickly build, deploy, and re-use highly scalable data integration flows. But when the demands of robust data integration set in and IT departments reverted to SQL in an attempt to meet these demands, productivity slowed to a crawl. As a result, developers are bogged down writing, maintaining, and extending thousands of lines of complex code to cope with changing business requirements. Conventional ETL tools also put the burden of tuning for performance and scalability on the developer. Not only must the developer code meet functional requirements, but also design for performance, a rare combination of skills that is only gained after years of experience and finely honed expertise with a specific tool. A lack of metadata puts even greater challenges on organizations with hybrid development environments. Dispersed on- and off-shore teams face significant complications sharing, testing, and propagating jobs across dispersed production environments. High-performance ETL shifts the burden of handling common and repetitive tasks, as well as performance tuning, from individuals to software. CHECKLIST QUESTIONS When determining if a solution will support development productivity, ask these questions: WHAT PERCENTAGE OF MY DEVELOPERS TIME WILL BE SPENT WRITING CODE? Reusable tasks are self-contained, unit-testable and can be assembled to create jobs and accelerate updates, minimize the risk of errors or delays that typically occur when manually writing and re-writing hundreds of lines of code. DO MY DEVELOPERS NEED ANY SPECIFIC SKILL TO ENSURE PERFORMANCE OPTIMIZATION? Built-in optimization capabilities seamlessly handle the performance issues of any job or task, enabling users to design for functionality and inherit performance. 7

Dynamic ETL Optimization Achieving the highest levels of throughput with minimum resource utilization becomes increasing difficult as the demand for critical information and data volumes rise. As much as 80 percent of all ETL processing is spent sorting records. Joins, aggregations, rankings, database loads, etc., all depend on sorting to complete their processing. Even the final step of loading data into a target database can be more efficient using less CPU and elapsed time if the data is sorted first. But sorting records with conventional tools is typically the most inefficient step in the ETL process. And as business requirements increase, most organizations need to invest in more hardware. Adding to the complexity, balancing the performance triangle between memory, CPU and disk space is a moving target. As business requirements change, so do the number and type of data sources, the type of transformations, and the volumes of data; all of this happens in an environment where a variety of applications (ETL, relational databases, etc.) continuously compete for priority. Therefore, the level of tuning that needs to be achieved and maintained to ensure maximum performance at runtime simply isn t possible using manual methods or a static, one-size-fits-all approach. High-performance ETL leverages algorithms, optimizations and smart technology to intelligently accelerate performance on-the-fly. 8

CHECKLIST QUESTIONS When determining if a solution can dynamically self-optimize, ask these questions: HOW CAN THE SOLUTION HELP ENSURE I M ACHIEVING MAXIMUM RUNTIME PERFORMANCE AND MINIMUM RESOURCE UTILIZATION? Look for solutions that include a full library of algorithms and optimizations (covering sorting, joins, merges, aggregations, transformation, copies, memory management and compressions), as well as technology to handle the complexities of optimization by dynamically selecting and even switching algorithms midstream. Removing users from the process via automation and using highly targeted algorithms will ensure you aren t leaving optimization up to chance. HOW HAS THE SOLUTION PERFORMED IN ENVIRONMENTS SIMILAR TO MINE? Third-party validation, including patents, customer examples and benchmarks, as well as the opportunity to conduct proof of concepts with no manual tuning allowed, will quickly verify if you can achieve faster performance with fewer resources on existing hardware. 9

Pervasive Connectivity Data often comes from a big list of data sources and targets, including relational databases, files, mainframes, CRM systems, web logs, HDFS, social media, and more. Unlocking the insights from this data quickly and easily is at the heart of the value to be gained from big data. Without it, business agility suffers as organizations can t react quickly to market dynamics, change in customer behavior, and new competitive forces. Any cost-effective approach to data integration must be capable of seamlessly plugging into a range of file and storage systems as well as other DI solutions. But connectivity alone isn t enough. Every ETL process is ultimately I/O bound, especially at the end points of a job: extracting the data from the source and loading it into the target. The transformation phase can also quickly become disk bound when carrying out an operation that exceeds physical memory. Since disk is generally the slowest resource in most computing environments, its misuse can have the most dramatic impact on performance. High-performance ETL enables connectivity with a wide variety of sources and targets and incorporates innovations like Direct I/O to enable a more efficient transfer of larger blocks of data. CHECKLIST QUESTIONS When determining if a solution has the pervasive connectivity your organization requires, ask these questions: HOW CAN I CONNECT WITH ALL THE DATA SOURCES AND TARGETS IN MY ENVIRONMENT? It s fair to expect native connectivity for a range of sources and targets including files, relational database management systems (RDBMs), real-time ERP, appliances, cloud, JSON, XML, mainframe and legacy systems. WHAT TECHNIQUES ARE USED TO ELIMINATE I/O BOTTLENECKS? Solutions that fully leverage Direct I/O bypass the OS buffer cache, enabling a more efficient transfer of larger blocks of data. By avoiding an extra memory copy, less CPU is utilized. Automatic sort optimizations for larger sources, as well as built-in direct read and direct load optimizations (for example, reading directly from Oracle data files and bypassing Oracle s client OCI interface) will deliver further performance improvements, as much as 30%. 10

High-Speed Compression Given the increasing diversity of data sources and targets, including those residing in the cloud, data management costs can quickly become unsustainable. Large data volumes increase not only storage costs but also disk read/write access and network I/O, resulting in a negative impact on performance. Compression technology can help solve the storage and performance challenges, prompting the leading database and appliance vendors to make considerable investments in these technologies. For data integration, compression can be applied to minimize storage requirements and accelerate overall elapsed time by decreasing the amount of I/O, saving terabytes of storage and doubling performance when compared to conventional DI approaches. High-performance ETL takes compression to a new level, incorporating algorithms and technologies to address the entire transformation process. Syncsort DMX TARGETS DATA SOURCES TEMPORARY WORKSPACE 11

CHECKLIST QUESTIONS When determining if a solution fully enables high-speed compression, ask these questions: HOW IS COMPRESSION APPLIED TO DELIVER I/O SAVINGS? Solutions that optimize compression for reading and writing data files incorporate high-speed compression algorithms allowing the tool to support compression at all critical stages including sources, targets, temporary workspace storage and on-the-fly compression. HOW IS COMPRESSION FOR DISK SPACE HANDLED? Applying compression to temporary work spaces enables significant storage savings for large data volumes. Depending on data compression ratios and system specifications, such as the number and speed of CPUs and I/O rate, high-performance compression can deliver over 2x faster elapsed time and storage savings of up to 90% for even simple tasks. 12

Scalable Architecture Conventional ETL tools were designed in a different time, for a different time. Although functionality has been added, most of these tools need to push transformations down to the database to overcome performance and scalability challenges. This architectural decision has proven costly and complex. High-performance ETL incorporates dynamic ETL optimization, direct I/O and compression to perform heavy transformations on-the-fly without the need for temporary staging areas. Transformations are processed in-memory on commodity hardware and use temporary staging, on commodity disks, when memory is not enough, dramatically accelerating performance and reducing costs. Another design challenge with conventional ETL vendors is the use of heavy architectures that make inefficient use of resources. Moreover, conventional ETL tools and hand coding typically require a deployment or compile step creating rigid ETL flows that limit the amount of flexibility at runtime to adapt to changing conditions. These approaches also tend to have very poor thread and processes management often constrained by overwhelming thread and process spawning requests swamping the operating system. Their performance is hampered by their very design making them unsuitable for high-performance at scale. High-performance ETL is designed for today s dynamic business environments with efficient processing methods dynamically executed as needed. CHECKLIST QUESTIONS When determining if a solution is based on a scalable architecture, ask these questions: HOW DOES THE SOLUTION HANDLE PROCESSES AND THREADS? Solutions with hybrid multiprocess and multi-thread based architectures offer the full benefits of a master orchestration process with threads that are dynamically spawned/killed based on demand and processing. HOW DOES THE ARCHITECTURE OPTIMIZE PERFORMANCE WHILE SCALING? A truly scalable architecture automatically controls the processing method and conserves resources by only allocating them to steps as needed, maximizing performance as data flows through the integration job and supporting more jobs. The ability to dynamically process scripts at runtime delivers faster start-up and runtime performance while allowing greater flexibility, especially when passing dynamic variables/parameters. 13

A Real-World Example HIGH-PERFORMANCE ETL IN ACTION AT COMSCORE A leading internet technology company, comscore, measures what people do as they navigate the digital world and turns that information into insights and actions to help 1,800 organizations around the globe maximize the value of their digital investments. Data integration is a critical business process for comscore; their success depends on their ability to monitor, collect, transform and analyze data from a panel of 2 million internet users and an extensive network of sites participating in its Unified Digital Measurement (UDM) program. comscore collects information 24x7 from browsing to what people read, buy, and subscribe to and then sorts and aggregates that data. Within their first year, data volumes grew dramatically. They deployed Syncsort in 2000 and gained a 5-10X improvement in data processing speed. In 2009, comscore unveiled UDM data volumes and complexity skyrocketed. To support this innovation, the company decided to also leverage Hadoop but to rely on Syncsort to sort, partition, and compress the data before loading it into Hadoop and to optimize their Hadoop environment. The performance and ease of use of Syncsort DMX positively impacts our bottom line; DMX technology is able to convert raw click-stream data into valuable granular information at lightning speed. MIKE BROWN, CTO 14

15

CONCLUSION Tapping into previously unused sources of information is changing the way business is done and key to remaining relevant and competitive. You can t afford to be standing on the sidelines, missing out on valuable insights that will help you identify new revenue opportunities, save costs, increase operational efficiencies, improve products and services, and remain competitive. Whether you already have a set of ETL and DI tools or not, this checklist was designed to help you assess high-performance ETL solutions for data integration that stands up to today s challenges. Development productivity: Shifting the burden of handling common and repetitive tasks as well as performance tuning from the individual to the technology Dynamic optimization: Leveraging algorithms, optimizations, and smart technology to intelligently accelerate performance on-the-fly Pervasive connectivity: Enabling connectivity with a wide variety of sources and targets and incorporating innovations like Direct I/O to enable a more efficient transfer of larger blocks of data High-speed compression: Taking compression to a new level by incorporating algorithms and technologies to address the entire transformation process Scalable architecture: Designed for today s dynamic business requirements and environments with efficient processing methods dynamically executed as needed By checking all the boxes, you can ensure that the high-performance ETL solution you select will reduce the costs and complexity of your data integration initiatives, as well as complement and optimize existing DI platforms for faster performance and lower resource utilization. 16

ABOUT US Syncsort provides data-intensive organizations across the big data continuum with a smarter way to collect and process the ever-expanding data avalanche. With thousands of deployments across all major platforms, including mainframe, Syncsort helps customers around the world to overcome the architectural limits of today s ETL and Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less resources and lower TCO. For more information visit www.syncsort.com. LIKE THIS? SHARE IT! 2014 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and product names used herein may be the trademarks of their respective companies. DMX-EB-001-0114US