High performance ETL Benchmark



Similar documents
White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

SQL Server Business Intelligence on HP ProLiant DL785 Server

When to consider OLAP?

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Maximum performance, minimal risk for data warehousing

Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

Innovative technology for big data analytics

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Planning the Installation and Installing SQL Server

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Microsoft Analytics Platform System. Solution Brief

Microsoft SQL Server versus IBM DB2 Comparison Document (ver 1) A detailed Technical Comparison between Microsoft SQL Server and IBM DB2

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Cost-Effective Business Intelligence with Red Hat and Open Source

Key Attributes for Analytics in an IBM i environment

DEPLOYING IBM DB2 FOR LINUX, UNIX, AND WINDOWS DATA WAREHOUSES ON EMC STORAGE ARRAYS

Recommended hardware system configurations for ANSYS users

Business Usage Monitoring for Teradata

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

lesson 1 An Overview of the Computer System

Is ETL Becoming Obsolete?

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Scaling Your Data to the Cloud

Bringing Big Data into the Enterprise

SQL Server 2012 Performance White Paper

Capacity Planning Process Estimating the load Initial configuration

The HP Neoview data warehousing platform for business intelligence

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Eloquence Training What s new in Eloquence B.08.00

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Microsoft SQL Server 2014 Fast Track

Using distributed technologies to analyze Big Data

EMC BACKUP MEETS BIG DATA

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

SQL Server 2008 Performance and Scale

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Parallel Data Warehouse

NEXTGEN v5.8 HARDWARE VERIFICATION GUIDE CLIENT HOSTED OR THIRD PARTY SERVERS

Integrated Grid Solutions. and Greenplum

Maximizing Backup and Restore Performance of Large Databases

BW-EML SAP Standard Application Benchmark

Windows Server Performance Monitoring

The big data revolution

Optimizing Performance. Training Division New Delhi

Big Data Processing: Past, Present and Future

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

How To Get The Most Out Of A Large Data Set

Benchmarking Hadoop & HBase on Violin

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Benefits of multi-core, time-critical, high volume, real-time data analysis for trading and risk management

Performance and Scalability Overview

IBM Netezza High Capacity Appliance

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

The HP Neoview data warehousing platform for business intelligence Die clevere Alternative

A Data Warehouse Approach to Analyzing All the Data All the Time. Bill Blake Netezza Corporation April 2006

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Oracle Database 11g Comparison Chart

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Kronos Workforce Central on VMware Virtual Infrastructure

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Microsoft s SQL Server Parallel Data Warehouse Provides High Performance and Great Value

The Internet of Things and Big Data: Intro

Lenovo Database Configuration for Microsoft SQL Server TB

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Data Integrator Performance Optimization Guide

Chapter 5. Learning Objectives. DW Development and ETL

Performance Baseline of Oracle Exadata X2-2 HR HC. Part II: Server Performance. Benchware Performance Suite Release 8.4 (Build ) September 2013

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Application of Predictive Analytics for Better Alignment of Business and IT

2009 Oracle Corporation 1

TekSouth Fights US Air Force Data Center Sprawl with iomemory

How To Use Hp Vertica Ondemand

System Requirements. SAS Profitability Management Deployment

MS SQL Performance (Tuning) Best Practices:

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

How To Create A Multi Disk Raid

Virtuoso and Database Scalability

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Hardware Configuration Guide

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Transcription:

High performance ETL Benchmark Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 07/02/04 Email: erg@evaltech.com Abstract: The IBM server iseries is a highly scalable platform that can easily handle very large scale data warehousing and data integration applications, and the RODIN Data Asset Management software fully leverages and enhances the advanced technologies of this hardware and database platform. Combination of this hardware and software platform delivers performance levels, in a real world environment. Intellectual Property / Copyright Material All text and graphics found in this article are the property of the Evaltech, Inc. and cannot be used or duplicated without the express written permission of the corporation through the Office of Evaltech, Inc. Evaltech, Inc. Copyright 2004 Page 1 of 7

Summary The IBM server iseries is a highly scalable platform that can easily handle very large scale data warehousing and data integration applications, and the RODIN Data Asset Management software fully leverages and enhances the advanced technologies of this hardware and database platform. Combination of this hardware and software platform delivers performance levels, in a real world environment. Customers with a mixed technology environment should consider using the iseries and RODIN if they are looking for scalability, ease of use and a low cost of ownership (leading to improved ROI) when deciding on their DW / BI / CRM platform. This white paper includes following topics. Introduction Hardware Configuration & Software Configuration RODIN ETL architecture Test and Result Scenario Measurement Methodology Distribute database environment Conclusion Introduction As data warehouses grow to tens or hundreds of terabytes, it is clear that both hardware and software need to scale similarly. While the hardware, database and disk subsystems need to manage these huge amounts of data once loaded. The ETL (Extract, Transformation, and Load) process must be capable of loading many gigabytes of data on a daily basis, and to do this must take advantage of all hardware processing resources which is not an easy task. Speed of processing and scalability are just as important for smaller companies as they are for very large ones. Knowing the software is fully optimized to the available hardware resources ensures throughput is maximized and processing times minimized on all iseries models, whatever the size. This delivers the best possible TCO and ROI figures at all levels within the entire iseries range. Real world benchmarks such as this one are the proving ground for very large-scale iseries data warehouse applications as well as for the small to medium size applications currently being implemented by the majority of organizations today. Its very important and comforting to know data warehousing implementation can grow as and when the need arises without any costly and time-consuming hardware platform or software technology replacements. The IBM iseries i890 and RODIN Data Asset Management software are in the same performance class as high-end Unix, mainframe and Teradata data warehousing implementations. With the added benefits of greatly improved ease of use and proven low TCO, this solution provides a real alternative to the currently more prevelant data warehousing platforms. Evaltech, Inc. Copyright 2004 Page 2 of 7

Hardware Configuration Platform RODIN runs natively on the IBM iseries platform. The i890 system features the eighth generation 64 bit PowerPC processor, which utilizes IBM s copper and silicon-on-insulator technologies. T Server Configuration Model: IBM iseries i890 CPU: 32 x POWER4 1.3GHz 64-bit RISC Microprocessors. Memory: 256GB total: 240GB Disk: 15.9 TB 704 17GB drives. 172 36GB drives. Total 876 drives/disk arms. RAID 5 protected. Operating System: OS/400 V5R2 Database: DB2 UDB for iseries (integrated database). Evaltech, Inc. Copyright 2004 Page 3 of 7

RODIN ETL architecture RODIN is designed to take advantage of all applicable functionality in the OS/400 operating system and integrated DB2 database. Automatically generated ILE RPG programs perform both the extract from the source tables and the load into the target tables. Following figure demonstrates the two-stage design of a RODIN ETL process: two separate batch jobs concurrently perform the extract and load. This greatly enhances throughput in a multi-processor environment. For large loads on multiple CPU (n-way) systems, RODIN s unique parallel processing technology can also easily split the source data into n job streams to fully utilize the resources of all CPUs. Source Data Extract Program Staging Table Load Program Target Data Test and Result Scenario Complex load of detail level table (all inserts)== Test1 The same source data was used in each test: a) This test represents a scenario of a small fact table in a star schema data warehouse. It inserts 200 million rows into a target table with no index, and a record length of 100 bytes. The table contained 12 columns of 3 different data types: 1 Date column 5 Character columns 6 Packed Decimal columns b) This test is representative of the load of a typical large table in a relational data warehouse. It inserts 200 million rows into a target table with no index, and a record length of 500 bytes. The table contained 28 columns of 3 different data types: 2 Date columns 16 Character columns 10 Packed Decimal columns Result Evaltech, Inc. Copyright 2004 Page 4 of 7

a) Load of 100-byte table Elapsed Time (seconds) 14746 1534 1385 1425 Rows/Hour (in millions) 48.8 469.4 519.9 505.3 GB/Hour 4.2 40.7 45.1 43.8 b) Load of 500-byte table Elapsed Time (seconds) 14746 2494.0 1970.0 2239.0 Rows/Hour (in millions) 48.8 288.7 365.5 321.6 GB/Hour 23.1 136.3 172.6 151.8 Analysis The maximum throughput was achieved using a 24-way parallel load. It was expected that a 32-way split would achieve the best results. This new behavior is attributed to the imbalance in capacity / performance between the 32 processors and the IO subsystem on this particular server configuration. The IO subsystem becomes saturated when servicing the 48 concurrent jobs, each performing extremely high IO velocity. Adding additional parallel jobs simply increases the requests to the IO subsystem resulting in a detrimental affect on overall performance. It indicates that the system is making more efficient use of the available resources, negating the need to split the job into more parallel tasks. The 100-byte load achieved the highest rows/hour, whereas the 500-byte table load achieved a significantly higher GB/hour rate, at the expense of rows/hour. Complex load of summary level table (both inserts and updates)==test2 A 100-byte target table, identical to the table from test 1 was used, however this time the 200 million source rows were aggregated to 48 million rows in the target table. A unique primary index existed on the target table, and this index was maintained during the load. The same referential integrity, business rules and transformations were applied. Result Load 200 million rows into summary table inserts and updates Elapsed Time (seconds) 14853 1589 1469 1503 Rows/Hour (in millions) 48.5 453.1 490.1 479 GB/Hour 4.2 39.3 42.5 41.5 Evaltech, Inc. Copyright 2004 Page 5 of 7

Analysis GB/hour is not measured in this test, as it would be a confusing metric. The detail loads demonstrate the GB/hour performance. Rows/hour measurement is based on the number of source rows processed. The 200 million rows were aggregated to 48 million target rows. Updating existing rows in the target, rather than performing aggregation in memory and writing the final result to the table achieves aggregation. This approach allows full re-start recovery in the event of a system failure (unlike memory based processes), as well as other unique RODIN capabilities. These results were noticeably better than expectations: being within 4% of the equivalent detailed table load with no index. This is attributed to the same factors that were noted in the detail loads in test 1. The physical IO limit of this server configuration and the system is easily managing to keep the access path updated with minimal overhead. Complex load of BOTH detail and summary level tables concurrently)==test3 In this scenario, the 100-byte detail (non-indexed) target table and the 100 byte indexed summary table were loaded concurrently. Result Load 200 million rows into both detail and summary tables Elapsed Time (seconds) 23188 2575 2304 2551 Rows/Hour (in millions) 31.1 279.6 312.5 282.2 GB/Hour 62.2 559.2 625 564.5 Analysis The detail loads demonstrate the GB/hour performance. Since two inserts/updates are occurring for each source row, the rows/hour metric is calculated at both source and target level. This test demonstrates the value of RODIN s ability to load multiple target tables concurrently, with support for both inserts and updates. Measurement Methodology RODIN provides significant value add in the ETL process by automatically providing audit statistics and error reporting. Producing these audit and error reports at the end of the extract, as well as the similar overhead in initiating the ETL jobs, is a fixed overhead and these smaller test sets are negatively skewed by this overhead. This is not a significant factor in real time for example these stages took less than 2 minutes on a 24-way parallel load, however because the load rates are so high and the data set not that large, the impact on the overall rates were noticeable. Evaltech, Inc. Copyright 2004 Page 6 of 7

Distributed Database Environments The high end iseries servers offer tremendous processing power and can manage up to 72TB of DASD, it is possible that very large organizations could need even greater processing power or storage capacity. In this event, up to 32 iseries servers (each with up to 32 processors) can be linked to create a distributed database in an MPP (Massively Parallel Processing) environment. With the 32-way i890 servers, this allows for 1,024 processors and over 2 petabytes of disk to act as a single distributed database image. Realistically, it s unlikely that any organization will ever need to create distributed databases using more than 2 or 3 nodes. RODIN has been proven in a 3-node environment, and there s no technical reason why a 32-node environment would not be just as successful. Conclusions RODIN s unique parallel processing architecture, combined with it s native implementation on the iseries, superior error management capabilities and extensive functionality combine to make it the premier tool for building and managing iseries Data Warehouse and Data Mart infrastructures. It is important to recognize that this benchmark was not a straight data load exercise where source columns are simply mapped to target columns. A complex 8-way join was done for each row, validation rules were applied to many of the source columns, numeric date columns were verified and converted to date format etc. These activities are always necessary in ETL situations and were therefore included in these benchmark scenarios, to ensure they are representative of real-life situations. Organizations who are planning to implement a large data warehouse, or who have already implemented and are facing incredible data growth can be assured that the iseries continues to raise the bar, and that with RODIN there is an industrial strength native ETL tool that can utilize all of the processing capacity of the hardware to load and manage enterprise data warehouse and data mart environments. Load rates of over 500 million rows/hour, or 172GB per hour on a single server are unparalleled on any other hardware platform. These metrics exceed the requirements of most organizations, but for those with even greater need today, RODIN and the iseries distributed database technology can provide the solution. If your need is not quite so immediate, just wait for the next generation of iseries servers - they re sure to raise the bar even further. Evaltech, Inc. Copyright 2004 Page 7 of 7