High performance ETL Benchmark Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 07/02/04 Email: erg@evaltech.com Abstract: The IBM server iseries is a highly scalable platform that can easily handle very large scale data warehousing and data integration applications, and the RODIN Data Asset Management software fully leverages and enhances the advanced technologies of this hardware and database platform. Combination of this hardware and software platform delivers performance levels, in a real world environment. Intellectual Property / Copyright Material All text and graphics found in this article are the property of the Evaltech, Inc. and cannot be used or duplicated without the express written permission of the corporation through the Office of Evaltech, Inc. Evaltech, Inc. Copyright 2004 Page 1 of 7
Summary The IBM server iseries is a highly scalable platform that can easily handle very large scale data warehousing and data integration applications, and the RODIN Data Asset Management software fully leverages and enhances the advanced technologies of this hardware and database platform. Combination of this hardware and software platform delivers performance levels, in a real world environment. Customers with a mixed technology environment should consider using the iseries and RODIN if they are looking for scalability, ease of use and a low cost of ownership (leading to improved ROI) when deciding on their DW / BI / CRM platform. This white paper includes following topics. Introduction Hardware Configuration & Software Configuration RODIN ETL architecture Test and Result Scenario Measurement Methodology Distribute database environment Conclusion Introduction As data warehouses grow to tens or hundreds of terabytes, it is clear that both hardware and software need to scale similarly. While the hardware, database and disk subsystems need to manage these huge amounts of data once loaded. The ETL (Extract, Transformation, and Load) process must be capable of loading many gigabytes of data on a daily basis, and to do this must take advantage of all hardware processing resources which is not an easy task. Speed of processing and scalability are just as important for smaller companies as they are for very large ones. Knowing the software is fully optimized to the available hardware resources ensures throughput is maximized and processing times minimized on all iseries models, whatever the size. This delivers the best possible TCO and ROI figures at all levels within the entire iseries range. Real world benchmarks such as this one are the proving ground for very large-scale iseries data warehouse applications as well as for the small to medium size applications currently being implemented by the majority of organizations today. Its very important and comforting to know data warehousing implementation can grow as and when the need arises without any costly and time-consuming hardware platform or software technology replacements. The IBM iseries i890 and RODIN Data Asset Management software are in the same performance class as high-end Unix, mainframe and Teradata data warehousing implementations. With the added benefits of greatly improved ease of use and proven low TCO, this solution provides a real alternative to the currently more prevelant data warehousing platforms. Evaltech, Inc. Copyright 2004 Page 2 of 7
Hardware Configuration Platform RODIN runs natively on the IBM iseries platform. The i890 system features the eighth generation 64 bit PowerPC processor, which utilizes IBM s copper and silicon-on-insulator technologies. T Server Configuration Model: IBM iseries i890 CPU: 32 x POWER4 1.3GHz 64-bit RISC Microprocessors. Memory: 256GB total: 240GB Disk: 15.9 TB 704 17GB drives. 172 36GB drives. Total 876 drives/disk arms. RAID 5 protected. Operating System: OS/400 V5R2 Database: DB2 UDB for iseries (integrated database). Evaltech, Inc. Copyright 2004 Page 3 of 7
RODIN ETL architecture RODIN is designed to take advantage of all applicable functionality in the OS/400 operating system and integrated DB2 database. Automatically generated ILE RPG programs perform both the extract from the source tables and the load into the target tables. Following figure demonstrates the two-stage design of a RODIN ETL process: two separate batch jobs concurrently perform the extract and load. This greatly enhances throughput in a multi-processor environment. For large loads on multiple CPU (n-way) systems, RODIN s unique parallel processing technology can also easily split the source data into n job streams to fully utilize the resources of all CPUs. Source Data Extract Program Staging Table Load Program Target Data Test and Result Scenario Complex load of detail level table (all inserts)== Test1 The same source data was used in each test: a) This test represents a scenario of a small fact table in a star schema data warehouse. It inserts 200 million rows into a target table with no index, and a record length of 100 bytes. The table contained 12 columns of 3 different data types: 1 Date column 5 Character columns 6 Packed Decimal columns b) This test is representative of the load of a typical large table in a relational data warehouse. It inserts 200 million rows into a target table with no index, and a record length of 500 bytes. The table contained 28 columns of 3 different data types: 2 Date columns 16 Character columns 10 Packed Decimal columns Result Evaltech, Inc. Copyright 2004 Page 4 of 7
a) Load of 100-byte table Elapsed Time (seconds) 14746 1534 1385 1425 Rows/Hour (in millions) 48.8 469.4 519.9 505.3 GB/Hour 4.2 40.7 45.1 43.8 b) Load of 500-byte table Elapsed Time (seconds) 14746 2494.0 1970.0 2239.0 Rows/Hour (in millions) 48.8 288.7 365.5 321.6 GB/Hour 23.1 136.3 172.6 151.8 Analysis The maximum throughput was achieved using a 24-way parallel load. It was expected that a 32-way split would achieve the best results. This new behavior is attributed to the imbalance in capacity / performance between the 32 processors and the IO subsystem on this particular server configuration. The IO subsystem becomes saturated when servicing the 48 concurrent jobs, each performing extremely high IO velocity. Adding additional parallel jobs simply increases the requests to the IO subsystem resulting in a detrimental affect on overall performance. It indicates that the system is making more efficient use of the available resources, negating the need to split the job into more parallel tasks. The 100-byte load achieved the highest rows/hour, whereas the 500-byte table load achieved a significantly higher GB/hour rate, at the expense of rows/hour. Complex load of summary level table (both inserts and updates)==test2 A 100-byte target table, identical to the table from test 1 was used, however this time the 200 million source rows were aggregated to 48 million rows in the target table. A unique primary index existed on the target table, and this index was maintained during the load. The same referential integrity, business rules and transformations were applied. Result Load 200 million rows into summary table inserts and updates Elapsed Time (seconds) 14853 1589 1469 1503 Rows/Hour (in millions) 48.5 453.1 490.1 479 GB/Hour 4.2 39.3 42.5 41.5 Evaltech, Inc. Copyright 2004 Page 5 of 7
Analysis GB/hour is not measured in this test, as it would be a confusing metric. The detail loads demonstrate the GB/hour performance. Rows/hour measurement is based on the number of source rows processed. The 200 million rows were aggregated to 48 million target rows. Updating existing rows in the target, rather than performing aggregation in memory and writing the final result to the table achieves aggregation. This approach allows full re-start recovery in the event of a system failure (unlike memory based processes), as well as other unique RODIN capabilities. These results were noticeably better than expectations: being within 4% of the equivalent detailed table load with no index. This is attributed to the same factors that were noted in the detail loads in test 1. The physical IO limit of this server configuration and the system is easily managing to keep the access path updated with minimal overhead. Complex load of BOTH detail and summary level tables concurrently)==test3 In this scenario, the 100-byte detail (non-indexed) target table and the 100 byte indexed summary table were loaded concurrently. Result Load 200 million rows into both detail and summary tables Elapsed Time (seconds) 23188 2575 2304 2551 Rows/Hour (in millions) 31.1 279.6 312.5 282.2 GB/Hour 62.2 559.2 625 564.5 Analysis The detail loads demonstrate the GB/hour performance. Since two inserts/updates are occurring for each source row, the rows/hour metric is calculated at both source and target level. This test demonstrates the value of RODIN s ability to load multiple target tables concurrently, with support for both inserts and updates. Measurement Methodology RODIN provides significant value add in the ETL process by automatically providing audit statistics and error reporting. Producing these audit and error reports at the end of the extract, as well as the similar overhead in initiating the ETL jobs, is a fixed overhead and these smaller test sets are negatively skewed by this overhead. This is not a significant factor in real time for example these stages took less than 2 minutes on a 24-way parallel load, however because the load rates are so high and the data set not that large, the impact on the overall rates were noticeable. Evaltech, Inc. Copyright 2004 Page 6 of 7
Distributed Database Environments The high end iseries servers offer tremendous processing power and can manage up to 72TB of DASD, it is possible that very large organizations could need even greater processing power or storage capacity. In this event, up to 32 iseries servers (each with up to 32 processors) can be linked to create a distributed database in an MPP (Massively Parallel Processing) environment. With the 32-way i890 servers, this allows for 1,024 processors and over 2 petabytes of disk to act as a single distributed database image. Realistically, it s unlikely that any organization will ever need to create distributed databases using more than 2 or 3 nodes. RODIN has been proven in a 3-node environment, and there s no technical reason why a 32-node environment would not be just as successful. Conclusions RODIN s unique parallel processing architecture, combined with it s native implementation on the iseries, superior error management capabilities and extensive functionality combine to make it the premier tool for building and managing iseries Data Warehouse and Data Mart infrastructures. It is important to recognize that this benchmark was not a straight data load exercise where source columns are simply mapped to target columns. A complex 8-way join was done for each row, validation rules were applied to many of the source columns, numeric date columns were verified and converted to date format etc. These activities are always necessary in ETL situations and were therefore included in these benchmark scenarios, to ensure they are representative of real-life situations. Organizations who are planning to implement a large data warehouse, or who have already implemented and are facing incredible data growth can be assured that the iseries continues to raise the bar, and that with RODIN there is an industrial strength native ETL tool that can utilize all of the processing capacity of the hardware to load and manage enterprise data warehouse and data mart environments. Load rates of over 500 million rows/hour, or 172GB per hour on a single server are unparalleled on any other hardware platform. These metrics exceed the requirements of most organizations, but for those with even greater need today, RODIN and the iseries distributed database technology can provide the solution. If your need is not quite so immediate, just wait for the next generation of iseries servers - they re sure to raise the bar even further. Evaltech, Inc. Copyright 2004 Page 7 of 7