ETL-EXTRACT, TRANSFORM & LOAD TESTING Rajesh Popli Manager (Quality), Nagarro Software Pvt. Ltd., Gurgaon, INDIA rajesh.popli@nagarro.com ABSTRACT Data is most important part in any organization. Data is backbone of any report and reports are the baseline on which all the vital management decisions are taken. Most of the companies are constructing their data warehouse to store and monitor real time data as well as historical data. Many organizations have distributed departments with different applications running on distributedtechnology.etl process came into picture in order to make a flawless integration between different data sources from different departments.etl tool will work as an integrator, extracting data from different sources; transforming it in preferred format based on the business transformation rules and loading it in cohesive DB known are Data Warehouse. This paper deals with ETL testing and its architecture. 1.INTRODUCTION Extract, Transform and Load(ETL) is a process to extract data, mostly from different types of systems, transform it into a structure that s more appropriate for reporting and analysis and finally load it into the database. Extract from Source-In this step we extract data from different internal and external sources, structured and/or unstructured. Plain queries are sent to the source systems, using native connections, message queuing, ODBC or OLE-DB middleware. The data will be put in a Staging Area (SA), usually with the same structure as the source. Transform the data-once the data is available in the Staging Area, it is all on one platform and one database. So we can easily join and union tables, filter and sort the data using specific attributes, pivot to another structure and make business calculations. In this step of the ETL process, we can check on data quality and cleans the data if necessary. After having all the data prepared, we can choose to implement slowly changing dimensions. In that case we want to keep track in our analysis and reports when attributes changes over time, for example a customer moves from one region to another. We need to set the validation at time of data transformation like Avoid the NULL values in the table, validate the data type as using Tiny instead of integer etc. Load into the data warehouse-finally, data is loaded into a data warehouse, usually into fact and dimension tables. From there the data can be combined, aggregated and loaded into data-marts or cubes* and stores in OLAP server. Any report generation, analysis is done on these. Cubes have pre aggregated values which make report generation faster. Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. A cube typically includes the aggregations that are needed for business intelligence queries. 2.ETL ARCHITECTURE ETL also covers data profiling, data quality control, Data monitoring and cleansing, real-time and on demand data integration in a service oriented architecture (SOA), and metadata management. Data profiling and data quality control: Profiling the data, will give direct insight in the data quality of the source systems. It can display how many rows have missing or invalid values, or what the distribution is of the values in a specific column. Based on this knowledge, one can specify business rules in order to cleanse the data, or keep really bad data out of the data warehouse. By doing data profiling before designing your ETL process, you will be able to design a system that is robust and has a clear structure. Data Cleansing: It is a process of changing inconsistency and inaccuracies. Removing unwanted data and transforming inconsistent data into consistent format. Meta data management & ETL: Metadata is defined as data that describes other data. Metadata can be divided into two main types: structural and descriptive. Structural metadata describes the design structure and their specifications. This type of metadata describes the containers of data within a database. Descriptive metadata describes instances of application data. This is the type of metadata that is traditionally spoken of and described as data about the data.
A third type is sometime identified called Administrative metadata. Administrative metadata provides information that helps to manage other information, such as when and how a resource was created, file types and other technical information. Figure 1: ETL Architecture Metadata makes it easier to retrieve, use, or manage information resources by providing users with information that adds context to the data they re working with. Metadata can describe information at any level of aggregation, including collections, single resources, or component part of a single resource. Metadata can be embedded into a digital object or can be stored separately. Web pages contain metadata called metatags. Therefore, metadata is information about how data is extracted, and how it may be transformed. It is also about indexing and creating pointers into data. Database design is all about defining metadata schemas. Meta data can be stored either internally, in the same file as the data, or externally, in a separate area. If the data is stored internally, the metadata is together with the data, making it more easily accessible to view or change. However, this method creates high redundancy. If metadata is stored externally, the searches can become more efficient. There is no redundancy but getting to this metadata may be a little more technical. All the metadata is stored in a data dictionary or a system catalog. The data dictionary is most typically an external document that is created in a spreadsheet type
of document that stores the conceptual design ideas for the database schema. The data dictionary also contains the general format that the data, and in effect the metadata, should be. Metadata is an essential aspect to database design, it allows for increased processing power, due to the fact that it can help create pointers and indexes. Information about all the data that is processed, from sources to targets by transformations, is often put into a metadata repository; a database containing all the metadata. The entire ETL process can be managed with metadata management, for example one can query how a specific target attribute is built-up in the ETL process, called data lineage. Or, you want to know what the impact of a change will be, for example the size of the order identifier (id) is changed, and in which ETL steps this attribute plays a role. 3. CHALLENGES IN ETL PROCESS ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis can identify the data conditions that must be managed by transform rules specifications. This leads to an amendment of validation rules explicitly and implicitly implemented in the ETL process. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that must be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day micro batch to integration with message queues or real-time change-data capture for continuous transformation and update. 4. PERFORMANCE ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory. The fastest ETL record is currently held by Syncsort, Vertica and HP at 5.4TB in under an hour, which is more than twice as fast as the earlier record held by Microsoft and Unisys. In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Thus, for better performance, it may make sense to employ: Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce the load on source system while getting high speed extract Most of the transformation processing outside of the database Bulk load operations whenever possible. 5. ETL TESTING ETL Testing is used to test the entire ETL process using a tool or at table level with the help of test cases and Rules Mapping document. ETL or Data warehouse testing is categorized into four different engagements irrespective of technology or ETL tools used: New Data Warehouse Testing New DW is built and verified from scratch. Data input is taken from customer requirements and different data sources and new data warehouse is build and verified with the help of ETL tools. Migration Testing In this type of project customer will have an existing DW and ETL performing the job but they are looking to bag new tool in order to improve efficiency. Change Request In this type of project new data is added from different sources to an existing DW. Also, there might be a condition where customer needs to change their existing business rule or they might integrate the new rule. Report Testing Report are the end result of any Data Warehouse and the basic propose for which DW is build. Report must be tested by validating layout, data in the report and calculation.
6. ETL TESTING TECHNIQUES Verify that data is transformed correctly according to various business requirements and rules. Make sure that all projected data is loaded into the data warehouse without any data loss and truncation. Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data. Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability. Verify that for rejected data proper error logs are generated with all details Verify NULL value fields Verify that duplicate data is not loaded Verify data integrity Verify that the counts in the source and target are matching Verify that the data is matching between source and target Verify that the transformed data is as per expectation Verify that the data is incrementally getting updated Verify that the foreign - primary key relations are preserved during the ETL Verify if there are any duplicates in the loaded data 7. ETL TESTING CHALLENGES Incompatible and duplicate data Loss of data during ETL process Unavailability of inclusive test bed Testers have no privileges to execute ETL jobs by their own Volume and complexity of data is very huge Fault in business process and procedures Trouble acquiring and building test data Missing business flow information 8. ETL Bugs User interface bugs/cosmetic bugs Related to GUI of application Navigation, spelling mistakes, font style, font size, colors, alignment. BVA Related bug Minimum and maximum values ECP Related bug Valid and invalid type Input/output bugs Valid values not accepted Invalid values accepted Calculation bugs Mathematical errors Final output is wrong Load condition bugs Does not allows multiple users Does not allows customer expected load Race condition bugs System crash & hang System cannot run client plat forms Version control bugs No logo matching No version information available This occurs usually in regression testing H/W bugs Device is not responding to the application
Source bugs Mistakes in help documents 9. CONCLUSION ETL stands for Extract Transformation and Load, It collect the different source data from Heterogeneous System (DB), Transform the data into Data warehouse (Target). The methodology primarily aims at identifying and mitigating existing data defects and general errors occurring prior to processing of data for analytics and reporting. REFERENCES: 1. Kimball, The Data Warehouse Lifecycle Toolkit, p 332 2. Golfarelli /Rizzi, Data Warehouse Design, p 291 3. www.informatica.com The DW ETL toolkit, R. Kimball, J. Caserta, Wiley, 2004 Extraction-Transformation-Lo The DW ETL toolkit, R. Kimball, J. Caserta, Wiley, 2004 The DW ETL toolkit, R. Kimball, J. Caserta, Wiley, 2004 Pedersen, Aalborg University, http://www.cs.aau.dk/~tbp /Teaching/DWML06/DWML06.html