Testing Big data is one of the biggest

Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing a structured testing technique Testing Big data is one of the biggest challenges faced by organizations because of lack of knowledge on what to test and how much data to test. Organizations have been facing challenges in defining the test strategies for structured and unstructured data validation, setting up an optimal test environment, working with non-relational databases and performing non-functional testing. These challenges are causing in poor quality of data in production and delayed implementation and increase in cost. Robust testing approach need to be defined for validating structured and unstructured data and start testing early to identify possible defects early in the implementation life cycle and to reduce the overall cost and time to market. Different testing types like functional and non-functional testing are required along with strong test data and test environment management to ensure that the data from varied sources is processed error free and is of good quality to perform analysis. Functional testing activities like validation of map reduce process, structured and unstructured data validation, data storage validation are important to ensure that the data is correct and is of good quality. Apart from functional validations other nonfunctional testing like performance and failover testing plays a key role to ensure the whole process is scalable and is happening within specified SLA. Big data implementation deals with writing complex Pig, Hive programs and running these jobs using Hadoop map reduce framework on huge volumes of data across different nodes. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. Hadoop uses Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Hadoop utilizes its own distributed file system, HDFS, which makes data available to multiple computing nodes. Figure 1 shows the step by step process on how Big data is processed using Hadoop ecosystem. First step loading source data into 65

1 Loading Source 2 Perform Map data files into HDFS Reduce operations 3 Extract the output results from HDFS Figure 1: Big Data Testing Focus Areas Source: Infosys Research HDFS involves in extracting the data from different source systems and loading into HDFS. Data is extracted using crawl jobs for web data, tools like sqoop for transactional data and then loaded into HDFS by splitting into multiple files. Once this step is completed second step perform map reduce operations involves in processing the input files and applying map and reduce operations to get a desired output. Last setup extract the output results from HDFS involves in extracting the data output generated out of second step and loading into downstream systems which can be enterprise data warehouse for generating analytical reports or any of the transactional systems for further processing BIG DATA TESTING APPROACH As we are dealing with huge data and executing on multiple nodes there are high chances of having bad data and data quality issues at each stage of the process. Data functional testing is performed to identify these data issues because of coding errors or node configuration errors. Testing should be performed at each of the three phases of Big data processing to ensure that data is getting processed without any errors. Functional Testing includes (i) validation of pre-hadoop processing; (ii), validation of Hadoop Map Reduce process data output; and (iii) validation of data extract, and load into EDW. Apart from these functional validations non-functional testing including performance testing and failover testing needs to be performed. Figure 2 shows a typical Big data architecture diagram and highlights the areas where testing should be focused. Validation of Pre-Hadoop Processing Data from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further. Issues: Some of the issues which we face during this phase of the data moving from source 66

Big Data Analytics 1 Reporting using BI Tools 25% 25% 25% 25% 1 2 3 4 Bar graph hadoop Pre-Hadoop process validation Web Logs 4 5 ReportsTesting Pig HBase (NoSQL DB) Big Data Testing Focus Areas Enterprise Data Warehouse HIVE Map Reduce (Job Execution) HDFS (Hadoop Distributed File System) Streaming Data Processed Data Data Load using Sqoop Social Data 3 2 ETL Process validation ETL Process Map-Reduce process validation Transactional Data (RDBMS) Non-FunctionalT esting (Performance, Fail over testing) 4 Figure 2: Big Data architecture Source: Infosys Research systems to Hadoop are incorrect data captured from source systems, incorrect storage of data, incomplete or incorrect replication. Validations: Some high level scenarios that need to be validated during this phase include: 1. Comparing input data file against source systems data to ensure the data is extracted correctly 2. Validating the data requirements and ensuring the right data is extracted, 3. Validating that the files are loaded into HDFS correctly, and Validation of Hadoop Map Reduce Process Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data coming from different sources. Issues: Some issues that we face during this phase of the data processing are coding issues in map-reduce jobs, jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes, incorrect aggregations, node configurations, and incorrect output format. Validations: Some high level scenarios that need to be validated during this phase include: 4. Validating the input files are split, moved and replicated in different data nodes. 1. Validating that data processing is completed and output file is generated 67

2. Validating the business logic on standalone node and then validating after running against multiple nodes 3. Validating the data load in target system 4. Validating the aggregation of data 3. Validating the map reduce process to verify that key value pairs are generated correctly 4. Validating the aggregation and consolidation of data after reduce process 5. Validating the data integrity in the target system. Validation of Reports Analytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive. 5. Validating the output data against the source files and ensuring the data processing is completed correctly 6. Validating the output data file format and ensuring that the format is per the requirement. Issues: Some of the issues faced while generating reports are report definition not set as per the requirement, report data issues, layout and format issues. Validations: Some high level validations performed during this phase include: Validation of Data Extract, and Load into EDW Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement. Issues: Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS. Validations: Some high level scenarios that need to be validated during this phase include: 1. Validating that transformation rules are applied correctly 2. Validating that there is no data corruption by comparing target table data against HDFS files data Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring. Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports. Cube Testing: Cubes are testing to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report. Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases. 68

VOLUME, VARIETY AND VELOCITY: HOW TO TEST? In the earlier sections we have seen step by step details on what need to be tested at each phase of the Big data processing. During these phases of Big data processing the three dimensions or characteristics of Big data i.e. volume, variety and velocity are validated to ensure there are no data quality defects and no performance issues. Volume: The amount of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year [3]. Huge volume of data flows from multiple systems which need to be processed and analyzed. When it comes to validation it is a big challenge to ensure that whole data setup processed is correct. Manually validating the whole data is a tedious task. We should use compare scripts to validate the data. As data is stored in HDFS is in file format scripts can be written to compare two files and extract the differences using compare tools [4]. Even if we use compare tools it will take a lot of time to do 100% data comparison. To reduce the time for execution we can either run all the comparison scripts in parallel on multiple nodes just like how data is processed using Hadoop mapreduce process or sample the data ensuring maximum scenarios are covered. Figure 3 shows the approach on how voluminous amount of data is compared. Data is converted into expected result format and then compared using compare tools with actual data. This is a faster approach but involves initial scripting time. This approach will reduce further regression testing cycle time. When we don t have time to validate complete data, sampling can be done for validation. Variety: The variety of data types is increasing, namely unstructured text-based data and semistructured data like social media data, locationbased data, and log-file data. Structured Data is data which is in defined format which is coming from different RDBMS tables or from structured files. The data that is of transactional nature can be handled in files or tables for validation purpose. Map Reduce Jobs Testing Scripts to validate data in HDFS Unstructured (data) testing Structured data 1 SD Test 2 SD1 Test1 Unstructured to Structured Unstructured (data) testing Custom scripts to convert unstructured data to structured data Structured data testing Map Reduce Jobs run in test environment to generate the output Raw Data to Expected Results format Structured data testing Scripts to convert data to expected results data Expected Results Output Data Files Tool to compare the files Actual Results File by File Comparison Discrepancy Report Expected Results Figure 3: Approach for High Volume Data Validation Source: Infosys Research 69

Semi-structured data does not have any defined format but structure can be derived based on the multiple patterns of the data. Example of data is extracted by crawling through different websites for analysis purposes. For validation data need to be first transformed into structured format using custom built scripts. First the pattern need to be identified and then copy books or pattern outline need to be prepared, later this copy book need to be used in scripts to convert the incoming data into a structured format and then validations performed using compare tools. Unstructured data is the data that does not have any format and is stored in documents or web content, etc. Testing unstructured data is very complex and is time consuming. Automation can be achieved to some extent by converting the unstructured data into structured data using scripting like PIG scripting as showing in Figure 3. But the overall coverage using automation will be very less because of unexpected behavior of data; input data can be in any form and changes every time new test is performed. We need to deploy a business scenario validation strategy for unstructured data. In this strategy we need to identify different scenarios that can occur in our day to day unstructured data analysis and test data need to be setup based on test scenarios and executed. Velocity: The speed at which new data is being created and the need for real-time analytics to derive business value from it -- is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users. Data speed needs to be considered when implementing any Big data appliance to overcome performance problems. Performance testing plays an important role to identify any performance bottleneck in the system and the system can handle high velocity streaming data. NON-FUNCTIONAL TESTING In the earlier sections we have seen how functional testing is performed at each phase of Big data processing, these tests are performed to identify functional coding issues, requirements issues. Performance testing and failover testing need to be performed to identify performance bottlenecks and to validate the non-functional requirements. Performance Testing: Any Big data project involves in processing huge volumes of structured and unstructured data and is processed across multiple nodes to complete the job in less time. At times because of bad architecture and poorly designed code, performance is degraded. If the performance is not meeting the SLA, the purpose of setting up Hadoop and other Big data technologies is lost. Hence, performance testing plays a key role in any Big data project due to huge volume of data and complex architecture. Some of the areas where performance issues can occur are imbalance in input splits, redundant shuffle and sorts, moving most of the aggregation computations to reduce process which can be done at map process. [5]. These performance issues can be eliminated by carefully designing the system architecture and doing performance test to identify the bottlenecks. Performance testing is conducted by setting up huge volume of data and an infrastructure similar to production. Utilities like Hadoop performance monitoring tool can be used to capture the performance metrics and identify the issues. Performance metrics like 70

job completion time, throughput, and system level metrics like memory utilization etc. are captured as part of performance testing. Failover Testing: Hadoop architecture consists of a name node and hundreds of data notes hosted on several server machines and each of them are connected. There are chances of node failure and some of the HDFS components become non-functional. Some of the failures can be name node failure, data node failure and network failure. HDFS architecture is designed to detect these failures and automatically recover to proceed with the processing. Failover testing is an important focus area in Big data implementations with the objective of validating the recovery process and to ensure the data processing happens seamlessly when switched to other data nodes. Some validations that need to be performed during failover testing are validating that checkpoints of edit logs and FsImage of name node are happening at a defined intervals, recovery of edit logs and FsImage files of name node, no data corruption because of the name node failure, data recovery when data node fails and validating that replication is initiated when one of data node fails or data become corrupted. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics are captured during failover testing. TEST ENVIRONMENT SETUP As Big data involves handling huge volume and processing across multiple nodes, setting up a test environment is the biggest challenge. Setting up the environment on cloud will give us the flexibility to setup and maintain it during test execution. Hosting the environment on the cloud will also help in optimizing the infrastructure and faster time to market. Key steps involved in setting up environment on cloud are [6]: A. Big data Test infrastructure requirement assessment 1. Assess the Big data processing requirements 2. Evaluate the number of data nodes required in QA environment 3. Understand the data privacy requirements to evaluate private or public cloud 4. Evaluate the software inventory required to be setup on cloud environment (Hadoop, File system to be used, No SQL DBs, etc). B. Big data Test infrastructure design 1. Document the high level cloud test infrastructure design (Disk space, RAM required for each node, etc.) 2. Identify the cloud infrastructure service provider 3. Document the SLAs, communication plan, maintenance plan, environment refresh plan 4. Document the data security plan 5. Document high level test strategy, testing release cycles, testing types, volume of data processed by Hadoop, third party tools required. 71

C. Big data Test Infrastructure Implementation and Maintenance Create a cloud instance of Big data test environment Install Hadoop, HDFS, MapReduce and other software as per the infrastructure design Perform a smoke test on the environment by processing a sample map reduce, Pig/Hive jobs functional and non-functional requirements. Applying right test strategies and following best practices will improve the testing quality which will help in identifying the defects early and reduce overall cost of the implementation. It is required that organizations invest in building skillset both in development and testing. Big data testing will be a specialized stream and testing team should be built with diverse skillset including coding, white-box testing skills and data analysis skills for them to perform a better job in identifying quality issues in data. Deploy the code to perform testing. BEST PRACTICES Data Quality: It is very important to establish the data quality requirements for different forms of data like traditional data sources, data from social media, data from sensors, etc. If the data quality is ascertained, the transformation logic alone can be tested, by executing tests against all possible data sets. Data Sampling: Data sampling gains significance in Big data implementation and it becomes the testers job to identify suitable sampling techniques that includes all critical business scenarios and the right test data set. Automation: Automate the test suites as much as possible. The Big data regression test suite will be used multiple times as the database will be periodically updated. Hence an automated regression test suite should be built to use it after reach release. This will save a lot of time during Big data validations. CONCLUSION Data quality challenges can be encountered by deploying a structured testing approach for both REFERENCES 1. Big data overview, Wikipedia.org at http://en.wikipedia.org/wiki/big_data. 2. White, T. (2010), Hadoop- The Definitive Guide 2nd Edition, O Reilly Media. 3. Kelly, J. (2012), Big data: Hadoop, Business Analytics and Beyond, A Big data Manifesto from the Wikibon Community. Available at http:// wikibon.org/wiki/v/big_data:_ Hadoop,_Business_Analytics_and_ Beyond, Mar 2012. 4. Informatica Enterprise Data Integration (1998), Data verification using File and Table compare utility for HDFS and Hive tool. Available at https://community. informatica.com/solutions/1998. 5. Bhandarkar M. (2009), Practical Problem Solving with Hadoop, USENIX 09 annual technical conference, June 2009. Available at http://static.usenix.org/ event/usenix09/training/tutonefile.html. 6. Naganathan, V. (2012), Increase Business Value with Cloud-based QA Environments, Available at http://www. infosys.com/it-services/independentvalidation-testing-services/pages/ cloud-based-qa-environments.aspx. 72

Author s Profile s MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at mahesh_gudipati@infosys.com. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at Shanthi_Rao@infosys.com. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at naju_mohan@infosys.com. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at Naveen_Gajja@infosys.com. For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: InfosyslabsBriefings@infosys.com Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained in this document or to any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.