Testing 3Vs (Volume, Variety and Velocity) of Big Data 1
A lot happens in the Digital World in 60 seconds 2
What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big Data is a generic term used to describe the voluminous amount of unstructured, structured and semi-structured data. 3
Big Data Characteristics 3 key characteristics of big data: Volume: High volume of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources Variety: Data is in structured, semi-structured and unstructured format. Velocity: Data is generated at a high speed ohigh volume of data needs to be processed within seconds 4
Big Data Processing using Hadoop framework ❶Loading source data files into HDFS ❷Perform Map Reduce operations ❸Extract the output results from HDFS 5
Hadoop Map/Reduce processing Overview Map/Reduce is distributed computing and parallel processing framework where we have the advantage of pushing the computation to data Distributed computing Parallel Computing Based on Map & Reduce tasks 6
Hadoop Eco-System HDFS Hadoop Distributed File System HBase NoSQL data store (Non-relational distributed database) Map/Reduce - Distributed computing framework Scoop - SQL-to-Hadoop database import and export tool Hive Hadoop Data Warehouse Pig - Platform for creating Map/Reduce programs for analyzing large data sets 7
Unique Testing Opportunities in Big Data Implementations 8
Testing Opportunities for Independent Testing Early Validation of the Requirements Preparation of Big Test Data Early Validation of the Design Configuration Testing Incremental Load Testing Functional Testing 9
Early Validation of the Requirements Enterprise Data Warehouses integrated with Big Data Business Intelligence Systems integrated with Big Data Early Validation of the Requirements Are the requirements mapped to the right data sources? Are any data sources, that are not considered? Why? 10
Early Validation of the Design Is the Unstructured Data stored, in right place, for analytics? Is the structured Data stored, in right place, for analytics? Early Validation of the Design Is the data duplicated in multiple storage systems? Why? Are the data synchronization needs adequately identified and addressed? 11
Test Data Replicate data, intelligently, with tools How big the data files should be, to ensure near-real volumes of data? Preparation of Big Test Data Create data with incorrect schema Create erroneous data 12
Cluster Setup Is the system behaving as expected when a cluster is removed? Cluster Setup Testing Is the system behaving as expected when a cluster is added? 13
Big Data Testing 14
Volume Testing: Challenges Testing challenges Terabytes and Petabytes of data. Data storage in HDFS in file formats Data files are split and stored in multiple data nodes 100% coverage cannot be achieved Data consolidation issues 15
Volume Testing: Approach Testing Approach Use Data Sampling strategy Sampling to be done based on data requirements Convert raw data into expected result format to compare with actual output data Prepare Compare scripts to compare the data present in HDFS file storage 16
Variety Testing: Challenges Testing challenges Manually validating semi-structured and unstructured data Unstructured validation issues because of no defined format Lot of scripting required to be performed to process semi-structured and unstructured data Unstructured data sampling challenge 17
Variety Testing: Approach Testing Approach Structured Data: Compare data using compare tools and identify the discrepancies Semi-structured Data: Convert semi-structured data into structured format Format converted raw data to expected results Compare expected result data with actual results Unstructured Data : Parse unstructured text data into data blocks and aggregate the computed data blocks Validate aggregated data against the data output 18
Velocity Testing: Challenges Testing challenges Setting up of production like environment for performance testing Simulating production job run High velocity volume streaming Test data setup Simulating node failures 19
Velocity Testing: Approach Validation Points Performance of Pig/Hive jobs and capture Job completion time and validating against the benchmark Throughput of the jobs Impact of background processes on performance of the system Memory and CPU details of task tracker Availability of name node and data nodes Metrics Captured Job completion time Throughput Memory utilization No. of spills and spilled records Identify Jobs failure rate 20
Questions? 21
References http://en.wikipedia.org/wiki/big_data www.cloudera.com/ http://developer.yahoo.com/hadoop/tutorial/index.html http://wikibon.org/wiki/v/big_data:_hadoop,_business_analytics_and_beyond 22
THANK YOU www.infosys.com The contents of this document are proprietary and confidential to Infosys Limited and may not be disclosed in whole or in part at any time, to any third party without the prior written consent of Infosys Limited. 2012 Infosys Limited. All rights reserved. Copyright in the whole and any part of this document belongs to Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in whole or in part, in any manner or form, or in any media, without the prior written consent of Infosys Limited.