BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS Megha Joshi Assistant Professor, ASM s Institute of Computer Studies, Pune, India Abstract: Industry is struggling to handle voluminous, complex, unstructured and vast data sets. Testing big data is one of the major challenges which test engineers are not long way from encountering. Much data today is in unstructured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a major challenge. This paper details the possible challenges the software testing aspect has to deal with Big Test Data in a bigger way. Focus will be on two specialized testing techniques in explaining the concepts and the trends. Keywords: Big data, big test data, big test management, data warehouse test, testing, voluminous. INTRODUCTION Big data has lot of buzz in recent time. Enormous, voluminous, vast, complex, heterogeneous are some of the common terms that are perceived when Big Data is thought of. Big Data is the continuous explosion of large volume of data that are generated, processed, stored and accessed by applications that handle several concurrent transactions of data, instantaneously. A transition from structured relational data to voluminous unstructured, non-semantic, but essential, highly complex data remains a great challenge for data managers, data workers, data analyzers to hold and organize such Big Data. Social Networking sites, patenting websites, Geographical and Spatial data processing applications, remote sensing and meteorological systems have gone forward to collect data in fractions of a second and all of them are considered veracity data. Though system architects and designers are researching better ways to master Big Data, Test Architects and Test Engineers are also not long way from facing Big Test Data. Whether static or dynamic, Big Data has four dimensions - volume, velocity, variety, and veracity of data processing. 1
Fig.1. Dimensions of Big Data Volume is the tremendous amount of data. Enterprises are flooded with ever-growing data of all types, easily accumulate terabytes even petabytes of information. Variety is the heterogeneity of data. Big data is any type of data structured, semi structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Velocity is the rate of transfer of data that comes in, flows within and goes out. Time is important factor to be considered here. Veracity is the correctness or quality of the data or information. Establishing trust in big data presents a huge challenge as the variety and number of sources grows. 1. OVERVIEW OF BIG DATA ARCHITECTURE As always been, Big is complex. No generalized architecture or framework could be designed for Big Data, as the forth lying challenges are even BIG. Tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not semantic content and search: transforming such content into a structured format for later analysis is a major challenge. In addition to these, smileys, icons, long urls, junks of historic data are totally complex to process in one framework. Irrespective of the nature of data, the underlying storage structure is usually preferred to be a file storage system, over which Hadoop s distributions, mappings and map reduce configurations are programmed to access the BIG DATA. Data implementations such as the ones based on Apache Hadoop have no such limitations as they are capable of storing the data in multiple clusters. The storage is provided by HDFS (Hadoop Distributed File System), a reliable shared storage system which can be analyzed using MapReduce technology. Programming languages play an important role in the extraction and cleansing of the acquired data, and make them representable. NoSQL queries are customized to the type of data required to be analyzed and fetched. A big picture of how BIG DATA is functional is given below: 2
Fig. 2. Functional Model of Big Data Looking at this Big picture of BIG DATA, the following aspects in testing should be given due consideration: Gathering test requirements Collecting big test data Availability of test data for the environment Veracity of the patterns of usage in case of usability testing Security of these data Stress caused on the system due to the workload and volume of data Rate of scaling up of the data storage media Performance issues when such unanticipated volume of data of an application from a variety of sources The following sections throws light some these challenges. 2. SPECIALIZED WITH BIG TEST DATA Test architects and Test Engineers are not far from handling Big Test Data (BTD). Though we are at present handling clean, structured, frozen design and code and successfully complete our test cycles, it is not going to be the same in the near future. Here are two specialized testing where BTD will be a real challenge for us too. 2.1 Big Test in Data Warehouse Test Data warehouse by itself is a heterogeneous collection of relational tables, and is considered complex in most cases. Data warehouse with 27-billion rows is really big and is tested for its completeness, quality, scalability, integration, acceptance and etc. [2]. All these traits are tested in controlled environment in order to ensure that data-mining and analysis of data in 3
DW happens properly. But, how come this is possible in an uncontrollable environment and data. Here are a few differences. Fig.3. Comparison of Data Warehouse Test and Big Data Test Data Warehouse Test Clean Data Simplified, Structured, Semantic Data Structured Database schema Data from Relational Database, SQL queried data Specific business rules, transformation rules and design rules are applied Change in code and data is known and defined Big Data test Unclean Data Complex, Unstructured, non-semantic Data Customized instant schema generated Data from non-relational flat file storage, different storage formats, NoSQL queried data No specific business rules are applied Changes are unanticipated and occur with high velocity Data warehouse analytics and testing are BI processes which will take higher level of testing strategies, processes and tools when Big Data comes into picture. How big the data is, is a comparative perception. Big data has always been there since the late 1980s and every time it got bigger with the explosion of technological advancements. So the perception of how the test strategies are viewed and processes are redefined. There are three things to be considered: a) Make Big things Simple: Instead of talking and discussing big solutions for big test data, it is better to organize the big warehouse into simpler units that are easily testable. The tests for completeness and quality begin here. Once the Big data warehouse is logically or conceptually compartmentalized, then the testing power increases [3]. As BIG DATA normally prefers distributed and parallel computing, this strategy of divide and test will improve our testing processes. b) Normalize Design & Tests: Though NoSQL operates well with non-relational schemas, normalization of the customized schema structures is mandatory for successful DWT testing with BTD. This begins with the Hadoop s programming platforms in which these BIG DATA are tailored to the business functionality and requirements. When the dynamic schemas are normalized at design level, it paves a better way for generating normalized Big Test Data. c) Measure the 4-Vs: The last most important aspect for BIG DATAW Testing is measure and monitor the 4-Vs. Veracity is ensured by the cleaning and normalization of the data, Variety and Volume of the BIG DATA is tested for scalability testing, and the velocity is the measure of the rate of change in the BIG DATAW. The data warehouse test environments are designed to handle these four-vs with utmost priority. 2.2 Big Test in Performance Testing Performance Testing has been an essential and integral part of system testing which deals with volumes, workload (Users and transactions), real-time scenarios and navigation/behavioural patterns of end-users. While performance of a system depends on various factors and figures, like web server, database servers, hosting servers, network, 4
hardwares and number of peak loads, prolonged workloads etc. How these can be addressed in Big Test Data from the perception of a system s performance? Fig. 4. Big Test Data for Performance Testing Virtual Users in the Cloud Parallel Big Data Execution Group2 Group4 Group6 Group1 Group3 Group5 Distributed Testers Machine- with Controllers & Vugen Web/Application Servers Big Data Storage Layer Three things need consideration: 1. Distributed and Parallel Workload distribution testing should be conducted in parallel in a distributed environment. This being the basic concept of BIG DATA storage and computation, testing them in BIG terms is supposed to follow the same strategy. Since users transaction pattern sets are also large and voluminous, and proportional to the velocity of data flow-in, the recorded scripts are also distributed among the controllers to simulate real-time environment. 2. The performance test strategies for Load test, stress test and endurance test depend highly on the scenario set for the controller. The spreadsheets and the backend databases that hold our test data are comparatively less competent to hold unstructured BIG DATA. To execute these scenarios the controller should have an interface to integrate with the already distributed BTD. Hence the challenge. 5
3. The distributed Vugen, Controllers and monitors are executing the test scenarios interacting with the Big Test Data in the storage layer and the s distributed in the cloud. The test executions by the s are in parallel which pose a better way to handle the Big test execution. When the execution of these Big scenarios with Big Test Data are successful, the test results in terms of reports, charts and graphs are again too Big. The real challenge lies in the interpretation of the results, identifying the bottlenecks and areas of required performance tuning. The performance testing tools that we have now with us help meet the challenges of Big Test Data for Performance Testing is quite uncertain. As the tool vendors and the Big Data nonfunctional test teams are in the infant stages, lot more R&D need to be done in these areas. However, the practical, rollout model for Big Test environment for non-functional testing is not too far. 3. BIG TEST DATA MANAGEMENT With these two types of tests, DWT and PT, for Big Data, one could understand how complicated it is to manage the Big Test Data (BTD) in a dynamic, data-flooded environment. At this infant stage, having real-time BTD is at hand for DWT or PT is impossible due to the significant sensitivity of BIG DATA. But how to have BTD and manage it during the automated testing processes? How BTD acquiring, managing can be foreseen at the initial phases and during the execution phases of the Test life cycle? A few thoughts on these: a) Planning & Design: When dealing with BTD, planning & designing a test environment and strategy has to be prioritized. Automated tests conventionally involve recording and playback. However, refining and customizing the recorded scripts requires technical expertise, and the biggest bottleneck using scripts is that it cannot be scaled up to test big data [4]. Scaling up Big Test Data sets, without proper planning and design, will lead to delayed response time, which might result in timed-out test execution. In order to resolving the scaling up issue with BTD, action based testing (ABT) is proposed [4]. In ABT, tests are treated as actions in a test module. The actions are pointed towards keyword along with the parameters required for executing the tests. Ensure that the test modules are unambiguous, and unique, so that the actions are well-managed and non-redundant. This is in its infant level, and needs POCs to be done on BTD environment. b) Infrastructure Setup: This is unique to projects and companies. However, a generalized, tailor-made infrastructure framework is what is needed. Since test automations consume large resources in generating workloads, dedicated servers and machines are allotted for individual test cycles. Virtual Parallelism supports parallel execution of test scenarios for each test cycle at different virtual machines [4]. In this way, generation of higher workload in case of performance testing for BIG DATA can be handled effectively. However, investments on such servers are quite costly for not-so-big-companies that deal with big-data. One universal solution to rent infrastructure is 'IaaS offered through Cloud. Request for an allocation of big test data and execution of large test 6
executions in Cloud is an optimum way of making BIG DATAT effective and efficient. c) Manage: Effective test automation and efficient test execution are two equally significant facets of Big Test Data Management. Due to the extreme dynamism and heterogeneity of Big data, BTD setup is a challenging task in the aspects of test coverage, accuracy and the types of big test data. The roles of Test architects and leads will be crucial in setting up an environment for effective test automation. Tool selection, monitors installation, metrics collection and report generation are factors that commend test execution. With the infrastructure setup discussed above, carrying out the text execution with an appropriate tool has to be the focus of Big Data Test Strategy. Metrics collected during execution are then reported in the forms of charts and voluminous test results. This is again going to be a challenge for the testing team to interpret these results. Managing the entire Big Data Test life cycle is more challenging and involves more research into still unexplored areas. 4. NOT A DISTANT CHIMERA - CONCLUSION Big Data has the potential to revolutionize not just research, but also implementation practice and learning. Testing teams are in no way excluded from handling big data. Though there are frameworks like Hadoop, NoSQL and new programming platforms to handle Big data in development, testers are having a big time in finding optimized solutions, tools and frameworks to test the BIG DATA. Testing processes, customized test frameworks followed and testing tools used in various specialized testing will need a major revision while dealing with BIG DATA, and that will not be a distant hallucination. The challenges and discussions presented in this paper are only limited to current literature studies. Even wider potential risks and challenges will pose as we start working with the Big Test Data. What was once called as Garbage Data is today termed as Big Data. Nothing is wasted, nothing is deleted or removed. Everything is important for the business, for decision making and for the future of the organization. The future is not far, it is tomorrow. References 1. Anne & Peter Et. Al., Understanding System and Architecture of Big Data, IBM Research Technical Report, April 2012. 2. SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky Way Galaxy, and Extra-Solar Planetary Systems. Jan. 2008. Available at http://www.sdss3.org/collaboration/description.pdf 3. Ravishankar Krishnan, QA Strategy for Large Data Warehouse, Report, Intellisys Technology 4. http://www.logigear.com/magazine/test-process-improvement/making-big-testing-a-bigsuccess/ 7