1 Assurance White Paper Intelligent Analytics Assurance for Big Data Systems A framework for real-time Big Data Testing
2 About the Authors Debasish Das With more than a decade in the IT industry, Debasish Das currently heads the Assurance Cluster Center of Excellence (CoE) for the Energy, Utilities and Media verticals. Debasish has been instrumental in setting up Global Test Labs and testing CoEs for many customers. He brings rich hands-on expertise in the areas of test automation and performance testing of enterprise systems including Big Data. Debasis Rout Debasis Rout has worked extensively for three years on complex testing projects and been instrumental in creating unique automation frameworks in the Assurance arena. An avid researcher, Debasis has successfully completed numerous proofs-of-concept using unique tools and techniques in Automation of Analytics and Big Data systems. Debasis is currently part of the Big Data Testing initiative within TCS' Assurance Services Unit. Padmalaya Pradhani Padmalaya Pradhani, who has been with TCS for the past six years, specializes in Java development for open-source ecommerce- and SOA-based applications. Padmalaya has also worked extensively as a Business Intelligence expert. She has patents filed in India and the US. Padmalaya currently works with the Cluster CoE team of the Assurance Services Unit (ASU) and leads the unit's Big Data Testing initiative for multiple customers.
3 Abstract Enterprises determined to retain their leadership positions in tomorrow's world are moving marketing analytics from the traditional to the Big Data paradigm. But Big Data systems that have poorly integrated business intelligence frameworks, weak data scrubbing rules or non-real-time data processing are unlikely to be fail-safe with cost prohibitive consequences. Big Data errors can also surface repeatedly unless Assurance teams implement foolproof data validation. This paper proposes an approach for Big Data frameworks that avoids pitfalls and delivers ROI on analytics, prevents technological obsolescence, minimizes legacy maintenance, and enables swift ramp-ups.
4 Contents Introduction 5 Why Do We Need Big Data Analytics? 5 The Pitfalls of Existing Big Data Solutions 6 Features of a Holistic Enterprise Assurance Framework 6 Conclusion 8
5 Introduction Credit card companies routinely maintain every byte of their customers' data, including the details of each card swipe. This is Big Data: 51 percent of it is structured, while 49 percent is unstructured or semi-structured.¹ It does not lend itself to traditional data analytics, which is effective only on structured data. The behavioral analytics such enterprises carry out on the data helps them understand their customers' spending and credit patterns and provide improved customer service. Thus, thanks to the power of Big Data analytics, you can expect to receive promotional offers of, say, tickets for an unreleased movie whose trailer you recently liked or shared on social media. Big Data has ushered in a veritable universe of data entropy. Big Data analytics enable marketers to draw intelligent inferences from analyses of digital data that often manifest little correlation to the untrained eye. Experts use Big Data analytics to directly translate information into insight, improving decision-making and business performance. Big Data analytics has significant, measurable positive business impact and benefits. Slip-ups in Big Data analytics can also happen - with expensive consequences. Consider the following example. One morning, a credit card company's client executive calls one of its customers to verify a recent card transaction - the purchase of a smart phone - which, according to the customer's prior spending behavior, is not a usual pattern. Nevertheless, the customer confirms her purchase, and all is well. The company's intelligence, powered by Big Data, helped to identify the unusual shopping pattern. However, the executive, also notices other recent transactions that do not seem to follow the customer's standard purchase history. A few minutes into the call and the customer is shocked to see that there were multiple gambling transactions made in Vegas several days before the smart phone purchase. So while Big Data helped to identify one unusual shopping pattern, it failed to notice another. It was a case of Big Data testing gone wrong. Why Do We Need Big Data Analytics? Handling Big Data volumes has been a key challenge, as multiple zetabytes of data, if not more, reside in today's digital universe. Enterprises have already invested heavily in business intelligence (BI) frameworks and are attempting to build Big Data solutions over them. But they are baffled by the increasing demand for faster data processing, essential for achieving meaningful analytics outcomes in developing such frameworks. However, computer scientists are making breakthroughs in high-speed, high-volume data processing. A leading search engine defines algorithms that help break search problems into sub-problems and distribute the corresponding computational workloads across thousands of nodes and process them further using SQL or SQLbased BI tools. However, their biggest challenge at this juncture is achieving real-time data processing that correlates data structures within the existing framework to provide meaningful analyses. Thus, while businesses agree that they must process high-volume transaction data-sets intelligently, they have numerous questions regarding how they should navigate that endless ocean.  Tata Consultancy Services, The Emerging Returns on Big Data: A TCS 2013 Global Trend Study, 2013, May 7, 2015, 5
6 The Pitfalls of Existing Big Data Solutions The three major challenge areas associated with testing Big Data solutions include: a) Seamless integration. Integration of an existing BI framework with Big Data components is a challenge because the search algorithm in use may not be in a format that data warehousing standards validate. This leaves the Big Data system running the major risk that the BI ecosystems concerned integrate poorly with the results the algorithm has processed. Consequently, testing professionals cannot assure overall system quality because the BI and Big Data systems remain operationally disparate although they are theoretically interdependent. b) Data standards and data clean-up. Big Data systems lack precise rules that govern the data standards for filtering out or scrubbing bad data. The data for Big Data analytics includes sources hitherto unheard of, including unconventional sources - devices like sensors, social media inputs such as tweets, etc. However, frameworks currently available in the market are not as agnostic as they should be to accommodate these data points for analysis. c) Real-time data processing. Big Data solutions also need accuracy in predicting high-density data volumes and making them BI-friendly. But Big Data environments lack compatible adapters that enable real-time data processing in clusters or nodes. However, the frameworks that enable Big Data-set processing in smaller clusters or nodes do so in batches and not in the real-time mode. Features of a Holistic Enterprise Assurance Framework Given these challenges adopting an appropriate quality assurance (QA) strategy ensures that the Big Data system's customer behavioral analysis-based predictions are valid. This, in turn, demands that a holistic QA strategy become a quintessential aspect of the total solution. The following seven features should be considered for an ideal end-to-end Big Data Assurance Framework (Fig.1), as described below: Data validation Field-to-field map testing Initial data processing Job profiling validation 6 5 Distributed file system validation 4 3 Data processing speed testing Processing logic validation Fig. 1: A Proposed Big Data Assurance Framework 6
7 1) Data validation. Big Data enterprise ecosystems should have the capability to read data from any source, of any size, and at any speed. The framework must validate the data drawn from a wide, unconstrained range of sources - structured data such as spreadsheets, text files and unstructured data in the form of audio, image or video, as well as the Web, and GIS; among others. For instance, the utilities industry generates data through wattmeters and thermostats. This data is unlike that which computers or mobile phones generate; it creates a universe of data from different sources. Both the data and the devices vary. Thus, the data does not conform to the data standards that apply to traditional data sources. 2) Initial data processing. Big Data systems can process data in chunks and store it in nodes. The framework should therefore validate the initial unstructured-data processing stage and use an appropriate processing script, since the processed chunk of large data clusters is stored in nodes and used for data analysis and communication. In contrast to traditional data warehousing systems (which batch-process data and manifest latency in reflecting the latest processed data), the proposed framework would help analyze real-time data clusters. 3) Processing logic validation. The processing logic defined in the query language should be able to tidy up and organize the mammoth unstructured data into clusters. The enterprise data warehouse (EDW) should then input and validate the resulting data, using the traditional Extract-Transform-Load testing approach and the business logic of the key performance indicators used in report analysis, respectively. The framework must validate the incoming records with the outgoing data, inspecting both the number and hygiene of records transferred. This aspect becomes crucial, especially in banking transactions, where every single transaction can yield data that, if ignored, can lead to disastrous results. Hence, it is imperative to keep count of incoming and processed records so that one may establish the integrity of data and prevent data loss. 4) Data processing speed testing. To automate testing and identify bad data, the Big Data Assurance framework should integrate seamlessly with tools like ETL Validator and QuerySurge, among others. Such tools validate data from source files and other databases, compare it with the target data, and swiftly generate reports after analysis, pinpointing differences, if any. It should enable high-speed bulk data processing because speed is crucial in realtime digital data analysis. Several open-source tools exist that integrate well with Big Data frameworks for realtime performance monitoring of nodes and clusters are available, e.g. Munin and Nagios. For example, in the oil and gas industry, to optimize operations and avoid hazardous incidents, real-time oil well health monitoring is critical. Therefore, sensors continuously collect data on hundreds of well attributes - viscosity, pressure, temperature, and contamination levels, to name a few - and seismic information on each well which is fed to control stations for real-time processing. Consequently, in such scenarios, the speed of testing is vital in assuring the smooth operation and safety compliance of the wells. Thus, any delay in raising alarms that the real-time well health data processing triggers could have catastrophic results. 5) Distributed file system validation. For Big Data-based frameworks, which process data in distributed clusters, validation is key. Moreover, because the failure of a single node in a cluster prevents full functional testing, performance testing is imperative in such frameworks: it enables setting benchmarks for data that a cluster environment supports. This validation feature can help the Big Data Assurance framework to determine data flow parameters such as velocity, volume of data processed per node, and variety of data amalgamated in each node. 7
8 6) Job profiling validation. Job profiling in Big Data frameworks is essential prior to processing unstructured data algorithms because missing out on even one algorithm may cause massive adverse business impact. Therefore, the framework must allow the use of open-source tools like Jumbune and others, which leverage optimization through fine-grained node level analytic capability (to identify bottlenecks) and cluster-level resource consumption dashboards. For instance, the size of MapReduce jobs is too large for Hadoop processing. Moreover, errors in MapReduce algorithms raise the chances for job failure. Hence, the memory consumed by a Hadoop system may be higher, which can result in destabilizing the Hadoop cluster. Therefore, validating the MapReduce algorithm first on a small data chunk - job profiling - becomes imperative. 7) Field-to-field map testing. Migrating from legacy to Big Data databases involves massive architectural makeovers. These include solution-specific integration testing of which validating field-to-field mapping and schemas is an ideal part. Done well, such mapping ensures nil or minimal impact to critical data during migration. Conclusion For businesses to take full advantage of the potential of Big Data, formulating an effective validation strategy is key. The success of a business is directly proportional to the effectiveness with which an independent testing team can validate structured and unstructured data of large volume, wide variety and high velocity. The assurance framework we propose will help organizations predict ROI on their Big Data analytics. Securing their investments in this manner will enable enterprises to stay up to date on their Big Data architecture, while maintaining their traditional information architecture. The solution not only helps organizations build a validation framework over their existing data warehouse but also empowers them to ramp up and adopt new Big Data approaches for faster real-time - and, thus, more meaningful - data analysis. 8
9 About TCS' Assurance Services Unit With one of the most comprehensive portfolios of independent test capabilities on offer, TCS addresses both business and quality challenges for its global clients. We empower organizations across domains to optimize overheads, realize first mover advantage and improve customer satisfaction. TCS offers assurance services across the testing value cycle, including test consulting and advisory, test services implementation, and managed services for test environment and test data management. We continually redefine testing and QA paradigms to help our clients stay ahead of the curve. Our library of domain-based reusable business functions and proven engagement model founded on the twin pillars of product and process quality enable us to deliver certainty to our clients. Over 28,000 testing consultants, strategic alliances and partnerships with key product vendors, more than 60 dedicated test centers of excellence and our innovation labs power our tailor-made solutions, testing assets and accelerators. With specialized test environments and labs, TCS drives the delivery of assurance in a non-disruptive, agile, and automated manner, making the entire development lifecycle more efficient. Contact For more information about TCS Assurance Services Unit, visit: Blog: #ThinkAssurance About Tata Consultancy Services (TCS) Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to global business, ensuring a level of certainty no other firm can match. TCS offers a consulting-led, integrated portfolio of IT and IT-enabled infrastructure, engineering and TM assurance services. This is delivered through its unique Global Network Delivery Model, recognized as the benchmark of excellence in software development. A part of the Tata Group, India s largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India. For more information, visit us at IT Services Business Solutions Consulting All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright 2015 Tata Consultancy Services Limited TCS Design Services I M I 06 I 15
Banking & Financial Services White Paper How a Hybrid Cloud Strategy can help Financial Institutions Realize Business Value About the Author Ravi Satyanarayana Solution Architect and Consultant Ravi Satyanarayana
Components Engineering Group White Paper Enterprise Data Management An 'IDEAL' Solution About the Authors Srikar Chilakamarri senior consultant Srikar Chilakamarri is a senior consultant with TCS and specializes
Business Analytics Big Data Next-Generation Analytics the way we see it Table of contents Executive summary 1 Introduction: What is big data and why is it different? 3 The business opportunity 7 Traditional
The Definitive Guide tm To Cloud Computing Ch apter 10: Key Steps in Establishing Enterprise Cloud Computing Services... 185 Ali gning Business Drivers with Cloud Services... 187 Un derstanding Business
White Paper Predictive Analytics: A Game-Changer for Telcos Telecom services are becoming increasingly commoditized, and telecom companies or telcos are trying to break out of this impasse both strategically
Business intelligence (BI) How to build successful BI strategy Summary This paper focuses on building a BI strategy that aligns with the enterprise goals, improves knowledge management, advances business
Business Process Services White Paper Improving Efficiency in Business Process Services through User Interface Re-engineering About the Authors Mahesh Kshirsagar Mahesh has a vast experience of about 24
Chapter 1 Grasping the Fundamentals of Big Data In This Chapter Looking at a history of data management Understanding why big data matters to business Applying big data to business effectiveness Defining
W I N T E R C O R P O R A T I O N Executive Report BIG DATA: BUSINESS OPPORTUNITIES, REQUIREMENTS AND ORACLE S APPROACH RICHARD WINTER December 2011 SUMMARY NEW SOURCES OF DATA and distinctive types of
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
White Paper Data Warehouse Optimization with Hadoop A Big Data Reference Architecture Using Informatica and Cloudera Technologies This document contains Confidential, Proprietary and Trade Secret Information
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
BUY BIG DATA IN RETAIL Table of contents What is Big Data?... How Data Science creates value in Retail... Best practices for Retail. Case studies... 3 7 11 1. Social listening... 2. Cross-selling... 3.
OPEN DATA CENTER ALLIANCE : sm Big Data Consumer Guide SM Table of Contents Legal Notice...3 Executive Summary...4 Introduction...5 Objective...5 Big Data 101...5 Defining Big Data...5 Big Data Evolution...7
Embracing the New Normal of Big Data with Cloudera Enterprise Putting Apache Hadoop to Work for your Organization 2 BIM the way we see it Table of Contents Introduction 4 Industry Challenges for Analytics,
Big Data The definitive guide to the revolution in business analytics shaping tomorrow with you THE WHITE BOOK OF... Big Data The definitive guide to the revolution in business analytics THE WHITE BOOK
White Paper Shortening the Securities and Cash Settlement Cycle from T+3 to T+2 Bringing efficiency to the entire trade lifecycle is the key to decreasing risk. With increased cross border trade activities,
SPECIAL REPORT W I N T E R C O R P O R A T I O N T h e L a r g e S c a l e Big Data What Does It Really Cost? D a t a M a n a g e m e n t Expe r t s W I N T E R C O R P O R A T I O N Big Data What Does
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored