Intelligent Analytics Assurance for Big Data Systems



Similar documents
Transportation Solutions Built on Oracle Transportation Management. Enterprise Solutions

Business Process Services. White Paper. Automating Management: Managing Workflow Effectively

Business Process Services. White Paper. Business Intelligence in Finance & Accounting: Foundation for an Agile Enterprise

Business Process Services. White Paper. Improving Efficiency in Business Process Services through User Interface Re-engineering

Robotic Process Automation: Reenergizing the Directory Publishing Industry

Digital Enterprise Unit. White Paper. Reimagining the Future of Field Service Management with Digital Technologies

HiTech. White Paper. A Next Generation Search System for Today's Digital Enterprises

Business Process Services. White Paper. Configurable, Automated Workflows: Transforming Process Effectiveness for Business Excellence

Lead the Retail Revolution.

Linking Transformational Initiatives to Desired Business Outcomes: Leveraging a Business-Metrics Driven Framework

Digital Enterprise. White Paper. Multi-Channel Strategies that Deliver Results with the Right Marketing Attribution Model

BI Today and Tomorrow

KYCS - Integrating KYC with Social Identity: The Future-Ready Marketing Approach

ion IT-as-a-Service Solution

The four windows of organizational change in training for ERP transformation

Next Generation Electric Utilities Gear up Using Cloud Based Services

Business Process Services. White Paper. Improving Agility in Accounts Receivables with Statistical Prediction and Modeling

Business Process Services. White Paper. Mitigating Trade Fraud: The Case for Detecting Group Level Fraudulent Activity

Business Process Services. White Paper. Effective Credit Risk Assessment Strengthening the Financial Spreading with Technology Enablers

Business Process Services. White Paper. Social Media Influence: Looking Beyond Activities and Followers

Business Process Services. White Paper. Price Elasticity using Distributed Computing for Big Data

Redefining Agile to Realize Continuous Business Value

Business Process Services. White Paper. Predictive Analytics in HR: A Primer

The 4 Pillars of Technosoft s Big Data Practice

Backward Scheduling An effective way of scheduling Warehouse activities

Business Process Services. White Paper. Leveraging the Internet of Things and Analytics for Smart Energy Management

Big Data for Data Warehousing

Business Process Services. White Paper. Managing Customer Experience: Strategies for Success

Enter an insurance solution, TCS BaNCS from Tata Consultancy Services.

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

ion Human Capital Management Solution

White Paper. Social Analytics

Overview. Integrated Front-mid-back office as well as standalone Front office or Back-office solution or Trading Channels

Seven Strategic Imperatives for Transitioning to a Shared Services Model

Conferencing Agent Enhancing the Communication Experience

A pay-as-you-use model About TCS ion Integrated solutions Personalized solutions Automatic upgrades Increased agility

Digital Enterprise. White Paper. Capturing the Voice of the Employee: Enterprise Social Media Monitoring and Analytics

Omni-Channel Banking Customer Experience: Forget What You Thought You Knew about Channels

Implement Business Process Management to realize Cost Savings and High Return on Investments

HiTech. White Paper. Storage-as-a-Service. SAN and NAS Reference Architectures leveraging Private Cloud Storage

ion Customer Relationship Management (CRM) Solution

Enterprise-wide Anti-money Laundering and KYC Initiatives A point of view

Retail. White Paper. Driving Strategic Sourcing Effectively with Supply Market Intelligence

TCS Supply Chain Center of Excellence

Business Process Services. White Paper. Personalizing E-Commerce: Improving Interactivity to Increase Revenues

IT Infrastructure Services. White Paper. Cyber Risk Mitigation for Smart Cities

TCS Research Fellowship Program. Frequently Asked Questions by Researchers

Bring Your Own Device (BYOD) A point of view

BIG DATA THE NEW OPPORTUNITY

Global Consulting Practice. White Paper. Mainframes: Bridging Legacy Systems. Building Digital Futures.

Business Process Transformation A Pulse Check

IBM Tivoli Netcool network management solutions for enterprise

Backlog Management Index (BMI) Evaluation and Improvement An ITIL Approach

Life Sciences. White Paper. Integrated Digital Marketing: The Key To Understanding Your Customer

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Oracle E-Business Suite (EBS) in the World of Oracle Exadata Engineered Systems

Overview. Société Générale

Securities Processing

Bridging the IT Business Gap The Role of an Enterprise Architect

Life Sciences. White Paper. Real-time Patient Health Monitoring with Connected Health Solutions

Using In-Memory Computing to Simplify Big Data Analytics

Testing Big data is one of the biggest

A Whole New World. Big Data Technologies Big Discovery Big Insights Endless Possibilities

Banking & Financial Services. White Paper. Automated Advice Delivery Platforms: Simplifying the Investment Management Game

BPM Perspectives Positioning and Fitment drivers

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

ion Manufacturing Solution

Data Virtualization A Potential Antidote for Big Data Growing Pains

Business Process Services. White Paper. Five Principles to Consider when Consolidating your Finance and Accounting Function

EMC ADVERTISING ANALYTICS SERVICE FOR MEDIA & ENTERTAINMENT

EMC DOCUMENTUM MANAGING DISTRIBUTED ACCESS

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Business Process Services. White Paper. Strengthening Business Operations with the Digital Five Forces

Driving Airline Revenues and Profitability by Delivering Great Customer Experiences

Simplify your admission process - The ion Way

Master big data to optimize the oil and gas lifecycle

Fast, Low-Overhead Encryption for Apache Hadoop*

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

IT Infrastructure Services. White Paper. Utilizing Software Defined Network to Ensure Agility in IT Service Delivery

Advanced Big Data Analytics with R and Hadoop

Business Process Services. White Paper. Redesigning Retail Operations: A Digitally Connected Supply Chain for Accelerated Performance

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Energy Insight from OMNETRIC Group. Achieving quality and speed in analytics with data discovery

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

How To Handle Big Data With A Data Scientist

Transcription:

Assurance White Paper Intelligent Analytics Assurance for Big Data Systems A framework for real-time Big Data Testing

About the Authors Debasish Das With more than a decade in the IT industry, Debasish Das currently heads the Assurance Cluster Center of Excellence (CoE) for the Energy, Utilities and Media verticals. Debasish has been instrumental in setting up Global Test Labs and testing CoEs for many customers. He brings rich hands-on expertise in the areas of test automation and performance testing of enterprise systems including Big Data. Debasis Rout Debasis Rout has worked extensively for three years on complex testing projects and been instrumental in creating unique automation frameworks in the Assurance arena. An avid researcher, Debasis has successfully completed numerous proofs-of-concept using unique tools and techniques in Automation of Analytics and Big Data systems. Debasis is currently part of the Big Data Testing initiative within TCS' Assurance Services Unit. Padmalaya Pradhani Padmalaya Pradhani, who has been with TCS for the past six years, specializes in Java development for open-source ecommerce- and SOA-based applications. Padmalaya has also worked extensively as a Business Intelligence expert. She has patents filed in India and the US. Padmalaya currently works with the Cluster CoE team of the Assurance Services Unit (ASU) and leads the unit's Big Data Testing initiative for multiple customers.

Abstract Enterprises determined to retain their leadership positions in tomorrow's world are moving marketing analytics from the traditional to the Big Data paradigm. But Big Data systems that have poorly integrated business intelligence frameworks, weak data scrubbing rules or non-real-time data processing are unlikely to be fail-safe with cost prohibitive consequences. Big Data errors can also surface repeatedly unless Assurance teams implement foolproof data validation. This paper proposes an approach for Big Data frameworks that avoids pitfalls and delivers ROI on analytics, prevents technological obsolescence, minimizes legacy maintenance, and enables swift ramp-ups.

Contents Introduction 5 Why Do We Need Big Data Analytics? 5 The Pitfalls of Existing Big Data Solutions 6 Features of a Holistic Enterprise Assurance Framework 6 Conclusion 8

Introduction Credit card companies routinely maintain every byte of their customers' data, including the details of each card swipe. This is Big Data: 51 percent of it is structured, while 49 percent is unstructured or semi-structured.¹ It does not lend itself to traditional data analytics, which is effective only on structured data. The behavioral analytics such enterprises carry out on the data helps them understand their customers' spending and credit patterns and provide improved customer service. Thus, thanks to the power of Big Data analytics, you can expect to receive promotional offers of, say, tickets for an unreleased movie whose trailer you recently liked or shared on social media. Big Data has ushered in a veritable universe of data entropy. Big Data analytics enable marketers to draw intelligent inferences from analyses of digital data that often manifest little correlation to the untrained eye. Experts use Big Data analytics to directly translate information into insight, improving decision-making and business performance. Big Data analytics has significant, measurable positive business impact and benefits. Slip-ups in Big Data analytics can also happen - with expensive consequences. Consider the following example. One morning, a credit card company's client executive calls one of its customers to verify a recent card transaction - the purchase of a smart phone - which, according to the customer's prior spending behavior, is not a usual pattern. Nevertheless, the customer confirms her purchase, and all is well. The company's intelligence, powered by Big Data, helped to identify the unusual shopping pattern. However, the executive, also notices other recent transactions that do not seem to follow the customer's standard purchase history. A few minutes into the call and the customer is shocked to see that there were multiple gambling transactions made in Vegas several days before the smart phone purchase. So while Big Data helped to identify one unusual shopping pattern, it failed to notice another. It was a case of Big Data testing gone wrong. Why Do We Need Big Data Analytics? Handling Big Data volumes has been a key challenge, as multiple zetabytes of data, if not more, reside in today's digital universe. Enterprises have already invested heavily in business intelligence (BI) frameworks and are attempting to build Big Data solutions over them. But they are baffled by the increasing demand for faster data processing, essential for achieving meaningful analytics outcomes in developing such frameworks. However, computer scientists are making breakthroughs in high-speed, high-volume data processing. A leading search engine defines algorithms that help break search problems into sub-problems and distribute the corresponding computational workloads across thousands of nodes and process them further using SQL or SQLbased BI tools. However, their biggest challenge at this juncture is achieving real-time data processing that correlates data structures within the existing framework to provide meaningful analyses. Thus, while businesses agree that they must process high-volume transaction data-sets intelligently, they have numerous questions regarding how they should navigate that endless ocean. [1] Tata Consultancy Services, The Emerging Returns on Big Data: A TCS 2013 Global Trend Study, 2013, May 7, 2015, http://sites.tcs.com/big-data-study/big-data-study-key-findings/ 5

The Pitfalls of Existing Big Data Solutions The three major challenge areas associated with testing Big Data solutions include: a) Seamless integration. Integration of an existing BI framework with Big Data components is a challenge because the search algorithm in use may not be in a format that data warehousing standards validate. This leaves the Big Data system running the major risk that the BI ecosystems concerned integrate poorly with the results the algorithm has processed. Consequently, testing professionals cannot assure overall system quality because the BI and Big Data systems remain operationally disparate although they are theoretically interdependent. b) Data standards and data clean-up. Big Data systems lack precise rules that govern the data standards for filtering out or scrubbing bad data. The data for Big Data analytics includes sources hitherto unheard of, including unconventional sources - devices like sensors, social media inputs such as tweets, etc. However, frameworks currently available in the market are not as agnostic as they should be to accommodate these data points for analysis. c) Real-time data processing. Big Data solutions also need accuracy in predicting high-density data volumes and making them BI-friendly. But Big Data environments lack compatible adapters that enable real-time data processing in clusters or nodes. However, the frameworks that enable Big Data-set processing in smaller clusters or nodes do so in batches and not in the real-time mode. Features of a Holistic Enterprise Assurance Framework Given these challenges adopting an appropriate quality assurance (QA) strategy ensures that the Big Data system's customer behavioral analysis-based predictions are valid. This, in turn, demands that a holistic QA strategy become a quintessential aspect of the total solution. The following seven features should be considered for an ideal end-to-end Big Data Assurance Framework (Fig.1), as described below: Data validation Field-to-field map testing 7 1 2 Initial data processing Job profiling validation 6 5 Distributed file system validation 4 3 Data processing speed testing Processing logic validation Fig. 1: A Proposed Big Data Assurance Framework 6

1) Data validation. Big Data enterprise ecosystems should have the capability to read data from any source, of any size, and at any speed. The framework must validate the data drawn from a wide, unconstrained range of sources - structured data such as spreadsheets, text files and unstructured data in the form of audio, image or video, as well as the Web, and GIS; among others. For instance, the utilities industry generates data through wattmeters and thermostats. This data is unlike that which computers or mobile phones generate; it creates a universe of data from different sources. Both the data and the devices vary. Thus, the data does not conform to the data standards that apply to traditional data sources. 2) Initial data processing. Big Data systems can process data in chunks and store it in nodes. The framework should therefore validate the initial unstructured-data processing stage and use an appropriate processing script, since the processed chunk of large data clusters is stored in nodes and used for data analysis and communication. In contrast to traditional data warehousing systems (which batch-process data and manifest latency in reflecting the latest processed data), the proposed framework would help analyze real-time data clusters. 3) Processing logic validation. The processing logic defined in the query language should be able to tidy up and organize the mammoth unstructured data into clusters. The enterprise data warehouse (EDW) should then input and validate the resulting data, using the traditional Extract-Transform-Load testing approach and the business logic of the key performance indicators used in report analysis, respectively. The framework must validate the incoming records with the outgoing data, inspecting both the number and hygiene of records transferred. This aspect becomes crucial, especially in banking transactions, where every single transaction can yield data that, if ignored, can lead to disastrous results. Hence, it is imperative to keep count of incoming and processed records so that one may establish the integrity of data and prevent data loss. 4) Data processing speed testing. To automate testing and identify bad data, the Big Data Assurance framework should integrate seamlessly with tools like ETL Validator and QuerySurge, among others. Such tools validate data from source files and other databases, compare it with the target data, and swiftly generate reports after analysis, pinpointing differences, if any. It should enable high-speed bulk data processing because speed is crucial in realtime digital data analysis. Several open-source tools exist that integrate well with Big Data frameworks for realtime performance monitoring of nodes and clusters are available, e.g. Munin and Nagios. For example, in the oil and gas industry, to optimize operations and avoid hazardous incidents, real-time oil well health monitoring is critical. Therefore, sensors continuously collect data on hundreds of well attributes - viscosity, pressure, temperature, and contamination levels, to name a few - and seismic information on each well which is fed to control stations for real-time processing. Consequently, in such scenarios, the speed of testing is vital in assuring the smooth operation and safety compliance of the wells. Thus, any delay in raising alarms that the real-time well health data processing triggers could have catastrophic results. 5) Distributed file system validation. For Big Data-based frameworks, which process data in distributed clusters, validation is key. Moreover, because the failure of a single node in a cluster prevents full functional testing, performance testing is imperative in such frameworks: it enables setting benchmarks for data that a cluster environment supports. This validation feature can help the Big Data Assurance framework to determine data flow parameters such as velocity, volume of data processed per node, and variety of data amalgamated in each node. 7

6) Job profiling validation. Job profiling in Big Data frameworks is essential prior to processing unstructured data algorithms because missing out on even one algorithm may cause massive adverse business impact. Therefore, the framework must allow the use of open-source tools like Jumbune and others, which leverage optimization through fine-grained node level analytic capability (to identify bottlenecks) and cluster-level resource consumption dashboards. For instance, the size of MapReduce jobs is too large for Hadoop processing. Moreover, errors in MapReduce algorithms raise the chances for job failure. Hence, the memory consumed by a Hadoop system may be higher, which can result in destabilizing the Hadoop cluster. Therefore, validating the MapReduce algorithm first on a small data chunk - job profiling - becomes imperative. 7) Field-to-field map testing. Migrating from legacy to Big Data databases involves massive architectural makeovers. These include solution-specific integration testing of which validating field-to-field mapping and schemas is an ideal part. Done well, such mapping ensures nil or minimal impact to critical data during migration. Conclusion For businesses to take full advantage of the potential of Big Data, formulating an effective validation strategy is key. The success of a business is directly proportional to the effectiveness with which an independent testing team can validate structured and unstructured data of large volume, wide variety and high velocity. The assurance framework we propose will help organizations predict ROI on their Big Data analytics. Securing their investments in this manner will enable enterprises to stay up to date on their Big Data architecture, while maintaining their traditional information architecture. The solution not only helps organizations build a validation framework over their existing data warehouse but also empowers them to ramp up and adopt new Big Data approaches for faster real-time - and, thus, more meaningful - data analysis. 8

About TCS' Assurance Services Unit With one of the most comprehensive portfolios of independent test capabilities on offer, TCS addresses both business and quality challenges for its global clients. We empower organizations across domains to optimize overheads, realize first mover advantage and improve customer satisfaction. TCS offers assurance services across the testing value cycle, including test consulting and advisory, test services implementation, and managed services for test environment and test data management. We continually redefine testing and QA paradigms to help our clients stay ahead of the curve. Our library of domain-based reusable business functions and proven engagement model founded on the twin pillars of product and process quality enable us to deliver certainty to our clients. Over 28,000 testing consultants, strategic alliances and partnerships with key product vendors, more than 60 dedicated test centers of excellence and our innovation labs power our tailor-made solutions, testing assets and accelerators. With specialized test environments and labs, TCS drives the delivery of assurance in a non-disruptive, agile, and automated manner, making the entire development lifecycle more efficient. Contact For more information about TCS Assurance Services Unit, visit: http://www.tcs.com/assurance Email: global.assurance@tcs.com Blog: #ThinkAssurance About Tata Consultancy Services (TCS) Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to global business, ensuring a level of certainty no other firm can match. TCS offers a consulting-led, integrated portfolio of IT and IT-enabled infrastructure, engineering and TM assurance services. This is delivered through its unique Global Network Delivery Model, recognized as the benchmark of excellence in software development. A part of the Tata Group, India s largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India. For more information, visit us at www.tcs.com IT Services Business Solutions Consulting All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright 2015 Tata Consultancy Services Limited TCS Design Services I M I 06 I 15