Whitepaper. The Emerging Big Data System - Testing Perspective. : Digital Assurance Practice : Nagarajan K R :

Similar documents
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Scale Out Of A Nosql Database

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Testing 3Vs (Volume, Variety and Velocity) of Big Data

How To Handle Big Data With A Data Scientist

Apache Hadoop: The Big Data Refinery

Tap into Hadoop and Other No SQL Sources

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data and Trusted Information

Luncheon Webinar Series May 13, 2013

BIG DATA TRENDS AND TECHNOLOGIES

Testing Big data is one of the biggest

The Next Wave of Data Management. Is Big Data The New Normal?

Advanced Big Data Analytics with R and Hadoop

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data and Apache Hadoop s MapReduce

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Large scale processing using Hadoop. Ján Vaňo

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Big Data/Data Analytics & Software Development

Open source Google-style large scale data analysis with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

How to Enhance Traditional BI Architecture to Leverage Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Information Builders Mission & Value Proposition

White Paper: What You Need To Know About Hadoop

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Big Data for Investment Research Management

NoSQL for SQL Professionals William McKnight

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Are You Ready for Big Data?

Hadoop IST 734 SS CHUNG

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Business Intelligence for Big Data

Hadoop. Sunday, November 25, 12

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Big Data Zurich, November 23. September 2011

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Big Data: Tools and Technologies in Big Data

A Brief Outline on Bigdata Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Manifest for Big Data Pig, Hive & Jaql

Microsoft SQL Server 2012 with Hadoop

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data Technologies Compared June 2014

Data processing goes big

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data on Microsoft Platform

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Getting Started Practical Input For Your Roadmap

Journal of Environmental Science, Computer Science and Engineering & Technology

BIG DATA IN BUSINESS ENVIRONMENT

Survival Tips for Big Data Impact on Performance Share Pittsburgh Session 15404

Bringing Big Data into the Enterprise

There s no way around it: learning about Big Data means

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Transactions & Interactions

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Modernizing Your Data Warehouse for Hadoop

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Real Time Big Data Processing

Big Data Defined Introducing DataStack 3.0

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Business Intelligence. Advanced visualization. Reporting & dashboards. Mobile BI. Packaged BI

StackIQ Enterprise Data Reference Architecture

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Understanding traffic flow

Big Data Management and Security

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

BIG DATA CHALLENGES AND PERSPECTIVES

#TalendSandbox for Big Data

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

ITG Software Engineering

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Data Integration Checklist

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Cost-Effective Business Intelligence with Red Hat and Open Source

Transcription:

Whitepaper The Emerging Big Data System Presented by Author Email Id : Digital Assurance Practice : Nagarajan K R : nagarajankr@hexaware.com Hexaware Technologies. All rights reserved.

Table of Contents Abstract Typical Data Warehouse Architecture The Emergence of Big Data Hadoop Architecture & Components The Emerging Data Warehouse Testing Hadoop System Hadoop System - Data Flow Conclusion 03 03 04 04 05 05 06 06 Hexaware Technologies. All rights reserved. 2

Abstract The purpose of this paper is to delve into the evolution of the data paradigm in this information age, where data has truly exploded to Terabytes and Petabytes. From an enterprise perspective, this is both a challenge & an opportunity to tap into the new data sources. These data sources are referred to as Big Data. The existing data warehouse technologies cannot handle this data glut and will require adaptation. Therefore, data warehousing and big data technologies complement each other to derive maximum use from the hitherto untapped area of unstructured and semi-structured data getting generated endlessly in the Internet, sensor technologies etc.. We will see the basic Big Data architecture around Hadoop and also the testing approaches in a Hadoop system. Typical Data Warehouse Architecture Data Sources In a typical DW architecture, we will have data from diverse & disparate sources arriving at a central location, for processing into a Data PS, SAP, 222 ERP, OLTP MF, XLS warehouse. This data is cleansed and scrubbed before transformations are applied to be finally pushed into the data warehouse. This data warehouse now represents the single-version of truth for the enterprise, including the wealth of historical data accumulated over several years. This data can now be analyzed/ sliced/diced across multiple-dimensions for various decision supports, to help senior management take informed decisions. Staging Major ETL Tools in Market : Informatica, Abi initio, DataStage, Talend Major BI Tools in Market : Hyperion, Cognos, OBIEE, Business Objects Though the data are from different sources, these are mostly structured data coming out of organized systems / platforms. The ETL tool s major function is to convert the data into a consistent format. For example, the date in different sources may be in different formats, so that in the final analysis, reports are also consistent and accurate at the data warehouse. The major function of the analytical tool is to summarize and aggregate data across various dimensions. A pharmaceutical company will be interested in knowing how its various products have performed across different locations as compared to competition, in a specific time period. The senior management may want to make certain decisions on R&D. For these, the BI reports can be very helpful. With these reports the management can drill into the details for more clarity on certain aspects of the business. The data warehouse discussed, contains majorly the transactional data in a structured format. The historical data after a point becomes a drag and hits the data warehouse performance. So it needs to be maintained for Data Warehouse Cubes Analytical Tools Reports Data Processing Decision Support better results. Fig 1: Basic Data warehouse architecture Hexaware Technologies. All rights reserved. 3

The Emergence of Big Data With the explosion of Internet, followed by the crash in prices of storage devices, it has become imperative for enterprises to tap into additional sources of data to make much better and focused decisions. But the challenges are multi-fold, in terms of volume, variety and velocity of data, referred as the 3Vs. Welcome to the world of Big Data! Now-a-days, IBM can get quick feedback on specific products from its vast product portfolio from customers through Facebook, Twitter. The advancements in sensor technology have triggered a new wave of data that can help industries such as retail and logistics. Web sites and system generated log files from various applications in an enterprise can be analyzed for error patterns, user preferences, traffic analysis and conversion ratios (how many people finally bought something from the web site, how many just browsed). Credit card & insurance companies can utilize big data to identify frauds. For example, if a credit card had been swiped in 2 places that are far apart, in a short time, the system can now flag it as suspect and send alerts to the various stakeholders for action. An enterprise can analyze email patterns for possible abuses, also for arriving at the efficiency of its sales personnel in closing deals. But these data formats are unstructured and huge in terms of storage, running to Terabytes (TB). So the complexity is in the data format, its size as well as the speed with which they arrive. Making sense of this data presents a huge opportunity as well, in the wealth of information they provide. This calls for newer technologies for data processing as the traditional systems cannot accommodate these formats. Some unstructured / semi-structured Data Sources: Big Data Technologies : Hadoop Architecture & Components Apache Hadoop is a scalable, fault-tolerant system for data storage and processing. It is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware. It contains the following components: HDFS: The Hadoop Distributed File System (HDFS) is the storage system responsible for replicating and distributing the data throughout the computing cluster MapReduce: It is the software framework which developers use to write the applications which process the data stored throughout HDFS Hue: Sits on top of Hadoop, helps interface with the various Hadoop components ZooKeeper: It is responsible for coordinating the configuration data and process sync Database / Data warehouse for Hadoop: HBase, Hive is the DB, DW used with Hadoop Hexaware Technologies. All rights reserved. 4

The Emerging Data Warehouse With unstructured as well as semi-structured data joining the data sources, the current DW technologies will have to accommodate the Big Data technologies as well. In fact, these two technologies complement each other so that the transformed/ processed unstructured, semi-structured data find their way into the existing data warehouse. With this, the data warehouse continues to remain as the single truth of data, from which BI reports are generated. The semi-structured, unstructured data is pushed into the HDFS (Hadoop Distributed File System) using specific APIs. The HDFS holds the semi-structured, unstructured data sources. The MR framework in Hadoop uses the HDFS as input, does the processing on commodity hardware and finally writes-back into HDFS. Data Sources Data Processing Decision Support PS, SAP, 222 ETL ERP, OLTP MF, XLS Staging Data Warehouse Cubes Analytical Tools Reports Log Files, Clickstreams & Sensor data, Documents, Emails Location Data Fig 2: Hadoop in a traditional DW system Testing Hadoop System The testing steps broadly involve the following: 1. Testing the Input system (HDFS) To validate if the HDFS has data in the correct format To validate if all required source data have been moved into HDFS 2. Testing the Output of the MR process To validate if the MR process generates the correct output in terms of Key-value pairs To validate if the output is in sync with the HDFS source in data and format The summarized and aggregated data in the output (HDFS/Hive), tally with that of the input (HDFS) To validate if specific data transformations are as per the requirement To make sure that the output remains the same if Hadoop is run on a single-node or on multiple nodes When the HDFS output is finally moved into the data warehouse, it is required that HDFS data aggregation/summarization tally with the data moved into the data warehouse.hdfs can be queried using Pig, while the data warehouse can be queried using SQL 3. Testing the BI Reports Here we validate the reports for layout/ format and data 4. ETL Validations The ETL Validation strategy remains the same as that of a typical data warehouse Hexaware Technologies. All rights reserved. 5

Hadoop System - Data Flow With unstructured as well as semi-structured data joining the data sources, the current DW technologies will have to accommodate the Big Data technologies as well. In fact, these two technologies complement each other so that the transformed/ processed unstructured, semi-structured data find their way into the existing data warehouse. With this, the data warehouse continues to remain as the single truth of data, from which BI reports are generated. The semi-structured, unstructured data is pushed into the HDFS (Hadoop Distributed File System) using specific APIs. The HDFS holds the semi-structured, unstructured data sources. The MR framework in Hadoop uses the HDFS as input, does the processing on commodity hardware and finally writes-back into HDFS. TARGET PROCESS SOURCE Social Media data Pig / HBase / Hive Log File MR Data Warehouse HDFS Sensor Data File Emails, XML Fig 3 : Hadoop System - Data Flow Conclusion: Success in a data testing project calls for a clear test strategy. This cannot change in a big data environment, where the complexity has increased considerably due to the arrival of semi-structured/ unstructured data. Automation of the testing process can give a decisive edge in this complex data environment. About Author: Nagarajan K R, manages the BI Testing Practice in Hexaware technologies. He has over 15 years of IT experience in web, BI technologies. His current interests includes testing Big Data applications, coming up with utility tools for testing them and as a result the changing Dataware house architecture. To learn more, visit http:// 1095 Cranbury South River Road, Suite 10, Jamesburg, NJ 08831. Main: 609-409-6950 Fax: 609-409-6910 Disclaimer Contents of this whitepaper are the exclusive property of Hexaware Technologies and may not be reproduced in any form without the prior written consent of Hexaware Technologies. Hexaware Technologies. All rights reserved. 6