Testing 3Vs (Volume, Variety and Velocity) of Big Data



Similar documents
Testing Big data is one of the biggest

Krishna Markande, Principal Architect Sridhar Murthy, Senior Architect. Unleashing the Potential of Cloud for Performance Testing

Large scale processing using Hadoop. Ján Vaňo

How To Scale Out Of A Nosql Database

Hadoop implementation of MapReduce computational model. Ján Vaňo

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Workshop on Hadoop with Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

Data processing goes big

BIG DATA What it is and how to use?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data on Microsoft Platform

Trustworthiness of Big Data

NextGen Infrastructure for Big DATA Analytics.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

CIO Guide How to Use Hadoop with Your SAP Software Landscape

A Brief Outline on Bigdata Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Using distributed technologies to analyze Big Data

TRAINING PROGRAM ON BIGDATA/HADOOP

CA Big Data Management: It s here, but what can it do for your business?

Internals of Hadoop Application Framework and Distributed File System

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Maximizing Hadoop Performance with Hardware Compression

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

I/O Considerations in Big Data Analytics

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

ITG Software Engineering

Big Data Zurich, November 23. September 2011

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Integration Checklist

Peers Techno log ies Pv t. L td. HADOOP

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Hadoop IST 734 SS CHUNG

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data: Tools and Technologies in Big Data

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Real Time Big Data Processing

Advanced Big Data Analytics with R and Hadoop

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop Big Data for Processing Data and Performing Workload

Big Data Course Highlights

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

CS 378 Big Data Programming

Big Data and Apache Hadoop s MapReduce

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Apache Hadoop: The Big Data Refinery

Microsoft SQL Server 2012 with Hadoop

Hadoop and Map-Reduce. Swati Gore

Big Data Management and Security

Native Connectivity to Big Data Sources in MSTR 10

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cost-Effective Business Intelligence with Red Hat and Open Source

The Internet of Things and Big Data: Intro

Constructing a Data Lake: Hadoop and Oracle Database United!

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Trafodion Operational SQL-on-Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Application Development. A Paradigm Shift

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

How Cisco IT Built Big Data Platform to Transform Data Management

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

BIG DATA IS MESSY PARTNER WITH SCALABLE

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Are You Big Data Ready?

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

A very short Intro to Hadoop

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

Qsoft Inc

COURSE CONTENT Big Data and Hadoop Training

Virtualizing Apache Hadoop. June, 2012

Oracle Big Data SQL Technical Update

Tap into Hadoop and Other No SQL Sources

Data Modeling for Big Data

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Ubuntu and Hadoop: the perfect match

A Brief Introduction to Apache Tez

Extending Hadoop beyond MapReduce

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Search and Real-Time Analytics on Big Data

Transcription:

Testing 3Vs (Volume, Variety and Velocity) of Big Data 1

A lot happens in the Digital World in 60 seconds 2

What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big Data is a generic term used to describe the voluminous amount of unstructured, structured and semi-structured data. 3

Big Data Characteristics 3 key characteristics of big data: Volume: High volume of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources Variety: Data is in structured, semi-structured and unstructured format. Velocity: Data is generated at a high speed ohigh volume of data needs to be processed within seconds 4

Big Data Processing using Hadoop framework ❶Loading source data files into HDFS ❷Perform Map Reduce operations ❸Extract the output results from HDFS 5

Hadoop Map/Reduce processing Overview Map/Reduce is distributed computing and parallel processing framework where we have the advantage of pushing the computation to data Distributed computing Parallel Computing Based on Map & Reduce tasks 6

Hadoop Eco-System HDFS Hadoop Distributed File System HBase NoSQL data store (Non-relational distributed database) Map/Reduce - Distributed computing framework Scoop - SQL-to-Hadoop database import and export tool Hive Hadoop Data Warehouse Pig - Platform for creating Map/Reduce programs for analyzing large data sets 7

Unique Testing Opportunities in Big Data Implementations 8

Testing Opportunities for Independent Testing Early Validation of the Requirements Preparation of Big Test Data Early Validation of the Design Configuration Testing Incremental Load Testing Functional Testing 9

Early Validation of the Requirements Enterprise Data Warehouses integrated with Big Data Business Intelligence Systems integrated with Big Data Early Validation of the Requirements Are the requirements mapped to the right data sources? Are any data sources, that are not considered? Why? 10

Early Validation of the Design Is the Unstructured Data stored, in right place, for analytics? Is the structured Data stored, in right place, for analytics? Early Validation of the Design Is the data duplicated in multiple storage systems? Why? Are the data synchronization needs adequately identified and addressed? 11

Test Data Replicate data, intelligently, with tools How big the data files should be, to ensure near-real volumes of data? Preparation of Big Test Data Create data with incorrect schema Create erroneous data 12

Cluster Setup Is the system behaving as expected when a cluster is removed? Cluster Setup Testing Is the system behaving as expected when a cluster is added? 13

Big Data Testing 14

Volume Testing: Challenges Testing challenges Terabytes and Petabytes of data. Data storage in HDFS in file formats Data files are split and stored in multiple data nodes 100% coverage cannot be achieved Data consolidation issues 15

Volume Testing: Approach Testing Approach Use Data Sampling strategy Sampling to be done based on data requirements Convert raw data into expected result format to compare with actual output data Prepare Compare scripts to compare the data present in HDFS file storage 16

Variety Testing: Challenges Testing challenges Manually validating semi-structured and unstructured data Unstructured validation issues because of no defined format Lot of scripting required to be performed to process semi-structured and unstructured data Unstructured data sampling challenge 17

Variety Testing: Approach Testing Approach Structured Data: Compare data using compare tools and identify the discrepancies Semi-structured Data: Convert semi-structured data into structured format Format converted raw data to expected results Compare expected result data with actual results Unstructured Data : Parse unstructured text data into data blocks and aggregate the computed data blocks Validate aggregated data against the data output 18

Velocity Testing: Challenges Testing challenges Setting up of production like environment for performance testing Simulating production job run High velocity volume streaming Test data setup Simulating node failures 19

Velocity Testing: Approach Validation Points Performance of Pig/Hive jobs and capture Job completion time and validating against the benchmark Throughput of the jobs Impact of background processes on performance of the system Memory and CPU details of task tracker Availability of name node and data nodes Metrics Captured Job completion time Throughput Memory utilization No. of spills and spilled records Identify Jobs failure rate 20

Questions? 21

References http://en.wikipedia.org/wiki/big_data www.cloudera.com/ http://developer.yahoo.com/hadoop/tutorial/index.html http://wikibon.org/wiki/v/big_data:_hadoop,_business_analytics_and_beyond 22

THANK YOU www.infosys.com The contents of this document are proprietary and confidential to Infosys Limited and may not be disclosed in whole or in part at any time, to any third party without the prior written consent of Infosys Limited. 2012 Infosys Limited. All rights reserved. Copyright in the whole and any part of this document belongs to Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in whole or in part, in any manner or form, or in any media, without the prior written consent of Infosys Limited.