STeP-IN SUMMIT 2014. June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions



Similar documents
Performance Testing of Big Data Applications

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Big Data Analytics - Accelerated. stream-horizon.com

Testing 3Vs (Volume, Variety and Velocity) of Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Getting Started with SandStorm NoSQL Benchmark

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Inside Scoop on Hadoop

Open source software for building a private cloud

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Peers Techno log ies Pv t. L td. HADOOP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Big Data for Processing Data and Performing Workload

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Comparing Scalable NOSQL Databases

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Open source Google-style large scale data analysis with Hadoop

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Dell Reference Configuration for Hortonworks Data Platform

Benchmarking Hadoop & HBase on Violin

Big Data & Cloud. 4 th European Summit on the Future Internet. António Miguel Ferreira, CEO, Lunacloud. Aveiro, 13 to 14th June 2013

Ankush Cluster Manager - Hadoop2 Technology User Guide

Search and Real-Time Analytics on Big Data

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Mobile Performance Testing

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Scalable Architecture on Amazon AWS Cloud

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Hadoop Size does Hadoop Summit 2013

HADOOP PERFORMANCE TUNING

Hadoop Ecosystem B Y R A H I M A.

Hadoop & Spark Using Amazon EMR

Performance Analysis in Bigdata

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

How To Create A Data Visualization With Apache Spark And Zeppelin

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Qsoft Inc

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Can the Elephants Handle the NoSQL Onslaught?

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Dominik Wagenknecht Accenture

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Using RDBMS, NoSQL or Hadoop?

xpaaerns on Spark, Shark, Tachyon and Mesos

THE HADOOP DISTRIBUTED FILE SYSTEM

In Memory Accelerator for MongoDB

Virtualizing Apache Hadoop. June, 2012

Proact whitepaper on Big Data

Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB Originally Published: April 13, 2015 Revised: May 27, 2015

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Deploying Hadoop with Manager

H2O on Hadoop. September 30,

Hadoop: Embracing future hardware

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Mobile Application Performance: Test Strategies & Enhancement through WPO

NoSQL and Hadoop Technologies On Oracle Cloud

Extending Hadoop beyond MapReduce

CA Big Data Management: It s here, but what can it do for your business?

Enabling High performance Big Data platform with RDMA

Big Data Analytics - Accelerated. stream-horizon.com

Upcoming Announcements

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Unified Big Data Processing with Apache Spark. Matei

CDH AND BUSINESS CONTINUITY:

Benchmarking Cassandra on Violin

OnX Big Data Reference Architecture

Ubuntu and Hadoop: the perfect match

Big Data and Data Science: Behind the Buzz Words

Performance Testing Percy Pari Salas

Application Development. A Paradigm Shift

How To Scale Out Of A Nosql Database

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Enhancing Performance Test Strategy for Mobile Applications

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Manifest for Big Data Pig, Hive & Jaql

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

<Insert Picture Here> Big Data

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Accelerating and Simplifying Apache

Communicating with the Elephant in the Data Center

CSE-E5430 Scalable Cloud Computing Lecture 2

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Apache Hadoop new way for the company to store and analyze big data

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Apache Hadoop. Alexandru Costan

NoSQL Data Base Basics

Transcription:

11 th International Conference on Software Testing June 2014 at Bangalore, Hyderabad, Pune - INDIA Performance testing Hadoop based big data analytics solutions by Mustufa Batterywala, Performance Architect, and Shirish Bhale, Director of Engineering, Impetus Infotech Copyright: STeP-IN Forum Published with permission for restricted use in STeP-IN SUMMIT 2014 in agreement with full copyrights from owner(s) / author(s) of material. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior consent of the owner(s) / author(s). This edition is manufactured in India and is authorized for distribution only during STeP-IN SUMMIT 2014 as per the applicable conditions. Practices Experience Knowledge Automation Produced By Hosted By www.stepinforum.org www.qsitglobal.com

1 Key Big Data Technologies Map Reduce 2 Apache Hadoop, Cloudera, Hortonworks, MapR etc. NoSQL Cassandra, Mongo DB, Oracle NoSQL, Neo4j etc. Messaging queues Kafka, ActiveMQ, RabbitMQ, ZeroMQ etc. Search Lucene, Elastic Search, Solr 1

Agenda Big Data Performance Testing Focus Areas Challenges Performance Testing Approach Solutions 3 Big Data Performance Test Focus Areas 4 2

Performance Testing Challenges Diverse technologies Unavailability of tools Test scripting Test environment Limited monitoring solutions Lack of diagnostic solutions 5 Performance Testing Approach 6 3

Performance Test Environment Leverage cloud Automate environment creation Single node deployment Optimize, Tune Set up test cluster Replication Fail over configuration 7 Design Workload Messaging Servers Message/sec Bytes/sec Hadoop Number of jobs in parallel NoSQL Operations/sec Read/Write/Update ratio 8 4

Performance Testing Solutions Performance Test Tools YCSB (Yahoo Cloud Serving Benchmark), SandStorm, JMeter, Cassandra stress test Monitoring Tools Nagios, Zabbix, Ganglia, JMX utilities Diagnostic Tools (APM) visualvm, AppDynamics, Compuware 9 Benchmarks Hadoop TestDFSIO: Distributed i/o benchmark mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode NoSQL Workload I: 90% read, 10% insert Workload II: 50% read, 50% update 10 5

Critical Performance Parameters NoSQL Data Storage Commit Logs Concurrency Caching JVM parameters 11 Critical Performance Parameters mapreduce configurations Number of mappers & reducers HDFS chunk size Io.sort.mb, Io.sort.factor Memory Message queue configurations Sync v/s async Timeouts 12 6

Real World Experience About the application Online Social networking website with almost 100 million registered users across the globe Hundreds and thousands of users are online at any given time The challenges Develop a near real time analytics solution to analyze the user feedback and interactions Solution uses Kafka, HBase and Hive as major technologies SLA to support 50K messages per minute Optimize and tune the Kafka clusters for maximum throughout Real time monitoring of test environment for bottleneck identification 13 Real World Experience Impetus contributions Proposed and implemented performance test strategy for analytics solution Identified key performance components namely Kafka and Hbase for focus testing Proposed SandStorm as performance testing tool based on project requirements Prepared Kafka clients to simulate expected data ingestion volumes Optimized and tune single Kafka cluster in EC2 on medium instance Executed tests with varying message rate and reached to the max throughput of 50k messages per minute using 3 server cluster Real time monitoring using SandStorm to identify performance bottlenecks 14 Benefits Realized Optimum hardware utilization for complete solution Zero performance issues on Go live Maximum throughput of Kafka servers 7

Q&A 15 Thank You For more info mbatterywala@impetus.co.in, sbhale@impetus.co.in 16 8