11 th International Conference on Software Testing June 2014 at Bangalore, Hyderabad, Pune - INDIA Performance testing Hadoop based big data analytics solutions by Mustufa Batterywala, Performance Architect, and Shirish Bhale, Director of Engineering, Impetus Infotech Copyright: STeP-IN Forum Published with permission for restricted use in STeP-IN SUMMIT 2014 in agreement with full copyrights from owner(s) / author(s) of material. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior consent of the owner(s) / author(s). This edition is manufactured in India and is authorized for distribution only during STeP-IN SUMMIT 2014 as per the applicable conditions. Practices Experience Knowledge Automation Produced By Hosted By www.stepinforum.org www.qsitglobal.com
1 Key Big Data Technologies Map Reduce 2 Apache Hadoop, Cloudera, Hortonworks, MapR etc. NoSQL Cassandra, Mongo DB, Oracle NoSQL, Neo4j etc. Messaging queues Kafka, ActiveMQ, RabbitMQ, ZeroMQ etc. Search Lucene, Elastic Search, Solr 1
Agenda Big Data Performance Testing Focus Areas Challenges Performance Testing Approach Solutions 3 Big Data Performance Test Focus Areas 4 2
Performance Testing Challenges Diverse technologies Unavailability of tools Test scripting Test environment Limited monitoring solutions Lack of diagnostic solutions 5 Performance Testing Approach 6 3
Performance Test Environment Leverage cloud Automate environment creation Single node deployment Optimize, Tune Set up test cluster Replication Fail over configuration 7 Design Workload Messaging Servers Message/sec Bytes/sec Hadoop Number of jobs in parallel NoSQL Operations/sec Read/Write/Update ratio 8 4
Performance Testing Solutions Performance Test Tools YCSB (Yahoo Cloud Serving Benchmark), SandStorm, JMeter, Cassandra stress test Monitoring Tools Nagios, Zabbix, Ganglia, JMX utilities Diagnostic Tools (APM) visualvm, AppDynamics, Compuware 9 Benchmarks Hadoop TestDFSIO: Distributed i/o benchmark mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode NoSQL Workload I: 90% read, 10% insert Workload II: 50% read, 50% update 10 5
Critical Performance Parameters NoSQL Data Storage Commit Logs Concurrency Caching JVM parameters 11 Critical Performance Parameters mapreduce configurations Number of mappers & reducers HDFS chunk size Io.sort.mb, Io.sort.factor Memory Message queue configurations Sync v/s async Timeouts 12 6
Real World Experience About the application Online Social networking website with almost 100 million registered users across the globe Hundreds and thousands of users are online at any given time The challenges Develop a near real time analytics solution to analyze the user feedback and interactions Solution uses Kafka, HBase and Hive as major technologies SLA to support 50K messages per minute Optimize and tune the Kafka clusters for maximum throughout Real time monitoring of test environment for bottleneck identification 13 Real World Experience Impetus contributions Proposed and implemented performance test strategy for analytics solution Identified key performance components namely Kafka and Hbase for focus testing Proposed SandStorm as performance testing tool based on project requirements Prepared Kafka clients to simulate expected data ingestion volumes Optimized and tune single Kafka cluster in EC2 on medium instance Executed tests with varying message rate and reached to the max throughput of 50k messages per minute using 3 server cluster Real time monitoring using SandStorm to identify performance bottlenecks 14 Benefits Realized Optimum hardware utilization for complete solution Zero performance issues on Go live Maximum throughput of Kafka servers 7
Q&A 15 Thank You For more info mbatterywala@impetus.co.in, sbhale@impetus.co.in 16 8