BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Similar documents

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

LARGE-SCALE DATA STORAGE APPLICATIONS

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Benchmarking and Analysis of NoSQL Technologies

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Yahoo! Cloud Serving Benchmark

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Getting Started with SandStorm NoSQL Benchmark

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Hadoop IST 734 SS CHUNG

Benchmarking Cassandra on Violin

Benchmarking Hadoop & HBase on Violin

Can the Elephants Handle the NoSQL Onslaught?

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

GeoGrid Project and Experiences with Hadoop

Evaluation of NoSQL databases for large-scale decentralized microblogging

NoSQL Data Base Basics

PARALLELS CLOUD STORAGE

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Chapter 7. Using Hadoop Cluster and MapReduce

Design and Evolution of the Apache Hadoop File System(HDFS)

A survey of big data architectures for handling massive data

CSE-E5430 Scalable Cloud Computing Lecture 2

MakeMyTrip CUSTOMER SUCCESS STORY

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Apache Hadoop FileSystem and its Usage in Facebook

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

GigaSpaces Real-Time Analytics for Big Data

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Performance Testing of Big Data Applications

So What s the Big Deal?

Using RDBMS, NoSQL or Hadoop?

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Open source Google-style large scale data analysis with Hadoop

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Benchmarking the Availability and Fault Tolerance of Cassandra

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Rackspace Cloud Databases and Container-based Virtualization

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop. Sunday, November 25, 12

Tableau Server 7.0 scalability

Big Systems, Big Data

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Performance Analysis of Distributed Indexing using Terrier

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Distributed File Systems

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Large scale processing using Hadoop. Ján Vaňo

CitusDB Architecture for Real-Time Big Data

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

NOT IN KANSAS ANY MORE

Database Scalability and Oracle 12c

iservdb The database closest to you IDEAS Institute

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hypertable Architecture Overview

Massive Cloud Auditing using Data Mining on Hadoop

Hadoop Architecture. Part 1

HPC performance applications on Virtual Clusters

Data-Intensive Computing with Map-Reduce and Hadoop

INTRODUCTION TO CASSANDRA

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HP Smart Array Controllers and basic RAID performance factors

G Porcupine. Robert Grimm New York University

Evaluation Methodology of Converged Cloud Environments

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Maximizing Hadoop Performance with Hardware Compression

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Handle Big Data With A Data Scientist

Mark Bennett. Search and the Virtual Machine

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Fast Data in the Era of Big Data: Twitter s Real-

Bigtable is a proven design Underpins 100+ Google services:

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

SharePoint 2010 Performance and Capacity Planning Best Practices

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

NoSQL for SQL Professionals William McKnight

Practical Cassandra. Vitalii

Hadoop and Map-Reduce. Swati Gore

Transcription:

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next five years

More (old) numbers!!!! Facebook 1 billion active users, 1 in 3 Internet users have a Facebook account More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month. Holds 30PB of data for analysis, adds 12 TB of compressed data daily Twitter 200 million users, 200 million daily tweets 1.6 billion search queries a day,7 TB data for analysis generated daily 90 precent of the data in the world today has been created in the last two years alone Traditional data storage, techniques & analysis tools just do not work at these scales! Source :http://www.rabidgremlin.com/data20/

Size matters but Performance Geographic Distribution of users Availability Scalability

http://the-opt.com/?p=66 So many of them!

One Way Benchmark Benchmark On what parameters? Performance Scalability Availability /Fault tolerance

Get to know the giants HBase over Hadoop Cassandra

Cloud DB Concepts DFS :- Distributed File System is a file system implementation on top of the underlying operating file system which allows the creation and access of file from multiple hosts across network. DFS creates multiple replicas of data blocks and distributes them on nodes throughout a cluster to enable reliable, scalable and extremely rapid computations.

Cloud DB Concepts Reliability is augmented because of the replication of same data in multiple nodes DFS allows addition/removal of newer nodes to the cluster ensuring smooth scale up and scale down with minimal or no manual intervention (easy scalability). Computation speed gets a boost since now any node can theoretically access any data and we can make use of data locality for computation

Map Reduce programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. instead of transferring huge amount of data to a central location and then processing the retrieved data in a sequential way, send chunks of small code to the nodes (which ideally has the data needed for the computation), run the computation making use of the data locality (mapping) and send result back to reducer which coalesces the result.

Map Reduce no degradation of performance due to communication latency many algorithms cannot be easily expressed as a single Map Reduce job. But theoretically lot of process can be broken down in to a sequence of mapping and reducer tasks can be now run parallel on multiple nodes on the cluster; reduce the time taken for the process by a factor of nodes involved.

Data Model Both HBase and Cassandra follow a columnar data model approach NO to normalization theories Yes to replication Arrange data for Queries is the guiding rule

Consistency Cassandra [default] Eventual consistency Expected behavior: Very fast writes Slower reads HBase Strict consistency Expected behavior: Very fast reads Near optimal writes (comparatively slower)

Architecture Hbase follows a master slave model for ensuring optimised resource and task allocation potential performance bottleneck a potential candidate for single point failure

Architecture Cassandra follows a decentralized model and focusses more on availability and fault tolerance rather than ensuring strict consistency of data

Metrics Used for Comparison Performance:- number of transactions done per second. (basic) might not give a clear picture for real world application where we have to factor in the latency. Hence we measure performance not as throughput offered by the two services; but we compare the trade-off of latency vs. through put. i.e: A service/system with better performance will achieve the desired latency and throughput with the same amount servers.

Scalability System with good scale up features is that ; in which the latency should remain constant or reduce, as the number of servers and offered throughput scale upwards proportionally.

Experimental Setup 7 server-class machines and one extra machine for clients. Cassandra version V1.7.0 HADOOP version 0.20.203.0 and HBASE version 0.92.1 No replication(other than defaults) Force updates to disk (except HBase, which Primarily commits to memory) (default behaviour YCSB)

Hard ware configuration Compute node Configuration: Vendor_id : AuthenticAMD Vpu family : 15 Model : 5 Model name : AMD Opteron(tm) Processor 246 Stepping : 8 CPU MHz : 2004.296 Cache size : 1024 KB FPU : yes Cache_alignment : 64 Address sizes : 40 bits physical, 48 bits virtual

Cassandra Cluster All seven nodes were used as Cassandra nodes. We used RandomPartitioner to allow Cassandra to do the distributes rows across the cluster evenly and reduce extra overhead if any We also allocated 3 GB ram for Cassandra nodes.

HBase/Hadoop Cluster HADOOP CLUSTER: we have 1 name node (compute-0-11) and 3 data nodes (compute-0-11, compute-0-7, and compute-0-8). HBASE CLUSTER: we have 1 Master node (compute-0-3) and 4 region servers (compute-0-2, compute-0-3, compute-0-4, and compute-0-5). Hence in total we have dedicated seven servers for HBASE/ HADOOP Cluster. We allocated 1 GB ram to HADOOP and 3 GB ram to HBASE respectively

YCSB The YCSB is a Java program for generating the data to be loaded to the database, and generating the operations which make up the workload.

YCSB Workload executor drives multiple client threads. Each thread executes a sequential series of operations to load the database (the load phase) and to execute the workload (the transaction phase

YCSB At the end of the experiment, the statistics module aggregates the measurements and reports average, 95th and 99th percentile latencies, and either a histogram or time series of the latencies

Results : Work load A Workload A 50% READ-50% UPDATE

Results : Work load B Read Heavy 95% READ-5% UPDATE

Results : Work load C Read 100% Hence in a system with high reads and low updates Cassandra is slightly better than HBASE over HADOOP (if inconsistency of data is not much of an issue)

Results : Work load D INSERT 5% READ-95% UPDATE social networking sites where there are some inserts followed by heavy read operations on those insert.

Results : Work load D social networking sites where there are some inserts followed by heavy read operations on those insert.

Results : Work load E 100% INSERTS In a write heavy workload we see that at higher throughputs there is very little difference between HBase and Cassandra

Work load Target App Workload Operation Target Application 1 A Update heavy Read:50% Update: 50% Session store recording recent actions in a user session 2 B Read heavy Read: 95% Update: 5% Photo tagging; add a tag is an update, but most operations 3 C Read only Read: 100% User profile cache, where profiles are constructed elsewhere 4 F INSERTS Only Inserts 100% Logging applications, Data transformation applications

Scalability

Conclusion /Confusion HBase has good write latency, and somewhat higher read latency. Cassandra achieves higher throughput and lower latency in a comparable way for both writes/reads and updates. With respect to scalability for less number of servers (< 4) Cassandra scales better than HBASE

What next? Different cluster set ups Newer versions of Both Db s YCSB Map Reduce!?

Right tool for the Job

Thank you