LARGE-SCALE DATA STORAGE APPLICATIONS

Similar documents
Benchmarking Failover Characteristics of Large-Scale Data Storage Applications: Cassandra and Voldemort

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Yahoo! Cloud Serving Benchmark

Cassandra A Decentralized, Structured Storage System

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Benchmarking the Availability and Fault Tolerance of Cassandra

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Benchmarking and Analysis of NoSQL Technologies

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Getting Started with SandStorm NoSQL Benchmark

Introduction to Apache Cassandra

Can the Elephants Handle the NoSQL Onslaught?

NoSQL Data Base Basics

Distributed Systems. Tutorial 12 Cassandra

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Cassandra A Decentralized Structured Storage System

Benchmarking Replication in NoSQL Data Stores

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Benchmarking Cassandra on Violin

Scalable Architecture on Amazon AWS Cloud

Evaluation of NoSQL databases for large-scale decentralized microblogging

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Introduction to Big Data Training

Practical Cassandra. Vitalii

RDBMS in the Cloud: Oracle Database on AWS

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Cloud Computing For Bioinformatics

Structured Data Storage

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

Implementing Microsoft Windows Server Failover Clustering (WSFC) and SQL Server 2012 AlwaysOn Availability Groups in the AWS Cloud

Case study: CASSANDRA

How To Scale Out Of A Nosql Database

Accelerating Cassandra Workloads using SanDisk Solid State Drives

Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment

Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB Originally Published: April 13, 2015 Revised: May 27, 2015

The Total Cost of (Non) Ownership of a NoSQL Database Cloud Service

Distributed Storage Systems

Apache Hadoop. Alexandru Costan

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

So What s the Big Deal?

NoSQL: Going Beyond Structured Data and RDBMS

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Building a Private Cloud with Eucalyptus

Choosing the right NoSQL database for the job: a quality attribute evaluation

Apache Cassandra for Big Data Applications

Design and Evolution of the Apache Hadoop File System(HDFS)

Cloud Storage Solution for WSN in Internet Innovation Union

Cloud Spectator Comparative Performance Report July 2014

CS435 Introduction to Big Data

Alfresco Enterprise on AWS: Reference Architecture

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

A Survey of Distributed Database Management Systems

Benchmarking Cloud Serving Systems with YCSB

Benchmarking Cloud Serving Systems with YCSB

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Distributed Data Stores

Accelerating Big Data: Using SanDisk SSDs for MongoDB Workloads

NoSQL Databases. Nikos Parlavantzas

Time series IoT data ingestion into Cassandra using Kaa

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Data Management in the Cloud

Comparing Scalable NOSQL Databases

How To Choose Between A Relational Database Service From Aws.Com

Cassandra. Jonathan Ellis

Using Object Database db4o as Storage Provider in Voldemort

Deploying Splunk on Amazon Web Services

Online data processing with S4 and Omid*

Comparing NoSQL Solutions In a Real-World Scenario: Aerospike, Cassandra Open Source, Cassandra DataStax, Couchbase and Redis Labs

Accelerating and Simplifying Apache

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

MakeMyTrip CUSTOMER SUCCESS STORY

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

nosql and Non Relational Databases

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection

Introduction to NOSQL

Performance test report

Cloud Storage Solution for WSN Based on Internet Innovation Union

No-SQL Databases for High Volume Data

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

NoSQL Database in the Cloud: Couchbase Server 2.0 on AWS July 2013

Transcription:

BENCHMARKING AVAILABILITY AND FAILOVER PERFORMANCE OF LARGE-SCALE DATA STORAGE APPLICATIONS Wei Sun and Alexander Pokluda December 2, 2013

Outline Goal and Motivation Overview of Cassandra and Voldemort Design Benchmark Setup and Methodology Preliminary Results and Status Report Conclusion

Goal and Motivation Goal: To understand failover characteristics of largescale data storage applications Few benchmarks of failover characteristics have been done We have chosen to study the following: Cassandra Voldemort HBase (if time permits) Cassandra and Voldemort were chosen because both emphasize availability and performance HBase, which emphasizes consistency and performance, was chosen to cover a wider range of architectures

OVERVIEW OF CASSANDRA AND VOLDEMORT DESIGN

Cassandra Architecture Mixture of Dynamo and BigTable Consistent hashing Order preserving hash function Various replication options Rack unaware, rack aware, datacenter shard Fault tolerance Accrual Failure Detection Node Failure Down and up: Zookeeper Down entirely: replacement

Voldemort

Voldemort Consistent hashing Zone aware replication User-defined per zone replication factor Consistency Read-repair & quorum Vector-clock versioning Node failure hinted handoff

Cassandra vs. Voldemort Cassandra Voldemort Data Model Replication Partitioning Consistency Model Data Storage Disk Developed column database multi dimensional synchronous/asynchronous chosen by application: Rack unaware, rack aware, across datacenter consistent hashing (order preserving hash function) tunable all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle by Facebook in Java key-value datastore hash table Zone aware replication consistent hashing tunable Quorum, read-repair, hinted handoff Pluggable Storage Engines: BDB-JE, MySQL, Read-Only by LinkedIn in Java

BENCHMARK SETUP AND METHODOLOGY

Methodology Using Yahoo! Cloud Serving Systems Benchmark (YCSB) for load generation and reporting Extensible and generic framework for evaluation of key-value stores that has become an industry standard Can generate synthetic workloads that consist of a configurable distribution of CRUD operations Measure latency for a variety of throughputs Measure throughput vs time and error count for blue-sky and failure scenarios Node failures simulated using kill -9 Network failures simulated using tcpkill and firewalls Nodes run on AWS

Cluster Configuration Experimental setup consists of 5 nodes on AWS Database Servers 4x m1.xlarge Spot Instances 4 vcpu (8 ECU), 15 GiB RAM 4x420 GB disk as RAID 1+0 YCSB Load Generator 1x m3.2xlarge Spot Instance 8 vcpu (26 ECU), 30 GiB RAM Network throughput between database and YCSB servers consistently > 960 Mbit/s NFS Server 1x m1.small On-Demand Instance 1 vcpu (1 ECU), 1.7 GiB RAM 10 GB persistent EBS volume mounted at /home on all servers

Workloads and Parameters Workload A: Update heavy workload 50% Reads, 50% Writes Example: session store recording recent actions Workload B: Read mostly workload 95% Reads, 5% Writes Example: photo tagging Workload C: Read only 100% Read Example: user profile cache

Workloads and Parameters Workload D: Read latest workload New records inserted, reads mostly on latest inserted Example: User status updates Workload E: Short ranges Short ranges queried Example: Threaded conversations (clustered by thread ID) Workload F: Read-modify-write Records read, modified and written back Example: User database

PRELIMINARY RESULTS AND STATUS REPORT

Cassandra: Latency vs Throughput Cassandra 2.0.2 No failures

Cassandra: Throughput vs Time Left: No failures Right: 1 of 4 nodes killed at 20000 ms

Challenges The YCSB github repository is out of date and the documentation is incomplete Only Cassandra 0.5, 0.6, and 0.7 are supported A patch was submitted for Cassandra 1.0.6 but not documented Only Voldemort 0.81 is supported (but this is not documented) After a long search I found someone's personal fork of YCSB with support for Cassandra 2.0 (CQL) and an updated patch for Voldemort 0.96 in one of the Voldemort contributor s github repository

Status Report Need to re-run Cassandra tests with correct threadcount parameter and fork of YCSB that supports Cassandra 2.0 ObsoleteVersionExceptions are preventing Voldemort benchmark from progressing Contacted Voldemort developers through issue tracker: they said some ObsoleteVersionExceptions are normal I m working on patching the code based on stotch s recommendations Need to test failure scenarios Need to run Hbase tests (time permitting)

CONCLUSION

Summary Our goal is to understand failover characteristics of large-scale data storage applications Few benchmarks of failover characteristics have been done Presented an overview of the design of Cassandra and Voldemort Presented preliminary benchmark results using YCSB on a small cluster of nodes running in AWS

Lessons Learned Learned how to use AWS EC2 and VPC Learned differences between EBS, Instance Store and S3 and how to create AMIs Learned about instance types, placement groups and on-demand vs spot instances Learned about Regions and availability zones and how AWS is designed Designed to isolate failures Learned about the design and implementation and how to install, configure and tune several NoSQL Systems

QUESTIONS?