Comparing Scalable NOSQL Databases

Similar documents

Getting Started with SandStorm NoSQL Benchmark

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Can the Elephants Handle the NoSQL Onslaught?

Measuring Elasticity for Cloud Databases

Benchmarking Hadoop & HBase on Violin

Apache HBase. Crazy dances on the elephant back

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Yahoo! Cloud Serving Benchmark

Cloud Scale Distributed Data Storage. Jürmo Mehine

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB Originally Published: April 13, 2015 Revised: May 27, 2015

An Open Source NoSQL solution for Internet Access Logs Analysis

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Cloudera Manager Training: Hands-On Exercises

Search and Real-Time Analytics on Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

MapReduce with Apache Hadoop Analysing Big Data

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Databases 2 (VU) ( )

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

LARGE-SCALE DATA STORAGE APPLICATIONS

Distributed Storage Systems

NoSQL Databases. Nikos Parlavantzas

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Data Pipeline with Kafka

Cloudera Manager Health Checks

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

MapReduce, Hadoop and Amazon AWS

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Kafka & Redis for Big Data Solutions

Apache HBase: the Hadoop Database

NoSQL Data Base Basics

The Hadoop Distributed File System

Big Data and Scripting map/reduce in Hadoop

Integrating Big Data into the Computing Curricula

A Scalable Data Transformation Framework using the Hadoop Ecosystem

NoSQL and Hadoop Technologies On Oracle Cloud

THE HADOOP DISTRIBUTED FILE SYSTEM

Enterprise Operational SQL on Hadoop Trafodion Overview

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Ankush Cluster Manager - Hadoop2 Technology User Guide

Open source Google-style large scale data analysis with Hadoop

Hadoop Ecosystem B Y R A H I M A.

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Understanding NoSQL on Microsoft Azure

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Cloudera Manager Health Checks

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Introduction to Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

3 Case Studies of NoSQL and Java Apps in the Real World

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Structured Data Storage

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Lecture Data Warehouse Systems

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

A very short Intro to Hadoop

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Cassandra A Decentralized, Structured Storage System

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CSE-E5430 Scalable Cloud Computing Lecture 2

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

A Cost-Evaluation of MapReduce Applications in the Cloud

Understanding NoSQL Technologies on Windows Azure

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Survey of NoSQL Database Engines for Big Data

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

How To Store Data In Nosql

Benchmarking and Analysis of NoSQL Technologies

Transcription:

Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova Website : nosqlbenchmarking.com February 15, 2011

Clarications Motivation As a lot of people who read those slides did not get the oral explanations that MUST go with it, here are a few words of warning : All the databases were used with default congurations, I will post them soon on nosqlbenchmarking.com No index was set manually, doing so could have a big impact on performances Don't jump too fast on the conclusions, it would be WRONG to say that Cassandra is very good and that HBase sucks. The Cassandra implementation of MapReduce seems to be buggy and do not scale. There must be something wrong with my HBase conguration, HBase is known to run gigantic cluster without problems. 2 / 20

Clarications Motivation Also keep in mind that a benchmark is always biased by the chosen methodology so : The way I store data in each database could have an impact on the performances The summary about the results should not be taken in an absolute way, especially the rst one. When I say Good or Bad it is in THIS particular case. Moreover raw results are not the most important, scalability is very important too. So good performances for Cassandra MapReduce but without scalability is NOT good. The data set is too small, I'm testing cache performances (but it is the same for all of the databases) I will add soon a written analysis and a self critic about those results on www.nosqlbenchmarking.com 3 / 20

Motivation YCSB Yahoo! Cloud Servicing Benchmark is the best known nosql benchmarking application so why make another one? YCSB uses data generated from statistical distributions instead of real data YCSB only focuses on read/write/update/scan performances YCSB results for elasticity are not conclusive Idea Data and use case inspired by a concrete case : Wikipedia Test read/update performances Test MapReduce performances by computing an inverted search index 4 / 20

Cassandra 0.6.10 Motivation Cassandra 0.6.10 HBase 0.20.6 mongodb 1.6.5 Riak 0.14 Overview Cassandra is a fully distributed column oriented data store that provides a MapReduce implementation using Hadoop. All the nodes in the cluster play the same role The data (existing and new) are sharded automatically among the nodes The developer can choose the consistency level for each request 5 / 20

HBase 0.20.6 Motivation Cassandra 0.6.10 HBase 0.20.6 mongodb 1.6.5 Riak 0.14 Overview HBase is a column oriented database that aims to provide low latency requests on top of Hadoop HDFS An HBase cluster uses several kinds of servers : HDFS needs at least one namenode and several datanodes HBase needs a ZooKeeper cluster, a master and several regionservers The requests must be made to the master(s) On the HDFS level, existing data are not sharded automatically but new data are On the HBase level, the data are divided into regions that are sharded automatically across regionservers 6 / 20

Cassandra 0.6.10 HBase 0.20.6 mongodb 1.6.5 Riak 0.14 mongodb 1.6.5 Overview mongodb is a document oriented database that stores JSON dictionnaries. It provides auto sharding and a MapReduce implementation. A mongodb cluster is made of several kinds of servers : The shard servers that store data The conguration servers that store the conguration The router servers that receive and route the requests Existing and new data are sharded automatically MapReduce can only use one thread by server 7 / 20

Riak 0.14 Motivation Cassandra 0.6.10 HBase 0.20.6 mongodb 1.6.5 Riak 0.14 Overview Riak is a fully distributed key/bucket store with an implementation of MapReduce. Buckets can store the data directly or be a link to another bucket All the nodes in the cluster play the same role The data (existing and new) are sharded automatically amongs the nodes The developer can choose the consistency level for each request 8 / 20

The data Motivation The data used The client The methodology Wikipedia export 20.000 pages downloaded from Wikipedia Every document is in XML format All documents sum up to 620Mo Each document is associated to a single integer ID Insertions Each document is inserted only once during the whole benchmark 9 / 20

The client Motivation The data used The client The methodology Overview Updates Fully random requests Acts as a perfect load balancer The proportion of updates can be specied Specic parts : read/write/update and MapReduce The updates simply concatenate the string \1" at the end of the article. 10 / 20

The data used The client The methodology MapReduce Overview MapReduce is used to build a reverse index for a given keyword. The reverse index is a list of pairs made of : ID : the ID of the article if Count 6= 0 Count : the number of occurrences of the keyword in this article Justication This kind of computation implies that all the documents are crawled and take advantage of the specications of MapReduce 11 / 20

The data used The client The methodology The methodology 1 Start up a clean cluster of size 3 and insert all the documents 2 Choose a total number of requests, a read percentage and starts the benchmark 3 Wait one minute and starts the benchmark again 4 Wait ve minutes and starts the benchmark again 5 Start the MapReduce benchmark 6 Add a new node to the cluster and wait for it to be ready then restart immediately the bench with the new node's IP in the list 7 Jump to 3 until there are no more computer to add to the cluster 12 / 20

Read/update results 13 / 20

Read/update results without HBase 14 / 20

MapReduce performance 15 / 20

The HBase case Verications made : Checked the logs : nothing seemed problematic HDFS level : running the balancer with a very low threshold distributed the blocks evenly but without any impact on the performances HBase level : the regions where always nearly evenly distributed across the regionservers The number of rows did not change and the content of each row was correct 16 / 20

Summary of raw performances DB read/update performances MapReduce performances Cassandra Good Very Good HBase Bad / N.A. Average / N.A mongodb Good Poor but scalable Riak Poor / unstable Average but scalable 17 / 20

Summary of scalability Going from 3 to 8 servers is a 266% increase in capacity, here are the observed increases in performances : DB read/update MapReduce Cassandra 153% 112% HBase 11% 43% mongodb 145% 211% Riak 74% 189% Riak 7 nodes max 155% 168% 18 / 20

Conclusion and future work Conclusion The elastic gain seems more apparent than with YCSB but not linear either It is worth testing MapReduce performances as the results vary a lot between databases for both raw and scalability performances Future work This is still a work in progress : Applying this benchmark to other databases (Terrastore, Voldemort, Scalaris...) Trying with a growing/bigger data set 19 / 20

Questions and remarks Any questions or remarks? 20 / 20