MongoDB and Couchbase

Similar documents

MongoDB Developer and Administrator Certification Course Agenda

In Memory Accelerator for MongoDB

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Can the Elephants Handle the NoSQL Onslaught?

Cloud Scale Distributed Data Storage. Jürmo Mehine

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Introduction to Big Data Training

Big data and urban mobility

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

Big Data & Data Science Course Example using MapReduce. Presented by Juan C. Vega

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

NoSQL Databases. Nikos Parlavantzas

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Evaluation of NoSQL databases for large-scale decentralized microblogging

MongoDB. The Definitive Guide to. The NoSQL Database for Cloud and Desktop Computing. Apress8. Eelco Plugge, Peter Membrey and Tim Hawkins

NoSQL and Hadoop Technologies On Oracle Cloud

A Performance Analysis of Distributed Indexing using Terrier

MONGODB - THE NOSQL DATABASE

MongoDB: document-oriented database

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

A survey of big data architectures for handling massive data

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Domain driven design, NoSQL and multi-model databases

Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment

MakeMyTrip CUSTOMER SUCCESS STORY

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

An Approach to Implement Map Reduce with NoSQL Databases

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Scalable Architecture on Amazon AWS Cloud

Structured Data Storage

Apache Hadoop. Alexandru Costan

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University

RDBMS vs NoSQL: Performance and Scaling Comparison

An Open Source NoSQL solution for Internet Access Logs Analysis

Sharding and MongoDB. Release MongoDB, Inc.

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Certified MongoDB Professional VS-1058

[Hadoop, Storm and Couchbase: Faster Big Data]

Understanding Neo4j Scalability

How To Handle Big Data With A Data Scientist

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Techniques for Scaling Components of Web Application

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Sharding and MongoDB. Release MongoDB, Inc.

Benchmarking and Analysis of NoSQL Technologies

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Optimising the Mapnik Rendering Toolchain

SURVEY ON MONGODB: AN OPEN- SOURCE DOCUMENT DATABASE

How To Compare The Economics Of A Database To A Microsoft Database

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Sentimental Analysis using Hadoop Phase 2: Week 2

Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB Originally Published: April 13, 2015 Revised: May 27, 2015

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

NoSQL Database Options

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

ADAM 5.5. System Requirements

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

PARALLELS CLOUD STORAGE

Big Data Analytics. Rasoul Karimi

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Hadoop IST 734 SS CHUNG

Understanding NoSQL on Microsoft Azure

Understanding NoSQL Technologies on Windows Azure

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment

.NET User Group Bern

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Practical Cassandra. Vitalii

NoSQL. Thomas Neumann 1 / 22

MongoDB. Or how I learned to stop worrying and love the database. Mathias Stearn. N*SQL Berlin October 22th, gen

Performance and Scalability Overview

NoSQL in der Cloud Why? Andreas Hartmann

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

Transcription:

Benchmarking MongoDB and Couchbase No-SQL Databases Alex Voss Chris Choi University of St Andrews

TOP 2 Questions Should a social scientist buy MORE or UPGRADE computers? Which DATABASE(s)?

Document Oriented Databases

The central concept of a document-oriented database is the notion of a Document

Both and Uses JSON to store these Documents

Example Documents: { } FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing" { } FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[ {Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2} ] there are no empty 'fields', and it does not require explicitly stating if other pieces of information are left out

The Database Tests

Database Test Aggregate/Count: Most User Mentioned Most Hashed Tags Most Shared URLs

Database Test Aggregate/Count: Typical questions Most User Mentioned Most Hashed Tags Most Shared URLs

Has two query approaches: Map Reduce & Aggregation Framework

Map Reduce For Aggregation Uses JavaScript

Aggregation Framework Similar to Map Reduce But Uses C++ (Faster!)

Single Node SSD vs HDD Map Reduce Aggregation Framework 60 50 40 Most User Mentioned Most Hashed-Tags Most Shared URLs 60 50 40 Most User Mentioned Most Hashed-Tags 30 30 20 20 10 10 0 HDD SSD 0 HDD SSD

Single Node Aggregation Framework is roughly 2 times faster than Map Reduce

When we have more than 1 node ~ Scaling Generally...

Scaling Vertically BIGGER and FATTER machine! Scaling Horizontally MORE machines!

Replica-Set Master-Slave Replication Replicates data from master to slave Only one node is in charge (the Master), and data only travels in one direction (to the Slave)

Replica-Set Single Node vs Replica Set: Aggregation Framework 25 20 Single Node Replica Set Time (minutes) 15 10 5 0 User Mentioned Hashed-Tags Shared URLs

Replica-Set Single Node vs Replica Set: Map Reduce 80 70 Single Node Replica Set 60 Time (minutes) 50 40 30 20 10 0 User Mentioned Hashed-Tags Shared URLs

Replicating Data did not improve query performance (MapReduce and Aggregation Framework)... so we tried Sharding

Sharding One BIG stack Two or more smaller stacks

Master-Slave Replication uses a democratic hierarchical scaling approach

But Deployment is an issue with MongoDB... If you want to shard...

documentation suggests minimum 11 nodes, to build a fail-safe sharded cluster 2 Shards 6 nodes (3 each) 2 mongos (load balancer) 2 nodes 3 mongod (config servers) 3 nodes ------------------------------------------- Total: 11 nodes

The investment to go from single node --> sharded cluster is very HIGH!

By omitting Fail-Safe features For our tests, we kind of cheated and used: Note: X is variable 2 X Shards 6 2 nodes 2 1 mongos (load balancer) 2 0.5 node 3 1 mongod (config servers) 3 0.5 node ------------------------------------------- Total: 11 3 nodes NOT RECOMMENDED IN PRODUCTION

Which looks something like this: 4 Shard configuration

Or this: 8 Shard configuration

Another problem is with querying in a sharded environment...

We couldn't perform queries using the Aggregation Framework in the shards, because it was limited by: - Output at every stage of the aggregation can only contain 16MB - >10% RAM usage MongoDB would fail if the limits were reached

Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM 30.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 25.00 20.00 Time (minutes) 15.00 10.00 5.00 0.00 2 4 8 Number of Shards Optimal

Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM Single Node Three Nodes Time (minutes) 60 50 40 30 20 Most Hashed-Tags Most User Mentioned Most Shared URLs Time (minutes) 60.00 50.00 40.00 30.00 20.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 10 10.00 0 HDD SSD 0.00 2 4 8 Number of Shards

Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM Single Node Three Nodes Time (minutes) 60 60 50 50 40 40 Time (minutes) 30 30 20 20 Most Hashed-Tags Most Hashed-Tags User Mentioned Most User Shared Mentioned URLs Most Shared URLs Time (minutes) 60.00 50.00 40.00 30.00 20.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 10 10 10.00 0 0 HDD SSD SSD 0.00 2 4 8 Number of Shards

Performance of Single Node (SSD) is very comparable to 3 nodes (HDD) of different number of shards! (Single Node with SSD is 2-3 minutes slower than 3 nodes... [40 GB corpus]) Perhaps its worth purchasing a good SSD over investing more nodes to improve aggregate performance...

What happens when we scale vertically with ONE BIG BOX? we get --->

Sharded Cluster Map Reduce: On a Big Box 32 Cores 128GB RAM 12.00 10.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) Optimal 8.00 Time (minutes) 6.00 4.00 2.00 0.00 2 4 8 16 24 28 30 Number of Shards

uses a communistic scaling approach Master-Master Replication

Single Node 45 40 Most user mentioned Most hashed tags Most shared urls 35 30 Time (seconds) 25 20 15 10 5 0 HDD Hard Drive Type SSD

Single Node VS Map Reduce Time (seconds) 60 Most user mentioned 50 Most hashed tags Most shared urls 40 30 20 60 50 40 30 20 Most User Mentioned Most Hashed-Tags Most Shared URLs 10 10 0 HDD Hard Drive Type SSD 0 HDD SSD

Single Node VS Aggregation Framework Time (seconds) 60 Most user mentioned 50 Most hashed tags Most shared urls 40 30 20 60 50 40 30 20 Most User Mentioned Most Hashed-Tags 10 10 0 HDD Hard Drive Type SSD 0 HDD SSD

On a SINGLE NODE, while CouchBase is FASTER than MongoDB's MapReduce MongoDB's Aggregation Framework is FASTER than CouchBase

That said adding more Couchbase nodes was disappointing...

In short...

1 Node VS 2 Nodes VS 3 Nodes One Node 90 80 Two Nodes Three Nodes 70 60 50 40 30 20 10 0 Most user mentioned Most hashed tags Most shared urls

Adding more nodes did not improve query performance...

Other users have found similar issues, Couchbase Support says: One known issue with the current developer preview of Couchbase is that it slows down on smaller clusters. The views are optimal on very large clusters. [1] But they did not specify what is very large [1] Couchbase performance - real tests - it's horrible - or am I doing somethin wrong?

That was November 2011...

Which one?

The obvious answer is From a performance standpoint.

Simple flat file aggregation with Java took approx 60 mins On the same machine, MongoDB took approx 30 mins (MapReduce) 10 mins (Aggregation Framework)

Deployment Options Hard drive Map Reduce 140.00 120.00 1 Node HDD MR 1950 Total Execution Time Costs 2200 2500 2000 100.00 Total Execution Time (mins) 80.00 60.00 40.00650 1 Node SSD MR 900 3 Nodes HDD MR 1000 1000 MR 1500 1000 500 Approximate Total Cost (GBP) 20.00 0.00 1 Node #2 Fast SSD AF Big Box HDD MR 0 Aggregate Framework

But! Has problems with: - Full Text Search - Social Network Graphs - Aggregation

's Full Text Search is slow 1 simple query on a single node, 44GB corpus took approx 10 mins!

Solution: A Fast Full Text Search Engine The same search took only a fraction of a second

To create a Social Network Graph with Can be Complex, since you query multiple times to obtain relationships

Solution: A Popular Graph Database Where you can obtain a network graph with VERY FEW queries

Example: START john=node:node_auto_index(name = 'John') MATCH john-[:friend]->()-[:friend]->fof RETURN john, fof The above query fetches all nodes who are friends of friends of Johns

With Aggregation Is perhaps not the best, we have yet to play with

Or even An SQL language to build MapReduce queries With

Moral of the story Use the right tools for the right job

Future work We aim to experiment with: With possibility in integrating all with