these three NoSQL databases because I wanted to see a the two different sides of the CAP



Similar documents
Transactions and ACID in MongoDB

NoSQL Database Options

An Approach to Implement Map Reduce with NoSQL Databases

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Cassandra A Decentralized, Structured Storage System

Can the Elephants Handle the NoSQL Onslaught?

A survey of big data architectures for handling massive data

Preparing Your Data For Cloud

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Slave. Master. Research Scholar, Bharathiar University

Lecture Data Warehouse Systems

NoSQL and Hadoop Technologies On Oracle Cloud

MongoDB Developer and Administrator Certification Course Agenda

Integrating Big Data into the Computing Curricula

Big Systems, Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

So What s the Big Deal?

Practical Cassandra. Vitalii

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

INTRODUCTION TO CASSANDRA

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

NoSQL Databases. Nikos Parlavantzas

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

nosql and Non Relational Databases

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Application of NoSQL Database in Web Crawling

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Introducing DocumentDB

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University

Distributed Systems. Tutorial 12 Cassandra

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Cassandra A Decentralized Structured Storage System

Challenges for Data Driven Systems

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Introduction to NOSQL

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Referential Integrity in Cloud NoSQL Databases

Understanding NoSQL Technologies on Windows Azure

NoSQL in der Cloud Why? Andreas Hartmann

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Big Data Analytics. Rasoul Karimi

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

bigdata Managing Scale in Ontological Systems

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Structured Data Storage

NoSQL Data Base Basics

The Quest for Extreme Scalability

MONGODB - THE NOSQL DATABASE

Benchmarking and Analysis of NoSQL Technologies

The Sierra Clustered Database Engine, the technology at the heart of

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Apache Cassandra

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Comparing SQL and NOSQL databases

Advanced Data Management Technologies

Introduction to Cassandra

NoSQL Databases. Polyglot Persistence

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment

Big Data & Data Science Course Example using MapReduce. Presented by Juan C. Vega

A programming model in Cloud: MapReduce

2.1.5 Storing your application s structured data in a cloud database


How graph databases started the multi-model revolution

Cloud Scale Distributed Data Storage. Jürmo Mehine

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

An Open Source NoSQL solution for Internet Access Logs Analysis

Cassandra vs MySQL. SQL vs NoSQL database comparison

NoSQL: Going Beyond Structured Data and RDBMS

NoSQL Database Systems and their Security Challenges

How to Choose Between Hadoop, NoSQL and RDBMS

Understanding NoSQL on Microsoft Azure

Databases 2 (VU) ( )

.NET User Group Bern

NoSQL. Thomas Neumann 1 / 22

Getting Started with MongoDB

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

Cassandra. Jonathan Ellis

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Infrastructures for big data

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

How To Scale Out Of A Nosql Database

Transcription:

Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the CAP theorem that relational databases are not on, as well as I wanted to pick one from a different type of NoSQL database so that I could get a better understanding of the differences between each one, as well as when I should use the different types. I did not choose a graph database because it is the one that I am least likely to use. I also chose these three because they are fairly large players in the realm of NoSQL databases. MongoDB was created by Kevin P. Ryan and Dwight Merriman, the founders of DoubleClick, in fall of 2007 (Chodorow). They left DoubleClick together, and founded multiple new startups, yet as they were doing this, they kept running into the same problems as they were attempting to store their data; they could not find an effective way to be able to store their data in an easily scalable manner. In fall of 2007 they founded a company called 10gen. While working there, they created two new products, one was an app engine, and the other was MongoDB. MongoDB stores data in the form of documents, which are JSON-like field and value pairs. (MongoDB). These documents are very similar to data structures in programming languages that associate a key with some sort of value, like a map or dictionary. These key value combinations are stored in BSON, or a binary representation of JSON, with some additional type information as well. Because they serialize their data into binary, the data it can hold can be a representation of anything from another document to even arrays of documents. This allows for easy storage of non-structured data that can vary from record to record. All of these documents

are collected and stored in what is called a collection. This collection is just a group of related documents that have some sort of shared index. Essentially, these collections can be thought of as a table in a relational database. In these collections, one is able to do the normal operations that one would normally be able to do in a relational database, including queries, updates, deletes, and creations. One downside though of storing the documents in a collection is that each operation can only interact with one collection at a time. This means that if you need or want to do cross collection queries, you will need to run multiple queries while storing the intermediate results. The workaround for this is to store as much of the data that you can within the same collection. This is only truly feasible though if the data is truly connected. You can cause many problems by putting non related data together in the same collection. MongoDB stores their data in two different ways in the backend, MMAP v1 and Wired Tiger (MongoDB). MMAP v1 is the default for MongoDB, though as Wired Tiger continues to be improved upon, it may become the new default. MMAP v1 supports database level locking starting from release 2.2 on up, and supports collection level locking in version 3.0 and up, but it does not support document level locking. This means that if someone needs to write to a collection, the whole collection will be locked, and not just the individual document. While MongoDB does support multiple readers at a time, they only allow one writer, who will also block all other writers and all readers as well, and so this can potentially be a bottle neck. Wired tiger fixes some of these issues by fully supporting a document level lock, but it does so by storing the data in a binary tree, and so lookup becomes O(log n) instead of O(1). Hence, MMAP v1 is better for reading, as lookup is faster, yet Wired Tiger is better for writing, as you can log an individual document and still leave the collection open for others to use. One of the main

downsides to Wired Tiger that MMAP v1 doesn t have is that there is a possibility that you could lose the last sixty seconds of data if something happened that shut down the database or the journal that logs everything (Peacock). While MongoDB does not support transactions, it does guarantee consistency on the document level, as well as is fully ACID compliant but ONLY on the document level. Not on the database level, not the collection level, only with the individual documents themselves. The amount of data loss that can happen varies based on the storage engine used. MMAP v1 writes all changes to the journal first, so that even if the databases is shut down, MongoDB can go back and fill in the lost changes from the journal, so that no data is ever permanently lost. Wired Tiger on the other hand, does have the possibility that if it gets shut down it can lose the last sixty seconds of data that was written to it. MongoDB is able to scale fairly well for two main different reasons. First, it has replication built into it. Their manual goes over this aspect a fair amount and talks about all the different ways that they use this for data and fault tolerance as well as the ability to read from several nodes at the same time, thus increasing the read speed by a lot. The second way is by supporting sharding. Sharding is a method for storing data across multiple machines. (MongoDB) Sharding is the process of splitting up the data into smaller data sets allowing it to be hosted on multiple smaller servers. This essentially allows for unlimited scaling when combined with replication. Next on my list is Cassandra. Cassandra was developed by Avinash Lakshman and Prashant Malik in 2008 at Facebook. They decided that they needed a more powerful database to power their inbox search feature, and so the idea of Cassandra was conceived. Cassandra, as we know it today, was officially released to the public on April 12 2010.

Cassandra stores their data in what are called column families. A column family, also known as a table, resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. (Datab.US) While this may sound exactly the same as a traditional relational database, one of the main features that sets it apart is how it actually deals with these tables. Unlike tables in a normal relational database, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. (Datab.US) The similarities do not end there. Even the syntax to manipulate the data is very similar to that of a relational database. And example insert statement is this, INSERT INTO MyColumns (id, Last, First) VALUES ('1', 'Doe', 'John'); As you can see it looks basically identical to a RDBMS. The same is true for the majority of their other operations as well. The only thing this does not hold true for is the fact that Cassandra does not support joins or sub queries, so if these are required they must be done in multiple individual steps. Cassandra is stored in multi-server distribution, with the number of nodes not really having a maximum value. In fact, Apple revealed at the Cassandra Summit San Francisco 2015 that they have over 100,000 Cassandra nodes in their database. Because of this relative ease of adding in additional nodes whenever they are wanted, Cassandra has amazing scaling performance. In a 2012 study, University of Toronto researches said that In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments (Rabl, Sadoghi and Jacobsen) This amazing scaling though does come at a cost. Of the ACID properties, only atomic, isolated, and durable transactions are fully supported. Cassandra does also support consistency,

but it is eventually consistent, and thus does not fully support that property. Due to this, the Toronto study states that [the scaling] comes at the price of high write and read latencies. It is possible, and in fact probably, that people reading from different data centers at the same time will get different results even with the same queries, and this must be considered when thinking about implementing Cassandra. Finally we have DynamoDB. DynamoDB was announced by Amazon and released as a beta version on January 18, 2012 (Amazon). Amazon wanted to launch this in order to make their Amazon Web Services platform have more value, and is one of the few services that they allow you to purchase based on throughput rather than storage amount. Amazon s DynamoDB is a Key Value NoSQL database that provides fast and predictable performance with seamless scalability. (Amazon) An easy way to think of a Key Value store is to think of a map like structure in programming. One of the great things about the key value database system is that the value can be anything that we want it to be. Since every key is unique and can only be mapped to one value, there is no need to have a limit imposed on what can be stored. DynamoDB does not allow most of the operations that you will find in a traditional relational database, it only allows you to lookup values by their key, and then modify the values. There are no joins, no complex queries, just simple key lookups to retrieve or modify the value. Like I said earlier, just imagine you are dealing with a complex map and you will have the image of a key value database in your mind. DynamoDB is stored on multi-server distributions. Since none of the values are connected to anything else, you don t have to worry about what key values go where, and in fact there are many algorithms that will automatically figure out where they need to go and place them there. These same algorithms allow for easy retrieval when you are looking for the key as

well, the operation is just reversed and the key location is found. This allows for very quick reads and writes, as it usually only takes constant time to find the key value pair. DynamoDB scales very well. All you have to do is add another node to the total collection, and the database itself will figure out all the complex details of what data to move to the new node as well as what new data will be going there. Because of this easy to scale nature, DynamoDB does lose out on the consistency side of things of the ACID properties. It does not support transactions, but because of the nature of the key value system, it is fairly fault tolerant and will not lose data due to the automatic replication it offers. All of these databases have their pros and cons, and none of them is a silver bullet. MongoDB is great if you want to store data that may not have a super strict schema, yet still has things that bind the data together. Cassandra gives you access to the same rows and columns you are used to seeing in a RDBMS, yet gives you scalability that they cannot give you. Dynamo is great for when you have simple data that needs to have very high read write speeds. Each database is different, and your choice to use one over another will greatly impact how you have to design your schema, as well as the performance that your application will have. As you decide between which database type you will use, make sure you know the primary purpose of the data, as well as the format that it will be coming in. This will allow you to make an educated decision on which type of database is right for you.

Works Cited Amazon. Amazon DynamoDB Developers Guide. 2015. Web site. 20 October 2015. <http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/introduction.html>. Amazon. Amazon DynamoDB Document History. 2015. Web site. 20 October 2015. <http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/documenthistory.htm l>. Chodorow, Kristina. History of MongoDB. 23 August 2010. Website. 20 October 2015. <http://www.kchodorow.com/blog/2010/08/23/history-of-mongodb/>. Datab.US. Apache Cassandra. 2015. Web site. 20 October 2015. <http://datab.us/i/apache%20cassandra>. MongoDB. MongoDB Crud Introduction. 2015. Website. 20 October 2015. <https://docs.mongodb.org/manual/core/crud-introduction/>. MongoDB. MongoDB FAQ Storage. 2015. Website. 20 October 2015. <https://docs.mongodb.org/manual/faq/storage/>. Peacock, Simon. MongoDB Storage Engines. 2 April 2015. Website. 20 October 2015. <https://simonlearningsqlserver.wordpress.com/tag/mmap-v1/>. Rabl, Tilmann, et al. Solving Big Data Challenges for Enterprise Application Performance Management. Toronto: University of Toronto, 2012. PDF.