Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the CAP theorem that relational databases are not on, as well as I wanted to pick one from a different type of NoSQL database so that I could get a better understanding of the differences between each one, as well as when I should use the different types. I did not choose a graph database because it is the one that I am least likely to use. I also chose these three because they are fairly large players in the realm of NoSQL databases. MongoDB was created by Kevin P. Ryan and Dwight Merriman, the founders of DoubleClick, in fall of 2007 (Chodorow). They left DoubleClick together, and founded multiple new startups, yet as they were doing this, they kept running into the same problems as they were attempting to store their data; they could not find an effective way to be able to store their data in an easily scalable manner. In fall of 2007 they founded a company called 10gen. While working there, they created two new products, one was an app engine, and the other was MongoDB. MongoDB stores data in the form of documents, which are JSON-like field and value pairs. (MongoDB). These documents are very similar to data structures in programming languages that associate a key with some sort of value, like a map or dictionary. These key value combinations are stored in BSON, or a binary representation of JSON, with some additional type information as well. Because they serialize their data into binary, the data it can hold can be a representation of anything from another document to even arrays of documents. This allows for easy storage of non-structured data that can vary from record to record. All of these documents
are collected and stored in what is called a collection. This collection is just a group of related documents that have some sort of shared index. Essentially, these collections can be thought of as a table in a relational database. In these collections, one is able to do the normal operations that one would normally be able to do in a relational database, including queries, updates, deletes, and creations. One downside though of storing the documents in a collection is that each operation can only interact with one collection at a time. This means that if you need or want to do cross collection queries, you will need to run multiple queries while storing the intermediate results. The workaround for this is to store as much of the data that you can within the same collection. This is only truly feasible though if the data is truly connected. You can cause many problems by putting non related data together in the same collection. MongoDB stores their data in two different ways in the backend, MMAP v1 and Wired Tiger (MongoDB). MMAP v1 is the default for MongoDB, though as Wired Tiger continues to be improved upon, it may become the new default. MMAP v1 supports database level locking starting from release 2.2 on up, and supports collection level locking in version 3.0 and up, but it does not support document level locking. This means that if someone needs to write to a collection, the whole collection will be locked, and not just the individual document. While MongoDB does support multiple readers at a time, they only allow one writer, who will also block all other writers and all readers as well, and so this can potentially be a bottle neck. Wired tiger fixes some of these issues by fully supporting a document level lock, but it does so by storing the data in a binary tree, and so lookup becomes O(log n) instead of O(1). Hence, MMAP v1 is better for reading, as lookup is faster, yet Wired Tiger is better for writing, as you can log an individual document and still leave the collection open for others to use. One of the main
downsides to Wired Tiger that MMAP v1 doesn t have is that there is a possibility that you could lose the last sixty seconds of data if something happened that shut down the database or the journal that logs everything (Peacock). While MongoDB does not support transactions, it does guarantee consistency on the document level, as well as is fully ACID compliant but ONLY on the document level. Not on the database level, not the collection level, only with the individual documents themselves. The amount of data loss that can happen varies based on the storage engine used. MMAP v1 writes all changes to the journal first, so that even if the databases is shut down, MongoDB can go back and fill in the lost changes from the journal, so that no data is ever permanently lost. Wired Tiger on the other hand, does have the possibility that if it gets shut down it can lose the last sixty seconds of data that was written to it. MongoDB is able to scale fairly well for two main different reasons. First, it has replication built into it. Their manual goes over this aspect a fair amount and talks about all the different ways that they use this for data and fault tolerance as well as the ability to read from several nodes at the same time, thus increasing the read speed by a lot. The second way is by supporting sharding. Sharding is a method for storing data across multiple machines. (MongoDB) Sharding is the process of splitting up the data into smaller data sets allowing it to be hosted on multiple smaller servers. This essentially allows for unlimited scaling when combined with replication. Next on my list is Cassandra. Cassandra was developed by Avinash Lakshman and Prashant Malik in 2008 at Facebook. They decided that they needed a more powerful database to power their inbox search feature, and so the idea of Cassandra was conceived. Cassandra, as we know it today, was officially released to the public on April 12 2010.
Cassandra stores their data in what are called column families. A column family, also known as a table, resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. (Datab.US) While this may sound exactly the same as a traditional relational database, one of the main features that sets it apart is how it actually deals with these tables. Unlike tables in a normal relational database, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. (Datab.US) The similarities do not end there. Even the syntax to manipulate the data is very similar to that of a relational database. And example insert statement is this, INSERT INTO MyColumns (id, Last, First) VALUES ('1', 'Doe', 'John'); As you can see it looks basically identical to a RDBMS. The same is true for the majority of their other operations as well. The only thing this does not hold true for is the fact that Cassandra does not support joins or sub queries, so if these are required they must be done in multiple individual steps. Cassandra is stored in multi-server distribution, with the number of nodes not really having a maximum value. In fact, Apple revealed at the Cassandra Summit San Francisco 2015 that they have over 100,000 Cassandra nodes in their database. Because of this relative ease of adding in additional nodes whenever they are wanted, Cassandra has amazing scaling performance. In a 2012 study, University of Toronto researches said that In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments (Rabl, Sadoghi and Jacobsen) This amazing scaling though does come at a cost. Of the ACID properties, only atomic, isolated, and durable transactions are fully supported. Cassandra does also support consistency,
but it is eventually consistent, and thus does not fully support that property. Due to this, the Toronto study states that [the scaling] comes at the price of high write and read latencies. It is possible, and in fact probably, that people reading from different data centers at the same time will get different results even with the same queries, and this must be considered when thinking about implementing Cassandra. Finally we have DynamoDB. DynamoDB was announced by Amazon and released as a beta version on January 18, 2012 (Amazon). Amazon wanted to launch this in order to make their Amazon Web Services platform have more value, and is one of the few services that they allow you to purchase based on throughput rather than storage amount. Amazon s DynamoDB is a Key Value NoSQL database that provides fast and predictable performance with seamless scalability. (Amazon) An easy way to think of a Key Value store is to think of a map like structure in programming. One of the great things about the key value database system is that the value can be anything that we want it to be. Since every key is unique and can only be mapped to one value, there is no need to have a limit imposed on what can be stored. DynamoDB does not allow most of the operations that you will find in a traditional relational database, it only allows you to lookup values by their key, and then modify the values. There are no joins, no complex queries, just simple key lookups to retrieve or modify the value. Like I said earlier, just imagine you are dealing with a complex map and you will have the image of a key value database in your mind. DynamoDB is stored on multi-server distributions. Since none of the values are connected to anything else, you don t have to worry about what key values go where, and in fact there are many algorithms that will automatically figure out where they need to go and place them there. These same algorithms allow for easy retrieval when you are looking for the key as
well, the operation is just reversed and the key location is found. This allows for very quick reads and writes, as it usually only takes constant time to find the key value pair. DynamoDB scales very well. All you have to do is add another node to the total collection, and the database itself will figure out all the complex details of what data to move to the new node as well as what new data will be going there. Because of this easy to scale nature, DynamoDB does lose out on the consistency side of things of the ACID properties. It does not support transactions, but because of the nature of the key value system, it is fairly fault tolerant and will not lose data due to the automatic replication it offers. All of these databases have their pros and cons, and none of them is a silver bullet. MongoDB is great if you want to store data that may not have a super strict schema, yet still has things that bind the data together. Cassandra gives you access to the same rows and columns you are used to seeing in a RDBMS, yet gives you scalability that they cannot give you. Dynamo is great for when you have simple data that needs to have very high read write speeds. Each database is different, and your choice to use one over another will greatly impact how you have to design your schema, as well as the performance that your application will have. As you decide between which database type you will use, make sure you know the primary purpose of the data, as well as the format that it will be coming in. This will allow you to make an educated decision on which type of database is right for you.
Works Cited Amazon. Amazon DynamoDB Developers Guide. 2015. Web site. 20 October 2015. <http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/introduction.html>. Amazon. Amazon DynamoDB Document History. 2015. Web site. 20 October 2015. <http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/documenthistory.htm l>. Chodorow, Kristina. History of MongoDB. 23 August 2010. Website. 20 October 2015. <http://www.kchodorow.com/blog/2010/08/23/history-of-mongodb/>. Datab.US. Apache Cassandra. 2015. Web site. 20 October 2015. <http://datab.us/i/apache%20cassandra>. MongoDB. MongoDB Crud Introduction. 2015. Website. 20 October 2015. <https://docs.mongodb.org/manual/core/crud-introduction/>. MongoDB. MongoDB FAQ Storage. 2015. Website. 20 October 2015. <https://docs.mongodb.org/manual/faq/storage/>. Peacock, Simon. MongoDB Storage Engines. 2 April 2015. Website. 20 October 2015. <https://simonlearningsqlserver.wordpress.com/tag/mmap-v1/>. Rabl, Tilmann, et al. Solving Big Data Challenges for Enterprise Application Performance Management. Toronto: University of Toronto, 2012. PDF.