NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has the most promise to be beneficial to me in my endeavors as a game developer. I chose Riak because I want to learn more about key-value stores and my other two choices were both document stores. For my analysis, I will be comparing the systems as potential storage options for a (potentially non-technical) direct supervisor. Analysis: MongoDB MongoDB was created by a company called 10gen in 2007, who changed their name to MongoDB Inc. following their document store's wild success [1]. The product went open source in 2009 and has since gained great popularity. MongoDB's data model is document-oriented, and is highly flexible. For example, we can retrieve data by looking it up using regular expressions. This means we can pattern match instead of knowing exactly what we're looking for! In addition, every field can be indexed in MongoDB, which will give us similar behavior to our existing relational databases so you shouldn't worry about losing any existing functionality. MongoDB stores data on disk, and scales horizontally through a process they call sharding [2]. In practice, this means it spreads our data across different servers in a redundant way so that if one server dies, no data is lost. We can also add new servers to the database without having to shut it down! This way we can add more space without having to suspend our services. In the CAP model, we would describe MongoDB as sacrificing C or immediate consistency for being partition safe (hardware failures do not affect data availability) and available. Eventual
consistency is achieved through changes being propagated across all servers, but before those changes to spread, it is possible to retrieve stale data from the database. This means that MongoDB is useful for running many web services that value availability over consistency, but this is a shortcoming worth keeping in mind. MongoDB also is only ACID compliant per document, not per transaction as we would think of it in relational terms. As previously mentioned, MongoDB uses sharding to handle scaling issues. Given its popularity in the web industry, it's safe to say MongoDB scales well. Having said that, many companies, including Netflix, chose Cassandra over MongoDB exactly because of scalability. So it may not be the best, but it definitely far outshines relational databases. Analysis: Cassandra Cassandra started off as a column family store created by Facebook employees to drive their Inbox Search functionality. They open sourced the project in 2008, and the Apache Foundation picked it up and carried it forward. Today, Cassandra has evolved into a partitioned row store. Cassandra uses Cassandra Query Language (CQL), which looks and feels like SQL which is quite helpful for existing relational database programmers. Following this parallel, a column family in Cassandra is similar to a table in a relational database. It also has supported MapReduce functionality since version 0.6. Cassandra stores data on disk across multiple nodes in a multi-server environment. The disk storage is organized into tables that are distributed, multi-dimensional maps indexed by key. Cassandra, like MongoDB in practice, settles for eventual consistency, focusing on accuracy and partition safety. However, unlike MongoDB, Cassandra supports tunable consistency [3]. This means that the database administrator can manually make accuracy and consistency tradeoffs by telling the master node how many nodes needs to be updated before the new data can be considered updated. In addition, Cassandra supports fully ACID transactions since Cassandra 2.0.
When it comes to scaling, Cassandra seems to be the name of the game. Netflix chose Cassandra over the other two databases analyzed here entirely because of scalability. It supports horizontal scaling (adding new machines to be used) while the database is running. FamilySearch is migrating to it away from Oracle relational databases to handle live scaling, as they have weekly spikes in traffic that greatly multiplies their live accesses. Analysis: Riak Riak is a key-value storage system developed by people who moved from Akamai to Basho Technologies. Its first release was in August of 2009. Their goal was to create a web product that happened to use their own custom datastore on the backend. When the datastore created more interest than their web product, they decided to center their efforts on that, which became Riak. Since 2009, Riak has matured to offer adaptive CAP approaches where eventual or immediate consistency can be supported. Riak uses a REST-ful API for its basic operations, such as PUT, GET, DELETE and POST. It also allows for MapReduce use. In Riak, values are stored as key-value pairs and can be used in memory, stored on disk, or both. As with the other options discussed here, data is stored across multiple nodes on a network. Keys are located in near-constant time by hashing keys for lookup. Riak offers tunable consistency, similarly to Cassandra, but per bucket of key values. This allows it to have eventual consistency, or immediate. As far as ACID goes, Riak does not support atomic transactions, and is therefore not ACID compliant. Riak, like many other NoSQL databases, was intended to be used across a network with many nodes for redundancy. While the free version stops there, Riak Enterprise can duplicate data across multiple data centers, not just across multiple servers in one center. This puts it on the same scale as Cassandra, but colloquial comparisons made by business leaders between these two say that Riak doesn't scale quite as well.
Difference Comparison When comparing these three NoSQL databases, the first factor that comes up is scalability. All three of them offer network scaling, but Cassandra seems to outshine the other two by the opinions of large companies who have compared these databases for large-scale use. In terms of data mining use, all three options support MapReduce. In addition, MongoDB supports regex lookup, which neither of the other options explicitly support. Conclusion In conclusion, I believe that for a general business use, Cassandra would be the best option. It has the best scalability, lowering the chances of needing a massive database overhaul or migration in the future. It also has the use of a SQL-like language that will make re-training of existing SQL developers much easier while still allowing complex queries without using MapReduce. Finally, it also seems to have the best adjustment of its consistency, at least better than MongoDB, maybe not Riak, which could come in handy with changing business needs.
REFERENCES: [1] Harris, Derrick. 10gen embraces what it created, becomes MongoDB Inc. Gigaom Research, August 27, 2013. https://gigaom.com/2013/08/27/10gen-embraces-what-it-created-becomes-mongodbinc/ [2] MongoDB, Inc. Sharding and MongoDB https://docs.mongodb.org/manual/sharding/ [3] Configuring Data Consistency Datastax Documentation, 12 October 2015. http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html