Benchmarking and Analysis of NoSQL Technologies

Benchmarking and Analysis of NoSQL Technologies Suman Kashyap 1, Shruti Zamwar 2, Tanvi Bhavsar 3, Snigdha Singh 4 1,2,3,4 Cummins College of Engineering for Women, Karvenagar, Pune 411052 Abstract The social web generates terabytes of unstructured, user generated data, spread across thousands of commodity servers. The changed face of web based applications forced to invent new approaches to data management. The NoSQL databases were created as a mean to offer high performance (both in terms of speed and size) and high availability. They share the goals of massive scaling on demand (elasticity). Hence NoSQL is best suited solution to the big data needs posed by the evolving web application. With the advent of newer NoSQL technologies each day, developers face a serious issue in selecting the best suited datastore for their application as each of them place very different demands. The NoSQL databases haven t been tested for their claims. In this paper we discuss the benchmarking results of the 4 most popular NoSQL technologies namely- Cassandra, HBase, MongoDB and CouchDB. Also we analyze the results based on the performance of each datastore on various parameters. We are using the YCSB (Yahoo! Cloud Serving Benchmark) tool to compare Cassandra vs. HBase (key-value stores), and MongoDB vs. CouchDB (document-oriented databases). We are benchmarking the NoSQL datastore on tiers of Performance, Scalability and Availability. Keywords YCSB, Elasticity, Availability, Scale I. INTRODUCTION Relational database technology, invented in the 1970s are still in widespread use today, was optimized for the applications, users and infrastructure of that era. In some regards, it is the last domino to fall in the inevitable march towards fully-distributed software architecture. While a number of band aids have extended the useful life of the technology (horizontal and vertical sharding, distributed caching and data de-normalization), these tactics nullify key benefits of the relational model while increasing total system cost and complexity. In response to the lack of commercially available alternatives, organizations such as Google and Amazon developed their own kind of schema less stores- NoSQL data stores. These NoSQL or nonrelational database technologies are a better match for the needs of modern interactive software systems. But not every company can or should develop, maintain and support its own database technology. Our goal is to test the four technologies mentioned before using YCSB. YCSB test tool comes with a set of core workloads and a bunch of parameter files. The parameter file can be changed on a per DB instance. A workload represents the load that a given application will put on the database system. For benchmarking purposes, we define workloads that are relatively simple compared to real applications, so that we can better reason about the benchmarking results we get. However, a workload should be detailed enough so that once we measure the database s performance, we know what kinds of applications might experience similar performance. Thus we are writing our own new workloads by loading our own data on different datastores and running different operations on them. We are implementing aggregation as complex operations on the loaded data since aggregation operations are the most relevant and common ones. We are using node.js script that automates data generation as per the format specified in a file. Data up to 100GB has been produced using this script. The tier of scalability can be tested by adding newer nodes and data or removing existing ones and evaluating the behaviour of each datastore. II. PROBLEM CONTEXT We set the context of our work by briefly discussing the presence and need of NoSQL and architecture of YCSB. A. Presence and Need of NoSQL Just at the time when the database market seemed to many to be almost completely mature, a group of nonrelational data stores, collectively categorized as NoSQL databases, have attracted significant attention. These databases are often employed in public, massively scaled Web site scenarios, where traditional database features matter less, and fast fetching of relatively simple data sets matters most.eg Facebook with its 500,000,000 users or Twitter which accumulates Terabytes of data every single day. Many of these databases employ parallelized query mechanisms, horizontal partitioning and allow storage of heterogeneous, loosely-schematized data records. 422

In a NoSQL database, there is no fixed schema and no joins. A RDBMS "scales up" by getting faster and faster hardware and adding memory. NoSQL, on the other hand, can take advantage of "scaling out". Scaling out refers to spreading the load over many commodity systems. This is the component of NoSQL that makes it an inexpensive solution for large datasets. 1) Categories of NoSQL: The NoSQL technologies can be roughly classified into four categories Key- value stores, Column oriented stores, Document oriented and Graph based databases. Each of these has their idiosyncrasies and data model. However each of them use map-reduce for parallelized query and computation. The two kinds of categories that we are considering are Column oriented (HBase and Cassandra) and Document oriented data stores (MongoDB and CouchDB). B. YCSB YCSB Yahoo! Cloud Serving Benchmark is a benchmarking tool as well as a standard developed by Yahoo co. to test their PNUTS datastore, later open sourced in April 2009. A key design goal of this tool is extensibility, so that it can be used to benchmark new cloud database systems, and new workloads can be developed. YCSB benchmark differs from standard OLTP benchmarks like TPC-C. TPC-C contains several diverse types of queries meant to mimic a company warehouse environment. Some queries execute transactions over multiple tables. In contrast, the web applications we are benchmarking tend to run a huge number of extremely simple queries 1) Architecture of YCSB: The YCSB Client is a Java program for generating the data to be loaded to the database, and generating the operations which make up the workload. The architecture of the client is shown in Figure 1. The basic operation is that the workload executor drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer, both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. The threads also measure the latency and achieved throughput of their operations, and report these measurements to the statistics module. At the end of the experiment, the statistics module aggregates the measurements and reports average, 95th and 99th percentile latencies, and either a histogram or time series of the latencies. The Workload Executor contains code to execute both the load and transaction phases of the workload. The Database Interface Layer translates simple requests (such as read ()) from the client threads into calls against the database. C. Benchmarking Tiers Fig: 1 The cloud serving systems make different tradeoffs to make them suit a particular application scenario. There are many design decisions to make when building one of NoSQL systems, and those decisions have a huge impact on how the system performs for different workloads (e.g., read-heavy workloads vs. write-heavy workloads), how it scales, how it handles failures, ease of operation and tuning, etc. Hence we cannot have a cloud serving systems which is a panacea to all problems related to handling huge data sets. It hence becomes very vital to specify the criterions for evaluating the datastores. In this section we discuss the tiers on which we report about the NoSQL data stores briefly. 1) Tier 1- Performance: Performance tier measures the latency of requests when a database is under load. This is the most basic yet very crucial benchmarking tier, as the first question before any deployment is how much load the datastore will take per server. Latency of requests is strongly related to the throughput of the system. As the throughput is pushed higher, latency grows gradually. Eventually, as the system the system reaches saturation, latency jumps dramatically, and then the throughput can be pushed no higher. 2) Tier 2- Scalability: The big selling point of cloud serving systems is their ability to scale upward by adding more servers. 423

If the system offers low latency at small scale, as we proportionally increase workload size and the number of servers, latency should remain constant. If this is not the case, this is a hint the system might have bottlenecks that surface at scale. Elasticity is basically the characteristics of cloud serving systems to handle data while more nodes and servers are added while the system is running. 3) Tier 3 Availability: The tier of availability tests the capability of the datastore to provide data to the user despite of node failure. In a real life scenario, failure of commodity hardware s or servers is very common. Such failures can be caused due to network fails or other reasons. Usually the data stored in a cluster is replicated over multiple servers. The tier of availability is tested by Ssh to kill nodes of a cluster and checking the availability of data. The results are based on a workload specification. The Workloads are designed according to certain real life scenarios, specifying different parameters like read/write mix, record count, distribution etc. The difference in results, 95th and 99th percentile latencies, minimum, maximum and average latencies, and achieved throughput are a result of the tradeoffs made by each cloud serving system. As depicted by the figure 3.1, HBase could perform up to 7 operations within 0 seconds, whereas MongoDB displayed a spectacular performance by having 4223 operations done within the same time span. The results for the maximum and minimum latencies showed by the two cloud serving datastores HBase and MongoDB can be seen in the graph below (Fig 3.2). III. DESIGN AND IMPLEMENTATION We now present the specifics of our work. For our experiments, we have used 32-bit 2.5GHz Intel Dual-Core processors, 4GB of RAM to run each system. We also tested Cassandra 0.8 and HBase 0.2.0 on multiple nodes (up to five nodes were created on a single machine). The YCSB Client was run with up to 100 threads i.e. client instances. Also we use Amazon Elastic Compute Cloud (Amazon EC2), which is basically a web service that provides resizable compute capacity in the cloud, to test the datastores with varying processing and data capacities. Fig: 3.2 Fig: 3.1 The graph depicted illustrates the performance measure in terms of percentile latencies and throughput for HBase v/s MongoDB. IV. EXPERIMENTAL EVALUATION We present benchmarking results for the systems tested by us. While Cassandra and HBase share similar data model, their underlying implementations are quite different. In our tests, we ran the workloads designed by us, both to measure performance (benchmark tier 1) and to measure scalability and elasticity (benchmark tier 2). Cassandra has a lower Read Latency than HBase on a readheavy workload. Whereas HBase emerges as a winner while running a write-heavy workload as it has an Update Latency lower than Cassandra. Cassandra scaled well as the number of nodes increased. It should be noted that the results reported here are for particular versions of datastore systems that are undergoing continuous development, and that performance may change and improve in the future. 424

V. DISCUSSIONS AND FUTURE WORK One aspect of the NoSQL movement has been a move away from trying to maintain completely perfect consistency across distributed servers which is achieved using ACID (Atomicity, Consistency, Isolation, and Durability) properties. To reduce this burden it places on databases, particularly in distributed systems, NoSQL databases have adopted BASE (Basically Available, Soft state, Eventually Consistent) properties and follow CAP theorem. The now famous CAP theorem states that of consistency (all nodes see the same data at the same time), availability (a guarantee that every request receives a response about whether it was successful or failed), and network partitioning (the system continues to operate despite arbitrary message loss); only two can be guaranteed at any time. Traditional relational databases have kept strict transactional semantics to preserve consistency, but many NoSQL databases are moving towards a more scalable architecture that relaxes consistency. Relaxing consistency is often called eventual consistency. This permits much more scalable distributed storage systems where writes can occur without using two phase commits or system-wide locks. Map Reduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Map reduce is a programming model that serves for processing large data set in a massively parallel manner. Map reduce is used by almost all NoSQL technologies in some or the other way. "Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve. In addition to performance comparisons, it is important to examine other aspects of cloud serving systems. Replication can be considered as one more benchmarking tier to examine these NoSQL technologies. Cloud systems use replication for both availability and performance. Replicas can be used for failover, and to spread out read load. VI. CONCLUSION We have presented the Yahoo Cloud Serving Benchmark results for the 4 technologies. This benchmark is designed to provide tools for apples-to apples comparison of different serving data stores. One contribution of the benchmark is an extensible workload generator, the YCSB Client, which can be used to load datasets and execute workloads across a variety of data serving systems. Another contribution is the definition of five core workloads, which begin to fill out the space of performance tradeoffs made by these systems. New workloads can be easily created, including generalized workloads to examine system fundamentals, as well as more domain-specific workloads to model particular applications. As an open-source package, the YCSB Client is available for developers to use and extend in order to effectively evaluate cloud systems. We have used this tool to benchmark the performance of four cloud serving systems, and observed that there are clear tradeoffs between read and write performance that result from each system s architectural decisions. REFERENCES [1] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking Cloud Serving Systems with YCSB. In Yahoo! Research, Santa Clara, CA, USA. [2] Apache Cassandra. http://incubator.apache.org/cassandra/. [3] Apache CouchDB. http://couchdb.apache.org/. [4] Apache HBase. http://hadoop.apache.org/hbase/. [5] MongoDB. http://www.mongodb.org/. [6] Neal Leavitt: Will NoSQL Databases Live Upto Their Promise?. Published by IEEE Computer Society, 2010. [7] Bogdan George Tudorica, Cristian Bucur: A comparison between several NoSQL databases with comments and notes. In Department for Economical Mathematics and Economical Informatics Petroleum-Gas University of Ploiesti, Ploiesti, Romania. [8] Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/ycsb/wiki. [9] NoSQL Benchmarking. http://www.cubrid.org/blog/devplatform/nosql-benchmarking/. [10] Bringing Big Data to the Enterprise. http://www- 01.ibm.com/software/data/bigdata/. [11] What is HBase? http://www.01.ibm.com/software/data/infosphere/hadoop/hbase/. [12] Java development 2.0: MongoDB: A NoSQL datastore with (all the right) RDBMS moves. http://www.ibm.com/developerworks/java/library/j-javadev2-12/. [13] Exploring CouchDB. http://www.ibm.com/developerworks/opensource/library/oscouchdb/index.html. 425

[14] Key-Value:Cassandra. http://www.ibm.com/developerworks/cn/opensource/os-cncassandra/?ca=drs-tp4608. [15] YCSB. https://github.com/brianfrankcooper/ycsb. [16] Lars George: HBase-The Definitive Guide. Published by O REILLY. [17] Eben Hewitt: Cassandra-The Definitive Guide. Published by O RILLY. [18] Kristina Chodorow and Michael Dirolf: MongoDB-The Definitive Guide. Published by O REILLY. [19] J. Chris Anderson, Jan Lehnardt and Noah Slater: CouchDB-The Definitive Guide. Published by O REILLY 426