WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com
Table of Contents Abstract... 3 What Is Big Data?... 3 About Oracle NoSQL Database... 3 Big Data The Problem With Conventional Technology... 5 The Flash Memory Solution... 5 About The Tests... 6 Test Results... 6 Interpreting The Results...7 Oracle NoSQL Database And SanDisk A Winning Combination... 7 2
Abstract This paper describes the benefits of storing Oracle NoSQL Database data on SanDisk s Fusion iomemory products. Oracle and SanDisk partnered to test, validate, and deliver extreme-performance big data solutions for real-time applications. The superior performance of the Fusion iomemory devices complement the scalability, reliability, and simplicity of Oracle NoSQL Database, dramatically improving throughput and response times for serving keyvalue data. The combination of Oracle NoSQL Databases and Fusion iomemory products provide a compelling and cost-effective solution in a variety of scenarios. Results of testing showed that using an iodrive 2 device for data storage delivered nearly 30 times more operations per second than a 300GB 10k SAS disk on a 90% read and 10% write workload and nearly eight times more operations per second on a 50% read and 50% write workload. Equally impressive, an iodrive2 device reduced latency over 700% (seven times) on inserts in a 90% read and 10% write workload and over 5800% (58 times) on reads in a 50% read and 50% write workload. What Is Big Data? Big Data is an informal term that encompasses all sorts of data, including Web logs, sensor data, tweets, blogs, user reviews, and SMS messages. It is characterized by volume of hundreds of terabytes or more; wide data variety with no inherent structure (one row looks very different from another); and high velocity, on the order of hundreds of thousands of operations per second. Often, big data is processed using purpose-built software designed to address a specific data processing requirement. This category of big data processing solutions is generally referred to as NoSQL (not SQL or Not Only SQL). Although it is possible to process big data using traditional SQL-based products and solutions, NoSQL databases provide a more cost-effective and horizontally scalable alternative. NoSQL databases complement SQL-based solutions, providing significant new business advantages to the enterprise. Recently, there has been a huge surge of interest in big data processing solutions. As enterprises have embraced big data processing for business benefit, open source and commercial vendors have responded by providing a variety of solutions aimed at addressing specific big data processing needs. In October 2011, Oracle announced a suite of complementary products and technologies that provide a complete and comprehensive solution to address the big data processing needs of the market. Big data processing falls into two major categories: interactive processing and batch processing. In most big data processing applications, both kinds of data processing are required. Oracle NoSQL Database (NoSQL DB for short), also released in October 2011, is a scalable, highly available key-value store that can be used to acquire and manage vast amounts of interactive information. About Oracle NoSQL Database Oracle NoSQL Database is a highly available, linearly scalable, high-performance key-value database server. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring. Oracle NoSQL Database provides a very simple data model to the application developer. Each row is identified by a unique key, and also has a value, of arbitrary length, which is interpreted by the application. The application can manipulate (insert, delete, update, read) a single row in a transaction. The application can also perform an iterative, non-transactional scan of all the rows in the database. The simplicity of this data model and access provides tremendous flexibility and performance benefits over an SQL-based solution for big data processing. 3
As mentioned earlier, big data is characterized by variety, volume and velocity. The key-value paradigm permits the application to manage any kind of data: one row can be structurally very different from another row. The volume of data managed might change dramatically from one day to the next. For example, if e-commerce transactions are being managed in Oracle NoSQL Database, the volume of transactions and data can increase more than ten-fold during a busy shopping season, such as the weeks before the Christmas holiday. The data management system needs to scale easily to handle the change in workload without compromising performance. Similarly, high throughput and low response time are critical in many big data processing applications such as e-commerce, targeted advertising, and any application that provides interactive access to the customer. NoSQL DB is a sharded system each shard manages a subset of data. Typically, a shard is composed of three independent nodes to provide High Availability. One of the nodes in the shard is designated as a master, meaning it can serve read as well as write requests. Changes to data on the master node are continually propagated to the other nodes (the replicas) in the shard in order to keep the replicas up-to-date. Replicas can serve read requests; in case the master node fails, one of the surviving replicas is elected as the master, and processing continues without any interruption in database activity. Figure 1 illustrates the architecture of a typical NoSQL Database configuration with two clients. Note that the number of clients can vary, depending on application requirements. Figure 1: NoSQL Database system architecture Each node (master or replica) uses Berkeley DB Java Edition HA as the underlying data manager. Berkeley DB Java Edition uses a log-structured storage format to store the records and indices in the database. Log-structured storage is naturally optimized for write performance and can deliver extremely high write throughput. Through a combination of clever optimizations and effective use of memory, Berkeley DB Java Edition delivers excellent read performance as well. 4
Big Data The Problem With Conventional Technology Transactional semantics, high availability, scalable throughput and predictable latency are must-have requirements for the interactive (or real-time ) big data processing for which Oracle NoSQL Database is designed. For example, a retail e-commerce application must respond to user requests in under one or two seconds to ensure high user retention. Similarly, an in-home health care application must have the ability to capture and monitor data from multiple sensors, while processing and responding to critical medical events reliably and predictably without data loss. A common technique to ensure high throughput and low latency is to store all the information in memory. Due to the high and unpredictable volumes of data, however, an in-memory solution is not cost-effective for big data processing. Typically, big data solutions store the vast majority of the information on disk, and use memory for caching the most frequently accessed subsets of data. The performance of storing and retrieving data from disk often limits the throughput and response time achievable by the system. In particular, the number of input-output operations per second (IOPS) that a disk can deliver will dictate the performance characteristics of the system. Modern spinning disks are able to deliver fast sequential access, but poor sustained random performance of approximately 100 IOPS. Most often, the requirements of a NoSQL database application far exceed the capacity of a single disk. Consequently, high-performance solutions often use multiple disks per machine in order to get additional I/O bandwidth. This can work adequately for smaller data sets, but as the volume of data to be processed increases, applications require external arrays and the cost of hardware and maintenance to scale systems quickly becomes impractical. The Flash Memory Solution SanDisk s Fusion iomemory platform delivers the microsecond latency access interactive big data applications need to maintain real time response times for tens of terabytes of capacities something that in-memory databases cannot practically do. It provides persistent storage and the necessary I/O performance that disk arrays cannot achieve without racks of infrastructure and high bandwidth network infrastructure. Oracle and SanDisk have partnered to test the Fusion iomemory solution s benefits to the Oracle NoSQL Database. 5
About The Tests The tests were run on a single shard consisting of three nodes. Each node was a Sun Fire X4170 M2 configured with two Intel 2.93GHz 6-Core Xeon E5670 processors and 72GB of DRAM, a 300GB 10k SAS hard disk, and a 1.2TB Fusion iomemory iodrive2 card. The machines were configured with Oracle Linux Server release 5.7 and a pre-release version of NoSQL Database 2.0. The test driver consisted of a single Yahoo! Cloud Systems Benchmark (YCSB) client. The YCSB software was modified to use a larger key space for better distribution of keys when scaling up to large data sets. Tests were conducted on an Oracle system that was not tuned for flash. There were three sets of tests: 1. Pure insert: Insert 100 million records, with an average key size of 13 bytes and an average value size of 1108 bytes. 2. 50/50 R/W: Ten million operations consisting of a 50% read and 50% update mix, using the 100 million record store created by the insert test. 3. 95/5 R/W: Ten million operations consisting of a 95% read and 5% update mix, again using the 100 million record store created by the insert test. The above tests were run using both the SAS hard disk and iodrive2 card. Throughput and latency were measured by the YCSB client during these tests and are summarized in the tables presented below. Test Results 300GB SAS Disk iodrive2 Improvement Throughput (operations/sec) 23,308 24,150 3.60% Average insert latency (msec) 1 5.07 4.96 2.20% Average read latency (msec) 1 N/A N/A N/A Table 1. Pure insert test insert 100 million 1108 byte records 300GB SAS Disk iodrive2 Improvement Throughput (operations/sec) 3,342 33,693 908% Average insert latency (msec) 1 36.88 6.42 574% Average read latency (msec) 1 35.6 0.61 5836% Table 2. 50/50 read/update mix. 400 million 1108 byte records in the database 300GB SAS Disk iodrive2 Improvement Throughput (operations/sec) 3,583 106,616 2975% Average insert latency (msec) 1 34.57 4.79 721% Average read latency (msec) 1 33.16.91 3643% Table 3. 95/5 read/update mix. 100 million 1108 byte records in the database 1 Latency results include Java-application overhead. Raw for iomemory access latency is typically in the microsecond range. More efficient applications will see even faster response times. 6
Interpreting The Results For the pure insert scenario, the performance of disk and iodrive2 device is similar. This similarity is not surprising, since the underlying log-structured storage architecture for Oracle NoSQL Database is optimized for write operations on hard disks. However, we see a dramatic difference in the read/update mix tests. Read operations require random I/ Os (seeks) on conventional disks; consequently, the throughput as well as latency is affected. However, in the case of iodrive2, the cost of random I/O and sequential I/O is almost identical. In other words, any I/O operation in a sequence of operations is equally fast! The improvement factor is 30 times (nearly 3,000%). Notice that the overall throughput improves as the ratio of reads to writes increases. This happens because the benefits of log-structured storage have less of an impact when the relative proportion of writes to reads is small. Oracle NoSQL Database And SanDisk A Winning Combination From these performance tests, it is clear that iodrive2 card provides dramatic improvements in performance for interactive big data applications. Disk drives simply cannot achieve the number of IOPS that an iodrive2 device can. The superior performance of Oracle NoSQL Database using an iodrive2 card is critical for many mission-critical applications like e-retail, online advertising, home health care monitoring, financial services, security and surveillance, etc. Though the initial capital cost of flash storage-based technology is higher, a system using disk-based storage that delivers comparable performance will need a large number of disk spindles to deliver the required throughput, and may not be able to deliver the required latency at all. Further, the operational costs of flash-based technology, including the amount of hardware required, power consumption, and cooling, are much lower than comparable disk-based solutions. Finally, there are intangible benefits of deploying a super-high performance, low latency, and reliable NoSQL application, including customer and user loyalty and trust, competitive advantage, and lower operational costs. Oracle NoSQL Database with Fusion iomemory iodrive2 technology provides an enterprise-grade, highly reliable, highly scalable, high performance, and low-latency solution for the most demanding big data applications today. FOR MORE INFORMATION Contact a SanDisk representative, 1-800-578-6007 or fusion-sales@sandisk.com The performance results discussed herein are based on testing and use of the above described products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications. 2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. Fusion iomemory, iodrive and others are trademarks of SanDisk Enterprise IP LLC. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). 7