Benchmarking Cassandra on Violin



Similar documents
Benchmarking Hadoop & HBase on Violin

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Cassandra Workloads using SanDisk Solid State Drives

How To Scale Myroster With Flash Memory From Hgst On A Flash Flash Flash Memory On A Slave Server

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Accelerating Server Storage Performance on Lenovo ThinkServer

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Speeding Up Cloud/Server Applications Using Flash Memory

Can the Elephants Handle the NoSQL Onslaught?

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

DataStax Enterprise, powered by Apache Cassandra (TM)

Data Center Storage Solutions

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

How To Handle Big Data With A Data Scientist

SSD Performance Tips: Avoid The Write Cliff

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

How to Choose Between Hadoop, NoSQL and RDBMS

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Everything you need to know about flash storage performance

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Maximum performance, minimal risk for data warehousing

EMC Unified Storage for Microsoft SQL Server 2008

Using Synology SSD Technology to Enhance System Performance Synology Inc.

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Evaluation Report: Database Acceleration with HP 3PAR StoreServ 7450 All-flash Storage

InfiniteGraph: The Distributed Graph Database

Data Center Performance Insurance

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

MS Exchange Server Acceleration

Data Center Solutions

Understanding the Benefits of IBM SPSS Statistics Server

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Virtualization of the MS Exchange Server Environment

Microsoft SQL Server 2014 Fast Track

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

EMC XTREMIO EXECUTIVE OVERVIEW

Tableau Server 7.0 scalability

Big Fast Data Hadoop acceleration with Flash. June 2013

PrimaryIO Application Performance Acceleration Date: July 2015 Author: Tony Palmer, Senior Lab Analyst

Intel RAID SSD Cache Controller RCS25ZB040

Accelerating MS SQL Server 2012

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

FPGA-based Multithreading for In-Memory Hash Joins

Memory-Centric Database Acceleration

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Violin Memory Arrays With IBM System Storage SAN Volume Control

2009 Oracle Corporation 1

Boost Database Performance with the Cisco UCS Storage Accelerator

ioscale: The Holy Grail for Hyperscale

FAWN - a Fast Array of Wimpy Nodes

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

SLIDE 1 Previous Next Exit

Capacity Management for Oracle Database Machine Exadata v2

Business white paper. HP Process Automation. Version 7.0. Server performance

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

Big Data With Hadoop

Inge Os Sales Consulting Manager Oracle Norway

SAS Business Analytics. Base SAS for SAS 9.2

SQL Server Business Intelligence on HP ProLiant DL785 Server

Sun 8Gb/s Fibre Channel HBA Performance Advantages for Oracle Database

An Overview of Flash Storage for Databases

SQL Server Virtualization

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

HP Smart Array Controllers and basic RAID performance factors

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

HP reference configuration for entry-level SAS Grid Manager solutions

Optimizing SQL Server Storage Performance with the PowerEdge R720

Yahoo! Cloud Serving Benchmark

Scaling from Datacenter to Client

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

An Oracle White Paper May Exadata Smart Flash Cache and the Oracle Exadata Database Machine

Accelerating Big Data: Using SanDisk SSDs for MongoDB Workloads

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

Running Highly Available, High Performance Databases in a SAN-Free Environment

Understanding Flash SSD Performance

Binary search tree with SIMD bandwidth optimization using SSE

Transcription:

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract This technical report presents the results of a benchmark test performed by Locomatix, Inc. on Apache s Cassandra open source distributed database management system utilizing the Violin Memory 6000 Series flash-based memory array for primary storage.

Contents 1 Introduction...3 2 Cassandra...3 3 Benchmark...3 4 NoSQL Workload...5 5 Performance Analysis...5 5.1 Data Ingestion Results...5 5.2 Observations...6 6 Data Analysis Workload...7 6.1 Observations...8 7 Yahoo Cloud Serving Benchmark...8 7.1 Insert Only...9 7.2 Update Heavy 50% Reads and 50% Updates...9 7.3 Read Heavy 95% Reads and 5% Updates...10 7.4 Read Only 100% Reads...11 8 Conclusion...12 2

1 Introduction Recently, Locomatix, Inc. performed benchmark testing on Apache s Cassandra open source distributed database management system. The benchmarks consisted of a series of comparisons between a hard disk drive (HDD) based platform and one utilizing a single Violin flash Memory Array. The testing confirmed that Violin flash Memory Arrays offer significant performance gains in Cassandra environments. Most noteworthy was the 10-40x latency reduction achieved during mixed read/write workloads. The Cassandra benchmarks demonstrate the potential of storage at the speed of memory and also the limitations of both mechanical disks and current software architectures designed around HDD behaviors. When the data blocks were small, bandwidth requirements relatively low, and traffic patterns sequential, Cassandra performance differences between the Violin flash Memory Array and the HDD platform were predictably small. But as bandwidth requirements increased and especially as the input/output (I/O) patterns grew more random, disk performance suffered and soon stalled, while Violin flash Memory Arrays delivered. 2 Cassandra Cassandra is a popular NoSQL distributed key-value store designed to scale to a large (100s to 1000s) number of nodes and have no single point of failure. It was initially developed at Facebook to deal with terabytes of inbox system data. From a technology perspective, Cassandra uses three key concepts to achieve high performance: columnoriented data layout, data de-normalization, and distributed hashing. Collectively, these allow it to perform O(1) lookup on simple key-value operations, scale near-linearly on single pass based analytic algorithms, and achieve high speeds when appending new data. The fast data append feature of Cassandra distinguishes it from other similar systems and thus Cassandra is especially interesting as a data management solution for analytic applications that need to ingest large amounts of data very quickly. In Cassandra, the data is organized as a Column Family. A unique key identifies each data item in a column family. Several column families can co-exist in a single key space. A column family can be loosely compared to a table in a relational database management system (RDMS) and key space to a database in RDMS, as illustrated in Figure 1. 3 Benchmark Goals The main goal of the benchmark efforts was to compare and analyze the performance of Cassandra on two Figure 1: Cassandra Data Model platforms: a system with traditional HDD and a more modern system consisting of a single Violin flash Memory Array. 3

The first platform was a disk-based system with four computing nodes. The configuration for this system is shown in Figure 2 and called disk. The HDD-based cluster consisted of four computing nodes, each with two Intel Xeon Quad Core processors running at 2.53 GHz with hyper-threading enabled. Each node had 24 GB of DRAM and a total of eight 1TB SATA disks. One disk at each node was allocated exclusively to the Cassandra commit log so that the writes to the log were sequential; the other seven disks at each node were allocated for data. All the computing nodes on the HDD platform were running Debian/6.04 Squeeze Linux distribution. Nodes communicated with each other using a LAN switch operating at 10 Gb/sec. The second platform was another four-node system Figure 2: Disk-based System Used for Benchmark attached to a Violin 6616 Series flash Memory Array. The Violin 6616 is based on Single Level Cell (SLC) flash memory and optimized for high I/O per second (IOPS) and low latency while still providing robust RAID protection, ultra-low response times, high transaction rates, and realtime queries of large datasets. The Violin 6616-based cluster consisted of four computing nodes. Each node had two Intel Xeon Quad Core processors running at 2.53 GHz with hyper-threading and 36 GB of DRAM. All four nodes were connected directly to the same Violin 6616 Series Array via PCI- Express. Each of the nodes was running Ubuntu 10.04 Lucid Linux Distribution. A 10 Gb/s LAN switch provided communication between nodes. This platform, shown in Figure 3, is referred to as flash. Figure 3: Flash-based System Used for Benchmark 4

Table 1 summarizes the key system parameter values used in the benchmark. 4 NoSQL Workload The benchmark consisted of two simple workloads. The first, called data ingestion workload, was to insert data into the system and then measure the rate at which data could be ingested. This workload is representative of real-time systems with extremely fast data update rates, such as ingesting network activity log data for network situational awareness applications or ingesting social media data in real-time for real-time sensing of social trends. The second workload, called data analysis workload, consisted of queries on the ingested data where Table 1: Cassandra Benchmark Environment the amount of data retrieved in each query is varied. This workload represents analysis of data in real-time environments, such as identifying the cause of network congestion by scanning a large log of network activity data or scanning the last X minutes of social media data in a specific geographical area to identify the root causes behind an emerging trend/pattern. 5 Performance Analysis 5.1 Data Ingestion Results In these experiments, data was ingested continuously one record at a time. Record size was varied from a few tens of bytes to a few KBs. The data was pumped into the server nodes from five client nodes, with each node running a load program that spawned 100 threads. This method simulates 500 clients simultaneously pumping data into a server backend that has to ingest the data. Each of the five client nodes had two Intel Xeon Quad Core processors running at 2.53 GHz and 24 GB of DRAM memory. All the nodes were running Debian/6.04 Squeeze Linux. This configuration is shown in Figure 4. Figure 4: Benchmark Environment with Data Nodes 5

For these experiments, the data was replicated only once and compaction was disabled. Several Cassandra parameters were tuned, as recommended by Cassandra s documentation (and based on our experience) to extract the maximum performance. These parameters are also shown in Table 1 above. These experiments were repeated for records of size 50 bytes, 100 bytes, 1000 bytes, 5000 bytes, and 11000 bytes. The throughput was measured and is plotted in Figure 5. 5.2 Observations Figure 5: Data Ingestion Throughput The Violin flash Memory Array enabled data ingest performance gains of 30% to over 280% compared to the disk platform. In terms of performance trends, as the record size increased, the disk system s throughput decreased predictably. But the throughput for the flash system actually increased at first, even as the record size also increased, before the trend changed and the inserts/second began to decrease. To help understand the trends better, the CPU utilization per node and storage bandwidth for each of the record sizes are plotted in Table 2. With smaller record sizes, we could have expected both the disk and the flash systems to perform at the same level because the write bandwidth is smaller than 100 MB/sec, the typical maximum sequential bandwidth provided by SATA disks. However, this is not the case because of the commit log processing in Cassandra. A copy of all the data must pass through the commit log before being periodically flushed to storage. With disks, the throughput is limited by how fast the commit log thread can compute the Cyclic Redundancy Check (CRC) and flush the commit log. But with flash, more commit logs can be flushed because solid state storage supports more write bandwidth. Table 2: Data Ingestion Write IOPS and CPU Utilization Because of its greater performance capabilities, one would have expected the flash system s throughput to be much higher than what is shown above, since a direct attached Violin 6616 flash Memory Array could provide as much as 1.2 GB/sec per node. However, because all the data must go through a single commit log thread, the CRC computation saturates the CPU core running this thread, thereby limiting 6

throughput. Nonetheless, even with Cassandra s architecture limitations, the Violin flash Memory Array achieves higher write IOPS and CPU utilization while nearly doubling the actual record insert performance, and more, in many cases. 6 Data Analysis Workload In these experiments, Cassandra performance was evaluated when executing queries with a focus on point queries. Point queries retrieve the records that satisfy a predicate; the predicate checks if the value of an attribute is equal to the constant specified in the query. Logically, such queries must scan the data to look for all records that match the specified predicate. The number of records retrieved by the query depends on the selectivity of the attribute. If the selectivity is high, then the final result will have a small number of output data items, and vice versa. Experiments were run with different selectivity factors. The physical evaluation of these queries is carried out by either scanning all the records in the table or by using an index on the attribute to retrieve the qualifying records. A user level index was created by introducing a separate column family that used the distinct value of the attribute as the key and the entire record as value. In data processing parlance, this is a de-normalized schema with a materialized view on the index attribute. This scheme presents the most efficient way to answer these queries using an index (though at a higher storage footprint cost). To evaluate these queries, data was loaded into a single column family. Some of the attributes in these records were chosen to have repeating values to ensure certain selectivity. For data analysis workload experiments, 175 million rows were ingested that matched the aforementioned schema with a selectivity query run. The SQL equivalent of this query is: SELECT * FROM data WHERE attr_0001 = 1 For each experiment, 100,000 queries were run, presented to the cluster from a query node running a client program. The client program can be configured to vary the number of threads. Varying the number of threads translates to the number of concurrent queries presented to the system at any given instance. Figure 6: Benchmark Environment with Query Node Figure 6 shows the benchmark environment for presenting queries to the Cassandra cluster. In order to eliminate operating system caching effects, the file system buffers were cleared before each experiment, and each query within the experiment was provided with a distinct value for attr_0001 that was never repeated. The results are plotted in Figures 7 and 8 for two different selectivity factors of 0.0001% and 0.001% (tagged with the labels 0.0001 and 0.001 respectively). A selectivity factor of 0.0001% implies that the query selects one record for every one million input records in the database. 7

6.1 Observations For the disk system, as the number of concurrent queries increased the response time to complete the query also increased sharply. The main reason for the sharp increase was an increase in the number of random seeks, because index rows for particular distinct values are scattered across all the index column family files in disk, and each file must be read to retrieve its portion of the result. For flash, the increase was much more linear until 400 queries, after which the response time increased gradually. Because flash can offer more IOPS and the latency of an I/O is in the order of microseconds, it is able to satisfy the I/O requests for the queries quickly, leading to reduced latency. However, after 400 queries, CPU usage starts increasing and at 800 queries some of the queries were consuming 95% of the CPU. On the average, flash utilizes more CPU as compared to disks. This is a function of how fast the storage can provide the data, i.e. read latency. Aggregate CPU utilization for disk stalled at around 35% while the flash continued up to as much as 60%, and ultimately it was CPU saturation that limited Cassandra performance on the flash platform, not the capabilities of the storage itself. Figure 7: Concurrent Queries vs Latency (.0001) Figure 8: Concurrent Queries vs Latency (.001) 7 Yahoo Cloud Serving Benchmark To provide more benchmark breadth, as well as inject more realistic testing environments into the mix, Yahoo Cloud Serving Benchmark (YCSB) was added. YCSB is a new benchmark for web serving workloads characterized by traffic patterns seen in new Web 2.0 applications such as social networking or gaming. YCSB has been used in the past to compare the performance of various NoSQL systems and also of SQL systems that target these newer applications. It provides a set of workloads, each representing a particular mix of 8

read/write operations, data sizes, request distributions, and so on. Performance was measured for various workloads: insert only, update heavy, read heavy, and read only. 7.1 Insert Only For this experiment, 175 million records were inserted from five data nodes. Each node used 100 threads to insert 35 million records. The record sizes were varied from 50 bytes to 10,000 bytes and measured the latency and throughput. The results are shown in Figure 9. Figure 9: YCSB Insert Only Record Size vs Latency and Throughput As the size of the record increased, the latency increased and throughput fell, as expected, because when the record size increases, the amount of data passing through Cassandra and saved to storage increases. At smaller record sizes, flash and disk experienced similar latency and achieved similar throughput because Cassandra writes are sequential and disks hadn t yet reached their peak sequential performance of 100 MB/sec. But at higher record sizes, the disk performance degraded and eventually stalled all together because all the records had to pass through the single commit log thread. This thread, in the disk platform, was limited by the sequential write bandwidth. On the other hand, on the flash platform the commit log thread was not limited by bandwidth. It was instead limited by the CRC computation, because flash provides so much more bandwidth/iops capability than HDD. For the remainder of the experiments, 175 million records were loaded with a record size of 1000 bytes. Each record consisted of 10 fields and all fields were the same size. An operation could either read a field of a record or update a record by replacing the value of a field. Before each experiment, the file system caches were flushed to eliminate reading from cache. This workload was started immediately after inserting data, which meant that Cassandra would do compaction, the process of merging multiple insert files into fewer ones to improve read performance. Compaction does increase CPU costs for merging multiple files, and it utilizes I/O for sequential reads and writes. 7.2 Update Heavy 50% Reads and 50% Updates In this experiment, the workload was equally divided between reads and updates. A total of 500,000 operations were executed for each experiment, which was repeated for different target throughputs. The extraordinary results are shown in Figure 10. In this figure, the x-axis represents a gradual increase in the application demand/throughput and the y-axis represents the observed performance. 9

As can be seen in the figure, when the throughput is low the flash read latency is 10-30x less than the disk read latency, even in mixed workloads. As the throughput increased, the disk platform was not able to push beyond Figure 10: YCSB Update Heavy Read and Write Latency vs Throughput the threshold of 1000 operations/second. Even if the user provided a target throughput higher than 1000 ops/sec, Cassandra would eventually process the workload but the actual reported ops/sec would never rise beyond 1000 ops/sec. Therefore, no numbers for disk were included after the ceiling of 1000 ops/sec. On the other hand, flash continued to outperform disk and maxed out at a throughput of 10,000 ops/sec. At a target throughput of 1000 ops/sec, the read IOPS for both flash and disk are roughly equal. If that is the case, why is the read latency in disk more than 10x higher than in flash? It is because of the high cost of random seeks. At a target throughput of 1000 ops/sec, each operation issues around four random read I/Os. Since the cost of a random seek can be as high as 30ms, a total of 120ms was required to retrieve data from disk, which dominated the overall read latency. Furthermore, at this target throughput, disks were saturated, attaining their highest levels of performance around 100-120 IOPS. But in the case of flash, as the throughput increased, the number of read IOPS kept pace and thus the Violin 6616 served the data requests very quickly, because it can handle more than 1M IOPS. In the case of write operations, the latency of flash was still 2-3x lower than disk. This is because all writes in Cassandra are sequential and the cost of sequential writes to the disk platform are 1/10 of the cost of random seeks. Because Cassandra batches several writes into 32 MB-sized memtables, the cost of a write is amortized over several write operations and thus the write latency in the disk system is reasonable. 7.3 Read Heavy 95% Reads and 5% Updates This experiment evaluated the performance of a workload where read operations were very frequent and update operations occurred only occasionally. More specifically, the workload had 95% reads and 5% updates. The results for this workload are shown in Figure 11. From this figure, it can be observed that the read latency of flash was 30-40x less than disk. Again, blame the poor performance of disks on the high cost of random seeks. The read rate for disk at a target throughput of 1000 ops/sec was approximately 4000 IOPS. Hence each query costs four random seeks, which pegs the minimum read latency at 120 ms. 10

On the other hand, the write latency for disk was still 5x higher than the flash write latency. Furthermore, the latency does not seem to vary as the target throughput increases. Since the number of write operations is not very large for a target throughput, most of the writes are buffered in memory and periodic flushes of the Figure 11: YCSB Read Heavy Read and Write Latency vs Throughput memory buffers is triggered. Since the periodic flush occurs every one second and the data being written varies only between 50K to 1MB, the write latency does not show much variation. The difference in latency between disk and flash is essentially the cost of the actual writes, which in the case of the disk system is a few milliseconds for sequential I/O and for the flash system is less than one millisecond. 7.4 Read Only 100% Reads This experiment presented a workload consisting only of read operations. The read operation either reads a single field or all the fields of a record. Before each experiment, file cache buffers were flushed. The read latency and read IOPS are shown in Figure 12. Figure 12: YCSB Read Only Read Latency and IOPS vs Throughput 11

The observations here are similar to those for the read heavy workload. Flash provides sustained performance for higher targeted throughputs. For disks, the latency suffers from the cost of random seeks, as is evident from the number of read IOPS. 8 Conclusion Benchmark results of Cassandra measured on a traditional HDD-based platform compared to a platform anchored by a single Violin flash Memory Array demonstrate that flash can achieve significant performance improvements in all workload environments. These studies show that flash performs 10-40x faster than disks for queries with different selectivities. And for data loading, flash performs anywhere from 30% to nearly 300% better than disk. Even more interesting, the Cassandra benchmark studies suggest that as software architectures evolve to better utilize the much higher bandwidth and much lower latency of solid state storage devices such as Violin flash Memory Arrays, much greater overall performance gains can be expected. 12

Benchmarking Cassandra On Violin About Violin Memory Violin Memory is pioneering a new class of high-performance flash-based storage systems that are designed to bring storage performance in-line with high-speed applications, servers and networks. Violin Flash Memory Arrays are specifically designed at each level of the system architecture starting with memory and optimized through the array to leverage the inherent capabilities of flash memory and meet the sustained highperformance requirements of business critical applications, virtualized environments and Big Data solutions in enterprise data centers. Specifically designed for sustained performance with high reliability, Violin s Flash Memory Arrays can scale to hundreds of terabytes and millions of IOPS with low, predictable latency. Founded in 2005, Violin Memory is headquartered in Mountain View, California. For more information about Violin Memory products, visit. 2013 Violin Memory. All rights reserved. All other trademarks and copyrights are property of their respective owners. Information provided in this paper may be subject to change. For more information, visit. vmem-13q1-tr-casandra-r1-uslet-en