Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract This technical report presents the results of a benchmark test performed by Locomatix, Inc. on Apache s Cassandra open source distributed database management system utilizing the Violin Memory 6000 Series flash-based memory array for primary storage.

Contents 1 Introduction...3 2 Cassandra...3 3 Benchmark...3 4 NoSQL Workload...5 5 Performance Analysis...5 5.1 Data Ingestion Results...5 5.2 Observations...6 6 Data Analysis Workload...7 6.1 Observations...8 7 Yahoo Cloud Serving Benchmark...8 7.1 Insert Only...9 7.2 Update Heavy 50% Reads and 50% Updates...9 7.3 Read Heavy 95% Reads and 5% Updates...10 7.4 Read Only 100% Reads...11 8 Conclusion...12 2

1 Introduction Recently, Locomatix, Inc. performed benchmark testing on Apache s Cassandra open source distributed database management system. The benchmarks consisted of a series of comparisons between a hard disk drive (HDD) based platform and one utilizing a single Violin flash Memory Array. The testing confirmed that Violin flash Memory Arrays offer significant performance gains in Cassandra environments. Most noteworthy was the 10-40x latency reduction achieved during mixed read/write workloads. The Cassandra benchmarks demonstrate the potential of storage at the speed of memory and also the limitations of both mechanical disks and current software architectures designed around HDD behaviors. When the data blocks were small, bandwidth requirements relatively low, and traffic patterns sequential, Cassandra performance differences between the Violin flash Memory Array and the HDD platform were predictably small. But as bandwidth requirements increased and especially as the input/output (I/O) patterns grew more random, disk performance suffered and soon stalled, while Violin flash Memory Arrays delivered. 2 Cassandra Cassandra is a popular NoSQL distributed key-value store designed to scale to a large (100s to 1000s) number of nodes and have no single point of failure. It was initially developed at Facebook to deal with terabytes of inbox system data. From a technology perspective, Cassandra uses three key concepts to achieve high performance: columnoriented data layout, data de-normalization, and distributed hashing. Collectively, these allow it to perform O(1) lookup on simple key-value operations, scale near-linearly on single pass based analytic algorithms, and achieve high speeds when appending new data. The fast data append feature of Cassandra distinguishes it from other similar systems and thus Cassandra is especially interesting as a data management solution for analytic applications that need to ingest large amounts of data very quickly. In Cassandra, the data is organized as a Column Family. A unique key identifies each data item in a column family. Several column families can co-exist in a single key space. A column family can be loosely compared to a table in a relational database management system (RDMS) and key space to a database in RDMS, as illustrated in Figure 1. 3 Benchmark Goals The main goal of the benchmark efforts was to compare and analyze the performance of Cassandra on two Figure 1: Cassandra Data Model platforms: a system with traditional HDD and a more modern system consisting of a single Violin flash Memory Array. 3

The first platform was a disk-based system with four computing nodes. The configuration for this system is shown in Figure 2 and called disk. The HDD-based cluster consisted of four computing nodes, each with two Intel Xeon Quad Core processors running at 2.53 GHz with hyper-threading enabled. Each node had 24 GB of DRAM and a total of eight 1TB SATA disks. One disk at each node was allocated exclusively to the Cassandra commit log so that the writes to the log were sequential; the other seven disks at each node were allocated for data. All the computing nodes on the HDD platform were running Debian/6.04 Squeeze Linux distribution. Nodes communicated with each other using a LAN switch operating at 10 Gb/sec. The second platform was another four-node system Figure 2: Disk-based System Used for Benchmark attached to a Violin 6616 Series flash Memory Array. The Violin 6616 is based on Single Level Cell (SLC) flash memory and optimized for high I/O per second (IOPS) and low latency while still providing robust RAID protection, ultra-low response times, high transaction rates, and realtime queries of large datasets. The Violin 6616-based cluster consisted of four computing nodes. Each node had two Intel Xeon Quad Core processors running at 2.53 GHz with hyper-threading and 36 GB of DRAM. All four nodes were connected directly to the same Violin 6616 Series Array via PCI- Express. Each of the nodes was running Ubuntu 10.04 Lucid Linux Distribution. A 10 Gb/s LAN switch provided communication between nodes. This platform, shown in Figure 3, is referred to as flash. Figure 3: Flash-based System Used for Benchmark 4

Table 1 summarizes the key system parameter values used in the benchmark. 4 NoSQL Workload The benchmark consisted of two simple workloads. The first, called data ingestion workload, was to insert data into the system and then measure the rate at which data could be ingested. This workload is representative of real-time systems with extremely fast data update rates, such as ingesting network activity log data for network situational awareness applications or ingesting social media data in real-time for real-time sensing of social trends. The second workload, called data analysis workload, consisted of queries on the ingested data where Table 1: Cassandra Benchmark Environment the amount of data retrieved in each query is varied. This workload represents analysis of data in real-time environments, such as identifying the cause of network congestion by scanning a large log of network activity data or scanning the last X minutes of social media data in a specific geographical area to identify the root causes behind an emerging trend/pattern. 5 Performance Analysis 5.1 Data Ingestion Results In these experiments, data was ingested continuously one record at a time. Record size was varied from a few tens of bytes to a few KBs. The data was pumped into the server nodes from five client nodes, with each node running a load program that spawned 100 threads. This method simulates 500 clients simultaneously pumping data into a server backend that has to ingest the data. Each of the five client nodes had two Intel Xeon Quad Core processors running at 2.53 GHz and 24 GB of DRAM memory. All the nodes were running Debian/6.04 Squeeze Linux. This configuration is shown in Figure 4. Figure 4: Benchmark Environment with Data Nodes 5

For these experiments, the data was replicated only once and compaction was disabled. Several Cassandra parameters were tuned, as recommended by Cassandra s documentation (and based on our experience) to extract the maximum performance. These parameters are also shown in Table 1 above. These experiments were repeated for records of size 50 bytes, 100 bytes, 1000 bytes, 5000 bytes, and 11000 bytes. The throughput was measured and is plotted in Figure 5. 5.2 Observations Figure 5: Data Ingestion Throughput The Violin flash Memory Array enabled data ingest performance gains of 30% to over 280% compared to the disk platform. In terms of performance trends, as the record size increased, the disk system s throughput decreased predictably. But the throughput for the flash system actually increased at first, even as the record size also increased, before the trend changed and the inserts/second began to decrease. To help understand the trends better, the CPU utilization per node and storage bandwidth for each of the record sizes are plotted in Table 2. With smaller record sizes, we could have expected both the disk and the flash systems to perform at the same level because the write bandwidth is smaller than 100 MB/sec, the typical maximum sequential bandwidth provided by SATA disks. However, this is not the case because of the commit log processing in Cassandra. A copy of all the data must pass through the commit log before being periodically flushed to storage. With disks, the throughput is limited by how fast the commit log thread can compute the Cyclic Redundancy Check (CRC) and flush the commit log. But with flash, more commit logs can be flushed because solid state storage supports more write bandwidth. Table 2: Data Ingestion Write IOPS and CPU Utilization Because of its greater performance capabilities, one would have expected the flash system s throughput to be much higher than what is shown above, since a direct attached Violin 6616 flash Memory Array could provide as much as 1.2 GB/sec per node. However, because all the data must go through a single commit log thread, the CRC computation saturates the CPU core running this thread, thereby limiting 6

throughput. Nonetheless, even with Cassandra s architecture limitations, the Violin flash Memory Array achieves higher write IOPS and CPU utilization while nearly doubling the actual record insert performance, and more, in many cases. 6 Data Analysis Workload In these experiments, Cassandra performance was evaluated when executing queries with a focus on point queries. Point queries retrieve the records that satisfy a predicate; the predicate checks if the value of an attribute is equal to the constant specified in the query. Logically, such queries must scan the data to look for all records that match the specified predicate. The number of records retrieved by the query depends on the selectivity of the attribute. If the selectivity is high, then the final result will have a small number of output data items, and vice versa. Experiments were run with different selectivity factors. The physical evaluation of these queries is carried out by either scanning all the records in the table or by using an index on the attribute to retrieve the qualifying records. A user level index was created by introducing a separate column family that used the distinct value of the attribute as the key and the entire record as value. In data processing parlance, this is a de-normalized schema with a materialized view on the index attribute. This scheme presents the most efficient way to answer these queries using an index (though at a higher storage footprint cost). To evaluate these queries, data was loaded into a single column family. Some of the attributes in these records were chosen to have repeating values to ensure certain selectivity. For data analysis workload experiments, 175 million rows were ingested that matched the aforementioned schema with a selectivity query run. The SQL equivalent of this query is: SELECT * FROM data WHERE attr_0001 = 1 For each experiment, 100,000 queries were run, presented to the cluster from a query node running a client program. The client program can be configured to vary the number of threads. Varying the number of threads translates to the number of concurrent queries presented to the system at any given instance. Figure 6: Benchmark Environment with Query Node Figure 6 shows the benchmark environment for presenting queries to the Cassandra cluster. In order to eliminate operating system caching effects, the file system buffers were cleared before each experiment, and each query within the experiment was provided with a distinct value for attr_0001 that was never repeated. The results are plotted in Figures 7 and 8 for two different selectivity factors of 0.0001% and 0.001% (tagged with the labels 0.0001 and 0.001 respectively). A selectivity factor of 0.0001% implies that the query selects one record for every one million input records in the database. 7

6.1 Observations For the disk system, as the number of concurrent queries increased the response time to complete the query also increased sharply. The main reason for the sharp increase was an increase in the number of random seeks, because index rows for particular distinct values are scattered across all the index column family files in disk, and each file must be read to retrieve its portion of the result. For flash, the increase was much more linear until 400 queries, after which the response time increased gradually. Because flash can offer more IOPS and the latency of an I/O is in the order of microseconds, it is able to satisfy the I/O requests for the queries quickly, leading to reduced latency. However, after 400 queries, CPU usage starts increasing and at 800 queries some of the queries were consuming 95% of the CPU. On the average, flash utilizes more CPU as compared to disks. This is a function of how fast the storage can provide the data, i.e. read latency. Aggregate CPU utilization for disk stalled at around 35% while the flash continued up to as much as 60%, and ultimately it was CPU saturation that limited Cassandra performance on the flash platform, not the capabilities of the storage itself. Figure 7: Concurrent Queries vs Latency (.0001) Figure 8: Concurrent Queries vs Latency (.001) 7 Yahoo Cloud Serving Benchmark To provide more benchmark breadth, as well as inject more realistic testing environments into the mix, Yahoo Cloud Serving Benchmark (YCSB) was added. YCSB is a new benchmark for web serving workloads characterized by traffic patterns seen in new Web 2.0 applications such as social networking or gaming. YCSB has been used in the past to compare the performance of various NoSQL systems and also of SQL systems that target these newer applications. It provides a set of workloads, each representing a particular mix of 8

read/write operations, data sizes, request distributions, and so on. Performance was measured for various workloads: insert only, update heavy, read heavy, and read only. 7.1 Insert Only For this experiment, 175 million records were inserted from five data nodes. Each node used 100 threads to insert 35 million records. The record sizes were varied from 50 bytes to 10,000 bytes and measured the latency and throughput. The results are shown in Figure 9. Figure 9: YCSB Insert Only Record Size vs Latency and Throughput As the size of the record increased, the latency increased and throughput fell, as expected, because when the record size increases, the amount of data passing through Cassandra and saved to storage increases. At smaller record sizes, flash and disk experienced similar latency and achieved similar throughput because Cassandra writes are sequential and disks hadn t yet reached their peak sequential performance of 100 MB/sec. But at higher record sizes, the disk performance degraded and eventually stalled all together because all the records had to pass through the single commit log thread. This thread, in the disk platform, was limited by the sequential write bandwidth. On the other hand, on the flash platform the commit log thread was not limited by bandwidth. It was instead limited by the CRC computation, because flash provides so much more bandwidth/iops capability than HDD. For the remainder of the experiments, 175 million records were loaded with a record size of 1000 bytes. Each record consisted of 10 fields and all fields were the same size. An operation could either read a field of a record or update a record by replacing the value of a field. Before each experiment, the file system caches were flushed to eliminate reading from cache. This workload was started immediately after inserting data, which meant that Cassandra would do compaction, the process of merging multiple insert files into fewer ones to improve read performance. Compaction does increase CPU costs for merging multiple files, and it utilizes I/O for sequential reads and writes. 7.2 Update Heavy 50% Reads and 50% Updates In this experiment, the workload was equally divided between reads and updates. A total of 500,000 operations were executed for each experiment, which was repeated for different target throughputs. The extraordinary results are shown in Figure 10. In this figure, the x-axis represents a gradual increase in the application demand/throughput and the y-axis represents the observed performance. 9

As can be seen in the figure, when the throughput is low the flash read latency is 10-30x less than the disk read latency, even in mixed workloads. As the throughput increased, the disk platform was not able to push beyond Figure 10: YCSB Update Heavy Read and Write Latency vs Throughput the threshold of 1000 operations/second. Even if the user provided a target throughput higher than 1000 ops/sec, Cassandra would eventually process the workload but the actual reported ops/sec would never rise beyond 1000 ops/sec. Therefore, no numbers for disk were included after the ceiling of 1000 ops/sec. On the other hand, flash continued to outperform disk and maxed out at a throughput of 10,000 ops/sec. At a target throughput of 1000 ops/sec, the read IOPS for both flash and disk are roughly equal. If that is the case, why is the read latency in disk more than 10x higher than in flash? It is because of the high cost of random seeks. At a target throughput of 1000 ops/sec, each operation issues around four random read I/Os. Since the cost of a random seek can be as high as 30ms, a total of 120ms was required to retrieve data from disk, which dominated the overall read latency. Furthermore, at this target throughput, disks were saturated, attaining their highest levels of performance around 100-120 IOPS. But in the case of flash, as the throughput increased, the number of read IOPS kept pace and thus the Violin 6616 served the data requests very quickly, because it can handle more than 1M IOPS. In the case of write operations, the latency of flash was still 2-3x lower than disk. This is because all writes in Cassandra are sequential and the cost of sequential writes to the disk platform are 1/10 of the cost of random seeks. Because Cassandra batches several writes into 32 MB-sized memtables, the cost of a write is amortized over several write operations and thus the write latency in the disk system is reasonable. 7.3 Read Heavy 95% Reads and 5% Updates This experiment evaluated the performance of a workload where read operations were very frequent and update operations occurred only occasionally. More specifically, the workload had 95% reads and 5% updates. The results for this workload are shown in Figure 11. From this figure, it can be observed that the read latency of flash was 30-40x less than disk. Again, blame the poor performance of disks on the high cost of random seeks. The read rate for disk at a target throughput of 1000 ops/sec was approximately 4000 IOPS. Hence each query costs four random seeks, which pegs the minimum read latency at 120 ms. 10

On the other hand, the write latency for disk was still 5x higher than the flash write latency. Furthermore, the latency does not seem to vary as the target throughput increases. Since the number of write operations is not very large for a target throughput, most of the writes are buffered in memory and periodic flushes of the Figure 11: YCSB Read Heavy Read and Write Latency vs Throughput memory buffers is triggered. Since the periodic flush occurs every one second and the data being written varies only between 50K to 1MB, the write latency does not show much variation. The difference in latency between disk and flash is essentially the cost of the actual writes, which in the case of the disk system is a few milliseconds for sequential I/O and for the flash system is less than one millisecond. 7.4 Read Only 100% Reads This experiment presented a workload consisting only of read operations. The read operation either reads a single field or all the fields of a record. Before each experiment, file cache buffers were flushed. The read latency and read IOPS are shown in Figure 12. Figure 12: YCSB Read Only Read Latency and IOPS vs Throughput 11

The observations here are similar to those for the read heavy workload. Flash provides sustained performance for higher targeted throughputs. For disks, the latency suffers from the cost of random seeks, as is evident from the number of read IOPS. 8 Conclusion Benchmark results of Cassandra measured on a traditional HDD-based platform compared to a platform anchored by a single Violin flash Memory Array demonstrate that flash can achieve significant performance improvements in all workload environments. These studies show that flash performs 10-40x faster than disks for queries with different selectivities. And for data loading, flash performs anywhere from 30% to nearly 300% better than disk. Even more interesting, the Cassandra benchmark studies suggest that as software architectures evolve to better utilize the much higher bandwidth and much lower latency of solid state storage devices such as Violin flash Memory Arrays, much greater overall performance gains can be expected. 12

Benchmarking Cassandra On Violin About Violin Memory Violin Memory is pioneering a new class of high-performance flash-based storage systems that are designed to bring storage performance in-line with high-speed applications, servers and networks. Violin Flash Memory Arrays are specifically designed at each level of the system architecture starting with memory and optimized through the array to leverage the inherent capabilities of flash memory and meet the sustained highperformance requirements of business critical applications, virtualized environments and Big Data solutions in enterprise data centers. Specifically designed for sustained performance with high reliability, Violin s Flash Memory Arrays can scale to hundreds of terabytes and millions of IOPS with low, predictable latency. Founded in 2005, Violin Memory is headquartered in Mountain View, California. For more information about Violin Memory products, visit. 2013 Violin Memory. All rights reserved. All other trademarks and copyrights are property of their respective owners. Information provided in this paper may be subject to change. For more information, visit. vmem-13q1-tr-casandra-r1-uslet-en