DataStax Enterprise, powered by Apache Cassandra (TM)



Similar documents
Simplifying Database Management with DataStax OpsCenter

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Nexenta Performance Scaling for Speed and Cost

OPTIMIZING PERFORMANCE IN AMAZON EC2 INTRODUCTION: LEVERAGING THE PUBLIC CLOUD OPPORTUNITY WITH AMAZON EC2.

Benchmarking Cassandra on Violin

Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise

HyperQ Storage Tiering White Paper

The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

SOLUTION BRIEF. Resolving the VDI Storage Challenge

New Features in PSP2 for SANsymphony -V10 Software-defined Storage Platform and DataCore Virtual SAN

How To Connect Virtual Fibre Channel To A Virtual Box On A Hyperv Virtual Machine

Amazon EC2 Product Details Page 1 of 5

Amazon Cloud Storage Options

Understanding Enterprise NAS

EMC VFCACHE ACCELERATES ORACLE

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

The Data Placement Challenge

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Realizing the True Potential of Software-Defined Storage

Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks

Intel RAID SSD Cache Controller RCS25ZB040

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

SSD Performance Tips: Avoid The Write Cliff

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

The VMware Reference Architecture for Stateless Virtual Desktops with VMware View 4.5

Microsoft Windows Server in a Flash

DataStax Enterprise Reference Architecture

Understanding Data Locality in VMware Virtual SAN

PrimaryIO Application Performance Acceleration Date: July 2015 Author: Tony Palmer, Senior Lab Analyst

Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

White Paper. Educational. Measuring Storage Performance

How To Scale Myroster With Flash Memory From Hgst On A Flash Flash Flash Memory On A Slave Server

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Building Optimized Scale-Out NAS Solutions with Avere and Arista Networks

Flash Controller Architecture for All Flash Arrays

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

INTRODUCTION TO CASSANDRA

Microsoft Windows Server Hyper-V in a Flash

Dot Hill Storage Systems and the Advantages of Hybrid Arrays

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

SOFTWARE-DEFINED STORAGE IN ACTION

Achieving Zero Downtime and Accelerating Performance for WordPress

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Mission-Critical Java. An Oracle White Paper Updated October 2008

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

EMC XTREMIO EXECUTIVE OVERVIEW

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

How To Use Hp Vertica Ondemand

Building a Flash Fabric

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HadoopTM Analytics DDN

With DDN Big Data Storage

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

The IBM Cognos Platform for Enterprise Business Intelligence

Identity Theft. Credit Card Fraud. Hackers. Terrorists. It s scary out there.

Cloudwick. CLOUDWICK LABS Big Data Research Paper. Nebula: Powering Enterprise Private & Hybrid Cloud for DataStax Big Data

Benchmarking Hadoop & HBase on Violin

Unlock the value of data with smarter storage solutions.

EMC SOLUTION FOR SPLUNK

Dell One Identity Manager Scalability and Performance

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Axceleon s CloudFuzion Turbocharges 3D Rendering On Amazon s EC2

Intel Service Assurance Administrator. Product Overview

An Oracle White Paper October Realizing the Superior Value and Performance of Oracle ZFS Storage Appliance

The Multi-Model Database Cloud Applications in a Complex World

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Deploying Affordable, High Performance Hybrid Flash Storage for Clustered SQL Server

Big Fast Data Hadoop acceleration with Flash. June 2013

Cloud Optimized Performance: I/O-Intensive Workloads Using Flash-Based Storage

HyperQ Remote Office White Paper

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Case Study. Cloud Adoption, Fault Tolerant AWS Support & Magento ecommerce Implementation. Case Study

ANY SURVEILLANCE, ANYWHERE, ANYTIME

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

ScaleArc for SQL Server

Scalable Architecture on Amazon AWS Cloud

Technology Insight Series

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Big data management with IBM General Parallel File System

VMware Hybrid Cloud. Accelerate Your Time to Value

WHITE PAPER. Drobo TM Hybrid Storage TM

FlashSoft Software from SanDisk : Accelerating Virtual Infrastructures

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

SQL Server Virtualization

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

All-Flash Storage Solution for SAP HANA:

Maximizing SQL Server Virtualization Performance

Exchange Storage Meeting Requirements with Dot Hill

Transcription:

PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies Inc. Datagres PerfAccel is a patent pending technology from Datagres Technologies Inc. Information in this document is provided in connection with Datagres products. No license, express or implied, by estoppel or otherwise, to any Datagres intellectual property rights is granted by this document. Except as provided in Datagres's Terms and Conditions of Sale for such products. Datagres and PerfAccel are trademarks or registered trademarks of Datagres Technologies Inc or its subsidiaries in the United States and other countries. Copyright 2015, Datagres Technologies Inc. All Rights Reserved. Datagres may make changes to specifications and product descriptions at any time, without notice.

EXECUTIVE SUMMARY NoSQL databases, cloud deployments and SSDs are some of the buzz words that dominate the current IT infrastructure conversations and for good reasons. More and more deployments are moving to the cloud and a vast portion of these deployments include the current breed of Non-Relational Databases or NoSQL as they are mostly referred to. Solid State Devices are gaining popularity too, given the low latency high throughput option they present. All cloud providers now have SSD as part of their offerings, as direct attached devices and also as network attached devices. The direct attached devices are fast but available in limited capacity on the instances. Larger direct attached SSDs are only available with larger and more expensive instance types. This paper presents how PerfAccel can be used with cloud deployment of Cassandra to provide improved performance at lower cost and as a way to using cloud instance storage better. NoSQL databases run better with low latency SSD systems PerfAccel provides the best combination of price performance NOSQL DATABASES AND I/O PERFORMANCE ISSUES NoSQL or Not Only SQL as it is referred to in some contexts, is a new method of data management that is different from the more traditional relational model of data storage. NoSQL has evolved out of the need to provide simplicity in design, on-demand horizontal scaling and finer control over consistency, availability and partition tolerance which are some of the key requirements of modern applications. Adoption of NoSQL databases has increased with increase in size of datasets also referred to as Big Data and with applications trying to store data in a way that makes application design and development simpler. It s a good alternative to a normalized relational data model which causes impedance mismatch and makes application design and development slow and feature expansion difficult. Traditional databases typically scale vertically. NoSQL databases however, scale out and on demand. This is very critical for most cloud-based deployments which require rapid scaling on demand at times of high load. NoSQL databases scale horizontally presenting different IO challenges Cassandra compaction process creates high IOPS requirement Running all SSD systems are expensive and can limit scalability Horizontal or scale out model used by NoSQL databases presents a new dimension to IO Performance handling. Database IO optimization techniques which used to work well with vertical scaling model, are not completely applicable to the horizontal scaling approach. Hence, there is a need to apply different techniques of achieving better throughput and latencies while using cloud deployments. NoSQL databases are particularly prone to be affected by performance cliff, which happens when the working set of the application exceeds the system RAM. Due to the inherent way in which NoSQL databases and the applications that use them work, where most of the data access patterns are random and even with considerate design and good choice of primary key to drive a working data set, IO performance issues crop up all the time. Cassandra is one of the leading NoSQL databases. It is preferred for its properties of high-availability and high-scalability without any compromise on performance. While the architecture of Cassandra does provide many great features, it comes at a cost of I/O performance unless used with high speed disks (SSDs). This is due to the way Cassandra manages updates to existing keys. Cassandra follows the log structured update model, where updates are sequentially written to a new immutable file, and the older entries marked with tombstones. Compaction algorithm takes care of removing the stale entries. The additional task of removing stale entries and compressing the tables comes at a cost of additional reads and writes on an IOPS constrained data disk. This coupled with the limited network bandwidth on cloud instances is one reason why best practice suggests that local SSD based instance storage be used for storing the Cassandra dataset. However, the choice of running an all SSD system substantially increases the cost of deployment.

CLOUD INSTANCES AND STORAGE TYPES Cloud instances come in different forms and compositions. The variations are from RAM sizes, to number of CPUs and the type of storage devices connected. Instances typically are configured to be optimized for memory, cpu or storage. The choice depends on the use case and application behaviour. Figure 1 - Instance type categories for Amazon EC2 Two basic types of storage choices are available for use within cloud instances. Storage that is directly connected to the hosted virtual machine instance or using remote storage served over the network. Remote storage served over the network can be served off a magnetic media or off a bank of SSD storage. Similarly, the storage directly attached to the virtual machine also could be SSD or magnetic disks. SSD storage provides fastest performance as compared to magnetic storage especially if it is also locally attached SSD device. We describe the storage choices based on how the Amazon EC2 provides them. Other cloud providers also have similar offerings. Figure 2 - Storage types available on Amazon EC2 Storage optimized cloud instances are better equipped in terms of better storage options and higher capacity of local SSD storage available for use. These optimizations come at a cost. As can be seen from the table above, higher capacity local storage is much more expensive as they are only available on instances with higher configuration overall.

USAGE OF INSTANCE STORE DISKS / LOCALLY ATTACHED DISKS The instance store disks or locally attached disks, which are typically SSD devices in a storage optimized instance are very different from the rest of the storage devices. These disks are ephemeral in nature, and can lose data on a hard reset or if the base machine were to restart. This makes using these disks challenging. Since the data cannot survive past hard resets and restarts, the data on these devices need to be backed up at all times on a more reliable device type. Because of this difficulty the instance store disks are either not used at all or used as a temporary storage making it problematic. The only other alternative is when the application using the disk is able to seamlessly handle the loss of data which might ensue. PERFACCEL SOLUTION PerfAccel presents a unique solution that provides deep analytics to observe IO behavior, helping determine better data placement and improve performance of NoSQL database deployments. In addition, using its intelligent caching capabilities, PerfAccel can deliver much higher performance. The result is a significant reduction in infrastructure costs while providing rich analytics and much higher performance. PerfAccel supports acceleration of all IO across multiple platforms. It supports NAS, SAN and DAS to provide a seamless performance benefit to all types of applications. Configurable caching policies ensure that the right working set resides in the cache for maximum performance benefit. It is extremely easy to deploy and manage and the in-depth analytics can provide deep insights to help users understand application IO pattern and IO footprint to optimize workloads. GUI based management console helps in managing across large grid deployments with a centralized data repository for analytics. PerfAccel provides: Storage visibility through deep file-level analytics Intelligent caching & deterministic placement of hot files Higher performance using fewer SSDs used optimally Increased scale by leveraging spinning disks implicitly improves the write performance of the application. Instance Store SSDs are ephemeral and can lose data Limitations create problems for getting most out of the system PerfAccel can be used to take advantage of instance store SSD in a manner which is beneficial in two ways. Firstly, there is no need for the application or any tool to ensure that the data lost at reboot be placed back on the faster device before the application can start using it. Secondly, PerfAccel can actually use the device in a much more efficient manner, by ensuring the hot data resides on the faster storage providing much better performance from the same instance type. PerfAccel would use the faster device available as a cache and will ensure optimal placement of frequently used hot data. The application directly benefits since all the reads coming from this device are much faster increasing performance and reducing latency. In addition, since these read operations are offloaded by the cache, the backend storage device which holds the entire dataset is more responsive as it has to serve fewer IOPS. Thus PerfAccel cache not only improves read performance, it also

TEST AND BENCHMARK CONFIGURATION The following tests and benchmark were performed on Amazon EC2 Run a workload with the entire dataset residing on a provisioned iops, ssd backed EBS (optimized) volume. Run the workload with the entire dataset residing on a locally attached SSD. Run the workload with the entire dataset on provisioned iops ssd backed, EBS (optimized) volume, which is cached on a locally attached SSD, using PerfAccel. Benchmark Datastax s Cassandra as the NoSQL database. Cassandra-stress as the tool to generate load on the database. Figure 3- Test/Benchmark Configuration Chosen Instance Types The chosen instance types are same in all respects, except for the size of instance store SSD attached to them. The r3.2xlarge instance has a huge cost advantage as it is more than 50% cheaper compared to i2.2xlarge. The r3.2xlarge instances were used to run the workload with PerfAccel, and with EBS. For running the workload with all the data on SSD, i2.2xlarge instances were used. While running the workload with EBS and with PerfAccel, the dataset was stored on an optimized EBS volume (general purpose SSD backed). Figure 4- Chosen Instance Types from Amazon EC2.

TEST RESULTS Throughput Results Significant throughput improvement as compared to EBS. Within 10-20% of direct SSD throughput. Write heavy throughput is almost the same for all the cases, as the commit log resides on SSD for all the test cases. Figure 5- Throughput (Read Heavy) Figure 6- Throughput (Write Heavy) Results - Latency for 95% of ops Latency numbers follow the same pattern as throughput. Majority of data being served from the SSD with PerfAccel cache keeps the latency low and performance good. Once again, for the write heavy workload, the difference is not much. Figure 7- Latency for 95% of ops (Read Heavy) Figure 8- Latency for 95% of ops (Write Heavy) Results - Latency for 99% of ops These are important latency numbers, showing that majority of the operations benefit from the low latency I/O from the SSD cache. The I/O performance follows the same pattern of throughput. Figure 9- Latency for 99% of ops (Read Heavy) Figure 10- Latency for 99% of ops (Write Heavy)

PERFACCEL STATISTICS Following PerfAccel stats were collected at the end of the test when run with PerfAccel. Summary Top Files in the Cache by Read Hits Figure 11- PerfAccel Summary Stats Top Files in the Cache by size of cache used Figure 12- Top files in the cache by read hits Figure 13- Top files in the cache by size of cache used

PERFACCEL ANALYTICS Summary Graph Figure 14- PerfAccel advanced analytics summary graph The graph shows high read misses and low read hits in the earlier part of the run. At about the 20 minute mark, the read misses go down and the read hits start going up. Few cache cleanups are seen in the later part of the run. Lot of write misses, but no write hits. Which means no data in the cache is being updated. Inode Read Hits Graph Figure 15- PerfAccel advanced analytics Inode Read hits graph For the file with the most number of read hits. We look at the pattern of read hits over 60 second intervals. Initial part of the run, there are very few hits, as most of the file is still not in the cache. The end part of the profile is mostly writes; hence the low read hits at the end.

Inode Read Distribution Graph The inode read distribution graphs shows the read pattern from the file. Very clearly the pattern is completely random. With the overall pattern matching the previous graph with more hits in the middle part of the run, than at the start or end. Figure 16- PerfAccel advanced analytics File read activity graph RETURN ON INVESTMENT As seen above, by using an r3.2xlarge instance that costs less than half of a full ssd instance i2.2xlarge, one can reduce infrastructure costs by more than 50%. For a single node of i2.2xlarge node replaced with r3.2xlarge instance the cost savings are in the range of close to $9,000 on an annualized basis. PerfAccel can enable significant cost savings, by reducing the instance size, and by reducing the number of instances with its storage intelligence that uses fast SSD devices and can enable systems to handle much more load then they would otherwise be able to. SUMMARY Significant cost reduction Increased scalability Higher performance Better visibility PerfAccel cache with Instance store SSDs on smaller EC2 instances is a winning combination of Performance and Cost. For use cases, where the entire dataset cannot reside on instance store, PerfAccel presents an excellent solution to effectively use faster I/O media. PerfAccel is very easy to deploy and provides valuable insightful analytics Detailed analytics and configurable caching policies can further improve performance, by optimal use of cache space.

About Datastax: DataStax delivers Apache Cassandra, the leading distributed database technology, to the enterprise. Apache Cassandra is built to be agile, always-on, and predictably scalable to any size. With more than 400 customers in over 50 countries, DataStax is the database technology and transactional backbone of choice for the world s most innovative companies such as Netflix, Adobe, Intuit, and ebay. Based in Santa Clara, Calif., DataStax is backed by industry-leading investors including Comcast Ventures, Crosslink Capital, Lightspeed Venture Partners, Kleiner Perkins Caufield & Byers, Meritech Capital, Premji Invest and Scale Venture Partners. For more information, visit DataStax.com or follow us @DataStax. For more information, visit www.datastax.com. About Datagres: Datagres provides software that helps companies visualize, control and accelerate their application performance using deep storage intelligence. Datagres flagship product PerfAccel is a very powerful analytics driven software solution that operates at a file level and can show the exact IO pattern of an application data access especially in a scale-out grid environment. As a result, it can provide an effective way of controlling IOs and also accelerate for higher throughput and lower latencies using high-performance SSD devices. The company is headquartered in Palo Alto, California and is venture-backed by Nexus Venture Partners For more information, visit www.datagres.com. Datagres Technologies Inc 2600 EL CAMINO REAL, Palo Alto, CA 94306 Phone: 510-402-4365 www.datagres.com