Ceph Optimization on All Flash Storage

Similar documents
CEPH Op(miza(on on All Flash Storage Axel Rosenberg Systems Field Engineering & Solu(ons Marke(ng

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Scaling from Datacenter to Client

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

SMB Direct for SQL Server and Private Cloud

Installing Hadoop over Ceph, Using High Performance Networking

High Performance Tier Implementation Guideline

The Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

POSIX and Object Distributed Storage Systems

Marvell DragonFly. TPC-C OLTP Database Benchmark: 20x Higher-performance using Marvell DragonFly NVCACHE with SanDisk X110 SSD 256GB

Measuring Interface Latencies for SAS, Fibre Channel and iscsi

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

GraySort on Apache Spark by Databricks

How SSDs Fit in Different Data Center Applications

Best Practices for Increasing Ceph Performance with SSD

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

AIX NFS Client Performance Improvements for Databases on NAS

Characterize Performance in Horizon 6

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Big Fast Data Hadoop acceleration with Flash. June 2013

Analysis of VDI Storage Performance During Bootstorm

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Maximizing SQL Server Virtualization Performance

Enabling the Flash-Transformed Data Center

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Hadoop Size does Hadoop Summit 2013

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Accelerating Server Storage Performance on Lenovo ThinkServer

Windows 8 SMB 2.2 File Sharing Performance

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

The Transition to PCI Express* for Client SSDs

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Scaling Cloud-Native Virtualized Network Services with Flash Memory

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Evaluation Report: Supporting Multiple Workloads with the Lenovo S3200 Storage Array

Bringing the Public Cloud to Your Data Center

Scaling in a Hypervisor Environment

LS DYNA Performance Benchmarks and Profiling. January 2009

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

EqualLogic PS Series Load Balancers and Tiering, a Look Under the Covers. Keith Swindell Dell Storage Product Planning Manager

Sun 8Gb/s Fibre Channel HBA Performance Advantages for Oracle Database

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Deep Dive: Maximizing EC2 & EBS Performance

Delivering Quality in Software Performance and Scalability Testing

FlashSoft Software from SanDisk : Accelerating Virtual Infrastructures

NexentaStor Enterprise Backend for CLOUD. Marek Lubinski Marek Lubinski Sr VMware/Storage Engineer, LeaseWeb B.V.

The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions

version 7.0 Planning Guide

SSDs: Practical Ways to Accelerate Virtual Servers

Comparison of Hybrid Flash Storage System Performance

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Optimizing SQL Server Storage Performance with the PowerEdge R720

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

SSDs: Practical Ways to Accelerate Virtual Servers

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Benchmarking Hadoop & HBase on Violin

SOLUTION BRIEF. Resolving the VDI Storage Challenge

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

WHITE PAPER 1

Evaluation of Enterprise Data Protection using SEP Software

Brocade and EMC Solution for Microsoft Hyper-V and SharePoint Clusters

Performance and scalability of a large OLTP workload

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

SanDisk SSD Boot Storm Testing for Virtual Desktop Infrastructure (VDI)

Azure VM Performance Considerations Running SQL Server

Capacity planning for IBM Power Systems using LPAR2RRD.

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Traditional v/s CONVRGD

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

Benchmarking Cassandra on Violin

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

Transcription:

Ceph Optimization on All Flash Storage Somnath Roy Lead Developer, SanDisk Corporation Santa Clara, CA 1

Forward-Looking Statements During our meeting today we may make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market position, market growth, product sales, industry trends, supply chain, future memory technology, production capacity, production costs, technology transitions and future products. This presentation contains information from third parties, which reflect their projections as of the date of issuance. Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption Risk Factors and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof. Santa Clara, CA 2

What is Ceph? A Distributed/Clustered Storage System File, Object and Block interfaces to common storage substrate Ceph is LGPL open source project Part of Linux kernel and most distros since 2.6.x Santa Clara, CA 3

Flash Readiness? Started with Ceph Emperor (~2 years ago) Not much scaling with increasing number of clients Minimal scaling if we increase number of SSDs No resources are saturated CPU core usage per IOPS is very high Double write due to Ceph journal is increasing WA by minimum 2X Santa Clara, CA 4

SanDisk on Ceph We saw lot of potensal on Ceph Decided to dig deep and opsmize the code base for flash Ceph read path was comparasvely easier to tackle, so, started with that first Also, lot of our potensal customer workload was kind of WORM Santa Clara, CA 5

Bottlenecks Identified and Fixed OpSmized lot of CPU intensive code path Found out context switching overhead is significant if backend is very fast Lot of lock contenson overhead popped up, sharding helped a lot Fine grained locking helped to achieve more parallelism New/delete overhead on IO path is becoming significant Efficient caching of indexes (placement groups) is beneficial Efficient buffering while reading from socket Need to disable Nagle s algorithm while scaling out Needed to opsmize tcmalloc for object size < 32k Santa Clara, CA 6

Time for a Benchmark of.. InfiniFlash TM System IF500 Ceph with all the opsmizasons mensoned on top of InfiniFlash IF100 + Proper filesystem/kernel tuning opsmized for InfiniFlash IF100 Santa Clara, CA 7

InfiniFlash IF500 Topology on a Single 512TB InfiniFlash IF100 - IF100 BW is ~8.5GB/s (with 6Gb SAS, 12 Gb is coming EOY) and ~1.5M 4K RR IOPS - We saw that Ceph is very resource hungry, so, need at least 2 physical nodes on top of IF100 - We need to connect all 8 ports of an HBA to saturate IF100 for bigger block size Santa Clara, CA 8

2 Node Cluster (32 drives shared to each OSD node) Node RBD Client Setup Details 2 Servers (Dell R620) 4 Servers (Dell R620) Storage InfiniFlash IF100 with 64 cards in A2 Config Performance Config - InfiniFlash System IF500 2 x E5-2680 12C 2.8GHz 4 x 16GB RDIMM, dual rank x 4 (64GB) 1 x Mellanox X3 Dual 40GbE 1 x LSI 9207 HBA card 1 x E5-2680 10C 2.8GHz 2 x 16GB RDIMM, dual rank x 4 (32 GB) 1 x Mellanox X3 Dual 40GbE InfiniFlash IF100 IF100 is connected 64 x 1YX2 cards in A2 topology Total storage - 64 * 8TB = 512TB Network Details 40G Switch NA OS Details OS Ubuntu 14.04 LTS 64bit 3.13.0-32 LSI card/ driver SAS2308(9207) mpt2sas Mellanox 40gbps nw card MT27500 [ConnectX- 3] mlx4_en - 2.2-1 (Feb 2014) Cluster ConfiguraVon CEPH Version sndk- ifos- 1.0.0.04 0.86.rc.eap2 ReplicaSon (Default) 2 [Host] Note: - Host level replicason. Number of Pools, PGs & RBDs pool = 4 ;PG = 2048 per pool 2 RBDs from each pool RBD size 2TB Number of Monitors 1 Number of OSD Nodes 2 Number of OSDs per Node 32 total OSDs = 32 * 2 = 64 Santa Clara, CA 9

8K Random IO Block Size Sum of TotalOps 300000 250000 200000 150000 100000 50000 0 ReadPct 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Threads/worker Latency(ms) ceph Stock Ceph Storm 1.0 Stock Ceph IF500 [Queue Depth] Read Percent Santa Clara, CA 10

64K Random IO Block Size Sum of TotalOps 160000 140000 120000 100000 80000 60000 40000 20000 0 ReadPct 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Threads/worker Latency(ms) ceph Stock Ceph Storm 1.0 Stock Ceph IF500 We are saturating ~8.5 GB/s IF100 BW here [Queue Depth] Read Percent Santa Clara, CA 11

InfiniFlash IF500 Scale Out Topology Santa Clara, CA 12

InfiniFlash IF500 HW Set Up 6 Node Cluster (8 drives connected to each OSD node) 6 Servers Node (Dell R620) RBD Client Network Details 40G Switch OS Details 5 Servers (Dell R620) NA Performance Config 2 x E5-2680 v2 2.8GHz 25M$ 8 x 16GB RDIMM, dual rank (128 GB) 1 x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card 1 x E5-2680 v2 2.8GHz 25M$ 4 x 16GB RDIMM, dual rank (64 GB) 1 x Mellanox X3 Dual 40GbE OS Ubuntu 14.04 LTS 64bit 3.13.0-32 LSI card/ driver SAS2308(9207) / mpt2sas 16.100.00.00 Mellanox 40gbps nw card MT27500 [ConnectX- 3] / MLX4_EN mlx4_en - 2.2-1 Cluster ConfiguraVon CEPH Version sndk- ifos- 1.0.0.07 0.87-IFOS-1.0.0.7.beta ReplicaSon (Default) Number of Pools, PGs & RBDs 3 (CHASSIS) pool = 5 ; PG = 2048 per pool Note: - chassis level replicason 2 RBDs from each pool RBD size 3TB total DATA SET size = 30TB Number of Monitors 1 Number of OSD Nodes 6 Number of OSDs per Node 8 total OSDs = 6 * 8 = 48 Santa Clara, CA 13

4K Random IOPS Performance Latency(ms) [Queue Depth] Read Percent With tuning, maximum cumulative performance of ~700K IOPS measured towards 4KB blocks, saturating node CPUs at this point. Santa Clara, CA 14

64K Random IOPS Performance Latency(ms) Maximum Bandwidth of 12.8GB/s measured towards 64KB blocks [Queue Depth] Read Percent Santa Clara, CA 15

90-10 Random Read Performance [Queue Depth] block size Santa Clara, CA 16

Client IO with Recovery after 1 OSD Down/out Recovery time in second Note: a. Recovery/rebalancing Sme is around 90 min when one osd down with IO load of 64k RR_qd64 (user data 10tb ; with 3 way replicason total data on cluster is 30TB and each osd consists of around 1.2TB). b. Recovery parameters with priority to client IO c. Average perf degradason is around 15% when recovery is happening. Santa Clara, CA 17

Client IO with Recovery after the down OSD is in Note: Recovery time in second a. Recovery/rebalancing Sme is around 60 min when one osd in with IO load of 64k RR_qd64 b. Average perf degradason is around 5% when recovery is happening. c. Recovery parameters with priority to client IO Santa Clara, CA 18

Client IO with Recovery after Entire Chassis (128 TB) Down/out Recovery time in second Note: a. Recovery/rebalancing Sme is around 7 hours when one chassis down with IO load of 64k RR b. User data 10tb ; with 3 way replicason total data on cluster is 30tb and each chassis consists of around 10 TB). c. Recovery parameters with priority to client IO Santa Clara, CA 19

Client IO with Recovery after Entire down Chassis (128 TB) is in Recovery time in second Note: Recovery/rebalancing Sme is around 2 hours aver adding removed chassis with IO load of 64k RR_qd64 (user data 10tb ; with 3 way replicason total data on cluster is 30tb and each chassis consists of around 10 TB). Santa Clara, CA 20

InfiniFlash IF500 OpenStack Topology Santa Clara, CA 21

Performance Trends 250000 200000 150000 100000 50000 79288.91 60597.88 55431.02 37179.3 105736.83 84505.54 67530.84 48681.03 174073.72 121754.19 85210.88 59981.97 218235.51 183827.8 92720.89 67797.82 4K-100%RR-16 QD 4K-100%RR-64 QD 16K-90%RR-16 QD 16K-90%RR-64 QD 0 1 2 3 4 Number of Compute Nodes VM configuravon On single Compute node hosted 4 VMs and each vm is mapped four 1TB rbds. VMs root disks also from the same CEPH cluster. Each VM has 8 virtual CPU and 16GB RAM Santa Clara, CA 22

InfiniFlash IF500 Object Storage Topology Note: Clients and LBs are not shown in above topology. COSBench is used to generate the Santa Clara, CA workload. Also, each Gateway server is running multiple RGW/WS combinations 23

Object Throughput with EC Erasure Code Pool ConfiguraVon Host level Erasure coding 4:2 ( k:m ) Jerasure and technique is reed Solomon.rgw.buckets to be in Erasure coding pool others to be in replicated pool Bandwidth of ~14GB/s achieved towards 4MB objects reads from 48 drives For writes ~3GB/s throughput achieved with host based EC pool configurason Network is saturasng at around ~3.8GB/s from single Storage Node for read. Santa Clara, CA 24

Write Performance with Scaling OSD Write Object performance with EC pool scaling from 48 Drives to 96 drives Profile No of OSD Drives No of RGW servers No of RGW instances Clients Workers BW(GB/s) 90% - RespTime (ms) Avg. CPU usage of OSD nodes(%) Avg. CPU usage of RGW- nodes(%) Avg. CPU usage of COSBENCH- CPU(%) 4M_Write 48 4 16 4 512 2.86 1870 40 10 15 4M_Write 96 4 16 4 512 4.9 420 65-70 30-35 40-45 With increase in OSD drives (2x) write performance improve by around 65-70% Santa Clara, CA 25

Optimal EC Performance: Cauchy-good Profile No of RGW servers No of RGW instances Clients Workers BW(GB/s) 90% - RespTime (ms) Avg. CPU usage of OSD nodes(%) Avg. CPU usage of RGW- nodes(%) 4M_Write( ds:20 TB) 4M_Read (ds:20 TB) 4 16 4 512 4.1 820 40-45 20-25 4 16 4 512 14.33 220 50-55 50-52 Cauchy- good OpSmal write performance is at least 33% beyer Reed Solomon. In some runs got around 45% improvement Cauchy- good OpSmal read performance 10% beyer however there was no much scope of improvement as we were already saturasng the N/W bandwidth of RGW nodes. Santa Clara, CA 26

Degraded EC Read Performance: Reed- Solomon vs Cauchy- good Cluster Profile BW(GB/s) 90% - RespTime(ms) Avg. CPU usage of OSD nodes(%) Cauchy- good Degraded (1 Node Down) 4M_Read 12.9 230 71 Cauchy- good Degraded (2 Node Down) 4M_Read 9.83 350 75-80 RS Degraded (1 Node Down) 4M_Read 8.26 250 60 RS Degraded (2 Node Down) 4M_Read 3.88 740 50-55 Cauchy- Good is doing far beyer than Reed Solomon technique for opsmal and recovery IO performance as compared wrt write performance: gezng almost double Santa Clara, CA 27

InfiniFlash IF500 Future Goals Improved write performance Improved write amplificason Improved mixed read- write performance Convenient end- to- end management/monitoring UI Easy deployment RDMA support EC with block interface Santa Clara, CA 28

Thank you! Questions? somnath.roy@sandisk.com Visit SanDisk in Booth #207 @BigDataFlash #bigdataflash ITblog.sandisk.com hyp://bigdataflash.sandisk.com 2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and Santa Clara, CA other countries. InfiniFlash is a trademark of SanDisk Corporation. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). 29