Flash-Optimized Software-Defined Storage Nikolas Ioannou, Ioannis Koltsidas, Roman Pletka, Sasa Tomic,Thomas Weigold IBM Research Zurich 1
New Market Category of Big Data Flash Multiple workloads don t really need the write performance and endurance of good Flash In certain environments data is actually immutable What matters is high density, low cost, and good read performance Current Flash architectures are not a good fit ebay: We could live with 1/3rd the number of writes that normal flash supports as long as we could get it for 1/4th the price. IDC just introduced a new market category of Big Data Flash (March 2015) Content repositories, media and streaming services, Big Data and analytics, NoSQL, Object storage, Web infrastructure. 2 At <1$/GB for raw Flash, total acquisition cost becomes the same as an HDD-based solution, with much lower TCO. - IDC
Low-cost Flash technology (c-mlc, TLC) Can t we just use low-cost s? Low-cost Flash suffers from high write latency, low endurance - E.g., TLC, 3D-NAND, c-mlc Low-cost s have limited resources, simple controllers to keep the cost as low as possible (~ $0.4 /GB!) Therefore, they only employ simple Flash management - Sufficiently good read performance - But, limited write endurance, terrible write performance Raw low-cost s are practically unusable in a real datacenter Latency((usec)( 1000" 900" 800" 700" 600" 500" 400" 300" 200" 100" 0" 0" 10" 20" 30" 40" 50" 60" 70" kiops( Read(Performance( (4kB"random"reads)" > 50k IOPS @ 300usec 3 Write$Latency$(usec)$ 2500# 2000# 1500# 1000# 500# 0# Write$Latency$ (4kB#random#writes)# 1800# Almost as slow as an HDD! 53# 2200# MLC$PCI'e$Card$ SATA$TLC$$ 15k$RPM$HDD$
The characteristics of write performance 400 350 Write Bandwidth with varying block sizes Write Bandwidth (MB/s) 300 250 200 150 100 50 0 256MB seq 4kB rnd 1MB rnd 64MB rnd 256MB rnd 512MB rnd 1024MB rnd 1536MB rnd Sequential I/O on newly formatted drive 4 Multiple overwrites of the drive with Random I/O
SoftwAre Log-Structured Array What? A Flash-optimized I/O stack that elevates the performance and endurance of consumer-level s to enterprise standards. Why? Offer cost-effective all-flash storage in public and private clouds, mainly for read-dominated workloads, complementing our high-end FlashSystem offerings. How? 1. Use high-density, low-cost, off-the-shelf Flash s 2. Move complexity from hardware to software to reduce cost 3. Optimize end-to-end for low Write Amplification 4. Employ aggressive Data Reduction 5. Natively support Object Storage Squeeze the most capacity out of Flash ü Implements the state-of-the-art Flash Management in software ü Runs on Linux, exposes standard interfaces - File-systems and applications run unmodified on top of ü Is ideal for cost-optimized scale-out storage systems like GPFS, CEPH - enables SDS on low-cost s, offering high performance and endurance 5
Overview - Block - Object - I/O memory (RDMA) Interfaces Logical Layer Software Stack - Workload Isolation - Thin Provisioning - Compression - De-duplication - Recurring Pattern Detection - Storage Virtualization - Quality of Service - Data Reduction Physical Layer - - - - - - Log-structured organization - Flash-friendly access patterns - State-of-the-art Garbage Collection - Zero Read-Modify-Writes - RAID5-equivalent protection - Small footprint Log-Structured Array Capacity Management Traffic Shaping Load Balancing I/O handling Low-cost, high-density consumer s - Limited resources (FPGA, CPU, RAM) - Light Flash Management, simple GC SATA Runs on Linux, Intel x86 and Power8 TLC 3D NAND c-mlc 6
Stack IF Physical Layer Logical Layer Segment Grain Parallel writes to s Block I/O Memory Object Volume 1 Volume 2 Volume 3 Write Destage Buffer Thin-provisioned space De-duplication Global Garbage Collection Parity Generation Globally Shared Overprovisioning Heat Segregation Space Write Stream Separation Recurring Pattern Detection 7
Stack in Linux User space Kernel Logical Layer (Linux Device Mapper devices) Physical Layer (Linux Device Mapper device) /dev/mapper/vol0 Configuration & Tooling Linux Block Layer /dev/mapper/vol1 /dev/mapper/array0 Linux Block Layer /dev/mapper/vol2 Device Mapper Frontend Device Mapper Kernel Garbage Collection I/O Configuration & RAS Kernel Device Driver Hardware 8
Experiments Block Storage Read Latency (msec) Using in a commodity Linux server to create an array out of 5 s - With RAID5-equivalent parity protection Comparing against RAID0, RAID5 on the same s 0.450 0.400 0.350 0.300 0.250 0.200 0.150 0.100 0.050 0.000 Random (100/0 R/W) - Reads RAID0 RAID5 0 50 100 150 200 250 300 350 Read Throughput (kiops) 9 Latency (msec) 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 Random (80/20 R/W) Total IOPS 0 20 40 60 80 100 120 Throughput (kiops) dramatically improves performance in the presence of writes 41x 13x RAID0 RAID5
CEPH on 3-node x86 cluster 10 Gbit Ethernet network 2 x 1TB TLC s per node Replication factor of 3 Mixed read/write random I/O Node 1 Node 2 Node 3 XFS CEPH XFS XFS Throughput)(MB/s)) (KB/s)' 900000" 800000" 700000" 600000" 500000" 400000" 300000" 200000" 100000" Baseline"(CEPH"on"raw"s)" CEPH"on"" 39x at steady state 0" 4" 204" 404" 604" 804" 1004" 1204" 1404" 1604" 1804" 2004" 2204" 2404" 2604" 2804" 3004" 3204" 3404" 3604" Time)(seconds)) can enable CEPH on Flash with high performance at a low cost! 10
Performance Virtualized TPC-E 120.00# 100.00# TPC5E)Transac'onal)Throughput) vs. RAID5 using 5 x 1TB TLC s Linux Guest Transac'ons)Per)Seconds)(tps)) 80.00# 60.00# 40.00# 3.8x higher transactional throughput RAID5# # Host DB2 20.00# 0.00# 0# 5000# 10000# 15000# 20000# Time)(sec)) 70.00# TPC4E)Transac'onal)Latency) TPC-E OLTP benchmark that simulates the workload of a brokerage firm Running against DB2 in KVM guest 90% Reads / 10% Writes Transac'on)Latency)(msec)) 60.00# 50.00# 40.00# 30.00# 20.00# 10.00# 6.4x lower transactional latency RAID5# # 11 0.00# 0# 5000# 10000# 15000# 20000# Time)(sec))
Endurance Test using an off-the-shelf low-cost (0.4 $/GB). We measured the wear of the device, as reported by vendor-specific S.M.A.R.T attributes. Comparing the wear incurred by to the wear incurred using the raw device 140 Device Wear 120 100 80 60 40 20 Raw 4.6x 0 0 2 4 6 8 10 Full Device Writes prolongs the lifetime by 4.6 times! 12
Conclusion Low-cost Flash is in high demand Many workloads could benefit tremendously from capacity-optimized Flash is a Flash-optimized storage virtualization stack for Linux - Shifts the complexity of the FTL to software - Transforms user access patterns to be as Flash-friendly as possible - Elevates the performance and endurance of low-cost s to enterprise standards File systems & applications do not need to be modified Write Amplification Performance Capacity Device Lifetime 13
Questions? www.research.ibm.com/labs/zurich/cci/ 14