Methods to achieve low latency and consistent performance Alan Wu, Architect, Memblaze zhongjie.wu@memblaze.com 2015/8/13 1
Software Defined Flash Storage System Memblaze provides software defined flash system memos 2
Design Challenges Low latency challenges Write request with low latency Interaction between read & write requests Balance between bandwidth and latency Consistent performance challenges Linux OS makes performance inconsistent Multi-cores / NUMA affect performance consistency How to keep low latency within high IOPS 3
Where s the Bottleneck of Flash System? 4
Traditional Write Path Analysis Write Write NIC/IB/FC Write Write Write Initiator and exportation interface has performance bottleneck 1 Task scheduler Introduce IO latency and jitter CPU CPU CPU CPU Write back 2 Write cache Cache can reduce IO latency Linux OS IO scheduler HDD 3 4 Introduce IO latency and jitter GC/Random write affect performance 5
Traditional Read Path Analysis NIC/FC/IB Interface has performance bottleneck Read Read Read Read Read Introduce IO latency and jitter 1 Task scheduler CPU CPU CPU CPU Introduce IO latency and jitter 2 Linux OS IO scheduler 4 interrupts Interrupts affect performance HDD 3 GC/write affect read performance 6
New Approach: RISL Software Architecture SSD characteristics Random write generates lots of mapping information and make GC busy Sequential write can make FTL works in best condition High random read performance Write / erase operation affects read performance Memblaze answer: RISL (patent filed by Memblaze) Random Input Stream Layout Whatever input IO patterns, data layout on SSD is always sequential RISL Includes: Non-volatile write cache: converts any write pattern into sequential Separate read and write requests into different container (storage object) Pipeline and run-to-complete IO model is used to handle write request Run-to-complete IO model is used to handle read request 7
RISL Architecture Mapping cache memfs Write cache (NVDIMM) Read requests Pipeline IO model Fingerprint cache memos Run-to-complete IO model Write Stream Container Container Container Container Sealed containers memraid Active container 8
Introduce NVDIMM to Reduce Write Latency NVDIMM vs. SSD NVDIMM has higher IOPS and lower latency 10 ~ 100ns latency SSD has higher capacity 10 ~ 100us latency Benefits from NVDIMM Avoid updating metadata on SSD frequently Used as write cache to reduce latency for write request Convert all kinds of requests pattern into sequential Convert IOPS issue into bandwidth Enable to adopt pipeline IO handling model to deal with write request 9
IO Handling Model in RISL Design conflicts: bandwidth & latency IO handling model Pipeline Aggregate bandwidth but introduce latency Run-to-complete Reduce latency but affect bandwidth Combine pipeline and run-to-complete Separate write and read handling processes Write uses both run-to-complete and pipeline model Adopt NVDIMM to reduce latency Read uses run-to-complete model Expand CPU to increase bandwidth 10
Write Data Path with RISL Run-to-complete IO model Write handler CPU Write handler CPU Write handler CPU Random requests Container write handler Container write handler Data dedupe handler write callback Pipeline IO model Task scheduler memos CPU CPU CPU Interrupt / callback Stream data SSD 11
Read Data Patch with RISL Run-to-complete IO model Read handler Read handler Read handler Read callback CPU CPU CPU CPU Read request memos 1 2 Interrupts / callback SSD 12
Write Latency Evaluation with RISL Write latency is about 160us (8 NVMe SSD, RAID6, 4GB NVDIMM) Random Write IOPS (4KB, queue_depth=32) Random Write Latency (4KB, queue_depth=32) 900000 250 800000 700000 200 600000 IOPS 500000 400000 300000 Latency (us) 150 100 200000 50 100000 0 0 50 100 150 200 250 300 350 0 0 50 100 150 200 250 300 350 13
Read Latency Evaluation with RISL With 820,000 IOPS, read latency is about 230us (8 NVMe SSDs) Random Read IOPS (4KB, queue_depth:32) Random Read Latency (4KB, queue_depth:32) 1000000 400 900000 350 800000 700000 300 IOPS 600000 500000 400000 300000 200000 100000 Latency (us) 250 200 150 100 50 0 0 100 200 300 400 500 600 700 0 0 100 200 300 400 500 600 700 14
How to Make Consistent Performance? Mixed type of IO requests Separate read & write handling s Write request is dispatched into active containers and read request is distributed on sealed containers Linux OS affects performance consistency Linux task scheduler Use cgroup to separate CPU resources Interrupt Interrupt affinity and balance on multi-cores platform 15
Isolate CPU to make performance consistent Cgroup makes performance more consistent Latency comparision for mixed read/write 1800 1600 1400 1200 Latency (us) 1000 800 600 400 200 0 0 100 200 300 400 500 600 no_cgroup Time cgroup 16
Sustained Latency Evaluation Cumulative distribution function (8 NVMe SSDs, RAID6) 120 RANDOM READ & WRITE CDF 6K(IOPS)_Q1_4KB 820K(IOPS)_Q32_4KB 60K(IOPS)_Q1_4KB 770K(IOPS)_Q32_4KB 100 80 PROBABILITY 60 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 LATENCY (US) 17
Conclusion RISL (Random Input Stream Layout) architecture is used to ensure low latency and consistent performance Uses NVDIMM as write cache Separates read & write requests Combines pipeline and run-to-complete IO handling model Converts all kinds of IO pattern into sequential stream on SSD Optimizes data layout on SSD Optimize Linux to achieve consistent performance Cgroup / Interrupt affinity / request affinity 18
Thank You! http://www.memblaze.com 19