PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabón - Red Hat
ABOUT ME LUIS PABÓN Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon
QUESTIONS: What are the benefits of client side persistent caching? How to effectively use the SSD? Compute Node Storage SSD
MERCURY* Use in memory data structures to handle cache misses as quickly as possible Write sequentially to the SSD Increase storage backend availability by reducing read requests Cache must be persistent since warming could be time consuming * S. Byan, et al., Mercury: Host-side flash caching for the data center
MERCURY QEMU INTEGRATION
PBLCACHE
PBLCACHE Persistent BLock Cache Persistent, block based, look aside cache for QEMU User space library/application Based on ideas described in the Mercury paper Requires exclusive access to mutable objects
GOAL: QEMU SHARED CACHE
PBLCACHE ARCHITECTURE PBL Application Cache Map Log SSD
PBL APPLICATION Sets up the cache map and log Decides how to use the cache (writethrough, read-miss) Inserts, retrieves, or invalidates blocks from the cache Pbl App Msg Queue Cache map Log
CACHE MAP Composed of two data structures Maintains all block metadata Address Map Block Descriptor Array
ADDRESS MAP Implemented using as a hash table Translates object blocks to Block Descriptor Array (BDA) indeces Cache misses are determined extremely fast Address Map Block Descriptor Array
BLOCK DESCRIPTOR ARRAY Contains metadata for blocks stored in the log Length is equal to the maximum number of blocks stored in the log Handles CLOCK evictions Invalidations are extremely fast Address Map Block Descriptor Array Insertions always append
CACHE MAP I/O FLOW Block Descriptor Array
CACHE MAP I/O FLOW G et In address map N o Y es M iss H it S et CLOCK bit in BDA Read from log
CACHE MAP I/O FLOW Invalidate Free BDA index Delete from map
LOG Block location determined by BDA CLOCK optimized with segment read-ahead Segment pool with buffered writes Contiguous block support Segments SSD
LOG SEGMENT STATE MACHINE
LOG READ I/O FLOW Read In a segment? Yes No Read from segment Read from SSD
PERSISTENT METADATA Save address map to a file on application shutdown Cache warm on application restart Not designed to be durable System crash will cause metadata file not to be created
PBLIO BENCHMARK PBL APPLICATION
PBLIO Benchmark tool Uses an enterprise workload workload generator from NetApp* Cache setup as write through Can be used with or without pblcache Documentation https://github.com/pblcache/pblcache/wiki/pblio * S. Daniel et al., A portable, open-source implementation of the SPC-1 workload * https://github.com/lpabon/goioworkload
ENTERPRISE WORKLOAD Synthetic OLTP enterprise workload generator Tests for maximum number of IOPS before exceeding 30ms latency Divides storage system into three logical storage units: ASU1 - Data Store - 45% of total storage - RW ASU2 - User Store - 45% of total storage - RW ASU3 - Log - 10% of total storage - Write Only BSU - Business Scaling Units 1 BSU = 50 IOPS
SIMPLE EXAMPLE $ fallocate -l 45MiB file1 $ fallocate -l 45MiB file2 $ fallocate -l 10MiB file3 $ $./pblio -asu1=file1 \ -asu2=file2 \ -asu3=file3 \ -runlen=30 -bsu=2 ----- pblio ----- Cache : None ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2895 ms
RAW DEVICES EXAMPLE $./pblio -asu1=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde \ -asu2=/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi \ -asu3=/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm \ -runlen=30 -bsu=2
CACHE EXAMPLE $ fallocate -l 10MiB mycache $./pblio -asu1=file1 -asu2=file2 -asu3=file3 \ -runlen=30 -bsu=2 -cache=mycache ----- pblio ----- Cache : mycache (New) C Size : 0.01 GB ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2573 ms Read Hit Rate: 0.4457 Invalidate Hit Rate: 0.6764 Read hits: 1120 Invalidate hits: 347 Reads: 2513 Insertions: 1906 Evictions: 0 Invalidations: 513 == Log Information == Ram Hit Rate: 1.0000 Ram Hits: 1120 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 0 Wraps: 1 Segments Skipped: 0 Mean Read Latency: 0.00 usec Mean Segment Read Latency: 4396.77 usec Mean Write Latency: 1162.58 usec
----- pblio ----- Cache C Size ASU1 ASU2 ASU3 BSUs : 32 Contexts: 1 Run time: 600 s ----- LATENCY OVER 30MS : /dev/sdg (Loaded) : 185.75 GB : 673.83 GB : 673.83 GB : 149.74 GB Avg IOPS:1514.92 Avg Latency:112.1096 ms Read Hit Rate: 0.7004 Invalidate Hit Rate: 0.7905 Read hits: 528539 Invalidate hits: 120189 Reads: 754593 Insertions: 378093 Evictions: 303616 Invalidations: 152039 == Log Information == Ram Hit Rate: 0.0002 Ram Hits: 75 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 445638 Wraps: 0 Segments Skipped: 0 Mean Read Latency: 850.89 usec Mean Segment Read Latency: 2856.16 usec Mean Write Latency: 6472.74 usec
EVALUATION
TEST SETUP Client using 180GB SAS SSD (about 10% of workload size) GlusterFS 6x2 Cluster 100 files for each ASU pblio v0.1 compiled with go1.4.1 Each system has: Fedora 20 6 Intel Xeon E5-2620 @ 2GHz 64 GB RAM 5 300GB SAS Drives 10Gbit Network
CACHE WARMUP IS TIME COMSUMING 16 hours
INCREASED RESPONSE TIME 73% Increase
STORAGE BACKEND IOPS REDUCTION BSU = 31 or 1550 IOPS ~75% IOPS Reduction
CURRENT STATUS
MILESTONES 1. Create Cache Map - COMPLETED 2. Create Log - COMPLETED 3. Create Benchmark application - COMPLETED 4. Design pblcached architecture - IN PROGRESS
NEXT: QEMU SHARED CACHE Work with the community to bring this technology to QEMU Possible architecture: Some conditions to think about: VM migration Volume deletion VM crash
FUTURE Hyperconvergence Peer-cache Writeback Shared cache QoS using mclock* Possible integrations with Ceph and GlusterFS backends * A. Gulati et al., mclock: Handling Throughput Variability for Hypervisor IO Scheduling
JOIN! Github: https://github.com/pblcache/pblcache IRC Freenode: #pblcache Google Group: https://groups.google.com/forum/#!forum/pblcache Mail list: pblcache@googlegroups.com
FROM THIS...
TO THIS