idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp 1
idedup overview/context Storage Clients Primary Storage Dedupe exploited effectively here => 90+% savings Secondary Storage NFS/CIFS/iSCSI NDMP/Other Primary Storage: idedup Performance & reliability are key features Inline/foreground dedupe technique for primary RPC-based protocols => latency sensitive Minimal impact on latency-sensitive workloads Only offline dedupe techniques developed 2
Dedupe techniques (offline vs inline) Offline dedupe First copy on stable storage is not deduped Dedupe is a post-processing/background activity Inline dedupe Dedupe before first copy on stable storage Primary => Latency should not be affected! Secondary => Dedupe at ingest rate (IOPS)! 3
Why inline dedupe for primary? Provisioning/Planning is easier Dedupe savings are seen right away Planning is simpler as capacity values are accurate No post-processing activities No scheduling of background processes No interference => front-end workloads are not affected Key for storage system users with limited maintenance windows Efficient use of resources Efficient use of I/O bandwidth (offline has both reads + writes) File system s buffer cache is more efficient (deduped) Performance challenges have been the key obstacle Overheads (CPU + I/Os) for both reads and writes hurt latency 4
idedup Key features Minimizes inline dedupe performance overheads Leverages workload characteristics Eliminates almost all extra I/Os due to dedupe processing CPU overheads are minimal Tunable tradeoff: dedupe savings vs performance Selective dedupe => some loss in dedupe capacity savings idedup can be combined with offline techniques Maintains same on-disk data structures as normal file system Offline dedupe can be run optionally 5
Related Work Workload/ Method Offline Inline Primary NetApp ASIS EMC Celerra IBM StorageTank idedup zfs* Secondary (No motivation for systems in this category) EMC DDFS, EMC Cluster DeepStore, NEC HydraStor, Venti, SiLo, Sparse Indexing, ChunkStash, Foundation, Symantec, EMC Centera 6
Outline Inline dedupe challenges Our Approach Design/Implementation details Evaluation results Summary 7
Inline Dedupe - Read Path challenges Inherently, dedupe causes disk-level fragmentation! Sequential reads turn random => more seeks => more latency RPC based protocols (CIFS/NFS/iSCI) are latency sensitive Dataset/workload property Primary workloads are typically read-intensive Usually the Read/Write ratio is ~70/30 Inline dedupe must not affect read performance! Fragmentation with random seeks 8
Inline Dedupe Write Path Challenges Extra CPU random overheads I/Os in in the the critical write path write due path to dedupe algorithm Dedupe metadata (Finger Print DB) lookups and updates Dedupe requires computing the Fingerprint (Hash) of each block Updating the reference counts of blocks on disk Dedupe algorithm requires extra cycles Dedupe metadata Client Write Write Logging Compute Hash Dedupe Algorithm Write Allocation Random Seeks!! Disk IO De-stage phase 9
Our Approach 10
idedup Intuition Is there a good tradeoff between capacity savings and latency performance? 11
idedup Solution to Read Path issues Insight 1: Dedupe only sequences of disk blocks Solves fragmentation => amortized seeks during reads Selective dedupe, leverages spatial locality Configurable minimum sequence length Fragmentation with random seeks Sequences, with amortized seeks 12
idedup Write Path issues How can we reduce dedupe metadata I/Os? Flash(?) - read I/Os are cheap, but frequent updates are expensive 13
idedup Solution to Write Path issues Insight 2: Keep a smaller dedupe metadata as an in-memory cache No extra IOs Leverages temporal locality characteristics in duplication Some loss in dedupe (a subset of blocks are used) Blocks Cached fingerprints Time 14
idedup - Viability Dedup Ratio Is this loss in dedupe savings Ok? Both spatial and temporal localities are dataset/workload properties! Viable for some important primary workloads Original Spatial Locality Spatial + Temporal Locality 15
Design and Implementation 16
idedup Architecture I/Os (Reads + Writes) NVRAM (Write log blocks) idedup Algorithm De-stage Dedupe metadata (FPDB) File system (WAFL) Disk 17
idedup - Two key tunable parameters Minimum sequence length Threshold Minimum number of sequential duplicate blocks on disk Dataset property => ideally set to expected fragmentation Different from larger block size variable vs fixed Knob between performance (fragmentation) and dedupe Dedupe metadata (Fingerprint DB) cache size Workload s working set property Increase in cache size => decrease in buffer cache Knob between performance (cache hit ratio) and dedupe 18
idedup Algorithm The idedup algorithm works in 4 phases for every file: Phase 1 (per file):identify blocks for idedup Only full, pure data blocks are processed Metadata blocks, special files ignored Phase 2 (per file) : Sequence processing Uses the dedupe metadata cache Keeps track of multiple sequences Phase 3 (per sequence): Sequence pruning Eliminate short sequences below threshold Pick among overlapping sequences via a heuristic Phase 4 (per sequence): Deduplication of sequence 19
Evaluation 20
Evaluation Setup NetApp FAS 3070, 8GB RAM, 3 disk RAID0 Evaluated by replaying real-world CIFS traces Corporate filer traces in NetApp DC (2007) Read data: 204GB (69%), Write data: 93GB Engineering filer traces in NetApp DC (2007) Read data: 192GB (67%), Write data: 92GB Comparison points Baseline: System with no idedup Threshold-1: System with full dedupe (1 block) Dedupe metadata cache: 0.25, 0.5 & 1GB 21
Results: Deduplication ratio vs Threshold Dedupe ratio vs Thresholds, Cache sizes (Corp) Deduplication ratio (%) 24 22 20 18 16 14 12 10.25 GB.5 GB 1 GB Less than linear decrease in dedupe savings Spatial locality in dedupe Ideal Threshold = Biggest threshold with least decrease in dedupe savings 8 1 2 4 8 Threshold Threshold-4 ~60% of max 22
Results: Disk Fragmentation (req sizes) CDF of block request sizes (Engg, 1GB) 100 Percentage of Total Requests 90 80 70 60 50 40 30 20 10 0 Max fragmentation Baseline (Mean=15.8) Threshold-1 (Mean=12.5) Threshold-2 (Mean=14.8) Threshold-4 (Mean=14.9) Threshold-8 (Mean=15.4) Least fragmentation 1 8 16 24 32 40 Fragmentation for other thresholds are between Baseline and Thresh-1 Tunable fragmentation Request Sequence Size (Blocks) 23
Results: CPU Utilization 100 CDF of CPU utilization samples (Corp, 1GB) Percentage of CPU Samples 90 80 70 60 50 40 30 20 10 0 Baseline (Mean=13.2%) Threshold-1 (Mean=15.0%) Threshold-4 (Mean=16.6%) Threshold-8 (Mean=17.1%) 0 5 10 15 20 25 30 35 40 Larger variance (long tail) compared to baseline But, mean difference is less than 4% CPU Utilization (%) 24
Results: Latency Impact Percentage of Requests 100 95 90 85 80 75 CDF of client response time (Corp, 1GB) Affects >2ms Baseline Threshold-1 Threshold-8 0 5 10 15 20 25 30 Latency impact for longer response times ( > 2ms) Thresh-1 mean latency affected by ~13% vs baseline Diff between Thresh-8 and baseline <4%! Response Time (ms) 25
Summary Inline dedupe has significant performance challenges Reads : Fragmentation, Writes: CPU + Extra I/Os idedup creates tradeoffs b/n savings and performance Leverage dedupe locality properties Avoid fragmentation dedupe only sequences Avoid extra I/Os keep dedupe metadata in memory Experiments for latency-sensitive primary workloads Low CPU impact < 5% on the average ~60% of max dedupe, ~4% impact on latency Future work: Dynamic threshold, more traces Our traces are available for research purposes 26
Acknowledgements NetApp WAFL Team Blake Lewis, Ling Zheng, Craig Johnston, Subbu PVS, Praveen K, Sriram Venketaraman NetApp ATG Team Scott Dawkins, Jeff Heller Shepherd John Bent Our Intern (from UCSC) Stephanie Jones 27
28