idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti



Similar documents
idedup: Latency-aware, inline data deduplication for primary storage

Speeding Up Cloud/Server Applications Using Flash Memory

Online De-duplication in a Log-Structured File System for Primary Storage

Big Fast Data Hadoop acceleration with Flash. June 2013

Mambo Running Analytics on Enterprise Storage

Read Performance Enhancement In Data Deduplication For Secondary Storage

Benchmarking Hadoop & HBase on Violin

Delivering SDS simplicity and extreme performance

Benchmarking Cassandra on Violin

AIX NFS Client Performance Improvements for Databases on NAS

Data Backup and Archiving with Enterprise Storage Systems

Methods to achieve low latency and consistent performance

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

The What, Why and How of the Pure Storage Enterprise Flash Array

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Everything you need to know about flash storage performance

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Using Synology SSD Technology to Enhance System Performance Synology Inc.

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

EMC XTREMIO EXECUTIVE OVERVIEW

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Primary Data Deduplication Large Scale Study and System Design

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Decentralized Deduplication in SAN Cluster File Systems

CONSOLIDATING MICROSOFT SQL SERVER OLTP WORKLOADS ON THE EMC XtremIO ALL FLASH ARRAY

Demystifying Deduplication for Backup with the Dell DR4000

Flexible Storage Allocation

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

XtremIO Flash Memory, Performance & endurance

The IntelliMagic White Paper: Green Storage: Reduce Power not Performance. December 2010

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

HPe in Datacenter HPe3PAR Flash Technologies

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

PERFORMANCE STUDY. NexentaConnect View Edition Branch Office Solution. Nexenta Office of the CTO Murat Karslioglu

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Metadata Feedback and Utilization for Data Deduplication Across WAN

Physical Data Organization

Flash Accel, Flash Cache, Flash Pool, Flash Ray Was? Wann? Wie?

DEXT3: Block Level Inline Deduplication for EXT3 File System

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

Application-Focused Flash Acceleration

Understanding Data Locality in VMware Virtual SAN

Storage I/O Control: Proportional Allocation of Shared Storage Resources

SolidFire and NetApp All-Flash FAS Architectural Comparison

FLASH IMPLICATIONS IN ENTERPRISE STORAGE ARRAY DESIGNS

Understanding the Economics of Flash Storage

Q & A From Hitachi Data Systems WebTech Presentation:

BASIL: Automated IO Load Balancing Across Storage Devices

How To Make Your Computer More Efficient

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Understanding Enterprise NAS

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Adaptive Flash Cache: A Competitive Comparison

Chapter 11: File System Implementation. Operating System Concepts 8 th Edition

Array Performance 101 Part 4

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

How To Get The Most Out Of An Ecm Xtremio Flash Array

Trends in Enterprise Backup Deduplication

SOLUTION BRIEF. Resolving the VDI Storage Challenge

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

NetApp FAS Hybrid Array Flash Efficiency. Silverton Consulting, Inc. StorInt Briefing

White Paper. Educational. Measuring Storage Performance

EMC Backup and Recovery for Microsoft Exchange 2007 SP2

SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage

NetApp Data Compression and Deduplication Deployment and Implementation Guide

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Maximizing SQL Server Virtualization Performance

EMC SOLUTION FOR SPLUNK

A Deduplication File System & Course Review

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

FAS6200 Cluster Delivers Exceptional Block I/O Performance with Low Latency

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

ARCHITECTURE WHITE PAPER ARCHITECTURE WHITE PAPER

Deduplication, Compression and Pattern-Based Testing for All Flash Storage Arrays Peter Murray - Load DynamiX Leah Schoeb - Evaluator Group

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

ASKING THESE 20 SIMPLE QUESTIONS ABOUT ALL-FLASH ARRAYS CAN IMPACT THE SUCCESS OF YOUR DATA CENTER ROLL-OUT

A Data De-duplication Access Framework for Solid State Drives

Cloud De-duplication Cost Model THESIS

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Measuring Interface Latencies for SAS, Fibre Channel and iscsi

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Decentralized Deduplication in SAN Cluster File Systems

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

Data Deduplication in a Hybrid Architecture for Improving Write Performance

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

WHITE PAPER. SQL Server License Reduction with PernixData FVP Software

Estimating Deduplication Ratios in Large Data Sets

The Data Placement Challenge

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Transcription:

idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp 1

idedup overview/context Storage Clients Primary Storage Dedupe exploited effectively here => 90+% savings Secondary Storage NFS/CIFS/iSCSI NDMP/Other Primary Storage: idedup Performance & reliability are key features Inline/foreground dedupe technique for primary RPC-based protocols => latency sensitive Minimal impact on latency-sensitive workloads Only offline dedupe techniques developed 2

Dedupe techniques (offline vs inline) Offline dedupe First copy on stable storage is not deduped Dedupe is a post-processing/background activity Inline dedupe Dedupe before first copy on stable storage Primary => Latency should not be affected! Secondary => Dedupe at ingest rate (IOPS)! 3

Why inline dedupe for primary? Provisioning/Planning is easier Dedupe savings are seen right away Planning is simpler as capacity values are accurate No post-processing activities No scheduling of background processes No interference => front-end workloads are not affected Key for storage system users with limited maintenance windows Efficient use of resources Efficient use of I/O bandwidth (offline has both reads + writes) File system s buffer cache is more efficient (deduped) Performance challenges have been the key obstacle Overheads (CPU + I/Os) for both reads and writes hurt latency 4

idedup Key features Minimizes inline dedupe performance overheads Leverages workload characteristics Eliminates almost all extra I/Os due to dedupe processing CPU overheads are minimal Tunable tradeoff: dedupe savings vs performance Selective dedupe => some loss in dedupe capacity savings idedup can be combined with offline techniques Maintains same on-disk data structures as normal file system Offline dedupe can be run optionally 5

Related Work Workload/ Method Offline Inline Primary NetApp ASIS EMC Celerra IBM StorageTank idedup zfs* Secondary (No motivation for systems in this category) EMC DDFS, EMC Cluster DeepStore, NEC HydraStor, Venti, SiLo, Sparse Indexing, ChunkStash, Foundation, Symantec, EMC Centera 6

Outline Inline dedupe challenges Our Approach Design/Implementation details Evaluation results Summary 7

Inline Dedupe - Read Path challenges Inherently, dedupe causes disk-level fragmentation! Sequential reads turn random => more seeks => more latency RPC based protocols (CIFS/NFS/iSCI) are latency sensitive Dataset/workload property Primary workloads are typically read-intensive Usually the Read/Write ratio is ~70/30 Inline dedupe must not affect read performance! Fragmentation with random seeks 8

Inline Dedupe Write Path Challenges Extra CPU random overheads I/Os in in the the critical write path write due path to dedupe algorithm Dedupe metadata (Finger Print DB) lookups and updates Dedupe requires computing the Fingerprint (Hash) of each block Updating the reference counts of blocks on disk Dedupe algorithm requires extra cycles Dedupe metadata Client Write Write Logging Compute Hash Dedupe Algorithm Write Allocation Random Seeks!! Disk IO De-stage phase 9

Our Approach 10

idedup Intuition Is there a good tradeoff between capacity savings and latency performance? 11

idedup Solution to Read Path issues Insight 1: Dedupe only sequences of disk blocks Solves fragmentation => amortized seeks during reads Selective dedupe, leverages spatial locality Configurable minimum sequence length Fragmentation with random seeks Sequences, with amortized seeks 12

idedup Write Path issues How can we reduce dedupe metadata I/Os? Flash(?) - read I/Os are cheap, but frequent updates are expensive 13

idedup Solution to Write Path issues Insight 2: Keep a smaller dedupe metadata as an in-memory cache No extra IOs Leverages temporal locality characteristics in duplication Some loss in dedupe (a subset of blocks are used) Blocks Cached fingerprints Time 14

idedup - Viability Dedup Ratio Is this loss in dedupe savings Ok? Both spatial and temporal localities are dataset/workload properties! Viable for some important primary workloads Original Spatial Locality Spatial + Temporal Locality 15

Design and Implementation 16

idedup Architecture I/Os (Reads + Writes) NVRAM (Write log blocks) idedup Algorithm De-stage Dedupe metadata (FPDB) File system (WAFL) Disk 17

idedup - Two key tunable parameters Minimum sequence length Threshold Minimum number of sequential duplicate blocks on disk Dataset property => ideally set to expected fragmentation Different from larger block size variable vs fixed Knob between performance (fragmentation) and dedupe Dedupe metadata (Fingerprint DB) cache size Workload s working set property Increase in cache size => decrease in buffer cache Knob between performance (cache hit ratio) and dedupe 18

idedup Algorithm The idedup algorithm works in 4 phases for every file: Phase 1 (per file):identify blocks for idedup Only full, pure data blocks are processed Metadata blocks, special files ignored Phase 2 (per file) : Sequence processing Uses the dedupe metadata cache Keeps track of multiple sequences Phase 3 (per sequence): Sequence pruning Eliminate short sequences below threshold Pick among overlapping sequences via a heuristic Phase 4 (per sequence): Deduplication of sequence 19

Evaluation 20

Evaluation Setup NetApp FAS 3070, 8GB RAM, 3 disk RAID0 Evaluated by replaying real-world CIFS traces Corporate filer traces in NetApp DC (2007) Read data: 204GB (69%), Write data: 93GB Engineering filer traces in NetApp DC (2007) Read data: 192GB (67%), Write data: 92GB Comparison points Baseline: System with no idedup Threshold-1: System with full dedupe (1 block) Dedupe metadata cache: 0.25, 0.5 & 1GB 21

Results: Deduplication ratio vs Threshold Dedupe ratio vs Thresholds, Cache sizes (Corp) Deduplication ratio (%) 24 22 20 18 16 14 12 10.25 GB.5 GB 1 GB Less than linear decrease in dedupe savings Spatial locality in dedupe Ideal Threshold = Biggest threshold with least decrease in dedupe savings 8 1 2 4 8 Threshold Threshold-4 ~60% of max 22

Results: Disk Fragmentation (req sizes) CDF of block request sizes (Engg, 1GB) 100 Percentage of Total Requests 90 80 70 60 50 40 30 20 10 0 Max fragmentation Baseline (Mean=15.8) Threshold-1 (Mean=12.5) Threshold-2 (Mean=14.8) Threshold-4 (Mean=14.9) Threshold-8 (Mean=15.4) Least fragmentation 1 8 16 24 32 40 Fragmentation for other thresholds are between Baseline and Thresh-1 Tunable fragmentation Request Sequence Size (Blocks) 23

Results: CPU Utilization 100 CDF of CPU utilization samples (Corp, 1GB) Percentage of CPU Samples 90 80 70 60 50 40 30 20 10 0 Baseline (Mean=13.2%) Threshold-1 (Mean=15.0%) Threshold-4 (Mean=16.6%) Threshold-8 (Mean=17.1%) 0 5 10 15 20 25 30 35 40 Larger variance (long tail) compared to baseline But, mean difference is less than 4% CPU Utilization (%) 24

Results: Latency Impact Percentage of Requests 100 95 90 85 80 75 CDF of client response time (Corp, 1GB) Affects >2ms Baseline Threshold-1 Threshold-8 0 5 10 15 20 25 30 Latency impact for longer response times ( > 2ms) Thresh-1 mean latency affected by ~13% vs baseline Diff between Thresh-8 and baseline <4%! Response Time (ms) 25

Summary Inline dedupe has significant performance challenges Reads : Fragmentation, Writes: CPU + Extra I/Os idedup creates tradeoffs b/n savings and performance Leverage dedupe locality properties Avoid fragmentation dedupe only sequences Avoid extra I/Os keep dedupe metadata in memory Experiments for latency-sensitive primary workloads Low CPU impact < 5% on the average ~60% of max dedupe, ~4% impact on latency Future work: Dynamic threshold, more traces Our traces are available for research purposes 26

Acknowledgements NetApp WAFL Team Blake Lewis, Ling Zheng, Craig Johnston, Subbu PVS, Praveen K, Sriram Venketaraman NetApp ATG Team Scott Dawkins, Jeff Heller Shepherd John Bent Our Intern (from UCSC) Stephanie Jones 27

28