Theoretical Aspects of Storage Systems Autumn 2009



Similar documents
A Deduplication File System & Course Review

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

DEXT3: Block Level Inline Deduplication for EXT3 File System

Inline Deduplication

Speeding Up Cloud/Server Applications Using Flash Memory

A Data De-duplication Access Framework for Solid State Drives

The assignment of chunk size according to the target data characteristics in deduplication backup system

Data Backup and Archiving with Enterprise Storage Systems

Trends in Enterprise Backup Deduplication

Byte-index Chunking Algorithm for Data Deduplication System

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

A Deduplication-based Data Archiving System

Multi-level Metadata Management Scheme for Cloud Storage System

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

Deploying De-Duplication on Ext4 File System

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

Data Deduplication and Tivoli Storage Manager

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, DEPARTMENT OF ECE, IFET College of Engineering

The Advantages and Disadvantages of Network Computing Nodes

A Novel Deduplication Avoiding Chunk Index in RAM

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

A Survey on Data Deduplication in Cloud Storage Environment

Alternatives to Big Backup

Metadata Feedback and Utilization for Data Deduplication Across WAN

Read Performance Enhancement In Data Deduplication For Secondary Storage

Online De-duplication in a Log-Structured File System for Primary Storage


A Survey on Deduplication Strategies and Storage Systems

Reducing Replication Bandwidth for Distributed Document Databases

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

Availability Digest. Data Deduplication February 2011

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Big Data & Scripting Part II Streaming Algorithms

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Cloud De-duplication Cost Model THESIS

A Study on Data Deduplication in HPC Storage Systems

An Efficient Deduplication File System for Virtual Machine in Cloud

Data Deduplication and Tivoli Storage Manager

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

RAID 5 rebuild performance in ProLiant

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

Hardware Configuration Guide

Data Deduplication in BitTorrent

File Systems Management and Examples

Frequency Based Chunking for Data De-Duplication

PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

The What, Why and How of the Pure Storage Enterprise Flash Array

HTTP-Level Deduplication with HTML5

Sistemas Operativos: Input/Output Disks

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

COS 318: Operating Systems

Fragmentation in in-line. deduplication backup systems

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Everything you need to know about flash storage performance

Efficient Deduplication in Disk- and RAM-based Data Storage Systems

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Chapter 8: On the Use of Hash Functions in. Computer Forensics

PRUN : Eliminating Information Redundancy for Large Scale Data Backup System

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Reducing impact of data fragmentation caused by in-line deduplication

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Data Deduplication in Windows Server 2012 and Ubuntu Linux

Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup

bup: the git-based backup system Avery Pennarun

Unit Storage Structures 1. Storage Structures. Unit 4.3

Transcription:

Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann

News

Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements

Motivation Backups: 26 full backups 26 times required backup capacity Few changes and high redundancy between different backups Similar behavior can be seen for Virtual Machine Images Home Directories Network file systems (LBFS) Data Deduplication removes redundant data and tries to ensure that information is only stored once

Fingerprinting Use hashing schemes to characterize content of data block Delta Encoding Search near-duplicates and just store delta between blocks Different Approaches

Fingerprinting Fingerprinting is based on four stages Chunking Divide data stream into chunks of fixed or variable size Fingerprinting Calculate hash-function for each chunk Duplicate Detection Compare hash result with already stored index Update indexes and store data

Chunking Data Stream / File /... Chunk Chunk Chunk The process of chunking divides the data stream in smaller, non-overlapping blocks Different Approaches Static Chunking Content defined Chunking File-based Chunking

Static Chunking Each chunk has a fixed size Very fast approach Vulnerable to shifts inside the data stream Seldom used Virtual machine deduplication Deduplication of block storage 14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653 58979 32384 62643 38327 95028 84197 13751 0

Content defined Chunking Chunks will be generated based on their content Fingerprint (hash calculation) for each substring of size w Chunk ends if it holds for fingerprint f that f mod n = c for some constant 0 <= c < n Influence on chunk size Variable length Expected length is n U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, 1994.

Content-defined Chunking Each change only impacts its direct neighbors 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A265 3589793 23846 26433 83279 50288 41971 69399 37510 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A2653589793 23846 26433 83279 50288 41971 69399 37510 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A26535 89793 23846 26433 83279 50288 41971 69399 37510

Very small chunks Special Cases E.g. unfortunate repetition of a 48 Byte window Requires more memory for fingerprint than for actual data Big chunks E.g. many runs of 0 High memory demand during processing Solution: Define Min/Max-length Typically between 2 kbytes and 64 kbytes

Processing Overhead Content-defined chunking requires the calculation of one fingerprint for each substring of length w one fingerprint for each word Processing overhead for fingerprint typically depends on string length Small string length: good performance, but bad chunking properties Large string length: good chunking properties, but huge performance impact Way out: Use rolling hash functions, which allow calculation of new fingerprint based on previous fingerprint in constant time

Rolling Hash A rolling hash is a hash function where the input is hashed in a window that moves through the input Few hash functions allow a rolling hash to be computed very quickly: the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window Applications besides data deduplication Rabin-Karp string search algorithm Rsync Wikipedia: Rolling hash

Rabin Fingerprints Rabin Fingerprints only requires multiplications and additions F =c 0 a k 1 + c 1 a k 2 + c 2 a k 3 +...+ c k 1 a 0 Typically, all operations are performed modulo n Of course, the choices of a and n are critical for good hashing properties The calculation of a new fingerprint from an old one just requires one addition, one substraction, and a multiplication by a Performance results: we have measured 102 MByte/s for each processor core of a 2 GHz processor M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.

Duplicate Detection The system has to check for every chunk, whether it is a duplicate or not Compare-by-Hash Calculate one Fingerprint for every chunk (typically SHA1) Check whether this fingerprint is already known to the system SHA 1 still very costly 73,1 Mbyte/s throughput on each 2 GHz core

Compare-by-Hash Applicability of approach is (at best) discussed (see e.g. [Hen03]): "Use of compare-by-hash is justified by mathematical calculations based on assumptions that range from unproven to demonstrably wrong. The short lifetime and fast transition into obsolescence of cryptographic hashes makes them unsuitable for use in long-lived systems. When hash collisions do occur, they cause silent errors and bugs that are difficult to repair. V. Henson. An analysis of compare-by-hash. In HOTOS'03: Proceedings of the 9th conference on Hot Topics in Operating Systems

Compare-by-Hash Data loss based on accidental collisions: Birthday paradoxon: It is sufficient for one block to have the same hash value as an arbitrary other block to produce silent data corruption Assuming n data block, and b bits hash length, this probability can be bounded by p nn 1 ( ) 1 2 2 b Assume 1 Exabyte of data (2 60 Bytes), 4 Kbyte Chunk Size (2 12 Byte), 160 Bit SHA 1 fingerprint p < 10-19 But attacks can become successful as soon as SHA 1 gets broken J. Black. Compare-by-hash: a reasoned analysis. In ATEC '06: Proceedings of the annual conference on USENIX '06 Annual Technical Conference

Internal Redundancy Redundancy inside an AFS file system at the University of Paderborn CDC = Content-defined Chunking SC = Static Chunking Datei = File-based Chunking D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

Internal Redundancy based on Data Type D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

< 4K > 4K > 8K > 16K > 32 K > 64K > 128K > 256K > 512K > 1M > 2M > 4M > 8M > 16M > 32M > 64M > 128M > 256M > 512M > 1G > 2G Internal Redundancy based on File Size 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% CDC-8 CDC-16 SC-8 SC-16 Datei D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

Temporal Redundancy Redundancy considering previous backup runs D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

Bottleneck Index One fingerprint for each chunk Chunk index Index of all previous accessed chunks Assume chunk size of 8 Kbyte 2 40 /2 13 = 2 27 chunks for each Tbyte of data Using 20 Byte SHA 1 fingerprints 2,5 Gbyte index for each TByte of Data Index cannot be stored in main memory for large scale storage systems, but storing it on disk results in no locality in index access (besides for archiving) Random I/O accesses on disk 100 Mbyte/s throughput requires up to 24,000 index lookups Disk will become bottleneck

Use more disks What can we do? 200 IO/s per Disk 10 Disks lead to throughput of 6 MB/s Use SSDs Intel X25E: 13.000 IO/s 1 SSD achieves 60 MB/s deduplication throughput Target throughput >> 200 MB/s

Approach 1: Bloom Filter (1) Probabilistic data structure similar to Set Insert(key), Lookup(key) Lookup(key) = false Item guarateed not in set Lookup(key) = true Item probably (!) in set Bloom Filter for fingerprints Lookup(fp) = false No chunk index lookup necessary Lookup(fp) = true Lookup still required B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

Data structure Bitmap b of variable length m k independent hash functions h i Insert(i) Set bit positions h 1 (i),..., h k (i) to 1 Lookup(i) Bloom Filter (2) If h 1 (i),..., h k (i) = 0, then the item is definitely not inside set Probability of False Positive E.g. 2% for k = 4 and 1 Byte for each FP Does this really helps? B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

Approach 2: Locality-Preserving Caching Container is sequence of new chunks 4 MB to 10 MB Metadaten (FP) and Daten is stored together on disk Chunk Lookup: Read and cache FP of complete container Lookup in Container-Cache for each sequential chunk Idea Long Runs of chunks to same container One IO for complete container metadata B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

Approach 3: Sparse Indexing Do not keep complete chunk index Divide index into segments Sequence of 10 MB chunks Choose k champions for each segment Champion index (RAM): Champion Segment Lookup inside champion index Idea: Long Runs of chunks coming from same segment 1 (successful) champion lookup delivers FP for many Chunks Applied inside HP D2D2500, D2D4000 M. Lillibridge, et al.: Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality.. In Proceedings of the 7nd USENIX Conference on File and Storage Technologies (FAST'09)

Chunking Near-Duplicate Detection Delta-Encoding Delta-Encoding

Shingling Near-Duplicate Detection Calculate hash value of all w-windows of chunk c Choose k biggest hash values S (shingles, features) Seek for chunk c with maximum number of common features Then, whp., c is similar, but not necessarily identical to c Standard technique in information retrieval C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, May 2008

Resemblance Detection Resemblance of two chunks A and B is defined as ra,b ( )= S(A) S(B) S(A) S(B) 1 Calculating of r(a,b) too computational intensive Just use a clearly defined subset of shingles, e.g. biggest values (see [Man 1994]) Broder suggests to use k minimal Fingerprints and shows that the function is an unbiased estimator A. Z. Broder. Identifying and ltering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

Delta Encoding Chunk c can be compressed based on chunk c c = QWIJKLMNOBCDEFGHZDEFGHIJKL c' = ABCDEFGHIJKLMNOP (Insert; 2; QW) QW (Copy; 7; 8) IJKLMNO (Copy; 7; 1) BCDEFGH (Insert; 1; Z) Z (Copy; 9; 3) DEFGHIJK

Delta Encoding Douglis and Iyengar claim that "Delta-encoding itself has been made extremely efficient, and it should not usually be a bottleneck except in extremely highbandwidth environments. [...] The inclusion of the Ajtai delta-encoding work in a commercial backup system, also support the argument that DERD will not be limited by the delta-encoding bandwidth. F. Douglis and A. Iyengar. Application-specic deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference

Diligent Delta-Encoding: Memory Overhead Chunks of size 32 MByte Resemblance detection based on 4 Kbyte for each shingle k=8 maximum value shingles are used for resemblance detection Feature index for 1 TByte has size, m k f Bytes where m is chunk size, k is number of shinglings per chunk and f is the size of a fingerprint For Diligent, the feature index has size Optimized for Backup applications Diligent Technologies. HyperFactor -- a breakthrough in data reduction technology. Diligent White Paper. 2 40 2 40 8 16 Bytes = 4 MByte 25 2

Discussion Delta-Encoding Delta encoding can help to overcome disk bottleneck in today s deduplication systems, but reconstruction of chunks can trigger reconstruction of additional chunks slowdown, both for reading and in writing data problem becomes worse, if the system gets older Depth of dedup tree increases Clean-up process can help to limit this problem Diligent has, e.g. to our knowledge, restricted the maximum tree depth