Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann
News
Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
Motivation Backups: 26 full backups 26 times required backup capacity Few changes and high redundancy between different backups Similar behavior can be seen for Virtual Machine Images Home Directories Network file systems (LBFS) Data Deduplication removes redundant data and tries to ensure that information is only stored once
Fingerprinting Use hashing schemes to characterize content of data block Delta Encoding Search near-duplicates and just store delta between blocks Different Approaches
Fingerprinting Fingerprinting is based on four stages Chunking Divide data stream into chunks of fixed or variable size Fingerprinting Calculate hash-function for each chunk Duplicate Detection Compare hash result with already stored index Update indexes and store data
Chunking Data Stream / File /... Chunk Chunk Chunk The process of chunking divides the data stream in smaller, non-overlapping blocks Different Approaches Static Chunking Content defined Chunking File-based Chunking
Static Chunking Each chunk has a fixed size Very fast approach Vulnerable to shifts inside the data stream Seldom used Virtual machine deduplication Deduplication of block storage 14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653 58979 32384 62643 38327 95028 84197 13751 0
Content defined Chunking Chunks will be generated based on their content Fingerprint (hash calculation) for each substring of size w Chunk ends if it holds for fingerprint f that f mod n = c for some constant 0 <= c < n Influence on chunk size Variable length Expected length is n U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, 1994.
Content-defined Chunking Each change only impacts its direct neighbors 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A265 3589793 23846 26433 83279 50288 41971 69399 37510 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A2653589793 23846 26433 83279 50288 41971 69399 37510 14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 14159 A26535 89793 23846 26433 83279 50288 41971 69399 37510
Very small chunks Special Cases E.g. unfortunate repetition of a 48 Byte window Requires more memory for fingerprint than for actual data Big chunks E.g. many runs of 0 High memory demand during processing Solution: Define Min/Max-length Typically between 2 kbytes and 64 kbytes
Processing Overhead Content-defined chunking requires the calculation of one fingerprint for each substring of length w one fingerprint for each word Processing overhead for fingerprint typically depends on string length Small string length: good performance, but bad chunking properties Large string length: good chunking properties, but huge performance impact Way out: Use rolling hash functions, which allow calculation of new fingerprint based on previous fingerprint in constant time
Rolling Hash A rolling hash is a hash function where the input is hashed in a window that moves through the input Few hash functions allow a rolling hash to be computed very quickly: the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window Applications besides data deduplication Rabin-Karp string search algorithm Rsync Wikipedia: Rolling hash
Rabin Fingerprints Rabin Fingerprints only requires multiplications and additions F =c 0 a k 1 + c 1 a k 2 + c 2 a k 3 +...+ c k 1 a 0 Typically, all operations are performed modulo n Of course, the choices of a and n are critical for good hashing properties The calculation of a new fingerprint from an old one just requires one addition, one substraction, and a multiplication by a Performance results: we have measured 102 MByte/s for each processor core of a 2 GHz processor M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.
Duplicate Detection The system has to check for every chunk, whether it is a duplicate or not Compare-by-Hash Calculate one Fingerprint for every chunk (typically SHA1) Check whether this fingerprint is already known to the system SHA 1 still very costly 73,1 Mbyte/s throughput on each 2 GHz core
Compare-by-Hash Applicability of approach is (at best) discussed (see e.g. [Hen03]): "Use of compare-by-hash is justified by mathematical calculations based on assumptions that range from unproven to demonstrably wrong. The short lifetime and fast transition into obsolescence of cryptographic hashes makes them unsuitable for use in long-lived systems. When hash collisions do occur, they cause silent errors and bugs that are difficult to repair. V. Henson. An analysis of compare-by-hash. In HOTOS'03: Proceedings of the 9th conference on Hot Topics in Operating Systems
Compare-by-Hash Data loss based on accidental collisions: Birthday paradoxon: It is sufficient for one block to have the same hash value as an arbitrary other block to produce silent data corruption Assuming n data block, and b bits hash length, this probability can be bounded by p nn 1 ( ) 1 2 2 b Assume 1 Exabyte of data (2 60 Bytes), 4 Kbyte Chunk Size (2 12 Byte), 160 Bit SHA 1 fingerprint p < 10-19 But attacks can become successful as soon as SHA 1 gets broken J. Black. Compare-by-hash: a reasoned analysis. In ATEC '06: Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Internal Redundancy Redundancy inside an AFS file system at the University of Paderborn CDC = Content-defined Chunking SC = Static Chunking Datei = File-based Chunking D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
Internal Redundancy based on Data Type D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
< 4K > 4K > 8K > 16K > 32 K > 64K > 128K > 256K > 512K > 1M > 2M > 4M > 8M > 16M > 32M > 64M > 128M > 256M > 512M > 1G > 2G Internal Redundancy based on File Size 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% CDC-8 CDC-16 SC-8 SC-16 Datei D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
Temporal Redundancy Redundancy considering previous backup runs D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
Bottleneck Index One fingerprint for each chunk Chunk index Index of all previous accessed chunks Assume chunk size of 8 Kbyte 2 40 /2 13 = 2 27 chunks for each Tbyte of data Using 20 Byte SHA 1 fingerprints 2,5 Gbyte index for each TByte of Data Index cannot be stored in main memory for large scale storage systems, but storing it on disk results in no locality in index access (besides for archiving) Random I/O accesses on disk 100 Mbyte/s throughput requires up to 24,000 index lookups Disk will become bottleneck
Use more disks What can we do? 200 IO/s per Disk 10 Disks lead to throughput of 6 MB/s Use SSDs Intel X25E: 13.000 IO/s 1 SSD achieves 60 MB/s deduplication throughput Target throughput >> 200 MB/s
Approach 1: Bloom Filter (1) Probabilistic data structure similar to Set Insert(key), Lookup(key) Lookup(key) = false Item guarateed not in set Lookup(key) = true Item probably (!) in set Bloom Filter for fingerprints Lookup(fp) = false No chunk index lookup necessary Lookup(fp) = true Lookup still required B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
Data structure Bitmap b of variable length m k independent hash functions h i Insert(i) Set bit positions h 1 (i),..., h k (i) to 1 Lookup(i) Bloom Filter (2) If h 1 (i),..., h k (i) = 0, then the item is definitely not inside set Probability of False Positive E.g. 2% for k = 4 and 1 Byte for each FP Does this really helps? B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
Approach 2: Locality-Preserving Caching Container is sequence of new chunks 4 MB to 10 MB Metadaten (FP) and Daten is stored together on disk Chunk Lookup: Read and cache FP of complete container Lookup in Container-Cache for each sequential chunk Idea Long Runs of chunks to same container One IO for complete container metadata B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
Approach 3: Sparse Indexing Do not keep complete chunk index Divide index into segments Sequence of 10 MB chunks Choose k champions for each segment Champion index (RAM): Champion Segment Lookup inside champion index Idea: Long Runs of chunks coming from same segment 1 (successful) champion lookup delivers FP for many Chunks Applied inside HP D2D2500, D2D4000 M. Lillibridge, et al.: Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality.. In Proceedings of the 7nd USENIX Conference on File and Storage Technologies (FAST'09)
Chunking Near-Duplicate Detection Delta-Encoding Delta-Encoding
Shingling Near-Duplicate Detection Calculate hash value of all w-windows of chunk c Choose k biggest hash values S (shingles, features) Seek for chunk c with maximum number of common features Then, whp., c is similar, but not necessarily identical to c Standard technique in information retrieval C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, May 2008
Resemblance Detection Resemblance of two chunks A and B is defined as ra,b ( )= S(A) S(B) S(A) S(B) 1 Calculating of r(a,b) too computational intensive Just use a clearly defined subset of shingles, e.g. biggest values (see [Man 1994]) Broder suggests to use k minimal Fingerprints and shows that the function is an unbiased estimator A. Z. Broder. Identifying and ltering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Delta Encoding Chunk c can be compressed based on chunk c c = QWIJKLMNOBCDEFGHZDEFGHIJKL c' = ABCDEFGHIJKLMNOP (Insert; 2; QW) QW (Copy; 7; 8) IJKLMNO (Copy; 7; 1) BCDEFGH (Insert; 1; Z) Z (Copy; 9; 3) DEFGHIJK
Delta Encoding Douglis and Iyengar claim that "Delta-encoding itself has been made extremely efficient, and it should not usually be a bottleneck except in extremely highbandwidth environments. [...] The inclusion of the Ajtai delta-encoding work in a commercial backup system, also support the argument that DERD will not be limited by the delta-encoding bandwidth. F. Douglis and A. Iyengar. Application-specic deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference
Diligent Delta-Encoding: Memory Overhead Chunks of size 32 MByte Resemblance detection based on 4 Kbyte for each shingle k=8 maximum value shingles are used for resemblance detection Feature index for 1 TByte has size, m k f Bytes where m is chunk size, k is number of shinglings per chunk and f is the size of a fingerprint For Diligent, the feature index has size Optimized for Backup applications Diligent Technologies. HyperFactor -- a breakthrough in data reduction technology. Diligent White Paper. 2 40 2 40 8 16 Bytes = 4 MByte 25 2
Discussion Delta-Encoding Delta encoding can help to overcome disk bottleneck in today s deduplication systems, but reconstruction of chunks can trigger reconstruction of additional chunks slowdown, both for reading and in writing data problem becomes worse, if the system gets older Depth of dedup tree increases Clean-up process can help to limit this problem Diligent has, e.g. to our knowledge, restricted the maximum tree depth