Data Deduplication. Hao Wen

Data Deduplication Hao Wen

What Data Deduplication is

What Data Deduplication is Dedup vs Compression Compression: identifying redundancy within a file. High processor overhead. Low memory resource requirement. Deduplication: The comparable range is now across all the files (in fact segments of those files) in the environment. More memory intensive-> data on disks Caching technique to move data in and out of DRAM Traditionally not cache friendly (not in FIFO)

Evaluation Dedup ratio Throughput Throughput refers to the rate at which data can be transferred in and out of the system. High throughput is particularly important because it can enable fast backups, minimizing the length of a backup window. Scalability the ability to support large amounts of raw storage with consistent performance.

Classification Dedup location Source: When the deduplication occurs close to where data is created, it is often referred to as "source deduplication" Target: When it occurs near where the data is stored, it is commonly called "target deduplication".

Classification When to dedup Post: With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. Inline: This is the process where the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that it already stored on the system it does not store the new block, just references to the existing block.

Scenario Backup Dedup Deduplication in backup and archival systems was introduced by Microsoft SIS in 2000 and Venti in 2002 Write once Primary Dedup Performance sensitive Not write once and expect modification: Copy on write to prevent updates on aliased data. Chunk references change quickly

Scenario

Methods Hash based each chunk of data is assigned an identification calculated by the software The assumption is made that if the identification is identical, the data is identical

Methods Hash based Fix size chunking Pros: fast, simple, minimum CPU Cons: A little modification -> Change in all subsequent chunks original: aaa aaa aaa new: aaa baa aaa a

Methods Hash based Variable size chunking Content-defined chunking (CDC)

Methods Hash based Exp_chunk_size: 8KB chunk_mask: 0x1fff, the last 13bits Magic_value: 0x12 2 13 = 8k

Methods Content aware

Overview Of CBC Approach Sliding window Byte stream asfdgegsacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg abcdega Stream beginning

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Compute Rabin s rabin s fingerprint H( abcdega )= 726

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Rabin s fingerprint H( abcdega ) =726 726 mod 128 0

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Rabin s fingerprint H( bcdegaa ) = 4693 4693 mod 128 0

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Rabin s fingerprint H( cdegaac ) = 8359 8359 mod 128 0

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr Rabin s fingerprint H( cdegaac ) = 8359 8359 mod 128 0

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr Rabin s fingerprint H( acgdgyr ) = 40960 40960 mod 128 0

Overview Of CBC Approach Set boundary abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr SHA1( abcdegaacgdyr ) = 0x323 324 160 bits Chunk id

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr SHA1( abcdegaacgdyr ) =0x323 324 Lookup SHA1 0x323 324 in index table Chk id Chk freq Chk ptr NULL NULL NULL Table Initially NULL NULL NULL Empty NULL NULL NULL NULL NULL NULL NULL NULL NULL Chunk container

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr SHA1( abcdegaacgdyr ) =0x323 324 Lookup SHA1 0x323 324 in index table Chk id Chk freq Chk ptr NULL NULL NULL 0x323 324 not NULL NULL NULL Exist in the table! NULL NULL NULL NULL NULL NULL NULL NULL NULL Chunk container

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr SHA1( abcdegaacgdyr ) =0x323 324 Lookup SHA1 0x323 324 in index table Chk id Chk freq Chk ptr NULL NULL NULL It s a new chunk, need to transmit NULL NULL NULL it to the container & NULL NULL NULL Update the chunk index NULL NULL NULL table NULL NULL NULL Chunk container

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr Update the SHA1( abcdegaacgdyr ) =0x323 324 Index table Chk id Chk freq Chk ptr NULL NULL NULL Lookup SHA1 0x323 324 in index table 0x323 324 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL Chunk container

Overview Of CBC Approach transmit the chunk to The chunk container abcdegaacgdyr Network Chunk container

Overview Of CBC Approach abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg acgdgyr SHA1( abcdegaacgdyr ) =0x323 324 Lookup SHA1 0x323 324 in index table Chk id Chk freq Chk ptr NULL NULL NULL 0x323 324 1 NULL NULL NULL NULL NULL NULL Insert chunk addr In chk-ptr field 0xf3 ea1 NULL NULL NULL Chunk container abcdegaacgdyr

Overview Of CBC Approach Move forward, start a new chunk Abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfgcgdgyrg Rabin s fingerprint H( cgdgyrg ) = 51734 51734 mod 128 0

Overview Of CBC Approach Move forward Abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Omit the repeated Rabin s Fingerprinting check & chunk generations

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg casdfeg Rabin s fingerprint H( casdfeg ) = 65536 65536 mod 128 0

Overview Of CBC Approach Set boundary abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg casdfeg SHA1( gfdhchjsdfhjrchcsaaabgdvcasdfeg ) = 0x100 333 160 bits

Overview Of CBC Approach Set boundary abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg casdfeg SHA1( gfdhchjsdfhjrchcsaaabgdvcasdfeg ) = 0x100 333 Lookup SHA1 if 0x100 333 in index table Chk id Chk freq Chk ptr NULL NULL NULL 0x100 333 1 0xff ab1 NULL 1 NULL NULL NULL NULL 0xfff fff 300 0x12 232 Chunk container

Overview Of CBC Approach Set boundary abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg casdfeg SHA1( gfdhchjsdfhjrchcsaaabgdvcasdfeg ) = 0x100 333 Lookup SHA1 if 0x100 333 in index table Chk id Chk freq Chk ptr NULL NULL NULL No need to store Sliding Chunk window exists! Continue to move forward 0x100 333 1 0xff ab1 NULL NULL NULL 0xfff fff 300 0x12 222 Chunk container

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg asdfegg Rabin s fingerprint H( asdfegg ) = 75346 75456 mod 128 0

Overview Of CBC Approach Move forward abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg asdfegg 75346 Continue this process till the end of the stream 75 mod 128 0

Overview Of CBC Approach File Retrieval Process: retrieve foo.txt Step I: get the index file from chunk container Get index file Foo.txt id x id y id z Chunk container

Overview Of CBC Approach File Retrieval Process Step II: lookup chunk index table for corresponding chunks Foo.txt abcdegaacgdgyrgfdhchjsdfhjrchcsaaabgdvcasdfegggasdvghhyuufhjjfg Foo.txt id x id y id z Index file Chk id Chk freq Chk ptr 0x100 333 2 0xa1 0x100 334 40 0x11 NULL NULL NULL 0x323 324 1 0xfe NULL NULL NULL 0xfff fff 300 Chk 3 Chk 2 Chunk container Chk 1

Overview Of CBC Approach File Retrieval Process Step III: send addresses of requested chunk to the chunk container And concatenate receive chunks Foo.txt Get chunks Foo.txt id id id Chk Chk 3 2 Chk 1 Chk id Chk freq Chk ptr 0x100 333 2 0xa1 0x100 334 40 0x11 0x323 324 NULL 1NULL 0xfe NULL 0x323 324 1 0xfe 0xfff fff NULL 300 NULL 0xef NULL 0xfff fff 300 Chunk container

Overview of CBC Approach It chooses anchors uniformly at random as boundaries. Number of anchors v.s anchor frequency follows Zipf like distribution, meaning the major portion of anchors has low frequency. It can mitigate boundary shifting problem Its performance degrades significantly when changes sprinkle over the data stream Large sampling rate (smaller modulo N) results in smaller chunk size and larger metadata overhead, hence the dedup benefit may not be maximized

Performance Issue Dedup ratio vs. performance Fine-grained chunking Big index tables a) large overhead to look up table b) RAM cannot fit. Partially stored on disk. Disk overhead cache for indexes

Performance Issue Dedup ratio vs. performance Avoiding the Disk Bottleneck in the Data Domain Deduplication File System - In-memory Bloom Filter and caching index fragments reduce the number of times that the system goes to disk to look for a duplicate segment only to find that none exists - In backup applications, chunks tend to reappear in the same of very similar sequences with other chunks Dedicates containers to hold chunks for a single stream in their logical order. - Cache. if a chunk is a duplicate, the base chunk is highly likely cached already. Descriptors of all chunks in a container are added or removed from the chunk cache at once.

Performance Issue Dedup ratio vs. performance Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality - If two pieces of backup streams share any chunks, they are likely to share many chunks. - Based on segments. A segment is a sequence of chunks. We say that two segments are similar if they share a number of chunks. - Identify among all the segments in the store some that are most similar to the incoming segment. Deduplicate against those segments by finding the chunks they share with the incoming segment

Performance Issue Inline vs. post deduplication should be performed as soon as data enters the storage system to maximize its benefits. Consume CPU and memory. Impact latency. If in background, 100% additional storage space needed in worst case.

Performance Issue Dedup fragmentation

Performance Issue Dedup fragmentation Improving restore speed for backup systems that use inline chunk based deduplication Forward Assembly Area

Performance Issue Dedup fragmentation Improving restore speed for backup systems that use inline chunk based deduplication Container capping

Scalability Issue Dedup in large-scale storage system As the scale increases, it s harder to find matches. A centralized index solution is likely to become itself very large and its manipulation a bottleneck on deduplication throughput. Mitigate by (1) Isolated nodes. Exploiting data locality (routing similar files to the same nodes) (2) Distributed hash table (DHT) as the index

Reliability and Security Issue Reliability Deduplication vs. Redundancy for reliability Both for metadata and data Security Deduplication vs. Encryption Convergent Encryption (1) Brute-force attack (2) Large key space overheads and single-point-of-failure

SNIA. Understanding data deduplication ratios. http://www.snia.org/sites/default/files/understanding_data_deduplication_ratios-20080718.pdf Kaczmarczyk M, Barczynski M, Kilian W, et al. Reducing impact of data fragmentation caused by in-line deduplication[c]//proceedings of the 5th Annual International Systems and Storage Conference. ACM, 2012: 15. Guo, Fanglu, and Petros Efstathopoulos. "Building a High-performance Deduplication System." USENIX Annual Technical Conference. 2011. Lu, Guanlin, Yu Jin, and David HC Du. "Frequency based chunking for data de-duplication." Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010 IEEE International Symposium on. IEEE, 2010. Bobbarjung, Deepak R., Suresh Jagannathan, and Cezary Dubnicki. "Improving duplicate elimination in storage systems." ACM Transactions on Storage (TOS) 2.4 (2006): 424-448. Babette H, Alessio B, Michael B, Rik F, Abbe W. Guide to Data De-duplication: The IBM System Storage TS7650G ProtecTIER Deduplication Gateway. https://www.e-techservices.com/redbooks/ts7650gprotectierde-duplicationgateway.pdf Meyer, Dutch T., and William J. Bolosky. "A study of practical deduplication." ACM Transactions on Storage (TOS) 7.4 (2012): 14. Pibytes. Deduplication Internals Hash based deduplication : Part-2. https://pibytes.wordpress.com/2013/02/09/deduplicationinternals-hash-based-part-2/ Paulo, J., & Pereira, J. (2014). A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR), 47(1), 11. Lillibridge, Mark, Kave Eshghi, and Deepavali Bhagwat. "Improving restore speed for backup systems that use inline chunk-based deduplication."fast. 2013. Zhu, Benjamin, Kai Li, and R. Hugo Patterson. "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System." Fast. Vol. 8. 2008. Lillibridge, Mark, et al. "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality." Fast. Vol. 9. 2009.

Prepared for DISC-meeting, 02/24/2011 Deduplication Research Update Read Performance & Reliability Young Jin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao, David Du

Talk Outline 1. Overview of Dedupe Storage Designs 2. Dedupe Read Performance Problem 3. Dedupe Reliability Problem (briefly) 51

Data Deduplication Process Dividing data(object) into (variable/fixed-sized) small chunks & computing hash (SHA-1) for each chunk For each chunk hash, decide if it has a copy by looking up an index (a set of hash) If yes, store the address of the copy, Otherwise, (LZ-compress it &) store the new (unique) chunks in the data store 52

Data Dedupe Design Options [Dubnicki09] Granularity of dedupe whole files, partial files, fixed or variable-sized chunks Time to dedupe inline during write or post process@ storage server Precision of duplicate identification finding all duplicates or approximate (for performance) Verification of equality betw. duplicate & its copy hash comparison or full data comparison Scope of the dedupe global dedupe(entire system) or local dedupe(limited, specific node) 53

Data Dedupe Design Options Post process In-line 54

Current Dedupe Design It mainly highlights maximizing the duplicate detection and efficiency (chunking, index optimization/caching, bloom filtering) improving write I/O performance (LZ-compression, log-structured + largewrite containers) [Efstathopoulos10] HotStorage 10 [Dong11]-FAST 11 EMC deduplication storage architecture 55

Current Dedupe Design It didn t pay much attention to 1) Read I/O performance(*) Data domain [Zhu08] mentioned its importance data is read occasionally from the dedupe storage (200MB/s 140MB/s, 30% ) The more dedupe chunks, The more RD perf. degradation (esp. with different data streams) (each successive version of a given backup stream) 56

Current Dedupe Design It didn t pay much attention to 2) Dedupe & Reliability issue(*) HPL [Li10] paper presented in HotMetrics 10 dedupe & reliability are considered separately Higher severity of data loss Dedupe & Make copies! 57

Current Dedupe Design It didn t pay much attention to 3) Scalability (clustering) issue EMC [Dong11] -FAST 11, Symantec [Efst10]-HotStorage 10 single dedupe storage(1.5gb/s in-line t-put) is not good enough 58

Talk Outline 1. Overview of Dedupe Storage Designs 2. Dedupe Read Performance Problem 3. Dedupe Reliability Problem 59

Problem in Read Performance As deduped ratio increases, read performance decreases (200MB/s 140MB/s, 30% ) [Zhu08] original write sequence(sequential write) will be fragmented by eliminating the duplicate chunks 60

Problem in Read Performance Read t-put variations of single backup & 4 backup streams [Zhu08] (200MB/s 140MB/s, 30% ) (*) Note : Synthetic workloads (each successive version of a given backup stream) 61

Why Read Performance Matters? Traditionally, with Secondary Storage Rebuild Performance in Secondary Storage critical with the secondary storage [Zhu08] recovery window time & system availability with evergrowing data Arguing that it s occasionally happening! 62

Why Read Performance Matters? Recently, with Secondary Storage Long-Term Digital Preservation Requirements SNIA DPCO(snia.org/forums/dpco) LTDP Reference Model (http://www.ltdprm.org ) motivation: repository needs to have effective mechanisms to detect bit corruption or loss one solution: running a back-end process to reconstitute on the fly for audit purposes: hash, verify, recover if they test bad) 63

Why Read Performance Matters? Dedupe & Primary Storage Dedupe gets used for Primary Storage read IOs will be as many as write IOs eg: storing VM(virtual machine) images 64

Dedupe vs. Primary Storage VME in Primary Storage [Das10]-ATC 10 Migrating VM image of an idle desktop onto a (dedupeenabled) network storage for energy saving 65

Guanlin s Comment about why read performance is also important The dedupe box size is limited and once a while the backed data stored in the dedupe box has to be staged to archive storage, say tapes. This requires stream reconstruction because tape operations are stream based. In fact, this staging frequency is remarkably higher than the user triggered data retrieval frequency and hence the read performance is also important. 66

Defining Our Problem Dedupe Environment Dedupe Abstraction Data Store Multiple Data streams, where a data stream a series of (compressed/plain) files (for backup), or memory/disk/process-status image (for VM) after Chunking/Removing duplicates a series of unique chunks Storing Chunks into Storage, a pool of unique chunks store data stream chunks of data stream B unique pool of unique chunks B A Chunking b1 a1 b0 a0 Removing Dup a1 b0 a0 Storing Chunks b0 a1 a0 chunks of data stream A Storage 67

Defining Our Problem Dedupe Environment Dedupe Abstraction Read Reconstructing data streams a data stream consists of a series of chunks, but its chunks are physically dispersed due to dedupe read data stream caching, buffering, prefetching? dispersed chunks b0 b1 B A Chunking b1 a1 b0 a0 Removing Dup a1 b0 a0 Storing Chunks b0 a1 a0 chunks of data stream A Storage 68

Defining Our Problem Improvement of Read I/O Performance Our Research Topics How to effectively contain chunk data? How to effectively read chunk(stream) data? read data stream caching, buffering, prefetching? dispersed chunks b0 b1 B A Chunking b1 a1 b0 a0 Removing Dup a1 b0 a0 Storing Chunks b0 a1 a0 chunks of data stream A Storage 69

Our Research Topics Improvement of Read I/O Performance Effective Chunk Data Containing Understand chunk fragmentation with dedupe & how much read performance is degraded! (**) How effectively/initially place chunks into storage? read/reconstruction-aware chunk placements How adaptively replace/duplicate chunks to assure a demanded stream read performance? B A Chunking b1 a1 b0 a0 Removing Dup a1 b0 a0 Storing Chunks b0 a1 a0 chunks of data stream A Storage 70

Our Research Topics Improvement of Read I/O Performance Effective Chunk Data Reading How effectively prefetch the chunk data? How effectively cache the (deduped) chunks? read data stream caching, buffering, prefetching? dispersed chunks b0 b1 How effectively handle the concurrent reads of multiple chunk data streams? (prefetch & cache) b0 a1 a0 Storage 71 71

Our Research Topics Effective Chunk Data Containing Any Existing Solutions? No solutions are published yet to the best of our knowledge! Data Domain said in [Zhu08] they have investigated mechanisms to reduce fragmentation and sustain high write/read throughput but, no published literatures yet 72

Our Research Topics Effective Chunk Data Containing Any Existing Solutions? I/O De-duplication [Koller10] Increase in the # of duplicates for popular content on the disk can create greater opportunities for read I/O optimization ( solution with a single disk) content-based cache (filters the I/O stream based on hits in a content-addressed cache) dynamic replica retriever (optionally redirects on the disk to use the best access latencies to requests) selective duplicator (kernel: create a candidate list of content for replication; user: populate replica content in scratch space distributed across the entire disk) 73

Our Research Topics Effective Chunk Data Containing Understanding Chunk Fragmentation Very initially, all unique chunks are grouped into a container to preserve a spatial locality (read sequence is the same as write; sequential) D_A data stream A0 A1 A2 A3 A0 A1 A2 A3 do a LARGE write a series of unique chunks from D_A logged into a container container (fixed size, as large as RAID stripe size) Container 74

Our Research Topics Effective Chunk Data Containing Understanding Chunk Fragmentation Chunks fragmentation chunks of a stream get distributed in a more # of containers read from ONE container D_A read from TWO containers D_A A1 is deduped A0 A1 A2 A0 A1 A2 A3 pointing A0 A1 A2 A3 Container 3 A0 A2 Container 7 75

Our Research Topics Understanding Chunk Fragmentation New Metric for Chunk Fragmentation Assumptions & notations CS : fixed container size (B) data stream, DS = {c i 0 i n-1}, c i is the i-th chunk chunk size, s i is the size of c i Optimal chunk fragmentation(ocf) ceiling of [sum of s i (0 i n-1) / CS] Current chunk fragmentation(ccf) # of containers where {c i 0 i n-1} are dispersed Chunk Fragmentation Level(CFL) = OCF / CCF overuse ratio of containers w.r.t. OCF 76

Our Research Topics Understanding Chunk Fragmentation CFL vs. Read Performance CFL good indicator for read performance degradation with deduped data stream Under optimal conditions (CFL=1, CCF=OCF), to read the entire data stream, approximately, there will be (OCF-1) short seeks (betw. different containers) Under non-optimal conditions (CFL < 1, CCF > OCF), there will (OCF-1) short seeks + (CCF-OCF) long seeks Long seeks contribute to the read perf. degradation 77

Our Research Topics Understanding Chunk Fragmentation Theoretical Model with CFL Expected Read performance(resp. time) Response time under optimal conditions(cfl=1) to read the entire data stream RT opt = (OCF-1)*S, S is a short-seek time Response time under non-optimal conditions(cfl>1) RT non = (OCF-1)*S+(CCF-OCF)*L, L is a long-seek time RT non / RT opt = 1 + (1/CFL 1)*, where =L/S 78

Our Research Topics Understanding Chunk Fragmentation Theoretical Model with CFL Expected Read performance(resp. time) RT non / RT opt = 1 + (1/CFL 1)*, where =L/S Variation of RT with decrease of CFL, =1,2,4,8 80 70 60 RT non / RT opt =8 50 40 30 20 10 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 CFL (Chunk Fragmentation Level) =4 =2 =1 79

Our Research Topics Understanding Chunk Fragmentation Chunk Fragmentation Patterns Chunk in its own stream or one of other streams Current version(backup number or generation) or one of previous versions Four different cases : Case 1: current version of its own stream Case 2: one of previous versions of its own stream Case 3: current version of one of other streams Case 4: one of previous versions of one of other streams 80

Our Research Topics Understanding Chunk Fragmentation Case 1 & 2 Looks fine? Deduped from its own stream (self-dedupe) Giving little impact on read performance? relatively short seek times? D_A C1 is deduped D_A A0 A1 A2 A3 A0 A1 A2 partially sequential read (w/ fragment) fully sequential read pointing (self-dedupe) A0 A2 A0 A1 A2 A3 Container 7 Container 3 81

Our Research Topics Understanding Chunk Fragmentation Case 1 & 2 Looks fine? Read performance remains constant in [Zhu08] (*) Note : Synthetic workloads (each successive version of a given backup stream) 82

Our Research Topics Understanding Chunk Fragmentation Case 3 & 4 Looks bad? Deduped from other streams(b2) Giving considerable impact on read perf? more randomly distributed (long seek times)? D_A C1 is deduped D_B B0 B1 B2 B3 A0 B1 A2 partially sequential read (w/ fragment) pointing (dedupe from others) A0 A2 B0 B1 B2 B3 Container 7 Container 3 83

Our Research Topics Understanding Chunk Fragmentation Case 3 & 4 Looks bad? Read performance is degraded in [Zhu08] (200MB/s 140MB/s, 30% ) (*) Note : Synthetic workloads (each successive version of a given backup stream) 84

Guanlin s Comments How to verify our guess that there are (CCF-OCF) # of long seeks and OCF # of short seeks? From the result at slide #34 we could conclude: Case (1) if dedupe write process always packs chunks of one or multiple generations of a single stream into containers and never put chunks from different streams into a single container, then the read performance is roughly the same for any given generation (maybe this indicates that access penalty for any generation is almost the same, e.g., read a generation [i] requires read container [0] plus container [i] (i >0) but read container [0] and [i] could be long distanced Case (2) if dedupe write process allows packing chunks of multiple streams into a single container (mix-mode, which is true in real system where it has to handle a large number of concurrent streams with limited number of open containers in RAM), it has much more random seeks than case (1). 85

Our Research Topics Understanding Chunk Fragmentation Simulations/Experiments CFL variation w/ increase of backup number (successive versions of streams) Read perf. variation w/ increase of CFL (should be matched with its theoretical model) 86

Our Research Topics Understanding Chunk Fragmentation Paper Work(HotStorage 11) Main Contributions will be... 1. Address read performance degradation w/ dedupe 2. Introduce the CFL indicator for read perf. degradation w/ dedupe & its theoretical read performance model 3. Examine the CFL variations vs. backup number with multiple real traced workloads 4. Validate the read performance vs. CFL make sure it s matching with its theoretical model) 87

Future Work(1/2) Our Research Topics Effective Chunk Data Containing 1. How adaptively replace/duplicate chunks to assure a demanded stream read performance? idea 1: selective migration approach with case 1, 2, 4 whenever index cache misses (don t remove the original one while its container is not reclaimed) idea 2: replication approach whenever index cache misses 2. How effectively/initially place chunks into storage? 1. read/reconstruction-aware chunk placements 88

Future Work(2/2) Our Research Topics Effective Chunk Data Reading 1. How effectively prefetch the chunk data? 2. How effectively cache the unique(deduped) chunks? 3. How effectively handle the concurrent reads of multiple chunk data streams? (prefetch & cache) 89

Talk Outline 1. Overview of Dedupe Storage Designs 2. Dedupe Read Performance Problem 3. Dedupe Reliability Problem (briefly) 90

Data Reliability vs. Data Dedupe Some data(file) requires a level of reliability, mostly creating data duplication(raid1) or using more storage spaces for parity(raid5/6, erasure codes) This is a conflicting direction that the data deduplication is pursuing! 91

Typical Steps to provide Data Reliability for Deduped Storage 1. Perform data deduplication process (s chunks) 2. Aggregate chunks into a large fixed-sized container 3. Reliably store the container over a fixed # of disks by using erasure coding(or RAID) schemes 92

Example with Typical Steps 1. Storing two data(a,b) : reliability 1-out-of-3 2. 4 unique chunks (container size = 2 chunks) Data A : 1-out-of-3 D_A Data B : 1-out-of-3 D_B a1 a2 a3 b2 Disk 1 Disk 2 Disk Par of 3 Disk 4 Disk 5 Disk Par of 6 a1 a2 a3 x a1,2 a3,x disk1 disk2 disk3 disk4 disk5 disk6 93

Existing Reliability Metric 1. Data loss probability(probabilistic combinatorics) 2. Each chunk will survive in face of a disk failure (Dlp w/ dedupe is the same as dlp w/o dedupe) D_A D_B D_A, D_B : 1-out-of-3 a1 a2 a3 b2 Disk 1 Disk 2 Disk Par of a1 a2 3 Disk a3 4 Disk x 5 Disk Par of 6 a1,2 a3,x disk1 disk2 disk3 disk4 disk5 disk6 94

Severity in Data Loss How many data will be lost when a chunk is lost? 1) without dedupe : only 1 2) with dedupe : more than 1 D_A without dedupe D_B a1 a2 a3 a2 a3 b2 D_A with dedupe D_B a1 a2 a3 b2 95

Our Research Topics Defining Our Problem Data Reliability with Dedupe How to nicely represent the data loss severity? How to assure the given data loss prob & data loss severity with dedupe storage? reliable chunk(container) placement chunk(container) migration when a newly dedupe chunk demands higher data reliability 96

Questions & Answering! DEDUPLICATION RESEARCH UPDATE READ PERFORMANCE & RELIABILITY Young Jin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao, David Du

SNIA. Understanding data deduplication ratios. http://www.snia.org/sites/default/files/understanding_data_deduplication_ratios-20080718.pdf Kaczmarczyk M, Barczynski M, Kilian W, et al. Reducing impact of data fragmentation caused by in-line deduplication[c]//proceedings of the 5th Annual International Systems and Storage Conference. ACM, 2012: 15. Guo, Fanglu, and Petros Efstathopoulos. "Building a High-performance Deduplication System." USENIX Annual Technical Conference. 2011. Lu, Guanlin, Yu Jin, and David HC Du. "Frequency based chunking for data de-duplication." Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010 IEEE International Symposium on. IEEE, 2010. Bobbarjung, Deepak R., Suresh Jagannathan, and Cezary Dubnicki. "Improving duplicate elimination in storage systems." ACM Transactions on Storage (TOS) 2.4 (2006): 424-448. Babette H, Alessio B, Michael B, Rik F, Abbe W. Guide to Data De-duplication: The IBM System Storage TS7650G ProtecTIER De-duplication Gateway. https://www.e-techservices.com/redbooks/ts7650gprotectierdeduplicationgateway.pdf Meyer, Dutch T., and William J. Bolosky. "A study of practical deduplication." ACM Transactions on Storage (TOS) 7.4 (2012): 14. Pibytes. Deduplication Internals Hash based deduplication : Part-2. https://pibytes.wordpress.com/2013/02/09/deduplication-internals-hash-based-part-2/ Paulo, J., & Pereira, J. (2014). A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR), 47(1), 11.

Assuring Demand Read Performance of Data Deduplication Storage with Backup Datasets, Proc. of IEEE MASCOTS 2012 (Youngjin Nam, Dongchul Park and David Du) "Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage," IEEE International Symposium of Advances on High Performance Computing and Networking (HPCC/AHPCN), September 2011 (Youngjin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao and David Du) "Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity," The First International Workshop on Energy Consumption and Reliability of Storage Systems (IGCC/ERSS), July 2011 (Youngjin Nam, Guanlin Lu and David Du) ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System, Proc. of SNAPI08 Workshop on Storage Network Architecture and Parallel I/Os, Oct. 2008, Baltimore, Maryland (with Chuanyi Liu, Yingping Lu, Guanlin Lu, Dong-Sheng Wang, and David Du)