Probabilistic Deduplication for Cluster-Based Storage Systems Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France
Motivation Volume of data stored increases exponentially. Provided services are highly dependent on data. 2
Motivation Traditional solutions combine data in tarballs and store them on tape. Pros: cost efficient. Cons: low throughput. Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: www.backupworks.com 3
Deduplication Store data only once and replace duplicates with references. 4
Deduplication Store data only once and replace duplicates with references. file1 5
Deduplication Store data only once and replace duplicates with references. file1 file2 6
Deduplication Store data only once and replace duplicates with references. file1 file2 7
Deduplication Store data only once and replace duplicates with references. file1 file2 8
Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. 9
Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. Cluster-based solutions. Single-machine tradeoffs. Deduplication vs Load balancing. We focus on Cluster-based Deduplication Systems. 10
Example: Deduplication Vs Load Balancing A client wants to store a file. A B C D Clients Coordinator Storage Nodes 11
Example: Deduplication Vs Load Balancing The client sends the file to the Coordinator. A B C D Clients Coordinator Storage Nodes 12
Example: Deduplication Vs Load Balancing The Coordinator computes the overlap between the contents of and those of each Storage Node. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 13
Example: Deduplication Vs Load Balancing To maximize DEDUPLICATION, the new file should go to node C. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 14
Example: Deduplication Vs Load Balancing To achieve LOAD BALANCING, the new file should go to node D. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 15
Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. 16
Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. 17
Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. Good Throughput. Minimize CPU/Memory usage at Coordinator. 18
PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 19
Client: chunking Chunks: use content-based chunking techniques. basic deduplication unit. Super-chunks: group of consecutive chunks. basic routing and storage unit. 20
Client: chunking Split the file in chunks 21
Client: chunking Organize the chunks in super-chunks 22
Client: chunking 23
PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 24
Coordinator: goals Estimate the overlap between a super-chunk and the chunks of a given node. Maximize deduplication. Equally distribute storage load among nodes. Guarantee a load balanced system. 25
Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 26
Coordinator: Overlap Estimation Main observation : Do not need the exact matches. Need only an estimation of the size of the overlap. PCSA permits : Compact set descriptors. Accurate intersection estimation. Computationally efficient. 27
Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 28
Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() 7 6 5 4 3 2 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 29
Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() 7 6 5 4 3 2 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 p(y) = min(bit(y, k)) BITMAP 0 1 2 3 4 5 6 7 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 30
Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() INTUITION P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 p(y) = min(bit(y, k)) BITMAP 0 1 2 3 4 5 6 7 1 1 0 1 0 0 0 0 31
Coordinator: Overlap Estimation Intersection Cardinality Estimation? Union Cardinality Estimation? BITMAP(A) 0 1 2 3 4 5 6 7 1 1 0 1 1 0 0 0 BITMAP(B) 0 1 2 3 4 5 6 7 0 1 0 0 1 1 0 0 BitwiseOR BITMAP(A V B) 0 1 2 3 4 5 6 7 1 1 0 1 1 1 0 0 32
Coordinator: Overlap Estimation PCSA set cardinality estimation. Set intersection estimation. Selection of best storage node. 33
Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 34
Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. 35
Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. Problems Too aggressive, especially when a few data are stored in the system. 36
Load Balancing: our solution Bucket-based storage quota management. Measure storage space in fixed-size buckets. Coordinator grants buckets to nodes one by one. No node can exceed the least loaded by more than a maximum allowed bucket difference. 37
Load Balancing: our solution Bucket-based storage quota management. Bucket 38
Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 39
Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 40
Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 41
Load Balancing: our solution Bucket-based storage quota management. Bucket 42
Load Balancing: our solution Bucket-based storage quota management. Bucket 43
Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 44
Load Balancing: our solution Bucket-based storage quota management. NO you cannot! Bucket 45
Load Balancing: our solution Bucket-based storage quota management. Searching for the second biggest overlap. Bucket 46
Load Balancing: our solution Bucket-based storage quota management. Bucket 47
Contribution Summary Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 48
in l- ge t. ser nt ith file ble ge idch he ng he he ck on binary dat a, t hus covering many common cases. Table 1 presents more details on these datasets. In particular, the Deduplication Evaluation: Factor is the ratio of the Datasets original size of each dataset divided by its size after being deduplicated based on our chunking mechanism. 2 real world workloads: D at aset Size D eduplication Fact or m at Dat a For - (G B ) W ikipedia 522 1.96 HT ML Images 142 4.27 OS images 2 competitors Table 1: [Dong D at asetet D escr al. 2011]: ipt ion Minhash BloomFilter 4.2 Competitors Before evaluat ing t he specific aspect s of Pr oduck, we compare it against two state-of-the-art cluster-based deduplication storage systems presented in [6]: Bl oomfil t er,a stateful strategy, and M inhash, astateless one. T he latter is used in a commercial product [6], thus representing the 49
Evaluation: Competitors MinHash: Use the minimum hash from a super-chunk as its fingerprint. Assign super-chunks to bins using the mod(# bins) operator. Initially assign bins to nodes randomly and reassign bins to nodes when unbalanced. 50
Evaluation: Competitors BloomFilter: The Coordinator keeps a Bloom filter for each one of the Storage Nodes. If a node deviates more than 5% from the average load, he is considered overloaded. 51
Evaluation: Metrics Deduplication: Load balancing: Overall: ED and TD are normalized to the performance of a singlenode system to ease comparison. Throughput : 52
Evaluation: Effective Deduplication Wikipedia Images 32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21% 53
Evaluation: Throughput Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 54
Evaluation: Throughput Memory : 64KB for Produck 9,6bits/chunk or 168GB for 140TB/node Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 55
Evaluation: Load Balancing Wikipedia Load Balancing Images 56
57