Probabilistic Deduplication for Cluster-Based Storage Systems

Size: px

Start display at page:

Download "Probabilistic Deduplication for Cluster-Based Storage Systems"

Loreen Webb
8 years ago
Views:

1 Probabilistic Deduplication for Cluster-Based Storage Systems Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France

2 Motivation Volume of data stored increases exponentially. Provided services are highly dependent on data. 2

3 Motivation Traditional solutions combine data in tarballs and store them on tape. Pros: cost efficient. Cons: low throughput. Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: 3

Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost

4 Deduplication Store data only once and replace duplicates with references. 4

5 Deduplication Store data only once and replace duplicates with references. file1 5

6 Deduplication Store data only once and replace duplicates with references. file1 file2 6

7 Deduplication Store data only once and replace duplicates with references. file1 file2 7

8 Deduplication Store data only once and replace duplicates with references. file1 file2 8

9 Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. 9

10 Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. Cluster-based solutions. Single-machine tradeoffs. Deduplication vs Load balancing. We focus on Cluster-based Deduplication Systems. 10

Cluster-based solutions. Single-machine tradeoffs.

11 Example: Deduplication Vs Load Balancing A client wants to store a file. A B C D Clients Coordinator Storage Nodes 11

12 Example: Deduplication Vs Load Balancing The client sends the file to the Coordinator. A B C D Clients Coordinator Storage Nodes 12

13 Example: Deduplication Vs Load Balancing The Coordinator computes the overlap between the contents of and those of each Storage Node. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 13

14 Example: Deduplication Vs Load Balancing To maximize DEDUPLICATION, the new file should go to node C. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 14

15 Example: Deduplication Vs Load Balancing To achieve LOAD BALANCING, the new file should go to node D. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 15

16 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. 16

17 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. 17

18 Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. Good Throughput. Minimize CPU/Memory usage at Coordinator. 18

Load Balancing. Minimize: Ideally, equal to 1. Scalability.

19 PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 19

Assign chunks to nodes. Keep the system load balanced.

20 Client: chunking Chunks: use content-based chunking techniques. basic deduplication unit. Super-chunks: group of consecutive chunks. basic routing and storage unit. 20

21 Client: chunking Split the file in chunks 21

22 Client: chunking Organize the chunks in super-chunks 22

23 Client: chunking 23

24 PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 24

25 Coordinator: goals Estimate the overlap between a super-chunk and the chunks of a given node. Maximize deduplication. Equally distribute storage load among nodes. Guarantee a load balanced system. 25

26 Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 26

27 Coordinator: Overlap Estimation Main observation : Do not need the exact matches. Need only an estimation of the size of the overlap. PCSA permits : Compact set descriptors. Accurate intersection estimation. Computationally efficient. 27

28 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 28

29 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash()

30 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() p(y) = min(bit(y, k)) BITMAP

P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] =

31 Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() INTUITION P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 p(y) = min(bit(y, k)) BITMAP

32 Coordinator: Overlap Estimation Intersection Cardinality Estimation? Union Cardinality Estimation? BITMAP(A) BITMAP(B) BitwiseOR BITMAP(A V B)

33 Coordinator: Overlap Estimation PCSA set cardinality estimation. Set intersection estimation. Selection of best storage node. 33

34 Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 34

35 Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. 35

36 Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. Problems Too aggressive, especially when a few data are stored in the system. 36

37 Load Balancing: our solution Bucket-based storage quota management. Measure storage space in fixed-size buckets. Coordinator grants buckets to nodes one by one. No node can exceed the least loaded by more than a maximum allowed bucket difference. 37

38 Load Balancing: our solution Bucket-based storage quota management. Bucket 38

39 Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 39

40 Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 40

41 Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 41

42 Load Balancing: our solution Bucket-based storage quota management. Bucket 42

43 Load Balancing: our solution Bucket-based storage quota management. Bucket 43

44 Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 44

45 Load Balancing: our solution Bucket-based storage quota management. NO you cannot! Bucket 45

46 Load Balancing: our solution Bucket-based storage quota management. Searching for the second biggest overlap. Bucket 46

47 Load Balancing: our solution Bucket-based storage quota management. Bucket 47

48 Contribution Summary Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 48

49 in l- ge t. ser nt ith file ble ge idch he ng he he ck on binary dat a, t hus covering many common cases. Table 1 presents more details on these datasets. In particular, the Deduplication Evaluation: Factor is the ratio of the Datasets original size of each dataset divided by its size after being deduplicated based on our chunking mechanism. 2 real world workloads: D at aset Size D eduplication Fact or m at Dat a For - (G B ) W ikipedia HT ML Images OS images 2 competitors Table 1: [Dong D at asetet D escr al. 2011]: ipt ion Minhash BloomFilter 4.2 Competitors Before evaluat ing t he specific aspect s of Pr oduck, we compare it against two state-of-the-art cluster-based deduplication storage systems presented in [6]: Bl oomfil t er,a stateful strategy, and M inhash, astateless one. T he latter is used in a commercial product [6], thus representing the 49

50 Evaluation: Competitors MinHash: Use the minimum hash from a super-chunk as its fingerprint. Assign super-chunks to bins using the mod(# bins) operator. Initially assign bins to nodes randomly and reassign bins to nodes when unbalanced. 50

51 Evaluation: Competitors BloomFilter: The Coordinator keeps a Bloom filter for each one of the Storage Nodes. If a node deviates more than 5% from the average load, he is considered overloaded. 51

52 Evaluation: Metrics Deduplication: Load balancing: Overall: ED and TD are normalized to the performance of a singlenode system to ease comparison. Throughput : 52

53 Evaluation: Effective Deduplication Wikipedia Images 32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21% 53

54 Evaluation: Throughput Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 54

55 Evaluation: Throughput Memory : 64KB for Produck 9,6bits/chunk or 168GB for 140TB/node Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 55

56 Evaluation: Load Balancing Wikipedia Load Balancing Images 56

57 57

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC