SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput

Size: px

Start display at page:

Download "SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput"

Morgan Osborne
7 years ago
Views:

1 SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput Wen Xia Hong Jiang Dan Feng Yu Hua, Huazhong University of Science and Technology University of Nebraska-Lincoln

2 Data Deduplication Why deduplication? Reduces the storage space overheads. Minimizes the network transmission of redundant data. Deduplication Technique. Data fingerprints: MD5, SHA-1, SHA-256. Remove duplicate data by checking its fingerprints. Deduplication granularity. File-level. Chunk-level. Fixed-length Chunking; Content Defined Chunking. 2

3 Deduplication Challenges Chunking Hashing Indexing Files Chunks Fingerprints Store The Scalability of Deduplication Indexing Deduplicate 800 TB unique data. SHA-1 signature. Avg. 8KB Chunk. 2TB Fingerprints are generated. Global indexing. Disk bottleneck. 3

4 A B C A Locality-based Approaches (1) Input data stream Miss RAM Hit Locality Cache DISK A B C F G H L M N Global index on the disk Input data stream A G N Q 4

5 Locality based approaches (2) DDFS, Sparse Indexing, ChunkStash. Exploit locality of backup streams. It maximizes the RAM utilization and reduces frequent accesses to on-disk index by retaining access locality in the locality cache. Limitations. Work poorly when backup stream lacks locality. High RAM consumed. 5

6 Similarity-based Approaches (1) Backup File 3 A B C J extract the similarity characteristics of File 3 RAM F C Similarity Index Lookup C in the Similarity Index DISK File 1 A B C D File 2 E F G H Achieve a single on-disk index access for chunk lookup per file thus avoid global index on the disk. Deduplicate File 3 with File 1 6

7 Limitation of These Approaches (2) Exploit similarity of backup streams. Avoid global indexing and achieve a single disk read. Minimize the RAM overhead for indexing fingerprints. Limitation. Degradation of Deduplication efficiency. Theorem 1: Consider two files S1 and S2, Let min(h(s)) denote the similarity characteristic of file S. Then similarity degree between the two files is quantified by the probability that min(h(s1))= min(h(s2)), which is dependent on the percentage of data common to both files: Pr[min( H( S )) min( H( S ))] S1 S2 S S 1 2

8 Evaluation of Similarity Approach Similarity based Deduplication efficiency is dependent on the similarity degree of data stream 8

9 Observation The deduplication of small files and large files. Percentage of total file number Percentage of total space Small files ( 64KB) Large files ( 2 MB) 80% 20% 20% 80% Grouping many highly correlated small files into a segment to minimize dedupe overheads Dividing the large files into many small segments to expose more similarity characteristics 9

10 Intuition The combination of similarity and locality. (a) Similarity approach Existing data stream Similar Similar Input data stream Locality Enhancement 10 Potential duplicate

11 System Architecture Overview A disk-inline backup storage system. File Deamon is a deamon program providing a functional interface in users computers. Backup Server is the manager of the backup system that directs all File Agents and Storage Servers. File Deamon Backup Server Deduplication Server User Interface Job MetaData Cache HashTable Block Store Chunking File Agent Job Agent Deduplication Metadata Agent MDS Storage Server is the repository for backedup data. Network Storage Server Deduplication Server is to store and look up all fingerprints of files and chunks. Storage Agent Contain Store Storage Agent Contain Store Storage Agent Contain Store Disk Disk Disk 11

12 Deduplication Server RepChunk ID RAM Similarity Hash Table Locality Cache Block ID LHTable Chunk ID Block Write Buffer Block Read Cache Seg Key SHTable DISK Segment Block Block Block Block Block Block The similarity unit, (sequence of chunks) The locality unit, (sequence of segments) Deduplication Server. It is most likely the performance bottleneck. 12

13 The Similarity Algorithm Structuring data from backup streams into segments according to the following three principles. P1. Correlated small files in a backup stream are to be grouped into a segment. P2. A large file in a backup stream is divided into several independent segments. P3. All segments are of approximately the same size (e.g., 2MB). Block Block Metadata Block Block Metadata Segment1 Metadata Small File1 Small File2 Reduce RAM usage Segment1 Segment1 Small File3 Segment2 Segment3 Segment2 Segment3 Segment2 Metadata Part 1 of large file L1 Segment3 Metadata Eliminate more data Part 2 of large file L1 13

14 The Locality Algorithm The locality algorithm groups several contiguous segments into a block and preserves their locality-layout on the disk. It maximizes the RAM utilization and reduces frequent accesses to on-disk index by retaining access locality in the locality cache. By exploiting the inherent locality in backup streams, the block-based SiLo locality algorithm can eliminate more duplicate data. B1 S 11 S 12 S 13 S 14.. S 1k Segment Similar Block Similar Duplicate more data B2 S 21 S 22 S 23 S 24.. S 2k 14

15 SiLo Workflow The locality algorithm helps detect more potentially duplicate chunks that are missed by the similarity algorithm. (a) Input backup stream (big file) (big file) (Small files) (b) Segmenting large files and grouping small files (segment) (segment) (block) (c) Similarity detection Input segments similar similar Segments in cache or on disk (d) Locality-enhanced similartity detection by chunks filtering N N N N N (potentially duplicate) (potentially duplicate) (dup chunks) (new chunks) (dup chunks) 15

16 RAM Consideration RAM usage of SiLo: The locality cache? The similarity hash table? A small portion. The main portion. RAM usage analysis: SiLo requires only 30 MB for deduplicating 1TB unique data. Extreme Binning requires 300 MB for deduplicating 1TB unique data. (Avg. file size of 200KB). Sparse Indexing uses 170 MB of RAM space for a TB-scale deduplication system, whereas the Sparse Indexing paper estimates that DDFS would require 360 MB RAM to maintain a partial index depending on locality in backup streams. 16

17 Performance Evaluation Interplay of similarity and locality algorithms. Quantitative analysis of our similarity and locality algorithms. Comparison of state-of-the-art work. Locality approach: ChunkStash-HDD. Similarity approach: Extreme Binning. Four datasets. Feature One-set Inc-set Linux-set Full-set Small files Locality Weak Weak Strong Strong Similarity Weak Strong Strong Strong 17

18 Interplay of Similarity and Locality Locality unit Block Metadata Segment1 Block Metadata Segment1 Segment1 Metadata Small File1 Small File2 Small File3 Segment size Segment2 Segment2 Segment2 Metadata Part 1 of large file L1 Similarity unit Segment3 Segment3 Segment3 Metadata Part 2 of large file L1 Percentage of duplicate data eliminated and Time overhead of SiLo deduplication as a function of block size and segment size. The larger the block size is, the more locality can be retained. The smaller the segment size is, the more similarity can be exposed. Segment size 18

19 Locality Enhancement Evaluation Duplicate data missed by our SiLo approach. Duplicate data removed by similarity-only approach. Duplicate data removed by Locality-enhanced approach. The full Exploitation of locality jointly with similarity can remove almost all of the redundant data missed by the similarity detection. 19

20 Duplicate Elimination Locality approach: ChunkStash-HDD SiLo with segment size of 4MB The similarity approach: Extreme Binning SiLo achieves near-exact duplicate elimination under all workloads. 20

21 RAM Usage for Indexing Extreme Binning performs poorly on the Linux-set. SiLo consumes a RAM capacity that is only 1/41 1/60 and 1/3 1/90 respectively of that consumed by ChunkStash and Extreme Binning. 21

22 Deduplication Throughput Our evaluations on deduplication throughput suggest that SiLo outperforms ChunkStash by a factor of about 3 and Extreme Binning by a factor of about

23 Summary SiLo, a near-exact deduplication system. effectively and complementarily exploits similarity and locality achieve high duplicate elimination and throughput at extremely low RAM overheads. Combination of similarity and locality. SiLo proposes a new similarity algorithm that groups many small strongly correlated files into a segment and segments a large file to better expose and exploit their similarity characteristics. SiLo proposes an effective locality approach that captures more similar and duplicate data missed by the probabilistic similarity detection and also improve the deduplication throughput. Our experimental evaluation of SiLo. Quantitative analysis and demonstration of our similarity and locality algorithms. SiLo system consistently and significantly outperforms two existing state-of-the-art systems. 23

61025008,60933002,60873028,60703046, Changjiang innovative group of Education of China No.

24 Questions? Acknowledgements 973 Program of China under Grant No. 2011CB302301, ( 863 Program) of China under Grant No.2009AA01A401 and 2009AA01A402, NSFC No , , , , Changjiang innovative group of Education of China No. IRT0725, Fundamental Research Funds for the central universities, HUST, under grant 2010MS043, The US NSF under Grants NSF-IIS , NSF-CCF and NSF-CNS

SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput

SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput Wen Xia Hong Jiang Dan Feng Yu Hua, wx.hust@gmail.com jiang@cse.unl.edu dfeng@hust.edu.cn csyhua@hust.edu.cn