A Novel Deduplication Avoiding Chunk Index in RAM

Size: px
Start display at page:

Download "A Novel Deduplication Avoiding Chunk Index in RAM"

Transcription

1 A Novel Deduplication Avoiding Chunk Index in RAM 1 Zhike Zhang, 2 Zejun Jiang, 3 Xiaobin Cai, 4 Chengzhang Peng 1, First Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R. China, zhangzhike@mail.nwpu.edu.cn *2,Corresponding Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R. China, claud@nwpu.edu.cn 3,4 Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R. China, comnwpus@163.com, abelard2009@mail.nwpu.edu.cn Abstract Chunk-lookup Disk Bottleneck Problem is one of the most important problems in deduplication systems. Previous methods can reduce the RAM usage of index a lot to avoid reading part of the index from the disk for every chunk searching. However, these methods still need hundreds of gigabytes of RAM to hold the index for 1 PB of storage space utilization. We design Linear Hashing with Key Groups (LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel deduplication method which can avoid an index in RAM by utilizing LHs to compute the address of the bin containing the similar files to a file. Then, we don't need to maintain an index in RAM to do the same thing. Our method doesn't decrease the deduplication efficiency compared with Extreme Binning, while it needs only one disk read for every file. Experimental results show that, while there isn't an index in RAM, the deduplication efficiency of our method can achieve 12.7% better than Extreme Binning. And also, the overhead of LHs is about 5% for high redundant datasets. Keywords: Deduplication, Linear Hashing, Chunk-lookup Disk Bottleneck Problem 1. Introduction A chunk-based deduplication system can find duplicate chunks and only store unique chunks to save storage space, which is usually used for storing backup data. In this way, a deduplication system can save a large amount of storage space, since backup data usually contains tremendous duplicate data. To find a duplicate chunk quickly, it is necessary to maintain an index of chunk IDs in RAM. A chunk ID is a signature of a chunk, which is usually computed by a cryptographic hash function such as SHA-1 or MD5. When two chunk IDs are identical, the two corresponding chunks are identical. When the data size is not very large, it is easy to put the flat chunk index in RAM. Flat chunk index is an index containing the chunk IDs of all unique chunks in a deduplication system. However, as the size of a deduplication system increases, the index size increases, too. Generally, the ratio of the size of the flat chunk index to the size of all unique chunks is about 1:100 [1]. The flat chunk index for 1 PB of storage space utilization is about 10 TB. It is hard to put such a huge index totally in RAM. Part of the index must be put in disk and is loaded to RAM when needed. Then, for searching every chunk from the index, part of the index should be loaded from the disk to RAM, resulting in extremely low throughput. This problem is namely Chunk-lookup Disk Bottleneck Problem. Several methods are proposed to solve Chunk-lookup Disk Bottleneck Problem. To solve this problem, Bloom Filter [2] utilizes a summary vector to avoid the unnecessary index searches for new chunks, and utilizes Locality Preserved Caching (LPC) to ensure the descriptors of a duplicate chunk are highly likely already in the cache. Sparse Indexing [1] designs a sparse index by sampling some chunk IDs at an exact rate per segment and placing them in RAM. Both methods are very effective when there is high locality in the data stream. If there is low locality in the data stream, the chunk descriptors is often not in the cache and Bloom Filter needs to load part of the segment index from disk frequently, therefore resulting in low throughput. If there is low locality in the data stream, the sampled chunk IDs in the index can't represent the whole segment very well and Sparse Indexing can't find the similar segment through the index, thus, resulting in relatively worse deduplication efficiency, compared with the deduplication using the flat chunk index. Extreme Binning (EB) [3] works very well when there is low locality in the data stream. EB chooses the minimum chunk ID of all chunk IDs of a file as the International Journal of Advancements in Computing Technology(IJACT) Volume5,Number5,March 2013 doi: /ijact.vol5.issue

2 representative chunk ID and places it in the index in RAM. EB also puts all chunk IDs of the files with the same minimum chunk ID into the same bin in the disk. Some deduplication clusters [4-6] are also presented to decrease the RAM requirements for every single node [15-17]. Other methods are proposed to decrease the fragmentation of deduplication [13-14]. These methods can all reduce the RAM usage a lot, but the RAM usage is still too large. If the average chunk size is 4 KB, for 100 TB of storage space utilization, Bloom Filter [2] needs about 36 GB of RAM while Sparse Indexing needs 17 GB RAM for an equivalent level of deduplication [1], compared with 1500 GB RAM required by the flat chunk index of Jumbo Store [7]. The ratio of RAM required by EB to that required by the flat chunk index is about 1:100 [3]. Thus, for 1 PB of storage space utilization, all these three methods require hundreds of gigabytes RAM to hold their indexes. We propose a novel chunk-based deduplication method which can avoid an index in RAM and thus can avoid almost all RAM usage to hold an index. We only maintain many bins in the disk. A bin mainly contains a number of chunk IDs and some metadata such as the physical address of a chunk. For every file, a bin containing the similar files is found and loaded from the disk to RAM, and the file is deduplicated against this bin. We design Linear Hashing with Key Groups (LHs) based on Linear Hashing [8-10] to compute the bin address for a file. For every file, the address of a bin containing the chunk IDs of similar files is computed by the bin addressing algorithm of LHs. The LHs addressing algorithm only needs all chunk IDs of a file as input and outputs the bin address for this file. After the bin is loaded to RAM, the file is deduplicated against this bin and new chunk IDs are found and inserted into this bin. Finally, new chunks are stored in the disk, and then the file manifest containing all information to reconstruct this file is also stored in the disk. We evaluate LHs using three datasets from the real world. Experimental results show that, while there isn't an index in RAM, the deduplication efficiency of our method can achieve 12.7% better than Extreme Binning. And also, the overhead of LHs is about 5% for high redundant datasets. The rest of this paper is organized as follows. Section 2 describes the deduplication method using LHs and the details of LHs including the bin addressing, the bin splitting and the split control. Section 3 presents our simulator, datasets and experimental results. Finally, we present the conclusions in Section Our Approach In this section, we will describe our system and the deduplication process. Then, we present the structure of a bin. We describe the details of how LHs works in the deduplication system, including the LHs addressing, the LHs splitting and the split control of LHs. Finally, we analyze the RAM usage of our method Deduplication Process The key feature of our idea is that, we can avoid an index in RAM by utilizing LHs to compute the address of a bin containing chunk IDs of similar files to a file. EB needs to maintain an index in RAM to do the same thing. We will show the architecture of our deduplication system and the deduplication process. When file stream comes into the deduplication system, the files are processed one by one. A file is firstly chunked. Then, the LHs address of the file is computed. The deduplication system uses this LHs address to load the corresponding bin from the bin store. New chunk IDs are found and inserted into this loaded bin. After this, the old bin is deleted and this new bin is stored in the bin store using the same LHs address as the old bin. New chunks are stored in the disk. Finally, the file manifest of this file including all chunk IDs and other information is stored in the file manifest store. The block diagram of the deduplication process is shown in Figure 1. The LHs address of a file is computed after it is chunked. All chunk IDs of a file constitute a LHs key group. The minimum chunk ID of the file is found and is used as the representative LHs key. The details of how to compute a LHs address for a key group is in Subsection 2.2. The Deduplicator uses the LHs address to load a bin from the bin store. A bin needs to be read for every file. Firstly, it searches its whole-file hash in the metadata section of this bin. If the whole-file hash is found in the metadata section, this file is a duplicate file. All chunk IDs of this file are marked as duplicate. In this case, there is no need to update the bin. If the whole-file hash is not found in the 84

3 metadata section, the Deduplicator searches every chunk ID of the file being processed in the chunk ID section of the bin. If a chunk ID is found in the chunk ID section, this chunk ID is marked as a duplicate chunk ID. If not found, the chunk ID is marked as a new chunk ID, and is inserted into the bin. Finally, the Deduplicator sends all new chunk IDs to the bin store. And also, the Deduplicator sends the whole-file hash and the minimum chunk ID to the bin store to update the metadata section of the bin. The Deduplicator stores the file manifest in the file manifest store. The file manifest of a file contains all chunk IDs, chunk addresses, whole-file hash and other information, which can be used to reconstruct the file. Figure 1. Deduplication Diagram After the bin store receives the new chunk IDs and the metadata of a file from the Deduplicator, the bin store updates the bin. To ensure a bin is stored sequentially in the disk, we update the bin by simply deleting the old bin and inserting this updated new bin into the bin store. Then, the bin store verifies whether a bin split is necessary. There are two prerequisites to split a bin. The first prerequisite is that, the size of a bin exceeds the maximum bin size predefined. The second prerequisite is that, the load factor of the whole bin store is more than the threshold of the load factor. We check the bin size not when every chunk ID is inserted into the bin, but after all chunk IDs of the file are inserted into the bin. Therefore, one file can only trigger at most one bin splitting. We use load factor to ensure the fullness of bins. Note that, as shown in the Subsection 2.3, the bin needed to be split is not necessary to be the current updated bin. When the bin store splits a bin, it firstly computes the LHs address of every file in this bin one by one. We use the minimum chunk ID of a file as the representative chunk ID to compute the LHs address. The mapping of the minimum chunk ID of a file to its whole-file hash is stored in the metadata section of a bin. If the new LHs address of a file is the same as its original bin address, the Deduplicator does nothing on this file. If not, the bin store will create a new bin and move the metadata of the file to the new bin, including the minimum chunk ID and the whole-file hash. The bin store should also move all chunk IDs of this file to the new bin. However, there is no information in a bin about what chunk IDs a file contains. Therefore, the file manifest of this file is loaded from file manifest store. After this, the bin store inserts all chunk IDs of this file into the new bin, and erases these chunk IDs in the old bin. Note that, simply erasing all chunk IDs which are moved into the new bin may result in the chunk IDs belonged to different files only to be placed in one bin. It may decrease the deduplication efficiency. However, in our algorithm, the files with the same minimum chunk ID are placed in the same bin and we believe files with different minimum chunk IDs rarely share chunks according to Broder's Theorem [12]. For quick reference, we summarize the deduplication process in Figure Bin Addressing 85

4 When a file is being deduplicated, the address of the bin which this file is deduplicated against should be computed firstly. Then, the bin with this address is loaded into memory to be searched for the duplicate chunk IDs. We design a variation of LH used for bin addressing, namely Linear Hashing with Key Groups (LHs). LHs consists of a group of bins or buckets. Each bin of LHs contains a number of key groups. The key difference between LH and LHs is that, the basic unit of LH for the addressing and the splitting of a bin is one key, but the basic unit of LHs for the addressing and the splitting of a bucket are a group of keys. This difference results in a different addressing strategy of LHs and a different splitting strategy of LHs. In this subsection, we will describe the addressing of LHs. In the next subsection, we will describe the split of LHs. Figure 2. Deduplication Algorithm Figure 4. Splitting Algorithm of LHs All chunk IDs of a file constitute a key group of a LHs bin. The representative chunk ID of a file is the representative key of a key group of LHs. Firstly, we find the minimum chunk ID of a file as the representative chunk ID. And then, this minimum chunk ID is used as a LH key to compute a LH address, which is the LHs address of the key group. In other words, the LHs address of a key group is the LH address of the representative key in the key group. Denote a key group by G. We use the same symbol definitions as LH. Denote the initial number of bins or buckets by N (N 1), the maximum size of a bin by B, the key by C, the split pointer by n, the file level by i+1 (i = 0, 1, 2...), the threshold of the load factor by t, the hash function for addressing by h. We also use the same hash functions used for addressing as LH, which is shown in Equation 1. h mod( 2 i i C N ) (1) The addressing algorithm of LHs, which is to compute the bin address of a key group, is show in Figure Bin Splitting Figure 3. Addressing Algorithm of LHs As key groups are inserted into bins of LHs, there would be collisions in LHs. A collision means that, after a key group is inserted into a bin, the bin becomes full from not full. When a collision happens, a bin needs to be split into two bins. As bins split one by one, the number of bins increases and the bin size will not become bigger and bigger. Too big bin needs more than one disk seek to be loaded to RAM and will deteriorate the performance. 86

5 LH splits a bucket by computing the new addresses of all keys in the bucket, and moves the keys with the different address from this bucket to the new bucket. LHs computes the new addresses of all key groups and moves the key groups with the different address to the new bucket. Denote a whole-file hash by w. For every file in a bin, the pair of the whole-file hash of the file and the minimum chunk ID of the file is contained in the metadata section of the bin. This pair is denoted by (w, r). All chunk IDs of a file constitutes a key group of a LHs bin. The minimum chunk ID of a file is the representative key of a key group of LHs. The bin that needs to be split is the bin pointed by the split pointer n. All key groups in a bin are processed one by one. Firstly, the LHs address of a key group is computed by computing the LH address of the representative key using the hash function of h i+1 (r). The representative key of every key group can be found in the metadata section of a bin. After this, we compare the new address with the address of this bin. If the two addresses are not identical, we load the file manifest using the whole-file hash of the file from the disk. All chunk IDs in a file manifest constitute a key group. Then, this key group is moved to a new bin with the new address. If the two addresses are identical, nothing is done. For both cases, the current pair of a minimum chunk ID and a whole-file hash is deleted from the bin. After all key groups are processed, the split pointer n moves to the next bin. If n is more than2 i N, it means that all bins in the file level i have been split. Then, the split pointer points to the first bin, and the pair of hash functions becomes h i+1 and h i+2 from h i and h i+1. The summary of the LHs splitting algorithm is shown in Figure Split Control To increase the storage utilization rate of LHs, we need to control the bin splitting. We define the load factor to quantize the storage utilization rate, denoted by a. The load factor is the total size of all bins divided by the capacity of all bins. The actual bin size is the size of the metadata section of the bin plus the size of the chunk ID section of the bin, denoted by A. Let the number of bins is M. The load factor of LHs can be computed by Equation 2. We also use a threshold of the load factor to control the bin splitting, denoted by t. M 1 A j 0 j a (2) B M We utilize controlled split in our system. When controlled split is used, there are two prerequisites of the bin splitting. One is the LHs collision, the other is the load factor of LHs is larger than the threshold. The higher load factor is preferable in our design, because it means less storage space is used RAM Usage Previous deduplication methods use an index in RAM to find duplicate chunks. The RAM usage of Bloom Filter [2] depends on the number of unique chunks. The more unique chunks exist, the bigger RAM is needed. The RAM usage of Sparse Indexing [1] depends on the size of data stream. As the size of data stored in the deduplication system increases, the index size increases. The RAM usage of Extreme Binning depends on the number of the minimum chunk IDs of all files. There isn't an index in RAM in our design. We don't need an index to decide whether a chunk exists in the deduplication system as Bloom Filter does. We also don't need an index to find the similar segment as Sparse Indexing does or the similar file group as Extreme Binning does. Instead, we use the minimum chunk ID of a file to compute the LHs address, and then load a bin with this address. Therefore, the RAM usage of our method doesn't depend on the number of unique chunks, the size of data stream or the number of the minimum chunk IDs of all files. The RAM usage of the index is avoided in our method. Note that, the RAM usage of a deduplication system includes not only an index of chunk IDs, but also the stream buffer and so on. These RAM usages are necessary for any deduplication system. 3. Experimental Results In this section, we will answer several questions: 87

6 How do various B and various t affect deduplication efficiency? B is the maximum bin size. t is the threshold of the load factor. These two parameters are used for controlled splitting of bins. How large is the overhead of LHs? We call the disk read operations caused by the bin splitting as the overhead of LHs. For these questions, we will report the results using three datasets from the real world. One dataset is Linux source code archive. The other two are backup data from a company. One mainly contains full backups and the other one mainly contains incremental backups. We implement a simulator to verify our idea and to generate the results Simulator Our simulator consists of two parts. One is the Chunker, which takes a file consisting of all file names of a dataset and the corresponding dataset as input, and generates the chunking result. The other is the Deduplicator, which takes the chunking result as input, and outputs the deduplication efficiency, the size of every bin, the bin read counts and other results. The Chunker processes files one by one and outputs the chunking result to a chunking result file. Chunker firstly reads a file and then chunks it. We use TTTD [11] chunking algorithm to chunk a file. We set the mean chunk size as 4 KB. The Deduplicator takes the chunking result as input and outputs various statistics, which include the original data size, the deduplicated data size, the number of index items and every bin size. It can also record the times of reading bins, the times of writing bins, the load factors of all bins, the number of bin splitting, and the number of reading file manifests. It processes the file manifests one by one. A file manifest contains the whole-file hash of a file, all chunk IDs of a file and all chunk sizes Datasets Table 1. Datasets used in the experiments Set No. Files Size No. Unique Files Linux GB HDup GB LDup GB In order to verify our algorithm on different workloads, we collect three realistic datasets. Each dataset represents a different type of workload. The first dataset is the Linux source code archive, namely Linux. It contains versions from to , totally 564 versions. Most files of Linux are small files, typically dozens of kilobytes. Linux represents high redundant dataset consisting of small files. The second dataset consists of all the full backups and incremental backups taken from 21 engineers for about 30 days, namely HDup. HDup dataset consists of 162 full backups and 416 incremental backups. HDup contains many duplicate data. This is the only data set where we only had information on the chunks, but not direct access to the files. To verify our algorithm on low redundant dataset, we extract the first full backup and all the incremental backups for every engineer from HDup to construct our third dataset LDup. Compared with HDup, LDup contains much less duplicate data. We summarize the size information of the three datasets in Table Deduplication Efficiency We compare the deduplication rate of LHs with that of Extreme Binning (EB) and also that of the perfect deduplication (Perfect). The deduplication rate is the size of the original data divided by the size of the deduplicated data, presented in Equation 3. The bigger the deduplication rate, the better, which means that less storage space is consumed. original data size deduplication rate (3) deduplicated data size We collect the results on various B and various t. B is the maximum bin size. t is the threshold of the load factor. B and t are two parameters used to control the bin splitting. Since reading several 88

7 megabytes from the disk takes almost the same time as reading several kilobytes from the disk, we set the maximum value of B as 2 MB, and the minimum value of B as 256 KB. Load factor reflects the average storage utilization rate of all the bins. Bigger load factor means less bins and less index items. Therefore, it doesn't make sense to test the small values of t. We just experiment several big values of t, including 0.8, 0.9 and 1.0. For all three datasets, LHs shows better deduplication efficiency than EB as shown in Figure 5, Figure 6, and Figure 7. The deduplication rate of LHs is between 2.3% and 12.7% better than EB for Linux, and is slightly better than EB for HDup and LDup. The bin sizes of LHs are typically several megabytes, and the bin sizes of EB are dozens of kilobytes. Apparently, much bigger bin size brings larger chance to find duplicate chunks. Therefore, the deduplication efficiency of LHs is better. LHs performs better on HDup and LDup than on Linux. When B is 1 MB and t is 1, the ratio of the deduplication rate of LHs to that of Perfect is 90.77% for HDup, 91.82% for LDup, and 78.16% for Linux. The reason is that, in the Linux dataset, most files are small files, typically from several kilobytes to dozens of kilobytes. This characteristic results in that the minimum chunk ID changes in high probability. Thus, more similar files with different minimum chunk IDs exist in Linux than in HDup and LDup. At the same time, LHs chooses the minimum chunk ID as the representative chunk ID, which means that, there are more similar files with different minimum chunk ID are placed in different bins for Linux than for HDup and LDup. Therefore, LHs performs better on HDup and LDup. As B increases, the deduplication rate of LHs increases more for Linux than for HDup and LDup. More similar files with different minimum chunk IDs are placed in different bins for Linux than for HDup and LDup. As the maximum bin size increases, for Linux, more similar files with different minimum chunk IDs are placed in the same bin. Thus, more duplicate chunk IDs can be found for Linux than for HDup and LDup. Figure 5. Deduplication Efficiency for Linux Figure 6. Deduplication Efficiency for HDup 3.4. The Overhead of LHs When a bin is splitting, some file manifests are read from the disk to RAM. We call the disk read operations caused by the bin splitting as the overhead of LHs. We introduce the LHs overhead factor to evaluate the overhead, denoted by f, which is the ratio of the file manifest read count to the bin read count. The LHs overhead factor can be computed by Equation 4. file manifest read count f (4) bin read count The overhead factors on various t and various B for Linux, HDup and LDup are shown in Figure 8. The overhead factors for Linux and HDup are very small, which are about 5%. However, the overhead factor for LDup is relatively much higher, which is about 40%. Since HDup contains many full backups and Linux only contains full backups, there are many duplicate files in both Linux and HDup. 89

8 These duplicate files don't result in the bin splitting, but still lead to read bins. Therefore, the LHs overhead factor is much smaller for high redundant data than for low redundant data. Figure 7. Deduplication Efficiency for LDup Figure 8. Deduplication Efficiency for LDup The overhead factors on different B or t for HDup are only a little different. The ratio of the maximum overhead factor to the minimum overhead factor is for HDup, and that is for LDup. The overhead factors on various B and t for Linux are more different than that for HDup and LDup. The ratio of the maximum overhead factor to the minimum overhead factor is for Linux. 4. Conclusions We design LHs, a variation of LH, to organize and address the bins. Based on LHs, We propose a novel deduplication method by utilizing LHs to compute the address of the bin containing the similar files to a file. Then, we don't need to maintain an index in RAM to do the same job. Our method also doesn't decrease the deduplication efficiency compared with EB. We compute a LHs address for every file, load the bin with this address and deduplicate the file against this bin. Experimental results show that, while our method doesn't maintain an index in RAM, the deduplication efficiency of our method is between 2.3% and 12.7% better than Extreme Binning for the Linux dataset, and is slightly better than Extreme Binning for the datasets of HDup and LDup. The overhead of LHs is about 5% for HDup and Linux, which are both high redundant datasets. And, it is much higher for LDup, which is a low redundant dataset. 5. Acknowledgements This paper is supported by the Shaanxi province NSF grant 2010JM8023, and by the aviation science foundation grant 2010ZD References [1] Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, In Proceedings of the 7th conference on File and storage technologies, pp , [2] Benjamin Zhu, Kai Li, and Hugo Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, In Proceedings of the 6th USENIX Conference on File and Storage Technologies, pp , [3] Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long, and Mark Lillibridge, Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of 2009 IEEE 90

9 International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1 9, [4] Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki, Hydrastor: A scalable secondary storage, In Proceedings of the 7th conference on File and storage technologies, pp , [5] Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Całkowski, Cezary Dubnicki, and Aniruddha Bohra, Hydrafs: a high-throughput file system for the hydrastor content-addressable storage system, In Proceedings of the 8th USENIX conference on File and storage technologies, pp , [6] Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane, Tradeoffs in scalable data routing for deduplication clusters, In Proceedings of the 9th USENIX conference on File and storage technologies, pp , [7] Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes, Jumbo store: providing efficient incremental upload and versioning for a utility rendering service, In Proceedings of the 5th USENIX Conference on File and Storage Technologies, pp , [8] Witold Litwin, Linear hashing: a new tool for file and table addressing, In Proceedings of the sixth international conference on Very Large Data Bases - Volume 6, pp , [9] Witold Litwin, Marie-Anne Neimat, and Donovan Schneider, Rp*: A family of order preserving scalable distributed data structures, In Proceedings of the 20th International Conference on Very Large Data Bases, pp , [10] Witold Litwin, Marie-anne Neimat, and Donovan A. Schneider, Lh* a scalable, distributed data structure, ACM Transactions on Database Systems (TODS), ACM, vol. 21, no. 4, pp , [11] George Forman, Kave Eshghi, and Stephane Chiocchetti, Finding similar files in large document repositories, In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp , [12] Andrei Z. Broder, On the resemblance and containment of documents, In Proceedings of the Compression and Complexity of Sequences, pp , [13] Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki, Reducing impact of data fragmentation caused by in-line deduplication, In Proceedings of the 5th Annual International Systems and Storage Conference, p. 11, [14] Young Jin Nam, Dongchul Park, and David H.C. Du, Assuring demanded read performance of data deduplication storage with backup datasets, In Proceedings of 2012 IEEE 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp , [15] Jiansheng Wei, Ke Zhou, Lei Tian, Hua Wang, and Dan Feng, A Fast Dual-level Fingerprinting Scheme for Data Deduplication, International Journal of Digital Content Technology and its Applications, Vol. 6, No. 1, pp. 271 ~ 282, [16] Zhengda Zhou, and Jingli Zhou, High Availability Replication Strategy for Deduplication Storage System, Advances in Information Sciences and Service Sciences, Vol. 4, No. 8, pp. 115 ~ 123, [17] Zhike Zhang, Deepavali Bhagwat, Witold Litwin, Darrell Long, and Thomas Schwarz, S.J., Improved deduplication through parallel binning, In Proceedings of 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC), pp ,