A Fast Dual-level Fingerprinting Scheme for Data Deduplication

Transcription

1 A Fast Dual-level Fingerprinting Scheme for Data Deduplication 1 Jiansheng Wei, *1 Ke Zhou, 1,2 Lei Tian, 1 Hua Wang, 1 Dan Feng *1,Corresponding Author Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan , China, k.zhou@hust.edu.cn 2 Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE , USA Abstract Data deduplication has attracted recent interest in the research community. Several approaches are proposed to eliminate duplicate data first at the file level and then at the chunk level to reduce the duplicate-lookup complexity. To meet the high-throughput requirements, this paper proposes a fast dual-level fingerprinting (FDF) scheme that can fingerprint a dataset both at the file level and at the chunk level in a single scan of the contents. FDF breaks the fingerprinting process into task segments and further leverage the computing resources of modern multi-core CPUs to pipeline the time-consuming operations. The proposed FDF scheme has been evaluated in an experimental data backup network with real-world datasets and compared with an alternative two-stage approach. Experimental results show that FDF can maintain over 100MB/s fingerprinting throughput that matches the bandwidth of a gigabit network adapter while being fully pipelined. Keywords: Fingerprinting, Content-defined chunking, Deduplication, Pipelining, Data backup 1. Introduction Content-based fingerprinting is an important technology that has been widely used in data backup [1] and storage systems [2] [3] to identify and eliminate duplicate data objects for the purpose of saving network bandwidth and/or storage resources. In order to achieve high duplicate elimination ratio, many approaches employ content defined chunking (CDC) algorithms [4] to divide large files into small variable-sized chunks at KB granularity. However, identifying duplicates among massive data chunks can consume lots of query overhead and thus challenge the deduplication performance [5]. To this end, several most recent approaches, such as SAM [6], MAD2 [7], and Extreme Binning [8], eliminate duplicates first at the file level and then at the chunk level. If a duplicate file is detected, all the chunks belonging to that file can be directly skipped. After checking both file fingerprints and (necessary) chunk fingerprints with the storage server, only non-duplicate contents need to be actually transferred and stored. These approaches can effectively avoid the duplicate lookup bottleneck and significantly improve the deduplication performance. On the other hand, however, a new challenge has emerged while fingerprinting file sets both at the file level and at the chunk level. Considering the reliable hash algorithms (such as MD5 and SHA-1) for generating fingerprints are compute-intensive, reading and fingerprinting a data set twice at different levels will consume significant time overhead for both I/O and computation, which can conversely bottleneck the deduplication throughput. To overcome the above challenge, we propose in this paper FDF, a Fast Dual-level Fingerprinting scheme to fingerprint datasets both at the file level and at the chunk level in a single scan of the contents. FDF breaks the dual-level fingerprinting process into task segments and further employs three techniques to optimize the performance. First, FDF utilizes Rabin s fingerprinting algorithm [9] to divide files into variable-sized chunks while capturing and eliminating hot zero-chunks simultaneously by judiciously selecting chunk boundaries. Second, FDF employs SHA-1 algorithm to generate fingerprints for files and chunks and further defines hash context to preserve the intermediate result of the hash algorithm, so that the file-level hashing and the chunk-level hashing can be performed in parallel by sharing the same data cache. Third, FDF resolves cache conflicts between different task segments and further pipeline the fingerprinting process by leveraging the computing resources of modern multi-core CPUs, thus the time overhead can be greatly reduced. The proposed FDF scheme has International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number1,January 2012 doi: /jdcta.vol6.issue

2 been prototyped in a data backup network and evaluated with real-world datasets for measuring its efficiency. Experimental results reveal that the FDF scheme outperforms the two-stage approach in both fingerprinting performance and memory consumption. 2. Motivation and background This section presents the necessary background information about data deduplication approaches and content-defined chunking (CDC) methods to further motivate our research Research advances in data deduplication The purpose of data deduplication is to save network bandwidth and/or improve storage efficiency by identifying and eliminating duplicate data objects in data streams and/or data stores. Deduplication approaches usually work at KB or larger granularity, which is different from traditional sequential data compression algorithms and delta encoding methods that eliminate redundancy at the byte level. Benefit from the emergence of highly reliable hash algorithms and the improving performance of modern computers, data objects can be identified and compared with their fingerprints (i.e., hashes that are usually generated by MD5 or SHA-1). Existing data deduplication approaches can be carried out at three granularities of data objects, namely, whole files, fixed-size blocks, or variable-sized chunks generated by a CDC algorithm. Previous research [4] reveals that variable-sized chunk-level deduplication is more sensitive than the whole file hashing approach in detecting duplicates among files that are similar but not identical, and it is also smarter than the fixed-size blocking approach in resolving the boundary-shifting problem [10]. Another important observation is that the space efficiency of variable-sized chunk-level deduplication is highly dependent on the average chunk size, which indicates that more duplicate information can be detected among smaller chunks for a dataset [4] [11]. On the other hand, using smaller average chunk size means that more chunks will be generated for a given dataset. For a large storage system, the chunk index can exceed the available RAM capacity and force the query process to access an on-disk index. To avoid the duplicate-lookup disk bottleneck, DDFS [5] uses Bloom filter as an in-memory fast index and exploits data locality to accelerate the duplicate detection process. Extreme Binning [8], a distributed fuzzy deduplication approach, groups similar files into bins and eliminates duplicates first at the file level and then at the chunk level. MAD2 [7] distributes file recipes and chunk contents among clustered storage nodes while maintaining data locality, and it further eliminates all the duplicates both at the file level and at the chunk level in each node. SAM [6] proposes to do global file-level deduplication and local chunk-level deduplication for data sets belonging to different users in the cloud backup environment. Considering the file-level deduplication can filter out lots of duplicate contents and reduce the chunk-level duplicate lookups, the above dual-level approaches can achieve higher performance and better scalability than DDFS. However, reading and fingerprinting a file set twice at different levels can consume lots of I/O bandwidth and computing time, which can conversely bottleneck the deduplication throughput. Therefore, some kind of high-efficient dual-level fingerprinting scheme is in desperate need Content-defined chunking Consider two files α and β, where β is derived from α by inserting a data segment X into its contents, which is a common modification in real-world data sets. Obviously, β is very similar to α, and the two files have to be divided into small pieces to detect the duplicate contents. A straightforward method is to break both of the files into fixed-size blocks, as shown in Figure 1-(a). However, since segment X overwrites part of block C and further pushes all of the successive blocks forward, the corresponding block boundaries are shifted. As a result, there is little chance to detect duplicate contents after block B. In contrast, a CDC method selects dividing points according to the binary contents observed by a sliding window. As Figure 1-(b) shows, a w-byte window slides over the contents of both α and β, and a simple fingerprint will be generated from the covered bytes each time the window moves forward. If the fingerprint matches a predefined pattern, the position of the window will be chosen as a chunk boundary. In Figure 1-(b), the boundaries of chunk e are located again, thus the successive duplicate chunks can still be detected. 272

3 (a) Fixed-size blocking (b) Content-defined chunking (CDC) Figure 1. Comparison between fixed-size blocking and CDC Since the fingerprint has to be updated each time the window slides a byte, the chunking performance can be seriously bottlenecked by the computational overhead. Existing CDC methods usually employ Rabin s fingerprinting algorithm [9] [12] to improve the computational efficiency. Let S= {b 1, b 2,, b m } be a byte string and w=48, the Rabin fingerprint for the first 48-byte substring can be defined over a Galois filed: Z 2 = {0, 1} as 1 f Rabin = ( b p + b p + L + b48 ) mod M, 1 2 where p = 2 8, and M is a predefined constant that is generated from an irreducible polynomial of degree k over Z 2. If the window slides a byte forward, the fingerprint can be easily updated by removing the first monomial and concatenating the next byte. So that f Rabin + = ( b2 p + b3 p + L+ b49 ) mod M = (( f Rabin - b1 p ) p b49 ) mod M. If the possible results of b i p 47 for all 256 byte values are pre-calculated and stored in a cached table, the computational efficiency can be greatly improved. Note that the basic operations with polynomials can be simplified in Z 2 : addition is equivalent with bitwise XOR, and multiplication by p can be implemented by shifting left 8 bits. Defining a divisor D << M and a pattern code C < D, a simple boundary selecting method is to choose the positions of the window when the corresponding fingerprints satisfy f Rabin ( ) mod D = C. Considering the fingerprint can change randomly as the window slides forward, approximately one fingerprint out of D candidates can match the condition. As a result, the expected distance between two selected boundaries will be D bytes, which can be reflected by the average chunk size. However, there can be extreme cases where a candidate chunk is too small or too large comparing to the average chunk size. To avoid such abnormal chunks, it is necessary to impose two thresholds L and U on the chunk size. A matching boundary will not be chosen until the corresponding chunk size reaches the lower threshold L, and a hard break-point will be created if the chunk size reaches the upper threshold U without detecting a valid boundary. In practice, D is usually defined between 2 12 and 2 16, L and U satisfy L < D < U, so that the average chunk size will range from 4KB to 64KB accordingly. 3. The FDF scheme The fast dual-level fingerprinting (FDF) scheme is designed to fingerprint data sets both at the file level and at the chunk level in a single scan of the contents by efficiently utilizing system resources and scheduling task segments. Figure 2 presents a general framework of dual-level deduplication. In a practical deduplication system, raw files can be fingerprinted either (1) on the client end or (2) on the server end. In the former case, the whole file fingerprints will be firstly computed and sent to the remote server to identify unique files from duplicates. Then only non-duplicate files will be further chunked and fingerprinted, and a file recipe 273

4 containing the necessary reconstruction information will be built for each file. During the chunk-level deduplication, only fresh chunk contents need to be actually transferred and stored, which can save both network bandwidth and storage capacity. In some approaches such as the Extreme Binning [8], chunk fingerprints have to be generated before the elimination of duplicate files, so that all the files have to be previously fingerprinted at the chunk level. Server Server / Client Client raw files (1) local / (2) remote communication file-level fingerprinting chunk-level fingerprinting (1) remote / (2) local communication file-level deduplication chunk-level deduplication file fingerprint file chunk fingerprint chunk index recipes index contents deduplication storage Figure 2. A dual-level deduplication framework In the latter case, raw files on a client machine will be directly sent to the storage server without considering bandwidth saving. To achieve inline deduplication, the storage server is required to do dual-level fingerprinting with high performance to match the (100MB/s) throughput of a gigabit network adapter. The straightforward solution of buffering files in RAM or on local disks for two-stage fingerprinting can increase the local I/O overhead and bottleneck the deduplication throughput. There are mainly four kinds of data inside the deduplication storage, i.e. file fingerprint index along with file recipes and chunk fingerprint index along with chunk contents, where file recipes record the mapping relationship between file fingerprints and chunk fingerprints to facilitate file reconstruction and chunk retrieve Segmenting the fingerprinting process In this paper, we mainly focus on optimizing the fingerprinting performance in scenarios that both file fingerprints and chunk fingerprints have to be generated before the deduplication procedure. To find potential optimization methods for the fingerprinting process, we first look into the straightforward two-stage fingerprinting approach. As shown in Figure 3, the fingerprinting process can be divided into sequentially executed segments. To fingerprint a file on a client machine, the file will be first read and hashed in stage 1 to generate the whole file hash. In stage 2, the file contents will be scanned again and further divided into small chunks, which are hashed later to generate a list of chunk fingerprints. Suppose the throughput of reading (on-disk files), chunking, and hashing (in-memory data) are TP r, TP c, and TP h respectively, the total time overhead of fingerprinting a large file with a size of S should be T client l = T stage1 +T stage2 = (S/TP r +S/TP h ) + (S/TP r +S/TP c +S/TP h ) = 2S/TP r +S/TP c +2S/TP h. For a small file that can be fully buffered in RAM, the overhead of fingerprinting time can be reduced as T client s = T stage1 +T stage2 = (S/TP r +S/TP h ) + (S/TP c +S/TP h ) = S/TP r +S/TP c +2S/TP h. To fingerprint files on the server end, data has to be received from the network and buffered locally. In particular, a file has to be written to disk if it is too large for RAM to hold in its entirety. Suppose the 274

5 throughput of transferring and writing (file data) are TP t and TP w respectively, the total time overhead of fingerprinting a large file from the network is T server l =T stage1 +T stage2 =(S/TP t +S/TP h +S/TP w )+(S/TP r +S/TP c +S/TP h )=S/TP t +S/TP w +S/TP r +S/TP c +2S/TP h. For a small file that can be fully buffered in RAM, it is possible for the fingerprinting process to directly copy data from the buffer, which consumes negligible time, and the corresponding total time overhead is T server s = T stage1 +T stage2 = (S/TP t +S/TP h ) + (S/TP c +S/TP h ) = S/TP t +S/TP c +2S/TP h. Obviously, T server-s is smaller than T server-l because fingerprinting a small file can avoid expensive disk accesses. loop as necessary local buffered data File-level Fingerprinting local / remote files read / receive data cache hash f file fingerprints <fingerprint 1> <fingerprint 2> 3.2. Capturing extremely hot zero-chunks Chunk-level Fingerprinting read / copy data cache chunk boundary cache hash c chunk fingerprints <fingerprint 1> <fingerprint 2> Figure 3. Segmentation of fingerprinting process In the fingerprinting task segments, the chunking module is critically important for detecting duplicates among similar files. In particular, previous studies [13] [14] reveal that zero-byte strings may widely exists in different types of files, such as.vmdk-files (virtual machine disk images),.iso-files (CD/DVD images), and so on. Quickly identifying such zero-byte strings is helpful for eliminating duplicates among dissimilar files and improving the deduplication throughput. We present here a straightforward method to capture extremely hot zero-chunks in data contents. Recall the content-defined chunking algorithm in Section 2.2, the Rabin fingerprint of a 48-byte substring that starts from offset i is Obviously, f i Rabin 46 i+ 1 + = ( b p + b p + L b 47 ) mod M. i i f Rabin = 0 if all the 48 bytes are zeros. Conversely, if 47 i+ i f Rabin = 0, we can expect that all the 48 bytes are zeros with a certain probability. Further, if f Rabin remains 0 as the windows slides L bytes, which is the lower threshold of the chunk size, we can expect that the L bytes data is a zero-chunk with a high probability, and a chunk boundary can be determined. Clearly, it is a reasonable choice to define the pattern code C=0 to facilitate the chunk boundary selection. In practice, we define L=2 10 and pre-calculate the SHA-1 sum of the 1024-byte zeros to capture and confirm hot zero-chunks. Note that identifying zero-chunks while selecting chunk boundaries is more efficient than the straightforward method that detect and count zero-bytes in a separate process Enabling parallel hashing As described in Section 3.1, a file has to be read from disk one more time in the chunk-level fingerprinting stage if it is too large to be cached in RAM in its entirety. To avoid the redundant disk 275

6 accesses, we propose to use a shared data cache for both the file-level fingerprinting and the chunk-level fingerprinting and further enable parallel hashing. In the SHA-1 algorithm, which is employed in our fingerprinting scheme, a given data object will be appended with a bit 1, k bits 0 (0 k<512), and a 64-bit big-endian integer recording the length of the data object, so that the resulting data object can be divided into 512-bit blocks with no irregular fragments. The 160-bit hash of the data object will be first initialized as five 32-bit words, i.e., h 0 = 0x , h 1 = 0xEFCDAB89, h 2 = 0x98BADCFE, h 3 = 0x , and h 4 = 0xC3D2E1F0. Then the hash will be updated with a group of complex functions in 80 rounds every time a 512-bit block is input. In the fingerprinting process, it is possible that a data object (file or chunk) has not been fully loaded into the data cache, e.g., the boundary of a chunk is not detected until the end of the cached data is reached. To resolve the problem, we record the intermediate state of the hashing process using hash context, which is defined as { unsigned long long counter; unsigned long hash_sum[5]; unsigned char buffer[64]; }, where counter records the number of bytes processed, hash_sum records the intermediate SHA-1 sum, and buffer holds the incomplete data block that needs to be make up to 512 bits after loading the fresh data. By introducing hash context, the hashing process can be performed in an incremental manner. Most importantly, it becomes possible to perform the file-level hashing and the chunk-level hashing in parallel by using a shared data cache, and the I/O overhead can be reduced. Furthermore, if there are two available CPU cores to execute the dual-level hashing tasks, the computational efficiency can be greatly improved Pipelining fingerprinting task segments As a further optimization apart from improving the chunking algorithm and paralleling the hashing processes, this subsection focuses on leveraging the computing resources of modern multi-core CPUs to pipeline the fingerprint task segments and improve the overall fingerprinting performance. As analyzed in Section 3.1, the task segments of reading/receiving, chunking, and hashing are responsible for most of the time overhead and can be more likely to constitute a bottleneck. On the other hand, the rapid development of modern multi-core processors and the well-designed OpenMP (Open Multi-Processing) libraries provide us an opportunity of executing the task segments in parallel on different CPU cores. However, there are data dependencies that make it a natural choice to run the task segments in a sequential manner. For example, the cached data can not be fingerprinted at the chunk level until the corresponding chunk boundaries are determined. To avoid such data dependencies, we propose to use a cache group instead of a single cache for accommodating both data contents and chunk boundaries. As Figure 4 shows, the fingerprinting process has been reorganized into three stages, i.e., the data preparation stage, the chunking stage and the dual-level hashing stage, where each stage can be assigned to an independent data cache as well as a boundary cache. As a result, it becomes possible to pipeline the time-consuming task segments and improve the overall fingerprinting performance. For example, the chunking stage can parse data in cache B while the data preparation stage can simultaneously read fresh data into cache C. When data cache C is fully filled, data cache A can be reused for accommodating fresh data if its content has already been chunked and hashed. A data cache (as well as its corresponding boundary cache) will be switched and handed over to the next stage once the contents have been processed. In particular, the file-level hashing and the chunk-level hashing can be performed in parallel because they can share the same data cache in a read-only manner, and they have about the same computational complexity. Clearly, it requires at least four CPU cores to pipeline and parallelize all the time-consuming task segments. On a client machine, the time overhead of fingerprinting one file that belongs to a large dataset can be expected as 276

7 T client = max(t preparation, T chunking, T hashing ) = max(s/tp r, S/TP c, S/TP h ), where the definition of each denotation is the same as in Section 3.1. Similarly, the expected time overhead of fingerprinting a file on a remote server will be T server = max(t preparation, T chunking, T hashing ) = max(s/tp t, S/TP c, S/TP h ). Obviously, the time overhead can be significantly reduced comparing to that of the two-stage approach while dealing with large datasets. We will illustrate how to schedule the fingerprinting task segments on modern quad-core or even dual-core CPUs in Section 4. Data Preparation local / remote files read / receive A B C data cache group switch cache Chunking switch cache A B C data cache group chunk A B C boundary cache group switch cache Dual-level Hashing A B C data cache group hash f file fingerprints <fingerprint 1> <fingerprint 2> A B C boundary cache group hash c chunk fingerprints <fingerprint 1> <fingerprint 2> Figure 4. Parallelism of fingerprinting task segments 4. Evaluation and analysis We evaluate FDF through a prototype running on B-Cloud, which is a research-oriented distributed system that provides network backup services for user files and other binary data. The B-Cloud system [7] consists of backup clients, front-end backup servers, metadata servers and back-end storage servers. Specifically, a backup client scans and transfers user-specified datasets to remote backup servers according to predefined backup policies. On the other hand, a group of backup servers cooperatively provide backup services for the purpose of load balancing, and a new backup job will always be distributed to the backup server with the lowest workload level. A backup server splits metadata from file contents while receiving user data. Backup job information as well as the related file metadata will be sent to metadata servers, and file contents will be delivered to storage servers. In particular, file contents will be fingerprinted both at the file level and at the chunk level before being actually transferred to storage servers, and then duplicate contents will be eliminated through the MAD2 [7] deduplication approach. We implement the FDF scheme in backup clients and backup servers respectively to evaluate its performance, and the duplicate elimination ratios based on real-world datasets are also measured and reported. Further, we discuss some implementation issues at the end of this section Experimental setup The hardware configuration of our experimental backup clients includes a dual-core CPU running at 2.5 GHz, 4GB RAM, a 500GB hard disk, and 1 gigabit network adapter. The experimental backup servers are configured as follows: a quad-core CPU running at 2.0 GHz, 4 2GB RAM, 1 RAID 277

8 controller card with 128MB cache, 8 1TB hard disks organized as a RAID-5 partition, and 2 gigabit network interface cards. We have collected two real-world data sets from different groups of users. The first dataset was contributed by 15 students in an engineering group, which we referred to as the Workgroup set. Each student backs up data using their desktop PCs or workstations in a span of 31 days. There are 12.1 million files for a total size of 6.0TB in the Workgroup set. The second dataset was collected from 26 users on a campus network, including file transfer site managers, small website maintainers, and other individuals. Every user runs full or incremental backup jobs in a span of 31 days independently. This dataset is called Campus set and contains 15.4 million files that amount to a total of 4.7TB data Fingerprinting performance This subsection evaluates the performance of our FDF scheme on a dual-core backup client and a quad-core backup server respectively. We have also implemented the two-stage fingerprinting approach in both environments for comparison. Figure 5-(a) shows the scheduling process of deploying FDF on a dual-core client machine. The reading tasks and the chunking tasks are pipelined on different CPU cores. When a cached data segment has been chunked, the data contents will be hashed in parallel to update the corresponding file fingerprint and generate chunk fingerprints. Note that the chunking tasks and the hashing tasks are not pipelined due to limited number of CPU cores. Clearly, using two data caches associated with two boundary caches is enough for deploying the FDF scheme. During the evaluation, the experimental backup client fingerprints a local dataset of 302,550 files that amount to a total of GB data. For the stability of results, we measure the average fingerprinting performance by tracing the total time overhead of processing the whole dataset. It has also been measured that the local disk can achieve an average linear read throughput of 92.1MB/s and an average random read throughput of 84.5MB/s while transferring 1MB data blocks. Average Fingerprinting Performance (MB/s) (a) (b) Figure 5. Scheduling FDF on (a) a dual-core client machine and (b) a quad-core backup server FDF Single-Thread Two-Stage Two-Thread Two-Stage Average Throughput of Linear Read Average Throughput of Random Read The Capacity of Each Cache (MB, Logarithmic Scale) Average Fingerprinting Performance (MB/s) FDF Single-Thread Two-Stage Four-Thread Two-Stage Average Throughput of the Network Adapter The Capacity of Each Cache (MB, Logarithmic Scale) (a) (b) Figure 6. The fingerprinting performance of using FDF on (a) a dual-core client machine and (b) a quad-core backup server Figure 6-(a) presents the results under different configurations of cache capacities. As the cache size increases from 1MB to 256MB, the average fingerprinting performance of the FDF scheme fluctuates between 64.7MB/s and 67.0MB/s. In comparison, the average performance of the single-thread 278

9 two-stage fingerprinting process increases from 31.6MB/s to 42.1MB/s. It has been observed that over 90% of file data can be fully buffered in RAM and thus directly reused during the chunk-level fingerprinting stage as the cache capacity reaches 256MB (see Section 3.1). For fair comparison, we have also implemented the two-stage fingerprinting approach using two concurrent threads to fully utilize the available CPU cores. As shown in Figure 6-(a), the two-thread two-stage fingerprinting approach presents a throughput of only 23.7MB/s while using 1MB data caches. It is because the disk throughput can be suppressed to around 48.8MB/s while randomly reading small pieces of data with two concurrent threads. As the cache capacity increases, the benefit of using two concurrent threads shows up, and the average fingerprinting performance finally grows to 53.5MB/s while using 256MB data caches. The results indicate that the fingerprinting performance of the FDF scheme is not as sensitive as that of the two-stage approach, and it is possible for the FDF scheme to outperform the two-stage approach by using only a few mega-bytes of RAM resources. Figure 5-(b) shows the scheduling process of deploying FDF on a quad-core backup server. The fingerprinting tasks of receiving (data from the network), chunking and hashing are pipelined and distributed to all the CPU cores. In particular, the file-level hashing and the chunk-level hashing are executed in parallel while sharing the same data cache. Three data caches along with three boundary caches are used for deploying the FDF scheme in this environment. We use another backup server as a client to supply the source data and avoid the disk-access bottleneck. It has been measured that the RAID-5-based storage subsystem can achieve an average linear read throughput of 477.6MB/s and an average linear write throughput of 465.0MB/s while transferring 16MB data blocks. On the other hand, the gigabit network adapter shows an average throughput of 107.4MB/s. As shown in Figure 6-(b), the average throughput of the FDF scheme fluctuates between 102.1MB/s and 105.2MB/s as the cache capacity increase from 1MB to 256MB. The results suggest that deploying the FDF scheme on a quad-core server can further accelerate the fingerprinting performance and achieve a high throughput of over 95% of the available network bandwidth. In comparison, the single-thread two-stage approach that buffers large files on local disk for the chunk-level fingerprinting only represents a throughput that ranges from 34.5MB/s to 41.6MB/s, which is far below the performance of the FDF scheme. For fair comparison, a four-thread two-stage approach has also been implemented to fully utilize available CPU cores. It has been observed that the four-thread two-stage approach achieves a throughput that ranges from 73.5MB/s to 89.2MB/s as the cache capacity increases. Clearly, the FDF scheme still presents better fingerprinting performance than the four-thread two-stage approach, note that the latter can consume more RAM resources and even additional disk space than the former while dealing with large files Duplicate elimination efficiency As previously discussed, the chunk-level deduplication can detect duplicate information between similar files and thus achieve high space efficiency. On the other hand, the file-level deduplication can detect and eliminate duplicate files and thus reduce the duplicate lookup complexity at the chunk level. Figure 7-(a) reports the number of fingerprints of our experimental datasets at different levels to show the duplicate lookup complexities. For the Workgroup set, fingerprints are generated and have to be deduplicated at the chunk level. By introducing file-level deduplication, original files are deduplicated into unique files that contain nonzero-chunks and zero-chunks, where zero-chunks can be directly filtered out in our FDF scheme. Finally, the Workgroup set is deduplicated into unique chunks. It should be noted that a zero-chunk has a fixed size of 1KB in our implementation, while the nonzero-chunks have a much larger average size around 4KB. Obviously, the dual-level deduplication only needs to process file fingerprints and fingerprints of nonzero-chunks, which has a complexity of only 6.7% of the pure chunk-level deduplication that examines all the chunk fingerprints. Moreover, capturing zero-chunks in the fingerprinting process can significantly reduce the computational overhead of the deduplication approach. The Campus set initially contains files that can be further divided into chunks. The file-level deduplication detects unique files containing nonzero-chunks and zero-chunks, and unique chunks are finally captured after the chunk-level deduplication. Similar to the Workgroup set, the duplicate lookup complexity is greatly reduced by employing dual-level deduplication and capturing zero-chunks in the fingerprinting process. 279

10 Number of Fingerprints (Logarithmic Scale) Files Chunks Unique Files Nonezero-Chunks in Unique Files Zero-Chunks in Unique Files Unique Chunks 2. 03x x x x x x10 7 Workgroup Dataset 1. 09x x x x10 8 Campus 3. 95x x10 7 Data Size in GB (Logarithmic Scale) Original Data Unique Files Zero-Chunks in Unique Files Unique Chunks 6, , Workgroup Campus Dataset Duplicate Elimination Ratio File-Level Chunk-Level Workgroup Campus Dataset (a) The number of fingerprints at different levels (b) Data sizes and duplicate elimination ratios Figure 7. Duplicate elimination efficiency Figure 7-(b) presents the data sizes and duplicate elimination ratios in different deduplication levels for both experimental datasets. The duplicate elimination ratio (DER) is calculated as the original data size divided by the data size after deduplication. At the beginning, there are 6,151,88GB data in the Workgroup set and 4,778.15GB data in the Campus set respectively. After the file-level deduplication, the Workgroup set generates GB file data that achieves a DER of 10.31, and the Campus set generates GB file data that results in a DER of 8.0. By further eliminating duplicates at the chunk level, GB unique chunks are finally produced for the Workgroup set, which corresponds to a chunk-level DER of On the other hand, GB unique chunks are produced for the Campus set, corresponding to a chunk-level DER of that is about 2.3 times higher than the file-level DER. Moreover, Figure 7-(b) also reports the sizes of zero-chunks contained in unique files for both datasets. Our experimental results based on real-world datasets reveal that the file-level deduplication can eliminate most duplicate data and significantly reduce duplicate lookup complexities at the chunk level. Moreover, the chunk-level deduplication can detect more duplicate information between similar files and further benefit the duplicate elimination ratio. As a result, dual-level fingerprinting as well dual-level deduplication is recommendable while designing a practical data backup/archiving system Discussion The FDF scheme is designed to fingerprint a dataset both at the file level and at the chunk level in a single scan of the contents. However, as described at the beginning of Section 3, some dual-level deduplication approach may only want to fingerprint non-duplicate files at the chunk level. In such a case, the FDF scheme can consume unnecessary computational overhead for chunking and hashing contents belonging to duplicate files. If the computational overhead outweighs the benefit of fingerprinting files in a single scan, it becomes a better choice to do the chunk-level fingerprinting as a second stage after the file-level deduplication. We argue that the principle of our FDF scheme, i.e., pipelining time-consuming task segments by resolving the cache conflicts between them, is still applicable to optimize and accelerate the overall fingerprinting process. 5. Related work Chunking methods have been well studied in many previous works. The two-threshold two-divisor (TTTD) chunking approach [15] avoids abnormal large fixed-size blocks by introducing a backup divisor to restrict the actual chunk sizes. Specifically, if the size of the current chunk reaches the predefined upper threshold without finding a boundary match, the chunking process will switch to use a smaller divisor (see Section 2.2) and try again to find a content-defined boundary as an alternative to creating a hard-breakpoint. ADMAD [16] exploits certain metadata information (e.g., file type and file format) to divide files into variable-sized logic units, and further eliminate duplicate units to achieve a good space efficiency in the archival storage. However, this approach requires the fingerprinting module to recognize the file format and maintain lots of application specific chunking libraries. Recently, a novel 280

11 bimodal content-defined chunking approach [17] has been proposed to increase the average chunk size while maintaining comparable duplicate elimination ratio. Specifically, the bimodal chunking algorithm generates small chunks in limited regions of transition from duplicate to non-duplicate data and generates large chunks elsewhere according to the existence of the candidate chunks. Similarly, VS-SWC [18] uses small chunks only at the junction region between duplicate data and unique data, so that the duplicate elimination ratio can be improved with the number of chunks being suppressed. Apart from CDC algorithms, a frequency-based chunking (FBC) algorithm [19] has also been proposed. FBC first samples and identifies frequent fixed-size blocks over a data stream, and then coarsely divides the data stream into large content-defined chunks. If a coarse-grained chunk contains any high-frequent block, it will be further divided into fine-grained chunks. As a result, FBC is able to generate fewer chunks than the baseline CDC algorithm while maintaining the duplicate elimination ratio. Comparing to the above approaches that focus on optimizing the chunking algorithm and improving the deduplication ratio, our FDF scheme aims to efficiently fingerprint file sets both at the file level and at the chunk level to benefit the deduplication throughput. Apart from the most relevant deduplication approaches introduced in Section 2.1, many other excellent works that eliminate duplicate data in different environments can be found in the MAD2 paper [7] and the DCBA [20] paper. 6. Conclusion This paper presents FDF, a fast dual-level fingerprinting scheme, which can fingerprint a dataset both at the file-level and at the chunk-level with high-performance in a single scan of the data contents. Experimental results reveal that our FDF scheme can significantly outperform the two-stage fingerprinting approach by using only a small fraction of memory resources of the latter. Most importantly, the FDF scheme can generally match the throughput of a gigabit network adapter while being fully pipelined. Deduplication results based on real-world data sets show that eliminating duplicate files can greatly reduce the duplicate-lookup complexity at the chunk level. Further, millions or even tens of millions of hot zero-chunks have been captured and pre-eliminated while processing data contents belonging to non-duplicate files. 7. Acknowledgement This work is supported in part by the National Basic Research Program (973 Program) of China under Grant No. 2011CB and the National High Technology Research and Development Program (863 Program) of China under Grant No. 2009AA01A References [1] Tianming Yang, Dan Feng, Zhongying Niu, Yaping Wan, "Scalable high performance de-duplication backup via hash join", Journal of Zhejiang University-SCIENCE C, vol. 11, no. 5, pp , [2] Lawrence L. You, Kristal T. Pollack, Darrell D. E. Long, "Deep Store: An Archival Storage System Architecture", In Proceedings of the 21st International Conference on Data Engineering (ICDE), pp , [3] Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie, "Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment", JDCTA: International Journal of Digital Content Technology and its Applications, vol. 4, no. 3, pp.85-94, [4] Calicrates Policroniades, Ian Pratt, "Alternatives for Detecting Redundancy in Storage Systems Data", In Proceedings of the General Track: 2004 USENIX Annual Technical Conference, pp.73-86, [5] Benjamin Zhu, Kai Li, Hugo Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System", In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), pp , [6] Yujuan Tan, Hong Jiang, Dan Feng, Lei Tian, Zhichao Yan, Guohui Zhou, "SAM: A Semantic-Aware Multi-Tiered Source De-duplication Framework for Cloud Backup", In Proceedings of the 39th International Conference on Parallel Processing (ICPP), pp ,

12 [7] Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng, "MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services", In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), [8] Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, Mark Lillibridge, "Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup", In Proceedings of the 17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp , [9] Michael O. Rabin, "Fingerprinting by random polynomials", Technical Report, No. TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, MA, USA, [10] Udi Manber, "Finding Similar Files in a Large File System", In Proceedings of the Winter 1994 USENIX Technical Conference, pp.1-10, [11] Purushottam Kulkarni, Fred Douglis, Jason LaVoie, John M. Tracey, "Redundancy Elimination within Large Collections of Files", In Proceedings of the General Track: 2004 USENIX Annual Technical Conference, pp.59-72, [12] Andrei Z. Broder, "Some applications of Rabin's fingerprinting method", Sequences II: Methods in Communications, Security, and Computer Science, Springer-Verlag, New York, USA, pp , [13] Keren Jin, Ethan L. Miller, "The Effectiveness of Deduplication on Virtual Machine Disk Images", In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, [14] Dirk Meister, André Brinkmann, "Multi-Level Comparison of Data Deduplication in a Backup Scenario", In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, [15] Kave Eshghi, Hsiu Khuern Tang, "A Framework for Analyzing and Improving Content-Based Chunking Algorithms", Technical Report, No. HPL (R.1), Hewlett-Packard Laboratories, Palo Alto, CA, USA, [16] Chuanyi Liu, Yingping Lu, Chunhui Shi, Guanlin Lu, David H. C. Du, Dongsheng Wang, "ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System", In Proceedings of the 5th IEEE Int. Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), pp.29-35, [17] Erik Kruus, Cristian Ungureanu, Cezary Dubnicki, "Bimodal Content Defined Chunking for Backup Streams", In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST), pp , [18] Can Wang, Zhiguang Qin, Lei Yang, Peng Nie, "Improved Deduplication Method based on Variable-Size Sliding Window", JDCTA: International Journal of Digital Content Technology and its Applications, vol. 5, no. 9, pp.80-87, [19] Guanlin Lu, Yu Jin, David H. C. Du, "Frequency Based Chunking for Data De-Duplication", In Proceedings of the 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp , [20] Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng, Hua Wang, "Detecting Duplicates over Sliding Windows with RAM-Efficient Detached Counting Bloom Filter Arrays", in Proceedings of the 6th IEEE International Conference on Networking, Architecture, and Storage (NAS), pp ,