Metadata Feedback and Utilization for Data Deduplication Across WAN

Transcription

1 Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): May DOI /s Metadata Feedback and Utilization for Data Deduplication Across WAN Bing Zhou and Jiang-Tao Wen, Fellow, IEEE State Key Laboratory on Intelligent Technology and Systems, Tsinghua University, Beijing , China Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing , China Department of Computer Science and Technology, Tsinghua University, Beijing , China Received May 11, 2015; revised January 19, Abstract Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographically distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20% 40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the baseline content defined chunking(cdc) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions. Keywords data deduplication, wide area network (WAN), metadata feedback, metadata utilization 1 Introduction Elimination of redundant data transmissions across the network with the aid of the data deduplication technology has been introduced into many geographically distributed file communication systems [1-2] for bandwidth saving and end-to-end latency reduction. In a typical data deduplication based file communication process, the Rabin fingerprint algorithm [3] is firstly applied to calculate a sequence of fingerprints for each input file at the sender, with the data between any two neighboring fingerprints extracted as separate data chunks. After file chunking, a batch of SHA-1 1 hash values would be calculated over the separate chunks and sent to the receiver for duplication detection. At the receiver, the SHA-1 hash values received are examined as to whether they are identical to the previously stored ones. A duplicate SHA-1 hash value identified at the receiver indicates a corresponding duplicate chunk at the sender. Only the confirmed non-duplicate chunks after duplication detection together with the metadata for file restoration would be transmitted from the sender to the receiver. The hash collision probability of SHA- 1 is extremely small and can be ignored while it is used for duplication detection [4]. The inherent content overlaps of the data source and the deduplication algorithm jointly determine the Regular Paper This work was supported by the National Science Fund for Distinguished Young Scholars of China under Grant No and the State Key Program of National Natural Science Foundation of China under Grant No Apr Springer Science + Business Media, LLC & Science Press, China

2 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 605 output bandwidth saving ratio and the corresponding time and space overheads. There are three types of time overheads during data deduplication across the network affecting the file communication throughput. Duplication Query/Answer Overhead. Each remote duplication query from the sender and each duplication answer from the receiver introduce at least one round trip time (RTT) of extra latency. Disk Access Overhead. When the deduplication metadata are too large to fit into the RAM at the receiver, each disk look-up or I/O operation causes expensive latency of about 10 ms according to [5]. Protocol Related Overhead. The terminations and establishments of TCP connections for sparsely distributed duplication query and answer packets as well as the non-duplicate chunks transmissions also cause time overhead. Besides the time overheads, there are also inevitable space overheads affecting the bandwidth saving efficiency, including the duplication query/answer packets to determine whether the chunks divided at the sender are duplicate at the receiver, alongside with the metadata produced at the sender and required at the receiver to reconstruct the original files from deduplicated data. To relievethe disk accessbottleneck, data domain [5] utilized an in-ram bloom filter to avoid unnecessary disk look-up operations for the non-duplicate chunks and an in-ram cache exploiting data locality to avoid disk look-up operations for subsequent duplicate chunks. A similar in-ram sparse index data structure was also used in Sparse Indexing [6] to reduce the number of disk look-up operations. To minimize the protocol related overheads, multiple TCP connections were utilized while building a high performance deduplication system [7] in cooperation with the existing disk bottleneck avoiding methods. However, the number of remote duplication queries and answers in each single deduplication thread was still to be reduced for further throughput optimization, especially in the wide area network (WAN) environment where the end-to-end latency is usually significantly large, as compared with that in the local area network (LAN). Theprimaryproblemwearetoaddressinthispaper is to reduce the duplication query/answer related time overhead by introducing a metadata feedback mechanism. With the chosen metadata piggy-backed from the receiver, the sender firstly conducts duplication identification locally and only the left hash values which cannot find duplicates at the sender are sent out to the receiver for remote duplication query. With inherent data locality preserved, each metadata feedback is a sequence of hash indices containing hash values and corresponding address information of consecutive deduplicated chunks, for which the metadata feedback is also called the manifest feedback in this paper. Thesecondproblemweareconcernedwithistoharness the metadata related disk I/O operations at the receiver, as well as the bandwidth overheads of metadata feedbacks by metadata utilization. In order to reduce the bytes of each metadata feedback, a hysteresis hash re-chunking based hash granularity adaptation method is introduced. A deduplicated file is initially indexed with an alternation of small and large granularity hash values. To conduct deduplication with the hash values of different granularities, a collaborative hit and matching extension deduplication framework acrossthenetworkisusedatboththesenderandthereceiver. When an incoming hash value is found identical to an existing hash value, a matching extension process is performed by hash comparison with the hash values before and after the hit hash value until a hash mismatch is encountered. When the chunk represented by the mismatched hash value which is usually of a large granularity contains a boundary between the incoming duplicate and non-duplicate chunks, the mismatched hash value is re-chunked into at most three new hash values of smaller granularity. In summary, we present a data deduplication based file communication system across WAN with metadata feedback and metadata utilization (MFMU) for an improved trade-off between the bandwidth saving efficiency and the file communication throughput. The traditional deduplication based file communication model and the MFMU model are compared in Fig.1. The kernel idea of MFMU is to trade more computation overheads for fewer remote duplication query operations at the sender and disk I/O operations at the receiver with metadata feedback and metadata utilization which are much more expensive. The most widely used baseline LBFS deduplication based file communication solution and the state-of-the-art Bimodal chunking solution are used in the experimental evaluation of MFMU. The contributions of this paper include: A metadata feedback method is proposed to reduce the number of remote duplication query/answer operations. When an old hash value is hit by an incoming duplicate hash value, the neighboring hash values are piggy-backed on the duplication answer packets and sent back to the sender, to avoid the remote dupli-

3 606 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 cation query/answer related time and space overheads for subsequent possible duplicate chunks at the sender, thereby accelerating the data deduplication process and improving the file communication throughput. Sender Receiver Sender Receiver Dup. Query Dup. Answer (a) Local Dedup. Dup. Query Dup. Answer Metadata Feedbacks (b) Metadata Utilization Fig.1. (a) Traditional model and (b) MFMU model. Dup.: Duplication; Dedup.: Deduplication. A hysteresis hash re-chunking based hash granularity adaptation method is proposed to save the bytes of metadata required to index the deduplicated data. Over each non-duplicate data slice which consists of a sequence of consecutive non-duplicate chunks, a few hash values of small granularity are uniformly sampled for subsequent duplication detection, and the hash values between two neighboring sampled hash values are initially merged into a hash value of large granularity. When and only when the data represented by a merged hash value is found straddling duplicate and nonduplicate chunks, the merged hash value is very conservatively re-chunked into at most three consecutive hash values. A hit and matching extension framework is proposed for data deduplication with hash values of different granularities. The hash values of small granularity are used for duplication detection firstly. When a duplicate hash value of small granularity is hit, the hash comparison granularity is enlarged to exploit the data locality in the matching extension process. It conducts a theoretical analysis on the lower bound of the metadata size required for a certain duplication elimination efficiency, as well as the metadata saving efficiency comparison among the proposed hysteresis hash re-chunking method, the baseline content defined chunking (CDC [1], used in LBFS [1] ) and existing Bimodal [8] chunking algorithms. The rest of this paper is organized as follows. The motivation is presented in Section 2. The design and the implementation of the proposed MFMU system are described in detail in Section 3. Theoretical analysis is given in Section 4, with experimental results in Section 5. The discussions and related work are in Section 6 and Section 7 respectively. Section 8 concludes this paper and gives the future work. 2 Motivation In data deduplication between two geographically distributed network nodes, when no priori knowledge is available, the duplication query/answer related time and space overheads for the non-duplicate chunks are inevitable, because the sender cannot confirm whether a non-duplicate chunk is non-duplicate without the confirmation by the receiver. The duplication query/answer operations for the duplicate chunks which are identified with data locality preserved metadata feedbacks at the sender can be avoided. For a better understanding of data locality information in data, Fig.2 shows the distribution of the numbers and contained bytes of duplicate data slices identified with different numbers of consecutive duplicate chunks when the expected chunk size (ECS) was set to 8 KB and 16 KB in a real-world dataset of 1.0 TB disk images. The disk images were collected from 10 PCs used by engineers running Windows, Linux or Max operating system with NTFS, FAT, FAT32, Ext3, Ext4 or HFS+ file system, including the user files as well as the system files. Various real-world applications were run on the collected PCs. The user-generated files included documents, source codes, pictures, and binary executive files but the video files, as we considered there were few duplicates within the video files which had been compressed by H.264 video compression standard. Percentage of Overall Dup. Data Slices # of Dup. Data Slices (ECS=16 KB) 0.2 # of Dup. Data Slices (ECS=8 KB) # of Dup. Bytes (ECS=16 KB) # of Dup. Bytes (ECS=8 KB) Upper Bound of Length of Dup. Data Slices (in s) Fig.2. Distribution of numbers and contained bytes of duplicate data slices identified in a real-world dataset of 1.0 TB disk images. Percentage of Overall Dup. Bytes

4 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 607 From the figure, we observed about 80% duplicate bytes were concentrated in 5% coarse-grained duplicate datasliceswhichwerelongerthan20chunks. Whenthe hash indices hit by the coarse-grained duplicates were piggy-backed to the sender, 80% duplicates can be identified at the sender with the corresponding duplication query/answer related overheads saved. As we know, when a single hash granularity is used in hash comparison based data deduplication, the smaller the hash granularity, the more the redundancies that can be identified and removed, at the cost of more hash values and related disk I/O operations produced. The distribution of the inherent duplicates of the input dataset determines the optimal hashing scheme, where adaptive hash granularity should be utilized to identify all the duplicates with the fewest hash values and related disk I/O operations produced. Intuitively, the fine-grained duplicates are indexed with hash values of small granularity and the coarse-grained duplicates are indexed with hash values of large granularity. Fig.3 gives an example for the optimal hashing scheme in a dataset of three files when full prior knowledge of duplication distribution is available. File 1 File 2 File 3 Slice 2 Slice 1 Slice 3 Slice 3 Slice 4 Slice 5 Slice 1 Fig.3. Optimal hashing scheme for a dataset of three files when full priori knowledge of duplication distribution is available. Each chunk is represented with a hash value. When file 1 is the only file to process, the chunk should be set to file 1. If there is a second file 2 that matches slice 1 of file 1, file 1 should be re-chunked into two chunks, namely slice 1 and slice 2. Similarly, if there is a third file 3 which matches slice 3 of file 2, file 2 should be re-chunked into three chunks, slice 3, slice 4 and slice 1. Only 5 hash values are used to eliminate all the duplicates in this example. When no priori knowledge is available, the duplication distribution of the input dataset is obtained during the data deduplication process and a sampling approach is used to create the sparse hooks on the detected non-duplicate data slices. According to the hit hooks and the content overlaps between data slices, we proposed the hysteresis hash re-chunking based hash granularity adaptation method. Multiple consecutive non-duplicate hash values of small granularity between two neighboring sampled hooks are initially merged into a single hash value of large granularity. When and only when a boundary between duplicate and non-duplicate data within the data block represented by the merged hash value is encountered, the merged hash value would be re-chunked into at most three hash values of smaller granularity. The sparsely sampled hooks help reduce the metadata related disk I/O operations and the sizereduced manifests help reduce the bandwidth related overheads. 3 Design and Implementation In this paper, terms sender and receiver refer to the sender and the receiver of a file to be transferred fromonepartytoanother, asopposedtothe senderand the receiver of various messages and other information required to facilitate this file transfer. The Rabin fingerprint algorithm [3] and the SHA-1 hash algorithm are used for data deduplication. 3.1 System and Metadata Overview The data deduplication system organization across WAN is shown in Fig.4. We assume that the connections between the sender and the receiver use the TCP/IP protocol. In order to avoid frequent initiations and terminations of TCP connections, a Packaging module is introduced at both ends of the connection respectively to combine small temporally and sparsely distributed data chunks and messages into a big data packet which would be carried in a single TCP session. There are usually four types of metadata utilized in a typical deduplication system across the network shown in Fig.4, the metadata produced for duplication identification and elimination (called deduplication metadata henceforth), the metadata produced for file restoration from deduplicated data, the metadata required for storage management, e.g., the inode data structure in a Unix-style file system, and the communication metadata exchanged between the sender and the receiver, e.g., the duplication query/answer packets and manifest feedbacks, as summarized in Table 1. All the non-duplicate chunks belonging to an input file are coalesced and written together into a newly created deduplicated file at the receiver, called Disk (as opposed to the chunk divided using the Rabin fingerprint algorithm), named after the SHA-1 hash value of the first non-duplicate chunk, in order to reduce the data deduplication caused file fragmentation. The metadata required for file restoration of an input file is a sequence of pointers, each of which points to a

5 608 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Sender Input Files ing Module SHA-1 Calculator Manifest Cache Wide Area Network Dup. Query Packets Disks FileManifests Receiver Packaging Bloom Filter Manifest Cache Packaging RAM Dup. Answer Packets Disk RAM Disk Manifest Feedbacks Disk FileManifest Disk FileManifest Hook Manifest Fig.4. Data deduplication system across WAN. deduplicated data block within the Disks, called FileManifest, stored in a file named after the original file name in a separate file directory. Table 1. Metadata Defined and Their Functionalities in a Typical Deduplication System Across the Network Metadata Defined Hooks and manifests FileManifests Inodes Dup. query/answer packets manifest feedbacks Functionality Dup. identification Data restoration Storage management Dedup. communication The deduplication metadata include the manifests and the hooks. Consistent with the terminologies used in Sparse Indexing [6], a manifest is a sequence of hash indices of adaptive granularities calculated over the consecutive data blocks within a Disk, stored in a file named after the first hash value of the sequence. A hook is an SHA-1 hash value of small granularity sampled from a manifest, stored as a content addressable file named after the sampled hash value and containing the entrance of the manifest. The hook (also called an anchor in [9]) is used for duplication detection and the manifest (also called a recipe in [10]) is used for data locality preserving and exploiting. In the data domain [5] system, the hooks are called segment descriptors which are stored together in the metadata section of a container, working as the manifest. In this paper, a hash index of a data block refers to the hash value and the address information of a data block. Since the receiver cannot see the duplicates identified at the sender while the sender can see all the duplicates found by either the sender or the receiver, the FileManifests and the Disks are produced at the sender and then transferred to the receiver. Considering reducing the bandwidth overhead, manifests and hooks are produced at the receiver according to the Disks, FileManifests and the non-duplicate hash values with corresponding byte sizes. As a result, the deduplication-related bandwidth overheads include the duplication query/answer packets, the FileManifests and the manifest feedbacks. One input file corresponds to one FileManifest and one Disk if the file is not completely duplicate. Each Disk corresponds to one manifest and at least one hook. The formats of the metadata are shown in Fig.5. As an example, Fig.6 shows FileManifest, Disk, manifest, and hook metadata organized for an input file consisting of nine divided chunks in which two chunks are detected as duplicates. 3.2 Deduplication Protocol The complete communication protocol with collaborative deduplication and metadata management is shown in Fig.7, including the following main steps. Step 1. The sender reads chunks from input files, calculates SHA-1 hash values over separate chunks, and conducts data deduplication using manifests cached. A duplication query packet is produced for a batch of hash values of the chunks detected as non-duplicate at the sender and sent out to the receiver.

6 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 609 Disk Name (20 Bytes SHA-1) Byte Start (8 Bytes) Byte Start (8 Bytes) (a) SHA-1 (20 Bytes) Byte Size (8 Bytes) Hook Flag (1 Byte) Manifest Name (20 Bytes SHA-1) (b) (c) Manifest Name (20 Bytes SHA-1) Offset in Bytes (8 Bytes) Direction (1 Byte) Index (4 Bytes) In- Matching Context (33 Bytes) SHA-1 and Byte Size (28 Bytes) (d) Disk Name (20 Bytes SHA-1) Byte Start (8 Bytes) Start Index in Query Packet (4 Bytes) Length in Query Packet (4 Bytes) Duplicate Data Slice Vector (36 Bytes) Manifest Feedback (e) Fig.5. Metadata formats. (a) FileManifest. (b) Manifest. (c) Hook. (d) Duplication query packet. (e) Duplication answer packet. Disk Manifest Original File FileManifest Bytes: Pointer to Pointer to Dup Pointer to Sender Read s and Calculate Values Deduplication with Manifest Cache Send Duplication Query Packet for Non-Duplicate s Receiver Parse Duplication Query Packet Hooks Hook Hook Fig.6. Example of metadata organization at the receiver. 5 and chunk 6 are detected as duplicates, and the rest seven chunks are written together into a Disk with a manifest and two hooks produced. The corresponding FileManifest contains three entries, one for duplicate data slice chunk 5-6, and two for two non-duplicate data slices chunk 1-4 and chunk 7-9 respectively. Step 2. The receiver parses the duplication query packet, and conducts data deduplication. After data deduplication, a hysteresis hash re-chunking process may be performed on the hit manifests, and the manifest last hit may be piggy-backed to the duplication answer packet which is later sent back to the sender. Step 3. The sender parses the duplication answer packet, updates the manifest cache, and writes the confirmed non-duplicate chunks out to the Disk file and the corresponding pointer lists out to the FileManifest file. N Parse Duplication Answer Packet and Update Manifest Cache File End? Y Creat Disk and FileManifest for Submission Duplication with Bloom Filter and Cache Hysteresis Re-ing Send Duplication Answer Packet and Manifest Hit Parse Disk and FileManifest Produce Hooks and Size- Reduced Manifest Fig.7. Extended data deduplication protocol.

7 610 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Step 4. If the data deduplication process for all the divided chunks of an input file is finished, the File- Manifest and the deduplicated Disk are created for transmission. Otherwise, step 1 step 3 are repeated for next query batch. The maximal number of small granularity hash values sent to the receiver in one duplication query batch is called the query granularity. The query granularity can be set large enough while in our implementation it was set to 100 hash values, considering the cost of the memory for caching the data being processed at the sender and the receiver. 3.3 Sender Deduplication At the sender, a duplicate chunk can be identified with theaidofthemanifests sentbackfromthe receiver and duplication queries across the network. All the manifest feedbacks are stored in an in-ram cache in which each manifest is organized into a separate hash table where the key is the SHA-1 hash value of a deduplicated data block, and the value is the correspondingaddressandbytesizeofthe datablock. When an existing hash value is hit by an incoming duplicate hash value, the hash values before and after the hit hash value in the manifest are utilized for duplication extension in the backward and forward matching extension processes. For the convenience of illustration, we term the hash value hit in the manifest as the Hit and the incoming duplicate chunk as the Hit. For the backward matching extension, temporary hash values of the same granularities are calculated over the buffered chunk bytes before the Hit and compared with the hash values before the Hit in the manifest until a hash value mismatch is found. Similarly in the forward math extension, temporary hash values are calculated for the following chunk bytes after the Hit and compared with the hash values after the Hit in the manifest until a hash value mismatch is found. If the byte size of the mismatched hash value in the manifest covers a boundary between incoming duplicate and non-duplicate chunks, the bytes represented by the mismatched hash value are exploited with matching extension, which we call the in-chunk matching extension process. As deduplicated data chunks are all stored at the receiver, the in-chunk matching extension processes are postponed to be finished at the receiver, with the aid of the in-chunk matching context recorded in the duplication query packet including the entrance of the manifest hit, the byte offset of the mismatched hash value in the manifest, the index of the mismatched hash value in the query batch, and the mismatched direction of the postponed in-chunk matching extension (see Fig.5). Each duplication query packet corresponds to a duplication answer packet in which the duplicate hash values of the query packet would be indicated using a sequence of vectors, each of which contains the offset and length of a duplicate data slice (see Fig.5). When a manifest is found piggy-backed on the duplication answer packet, the old version of the manifest with the same name would be replaced. If the manifest feedback is new, it would be stored in the manifest cache following the LRU caching policy, with expired manifests discarded directly by the sender. The manifest feedback is the manifest hit at the receiver by the last duplicate hash value of the duplication query packet. The backward matching extension triggered by the last duplicate hash value has been conducted at the receiver, and the corresponding forward matching extension would be continued at the sender, because of the mismatch of the hash granularities between the hash values in the manifest and the duplication query packet. At the sender, for the forward matching extension, temporary hash values of the same granularities are calculated over the incoming chunk bytes and compared with the hash values after the Hit in the manifest until a hash value mismatch is encountered. Using 10 incoming chunks, Fig.8 gives an example for the hit and matching extension deduplication framework across WAN from the perspective of the sender. The incoming chunk 6 first hits a cached manifest at hash 6, then chunk 2 chunk 5 are identified as duplicates via backward matching extension, and chunk 7 to chunk 8 are identified as duplicates via in-chunk matching extension at the receiver. An optimal feedback strategy is to choose the manifests that can help to identify as many duplicates as possible at the sender. Assuming the size of a manifest feedbackisabytes, themanifesthelpstoidentifyb duplicate hash values at the sender, and if the metadata feedback mechanism is not introduced, these B hash values would produce x bytes for the duplication query packet and y bytes for the duplication answer packet. When A is smaller than x+y, no extra bandwidth overhead would be required for the manifest feedback. If the bandwidth overheads are not considered, the whole set of deduplication metadata sent back to the sender can

8 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 611 Updated Manifest at the Receiver Hit Manifest at the Sender Edge Backward Matching Extension Forward Matching Extension Incoming s Fig.8. Hit and matching extension deduplication across the network from the perspective of the sender. The in-chunk matching extension for mismatched hash 9-12 and chunk 7 chunk 10 is performed at the receiver, with the hit manifest updated by hysteresis hash re-chunking. save all the remote duplication queries. 3.4 Receiver Deduplication At the receiver, duplicate hash values can be detected via in-chunk matching extension and content addressable hooks stored on the disk. Besides duplication identification, deduplication metadata generation and update are also managed at the receiver. After all the incoming hash values are parsed from the duplication query packet, in-chunk matching extension processes for the mismatched hash values indexed at the sender are first conducted by restoring the matching context with the entrance of the manifest hit, the byte offset of the mismatched hash values of large granularity in the manifest and the mismatched direction marked in the duplication query packet. The chunk bytes represented by the mismatched hash value in the manifest are reloaded from the disk to RAM, and temporary hash values of the same granularities are calculated over the reloaded bytes for hash comparison until a hash value mismatch is encountered. As shown in Fig.8, chunk bytes represented by hash 9-12 as part of the corresponding Disk are first reloaded from the disk into the RAM, and then temporary hash values of the same granularities are calculated over the reloaded chunk bytes for hash comparison with the incoming hash values of chunk 7 chunk 10. When the hash value mismatch between the temporal hash 11 and the incoming hash value of chunk 9 is found, the in-chunk matching extension process stops. After processing the marked in-chunk matching contexts, the remaining incoming hash values are checked one by one using the in-ram manifest cache, bloom filter and on-disk hooks. Except the last duplicate hash value, both the backward and the forward inchunk matching extension processes are performed for the hash values which cannot find duplicates and are around the newly detected duplicate hash values. For the last duplicate hash value, only the backward matching extension is performed at the receiver, and the forward matching extension is postponed to be finished at the sender with the manifest feedback. Fig.9 gives an example for the deduplication process conducted at the receiver. Incoming hash 3 and hash 8 hit manifest 1 and manifest 2 respectively, with hash 2, hash 4 and hash 7 identified as duplicates via in-chunk matching extension. The updated manifest 2 is sent back to the sender for subsequent deduplication. The metadata utilization mechanism is performed on the hash value sequence of each manifest, including two processes, namely, sample based hash merging, and hysteresis hash re-chunking. Since each in-chunk matching extension process requires one expensive disk I/O for reloading the chunk bytes from the disk to the RAM and the inchunk matching extension is triggered by the mismatch of a large granularity hash comparison, a hysteresis hash re-chunking process is introduced to divide the mismatched large granularity hash value into multiple small granularity hash values after each in-chunk matching extension process, for the purpose of preventing duplicate in-chunk matching extension processes triggered by identical duplicate data slices in the future. For conservative hash value introducing, over each reloaded chunk, a large granularity hash value is designed to be divided into at most three small granularity hash values, the first being a mismatched Edge- representing an EdgeBlock which does not cover a chunk boundary, and the other two for the data blocks before and after the EdgeBlock. With the Edge created, the matching extension process for subsequent

9 612 J. Comput. Sci. & Technol., May 2016, Vol.31, No Updated Manifest 1 Edge Manifest 1 Edge Incoming Values Manifest 2 Edge Updated Manifest Fig.9. Hit and matching extension across the network from the perspective of the receiver. The updated manifest 2 is sent back to the sender where the forward matching extension process would be continued. identical duplicate data slices would be stopped by a hash comparison failure at the Edge with no inchunk matching extension process triggered, because the Edge does not contain a chunk boundary. As examples, in Fig.8, hash 9-12 is re-chunked into three new hash values in which hash 11 is the created Edge; in Fig.9, hash 2-5 and hash 7-10 of manifest 1 are re-chunked into three new hash values respectively where hash 4 and hash 8 are the two created Edge- es, and hash 4-5 of manifest 2 is re-chunked into two new hash values and hash 4 is the created Edge-. Considering manifest size reduction, a sample based hash merging process is utilized to adaptively enlarge the granularity of chosen hash values while initially producing a manifest for a Disk and a FileManifest received. Each non-duplicate data slice constructed by a sequence of consecutive non-duplicate chunks within the Disk is first parsed according to the entries recorded in the FileManifest. Over each non-duplicate data slice, the hooks are uniformly sampled from the SHA-1 hash values of small granularity of the corresponding chunk sequence using a preset sample period termed as the sample distance, while the SHA-1 hash values between two neighboring hooks are merged into a single SHA-1 hash value calculated over the bytes represented by the original multiple hash values. At most two hash values are conservatively and infrequently added in a hysteresis hash re-chunking process which is conducted only when an in-chunk matching extension process is finished. Each non-duplicate data slice would trigger at most two in-chunk matching extension processes in the forward and backward directions. The metadata harnessing efficiency produced by the sample based hash merging processes would not be significantly affected, when the non-duplicates are concentrated in a relatively small number of non-duplicate data slices. Because each merged SHA-1 hash value represents a consecutive byte sequence of the input file and the SHA- 1 hash values generated from a non-duplicate data slice are ordered in accordance with the order in which the corresponding data blocks appear in the input file, the sample based hash merging process does not destroy the data locality within a manifest. In our preliminary work [11], the sample basedhash mergingprocessis conducted over the entire Disk rather than each independent non-duplicate data slice. The data block represented by the merged hash value may straddle two neighboring non-duplicate data slices and a short non-duplicate data slice may be completely concealed in a merged hash value of large granularity, leading to a proportion of future duplicates which cannot be found and about 5% duplication elimination ratio reduced in our experimental evaluation. In the example shown in Fig.10, chunk 1 chunk 5 and chunk 90 chunk 95 are two parsed non-duplicate data slices. Over the first non-duplicate data slice, hash 1 is sampled as a hook and hash values for chunk 2 chunk 5 are merged into hash 2-5. Over the second non-duplicate data slice, hash 90 and hash 95 are sampledastwohooks,withhashvaluesforchunk91 chunk 94 merged into hash

10 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN Disk Manifest Deduplicated s Hooks Hook Hook Hook Fig.10. Sample based hash merging used for size reduction while initially producing a manifest at the receiver. The sample distance is set to 5 hash values. 4 Analysis 4.1 Basic Definitions and Claims Definition 1. A duplicate data slice is the longest possible byte sequence containing a certain byte in a file detected as being identical to a deduplicated byte sequence in an older file. Definition 2. A non-duplicate data slice is the longest possible byte sequence containing a certain byte in a file detected as being non-duplicately interrupted by a duplicate data slice or the end of the file it belongs to. Definition 3. A reference data slice is a nonduplicate byte sequence in a file with at least one duplicate data slice detected to be identical. Definition 4. Given a referenced data slice set S, a referenced data slice cluster, denoted by C, is a subset of S, such that, E C: 1) E is the only one element of C, or, E C, E E and E E Λ; 2) E S C, E E is Λ. Claim 1. Given a dataset, after deduplication, the byte sequence of the dataset is divided into a duplicate data slice set and a non-duplicate data slice set, with a referenced data slice set corresponding to the duplicate data slice set. Claim 2. Given a dataset, a hash comparison based data deduplication algorithm with D duplicate chunks and N non-duplicate chunks produced, and a given query granularity Q, a lower bound on the number of the duplication queries required by the sender is (N +D)/Q when no metadata feedback is utilized and (N+D D )/Q when D duplicate chunks are identified at the sender with the metadata feedback utilized. In the best case, D is D L, and then the fewest duplication queries are (N +L)/Q. Claim 3. Given a dataset, a duplicate data slice set, and a corresponding reference data slice set, for any hash comparison based deduplication algorithm, a lower bound on the number of the hash indices required in manifests is the number of the clusters of the referenced data slice set. Claim 4. Given a dataset, and a referenced data slice set, assume the bytes of the reference data slice set are indexed using M separate manifests: 1) the number of the clusters of the reference data slice set is smaller than M; 2) for any hash comparison based deduplication algorithm, a lower bound on the number of the hooks required is M. Claim 5. Given a dataset, L duplicate data slices and K non-duplicate data slices, a lower bound on the number of the entries in FileManifests is L+K. Claim 6. Given a manifest feedback to the sender from the receiver, corresponding to a Disk which is constructed by K non-duplicate data slices of N nonduplicate chunks, after sample based hash merging, the bandwidth overhead of the manifest feedback is reduced to N/SD + K hash values from N hash values, when the sample distance is set to SD hash values. The above claims on the lower bounds can be easily proved by definitions and contradictions. If there exists a number smaller than the lower bound, the given conditions on duplicate data slices and non-duplicate data slices cannot be achieved. 4.2 Comparative Analysis In LBFS [1] where the content defined chunking (CDC) algorithm without metadata harnessing is used, given an expected chunk size ECS, each separate chunk is roughlyecs bytes and indexed by at least onesha- 1 hash value. For a deduplicated dataset of S bytes, the number of hash values required to index this nonduplicate dataset is about 20 S/ECS bytes. Considering metadata harnessing, the Bimodal [8] chunking algorithm first divides the input files into big chunks, and conducts duplication detection for such

11 614 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 big chunks. The non-duplicate big chunks adjacent to duplicate chunks (termed as transition points ) then would be re-chunked into small chunks, followed by the deduplication process on the smaller chunks. Three drawbacks exist in the Bimodal algorithm: 1) the duplicates within the big chunks not re-chunked into small chunks could not be found; 2) inaccurate rechunkings of big chunks may lead to more than three small chunks produced; 3) before re-chunking a nonduplicate big chunk, an unnecessary duplication query for the big chunk is introduced, affecting the deduplication throughput. Different from the big chunk first, small chunk second deduplication framework of Bimodal, the metadata utilization mechanism of MFMU performs duplication detection for all the hash values of small granularity before they are selectively merged into hash values of large granularity, and re-chunks each merged hash value into at most three hash values based on the hysteresis mechanism when the data block represented by the merged hash value is found straddling duplicate and non-duplicate chunks in the future. Given a dataset, an ECS, and a duplicate data slice set produced by LBFS, assuming that the shortest duplicate data slice contains SD consecutive chunks, when the sample distance used in MFMU is set to SD hash values, each duplicate data slice produced by LBFS can be identified by MFMU. Because each referenced data slice of the duplicate data slice would generate a hook in MFMU, then each duplicate data slice can be found by hitting a hook and conducting the following bidirectional matching extension process. We assume the same duplicate data slice set can also be produced by Bimodal, when the expected sizes of the small chunks and the big chunks are set to ECS and BC ECS respectively. Since each duplicate data slice identified by Bimodal is not shorter than BC small chunks, BC should not be greater than SD. Assuming that the same duplicate data slice set, non-duplicate dataset, and reference data slice set are produced by LBFS, Bimodal, and MFMU respectively, we compare the deduplication metadata size produced, and the deduplication metadata related disk I/O operations produced by different deduplication solutions. In the comparisons, the manifest and hook data structures for data locality leveraging are assumed to be implemented in all deduplication solutions. The LBFS and Bimodal algorithms with the same metadata feedback mechanism, denoted as MF-LBFS and MF-Bimodal respectively, are also implemented to evaluate the number of duplication queries/answers reduced. Given the notations and descriptions summarized in Table 2, the number of duplication queries/answers reduced in the best case as well as the deduplication metadata produced, and the deduplication metadata related disk I/O operations in the worst case are summarized in Table 3 and Table 4 respectively. In MF- Bimodal and Bimodal, since each non-duplicate data slice is adjacent to two duplicate data slices, two more duplication query operations are required for the rechunked small chunks. As a result, the number of the duplication query operations required by MF-Bimodal is (N +L)/Q+2K. Notation F L K N D Q SD BC ECS S Table 2. Notations and Descriptions in Analysis Description Number of manifests (Disks) Number of duplicate data slices Number of non-duplicate data slices Number of non-duplicate chunks Number of duplicate chunks Value of query granularity Value of sample distance Value of big chunk size Value of expected chunk size Value of deduplicated data size Table 3. Duplication Query/Answer Related Time Solution LBFS MF-LBFS Bimodal MF-Bimodal MFMU Overheads (in the Best Case) Number of Dup. Queries/Answers (N +D)/Q (N +L)/Q (N +D)/Q+2K (N +L)/Q+2K (N +L)/Q In MFMU, each manifest entry in MFMU is 29 bytes. Each duplicate data slice triggers at most two hysteretic hash re-chunking processes with four more manifest entries introduced, at most (Q/SD + 1) chunk reloads for in-chunk matching extension at both ends, and one on-disk manifest update operation. In LBFS and Bimodal, each manifest entry is 28 bytes with the unnecessary 1-byte hook flag saved. When SD > 30N (56BC 114)K 116L, MFMU produces fewer hooks and Q (2BC 3)K/L 2, fewer manifest bytes, and when SD > MFMU requires fewer deduplication metadata related

12 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 615 Table 4. Analytic Comparisons Among Different Deduplication Solutions Solution LBFS/MF-LBFS Bimodal/MF-Bimodal MFMU Deduplication metadata produced (in the worst case) Number of hooks Bytes of manifests N 28N N/BC +2K(BC 1) 28N/BC +2 28K(BC 1) N/SD +K 2 29(N/SD +K)+ 4 29L Deduplication metadata related disk I/O overheads (in the worst case) Number of Disk outputs Number of Disk inputs Number of hook outputs Number of hook inputs Number of manifest outputs Number of manifest inputs Summary F 0 N L F L 2(F +L)+N F 0 N/BC +2K(BC 1) L F L 2(F +L)+N/BC+2K(BC 1) F (Q/SD +1) L N/SD +K L F +L L 2F +(Q/SD +4)L+ N/SD +K disk I/O operations, as compared with LBFS and Bimodal. In MFMU, the computation overheads for SHA-1 hash value re-calculation introduced in hash merging and hysteresis hash re-chunking can be described as O(N ECS), and O(2 L SD ECS) respectively, assuming one computation operation is required for one byte. the applications such as remote incremental backups of file systems across WAN. GUI of Linktropy 5 Experimental Results Experiments were designed to evaluate the following for LBFS, Bimodal and MFMU deduplication algorithms: metadata feedback efficiency, deduplication throughput, metadata utilization efficiency, bandwidth saving ratio, RAM overhead for metadata feedbacks. 5.1 Experimental Setup, Dataset and Metrics We built a test bed as shown in Fig.11 using the same hardware and software configurations in Table 5 at both the sender and the receiver. The sender and the receiver were connected with a Linktropy network emulator 2 hardware which was introduced to simulate various WAN conditions of different RTTs and packet lossrates,througha100mbpssymmetriclink. Weused two SATA 3.0 connected hard disks for the implementation, oneforstoringtheinputdatasetatthesender,and the other for storing the output deduplicated dataset at the receiver. The real-world test dataset was 500 GB disk images, named WorkGroup, described in the motivation section. The test dataset was representative for Sender Linktropy Network Emulator Fig.11. Layout of the experimental test bed. Receiver In our experimental comparisons, the deduplication related metadata include the bytes contained in the manifest, the hook and the FileManifest files, alongside with the extra inodes for file storage management. Since each input file corresponds to a FileManifest file, the extra inodes refer to the inodes for the manifest, the hook and the Disk files. To quantitatively evaluate the size of the metadata, we assumed each inode was 256 bytes. The bandwidth overhead of data deduplication includes the total bytes for the FileManifests, the manifest feedbacks and the duplication query and answer packets in both directions. The deduplication throughput is defined as the ratio between the input file size and the execution time used in the whole deduplication process across the network. The deduplication ratio is defined as the ratio between the size of the detected duplicates and the input file size. The bandwidth saving ratio is defined as the ratio between the difference of the total size of the detected duplicates and the bandwidth overhead and the input file size. 2 Linktropy. Apr

13 616 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Table 5. Hardware and Software Configurations of the CPU Memory Hard disk drive Network simulator Experimental Test Bed Quad-core 1.55 GHz 8 GB 7200 RPM 3 TB Linktropy mini Operating system Ubuntu Linux kernel File system Ext3 TCP TCP-Cubic [12] Test dataset 500 GB disk images Comparative LBFS (using CDC algorithm) [1], algorithms Bimodal [8] 5.2 Comparative Solutions and Methodology As mentioned previously, besides the baseline LBFS and Bimodal algorithms, to evaluate the flexibility and efficiency of the metadata feedback mechanism of MFMU, the LBFS and Bimodal deduplication algorithms with metadata feedback, denoted as MF- LBFS and MF-Bimodal respectively in which the manifests latest hit at the receiver were piggy-backed to the sender for subsequent deduplication, were also implemented for comparisons. For a fair comparison on the metadata harnessing efficiency between MFMU and Bimodal, when the sampledistanceinmfmuwassettosd,theexpectedsizes of the small and the big chunks in Bimodal were set to ECS and ECS SD/2 respectively. For the same nonduplicate data slice of ECS SD bytes, two manifest entries would be produced by both the MFMU and the Bimodal algorithms. The sample distance in MFMU was set to 10 hash values. The size of the in-memory bloom filter and the size of the manifest cache at the receiver were set to 50 MB and 10 manifests respectively. The size of the manifest feedback cache at the sender was set to 1000 manifests. The query granularity was set to 100 hash values. ECS was set to 2 KB, 4 KB and 8 KB. In the experiments, we first examined the deduplication throughputs, followed by the bandwidth saving ratios of different deduplication solutions, and finally the RAM overheads of the metadata feedbacks. 5.3 Metadata Feedback The deduplication throughput is determined by the transmission time used for the transmissions of the deduplicated data and related metadata and the deduplication execution time at the sender and the receiver. The transmission time is mainly determined by the number of duplication query/answer operations and the deduplication execution time is mainly determined by the number of disk I/O operations at the receiver. We first evaluated the efficiency of the metadata feedback mechanism which is designed to identify as many duplicates as possible at the sender and reduce the number of duplication queries at the sender. As shown in Fig.12(a), about 80% of overall duplicates were identified with the metadata feedbacks at the sender in MF-LBFS, MF-Bimodal and MFMU, and fewer than 1% of overall duplicates were identified at the sender in LBFS and Bimodal where the metadata feedback mechanism was not implemented. The numbers of the duplication query/answer operations at the sender and the receiver in different deduplication solutions are given in Fig.12(b). Compared with LBFS, MF-LBFS and MFMU achieved an average of 17% 35% duplication query/answer reduction for different ECS values. Since a nonduplicate big chunk may require two duplication queries/answers, one for the big chunk and the other for the constituent small chunks in Bimodal and MF- Bimodal, MF-Bimodal required 10% 15% more duplication queries/answers than MFMU, and Bimodal without metadata feedback required the most duplication queries/answers among all the solutions. Besides, Fig.12(c) gives the numbers of deduplication metadata related disk I/O operations at the receiver in different deduplication solutions. Without metadata harnessing, the number of disk I/Os in LBFS and MF-LBFS was almost five times of that in MFMU. As analyzed previously, each duplicate data slice may trigger Q/SD+1 in-chunk matching extensions and one extra manifest update process. Each of such processes introduces one disk I/O at the receiver. Because the hysteretic hash re-chunking processes prevented identical duplicate data slices from triggering duplicate disk I/Os, the number of the disk I/Os resulted from chunk byte reloads and manifest updates was only about 20% of the number of the overall duplicate data slices detected. Since fewer hooks were produced by sample based hash merging, MFMU required about 30% fewer deduplication metadata related disk I/O operations than Bimodal and MF-Bimodal.

14 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 617 Percentage of Duplicates LBFS MF-LBFS Bimodal MF-Bimodal MFMU Number of Queries (Million) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 0 2 KB 4 KB ECS (a) 0 8 KB 2 KB 4 KB ECS (b) 8 KB Number of Disk I/Os (Million) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 2 KB 4 KB ECS (c) Deduplication Throughput (Mb/s) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 0 8 KB 20 ms 30 ms RTT (d) 40 ms Fig.12. Metadata feedback efficiency and deduplication throughput evaluation. (a) Percentage of duplicates identified at the sender. (b) Duplication query and answer operations performed at the sender and the receiver. (c) Deduplication metadata related disk I/O operations performed at the receiver. (d) Deduplication throughput. 5.4 Deduplication Throughput Our preliminary work [13] which is actually about MF-LBFS showed that different packet loss rates at the same RTT and link capacity setting made slightly effects on the deduplication throughput comparisons. We evaluated the deduplication throughputs of different deduplication solutions under different network conditions of 20 ms, 30 ms and 40 ms RTTs with a fixed 0% packet loss rate separately. The deduplication throughput versus RTT for different deduplication solutions is shown in Fig.12(d). As expected, different numbers of deduplication metadata related disk I/O operations and duplication query/answer operations led to different deduplication throughputs. With metadata feedbacks utilized at the sender, MF-LBFS and MF-Bimodal achieved a 20% deduplication throughput improvement, as compared with LBFS and Bimodal. Combining the efficiencies of the manifest feedback mechanism and the deduplication metadata related disk I/O reduction at the receiver, MFMU showed an about 40% deduplication throughput improvement compared with LBFS and an about 20% improvement compared with Bimodal. 5.5 Metadata Utilization The deduplication related metadata normalized against the input file size versus ECS for different deduplication solutions is plotted in Fig.13(a). Since MFMU performed hysteretic hash re-chunking less frequently and more conservatively, it showed much better metadata harnessing efficiency than Bimodal and MF-Bimodal. The deduplication related metadata produced by MFMU were 50% of those produced by Bimodal and MF-Bimodal respectively, consistent with the experimental results validated in our preliminary work [11]. Because the duplicates identified in MFMU via bi-

15 618 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 directional matching extension were more spatially concentrated than those detected by the other solutions, MFMU produced the fewest bytes for FileManifests. Without metadata harnessing, LBFS and MF-LBFS produced the most deduplication metadata, i.e., hooks and manifests. The curve for the lower bound on the deduplication related metadata required by MFMU, denoted by MFMU-LB, is also plotted in Fig.13(a). Specially, the lower bound indicated that the deduplication metadata which had been used for duplication detection were much smaller than the overall deduplication metadata actually produced by MFMU. 5.6 Bandwidth Saving The data deduplication introduced bandwidth overheads normalized against the input file size versus ECS values of different solutions are plotted in Fig.13(b). Without manifest feedbacks utilized at the sender, LBFS used the most duplication query/answer packets, because almost 100% of the duplicates were identified at the receiver. With the manifest feedback mechanism introduced, the duplication query/answer packets in MF-LBFS significantly decreased, but with significant bandwidth overhead introduced for the metadata feedbacks. MFMU required more bandwidth for the duplication query packets, as compared with Bimodal and MF-Bimodal, because MFMU conducted duplication detection for all the small granularity hash values and introduced extra bytes for the in-chunk matching extension contexts produced at the sender. Bimodal and MF-Bimodal performed duplication detection for big chunks and only the small chunks of the non-duplicate big chunks adjacent to duplicate big chunks. As a result, Bimodal and MF-Bimodal required fewer bytes for the duplication query/answer packets than MFMU. Normalized Metadata Size Deduplication Ratio LBFS (MF-LBFS) Bimodal (MF-Bimodal) MFMU MFMU-LB 2 KB 4 KB 8 KB ECS (a) LBFS (MF-LBFS) 0.66 Bimodal (MF-Bimodal) MFMU MFMU-LB Normalized Metadata Size (c) Normalized Bandwidth Overhead Bandwidth Saving Ratio KB 4 KB 8 KB ECS (b) LBFS MF-LBFS Bimodal MF-Bimodal MFMU LBFS MF-LBFS Bimodal MF-Bimodal MFMU MFMU-LB Normalized Metadata Size (d) Fig.13. Bandwidth saving evaluation (no data compression on deduplicated chunks). (a) Normalized metadata size. (b) Bandwidth overhead. (c) Deduplication ratio. (d) Bandwidth saving ratio.

16 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 619 The bandwidth overheads for the manifest feedbacks in MFMU and MF-Bimodal were similar, due to the similar metadata harnessing efficiencies on the manifests. As shown in Fig.13(b), the combined bandwidth overheads of MFMU were in between. The deduplication ratios versus the deduplication related metadata of different deduplication solutions are plotted in Fig.13(c), where the deduplication related metadata are normalized against the input file size. Consistent with the results in the work of Kruus et al. [8], Bimodal and MF-Bimodal overlooked the duplicates within the big chunks which were not rechunked into small chunks, leading to the lowest curve of deduplication ratio versus ECS. Because each nonduplicate data slice generated at least one hook for subsequent duplication detection, the duplicates within big chunks, including some of those not visible to hash comparison in LBFS and MF-LBFS, were detected via bi-directional matching extension and in-chunk matching extension, and the size of the deduplication related metadata was highly suppressed, MFMU provided a better trade-off between deduplication ratio and deduplication related metadata than LBFS and MF-LBFS which is close to the lower bound denoted by MFMU- LB. Considering the bandwidth overhead, the bandwidth saving ratio as a function of the normalized deduplication related metadata is shown in Fig.13(d). For the same metadata utilized, MF-LBFS provided a lower bandwidth saving ratio than LBFS because the bandwidth overhead introduced by the metdata feedbacks outweighed the reduction in the bytes of the duplication query and answer packets. Bimodal and MF-Bimodal showed the worst bandwidth saving ratio versus the deduplication related metadata, because of the lowest deduplication ratios achieved. With the high deduplication ratio maintained and the bandwidth overheads suppressed, MFMU provided the best costeffective bandwidth saving ratio and about 2% improvement to the peak bandwidth saving ratio compared with LBFS. The lower bound on the bandwidth saving ratio versus the normalized deduplication related metadata denoted by MFMU-LB is also plotted in Fig.13(d). 5.7 RAM Overhead for Metadata Feedbacks The manifest hit ratio of the manifest feedback cache at the sender increased as the cache capacity enlarged, and reached the maximal value of about 80% when the manifest feedback cache was limited to 1000 manifests, consistent with the metadata cache behavior shown in [14]. When the manifest feedback cacheat the sender was limited to manifests, the corresponding RAM overheads for the manifest feedbacks cached at the sender in MF-LBFS, MF-Bimodal and MFMU arecomparedinfig.14, from whichwecan observethat a smaller ECS led to a higher RAM overhead, because of the larger size of each manifest sent back from the receiver. As the sizes of the manifests were suppressed by MFMU and MF-Bimodal with a similar SD and expected big chunk size setting, the maximal RAM space sizes required in MFMU and MF-Bimodal were both kept at between 10 MB and 39 MB. Without metadata harnessing, MF-LBFS introduced an RAM overhead of 48 MB 194 MB at the sender, almost five times the RAM space used by MFMU. RAM Overhead (MB) KB 4 KB ECS MF-LBFS MF-Bimodal MFMU 8 KB Fig.14. RAM overhead for metadata feedbacks. 6 Discussions The distributions of the duplicate data slices within the input dataset determine the efficiency and performance of MFMU. The harnessing efficiency in metadata utilization could be validated in most types of dataset, while the corresponding overheads may vary. On a dataset where fine-grained duplicates account for a large portion of the overall duplicates, the metadata harnessing strategy may lead to a noticeable deduplication ratio reduced. When the duplicate data slice is smaller than the data block represented by the query granularity, the metadata feedback may not help to identify the whole duplicate data slice at the sender. In such case, no metadata feedback should be piggybacked to the sender. More efficient metadata feedback strategies could be introduced in MFMU to identify

17 620 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 more duplicates at the sender with fewer bandwidth overheads consumed. This paper is not to improve the bandwidth saving ratio, but rather to accelerate the data deduplication process across WAN. The deduplicated data and various intermediate data generated in the proposed system are usually compressible. Considering the overheads of maintaining reference data blocks at the sender for delta patching [2,14], local data compression using minilzo 3 was also applied on every 10 MB deduplicated data block before they were transmitted from the sender to the receiver in our evaluation for different deduplication solutions. The bandwidth could be saved further by about 27%. The corresponding latency, including the compression latency at the sender and the decompression latency at the receiver, was about 3 ms/mb processed data. MFMU is mainly designed and implemented for data deduplication across WAN where the end-toend RTT greatly affects the deduplication throughput. In the LAN environment where the end-to-end RTT may be low enough, the duplication queries/answers saved by the metadata feedbacks may not bring significant reduction in communication related time overheads. However, the metadata harnessing mechanism of MFMU could still accelerate the duplication detection process at the receiver with deduplication metadata and related disk I/O operations efficiently saved. Using a hash value of length K bits to represent a data chunk for duplication identification, assuming the hash values are independently and identically distributed, the probability that at least one hash collision occurs between N data chunks can be calculated and the upper bound is N(N 1) 2, which is a well-known K+1 Birthday Bound [4]. When the SHA-1 hash value of 160 bits is used and N is 2 50, the probability that hash collision happens is below When the chunk size is 1 KB, SHA-1hashvalue canbe used to represent TB data with hash collision probability below Byte-by-byte comparison used for duplication identification can thoroughly eliminate false-positive probability at the cost of higher computation complexity. 7 Related Work Bandwidth Saving. This problem is about how to eliminate more duplicates within a given dataset to be transmitted for more bandwidth saving. Rsync 4 exploited duplicate data transmissions using fixed-sized file chunking algorithm while LBFS [1] used the Rabin fingerprint based content defined chunking algorithm. For further redundance reduction, delta patching and data compression technologies were introduced to suppress the deduplicated data blocks in TAPER [2], [15] and [14]. However, these existing studies did not consider reducing the communication time overheads for the duplication query/answer operations across wide area network. On the contrary,[14] requires more metadata exchanging steps across networks to locate an optimal reference data block for delta compression. Lin et al. [16] improved the effectiveness of traditional compressors by introducing coarse-grained data transformation in addition to data deduplication. More file chunking and hashing method can be found in [17-20]. Black [4] explained that the probability for hash collisions in SHA-1 is ignorable in the practical system. MFMU does not consider to achieve greater redundancy reduction efficiency but guarantees that the bandwidth saving ratio is not affected with metadata harnessing. Data Deduplication Acceleration. This problem is about how to achieve a better trade-off between deduplication ratio and deduplication speed. Data locality preserved caching, in-ram bloom filter [21], in- RAM sparse index [6] and sampling [10] have been standard acceleration techniques to avoid the disk lookup bottleneck at the receiver during deduplication, as done in data domain [5], sparse indexing [6], HANDS [22], idedup [23] and Extreme Binning [10]. Guo and Efstathopoulos [7] utilized a multiple-tcp-connectionbased interaction model to optimize the deduplication throughput, where in each connection the baseline LBFS protocol was used. On the basis of the existing acceleration approaches, MFMU accelerates the deduplication process across WAN via reducing the metadata related disk I/O operations at the receiver and the remote duplication query operations at the sender. In addition, MFMU can also be implemented with faster hardware devices like solid state drive (SSD) as done in the studies such as [24-27]. Deduplication Related Metadata Saving. This problem is about how to achieve a better trade-off between deduplication ratio and metadata related overheads. Meister et al. [28] proposed an entropy coding based post-processing compression method for the 3 minilzo. Apr Rsync. Apr

18 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 621 FileManifests. FingerDiff [29] coalesced contiguous nonduplicate chunks up to a maximal number into one big chunk stored on the disk to reduce data fragmentation and the metadata required for storage management. But FingerDiff did not harness the deduplication metadata, e.g., the hash indices used during deduplication. Bimodal [8], Sub [9] and [30] first searched for duplicates using a large chunk size, with the nonduplicate big chunks selectively processed again using a small chunk size. In these studies, an unnecessary duplication query for the non-duplicate big chunk was introduced with a considerably high probability for missing duplicate data within the big chunks which are not re-chunked for further deduplication. MFMU replaces the big chunk first, small chunk second deduplication strategy with the bi-directional matching extension mechanism to avoid the unnecessary duplication queries for the non-duplicate big chunks. Maintaining the deduplication ratio, MFMU performs duplication detection for all the small granularity hash values before they are merged into a large granularity hash value, and exploits duplicates within big data blocks via in-chunk matching extension. Other deduplication related problems and solutions on deduplication tradeoffs, deduplication restoration, deduplication security and deduplication memory allocation optimization can be found in [31-34] respectively. Spyglass [35] uses several metadata search techniques, including index partitioning, file signature, and incremental crawling, as well as an index versioning mechanism, for achieving fast, scalable file metadata search performance. For the similar goal, SmartStore [36] uses decentralizedsemantic-awaremetadata organization method to limit the metadata search scope of a metadata query, and thereby to reduce the metadata query latency for large storage systems. MFMU reduces the deduplication related time overheads mainly by reducing the metadata related disk I/O operations, resulting from harnessing the size of the overall deduplication related metadata. MFMU currently does not maim to lower the metadata search latency based on the metadata already stored in RAM and on the disk. 8 Conclusions In this paper, we proposed a data deduplication system across WAN, named MFMU, combining the metadata feedback and metadata utilization features to accelerate the data deduplication process across WAN, while maintaining the bandwidth saving efficiency. In our experiments, about 35% duplication query reduction at the sender and about 80% disk I/O operations saved at the receiver were achieved, leading to a 20% 40% deduplication throughput improvement, compared with other comparative data deduplication solutions. In this paper, only the disk image dataset was used for experimental evaluation and one-receiver-onesender deduplication mode was discussed. In the future, more types of datasets such as the cloud storage related datasets will be collected and tested. In addition, the deduplication mode will be extended to one-receivermultiple-sender mode in which concurrency control and metadata consistency between multiple threads will be important topics. References [1] Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles (SOSP), October 2001, pp [2] Jain N, Dahlin M, Tewari R. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), December 2005, pp [3] Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, [4] Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp [5] Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, pp [6] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp [7] Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the USENIX Annual Technical Conference (ATC), June 2011, pp [8] Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, pp [9] Bartlomiej R, Lukasz H, Wojciech K, Krzysztof L, Cezary D. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-16:13. [10] Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.

19 622 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 [11] Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp [12] Ha S, Rhee I, Xu L. CUBIC: A new TCP-friendly highspeed TCP variant. In Proc. the ACM Operating Systems Review, Research and Developments in the Linux Kernel (SIGOPS), July 2008, Volume 42, pp [13] Zhou B, Wen J. Efficient file communication via deduplication with manifest feedback. IEEE Communications Letters, 2014, 18: [14] Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4): 13:1-13:26. [15] Park K, Ihm S, Bowman M, Pai V S. Supporting practical content-addressable caching with CZIP compression. In Proc. the USENIX Annual Technical Conference (ATC), June 2007, pp.14:1-14:14. [16] Lin X, Lu G, Douglis F, Shilane P, Wallace G. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proc. the 12th USENIX Conference on File and Storage Technologies (FAST), February 2014, pp [17] Eshghi K, Tang H K. A framework for analyzing and improving contentbased chunking algorithms. Technical Report, HPL R.1, Hewlett Packard Laboratories, Palo Alto, [18] Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6): [19] Pagh R, Rodler F F. Cuckoo hashing. Journal of Algorithms, 2004, 51(2): [20] Fabiano C, Botelho N G, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp [21] Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 1970, 13(7): [22] Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering (ICDE), April 2013, pp [23] Srinivasan K, Bisson T, Goodson G, Voruganti K. idedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp [24] Debnath B, Sengupta S, Li J. Stash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), February 2010, pp [25] Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May [26] Chen F, Luo T, Zhang X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Storage Technologies (FAST), February 2011, pp [27] Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp [28] Meister D, Brinkmann A, Süβ T. File recipe compression in data deduplication systems. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp [29] Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): [30] Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp [31] Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies (FAST), February 2015, pp [32] Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the USENIX Annual Technical Conference (ATC), June 2014, pp [33] Tang Y, Yang J. Secure deduplication of general computations. In Proc. the USENIX Annual Technical Conference (ATC), July 2015, pp [34] Zhang W, Yang T, Narayanasamy G, Tang H. Low-cost data deduplication for virtual machine backup in cloud storage. In Proc. the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), June 2013, pp [35] Leung A W, Shao M, Bisson T, Pasupathy S, Miller E L. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp [36] Hua Y, Jiang H, Zhu Y, Feng D, Tian L. Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(2): Bing Zhou received his B.S. degree in computer science and technology from Nanjing University, Nanjing, in 2009, and M.S. and Ph.D. degrees in computer science and technology from Tsinghua University, Beijing, in 2012 and 2015 respectively. He is working at 2012 Lab, Huawei Technologies Co., Ltd. His research interests include data communication/deduplication and high-performance distributed systems.

20 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 623 Jiang-Tao Wen received his B.S., M.S., and Ph.D. degrees (with honors), all in electrical engineering, from Tsinghua University, Beijing, in 1992, 1994, and 1996, respectively. From 1996 to 1998, he was a staff research fellow at the University of California, Los Angeles (UCLA), where he conducted cutting-edge research on multimedia coding and communications. Many of his inventions there were later adopted by international standards such as H.263, MPEG, and H.264. After UCLA, he served as the principal scientist at PacketVideo Corp., the chief technical officer at Morphbius Technology Inc., the director of Video Codec Technologies at Mobilygen Corp., and a technology advisor at Ortiva Wireless and Stretch, Inc. Since 2009, he has been a professor at the Department of Computer Science and Technology, Tsinghua University, Beijing. He is a world-renowned expert in multimedia communication over hostile networks, video coding, and communications. He has authored many widely referenced papers in related fields. Products deploying technologies that he developed are currently widely used worldwide. He holds over 30 patents with numerous others pending.