Metadata Feedback and Utilization for Data Deduplication Across WAN

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Metadata Feedback and Utilization for Data Deduplication Across WAN"

Transcription

1 Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): May DOI /s Metadata Feedback and Utilization for Data Deduplication Across WAN Bing Zhou and Jiang-Tao Wen, Fellow, IEEE State Key Laboratory on Intelligent Technology and Systems, Tsinghua University, Beijing , China Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing , China Department of Computer Science and Technology, Tsinghua University, Beijing , China Received May 11, 2015; revised January 19, Abstract Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographically distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20% 40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the baseline content defined chunking(cdc) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions. Keywords data deduplication, wide area network (WAN), metadata feedback, metadata utilization 1 Introduction Elimination of redundant data transmissions across the network with the aid of the data deduplication technology has been introduced into many geographically distributed file communication systems [1-2] for bandwidth saving and end-to-end latency reduction. In a typical data deduplication based file communication process, the Rabin fingerprint algorithm [3] is firstly applied to calculate a sequence of fingerprints for each input file at the sender, with the data between any two neighboring fingerprints extracted as separate data chunks. After file chunking, a batch of SHA-1 1 hash values would be calculated over the separate chunks and sent to the receiver for duplication detection. At the receiver, the SHA-1 hash values received are examined as to whether they are identical to the previously stored ones. A duplicate SHA-1 hash value identified at the receiver indicates a corresponding duplicate chunk at the sender. Only the confirmed non-duplicate chunks after duplication detection together with the metadata for file restoration would be transmitted from the sender to the receiver. The hash collision probability of SHA- 1 is extremely small and can be ignored while it is used for duplication detection [4]. The inherent content overlaps of the data source and the deduplication algorithm jointly determine the Regular Paper This work was supported by the National Science Fund for Distinguished Young Scholars of China under Grant No and the State Key Program of National Natural Science Foundation of China under Grant No Apr Springer Science + Business Media, LLC & Science Press, China

2 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 605 output bandwidth saving ratio and the corresponding time and space overheads. There are three types of time overheads during data deduplication across the network affecting the file communication throughput. Duplication Query/Answer Overhead. Each remote duplication query from the sender and each duplication answer from the receiver introduce at least one round trip time (RTT) of extra latency. Disk Access Overhead. When the deduplication metadata are too large to fit into the RAM at the receiver, each disk look-up or I/O operation causes expensive latency of about 10 ms according to [5]. Protocol Related Overhead. The terminations and establishments of TCP connections for sparsely distributed duplication query and answer packets as well as the non-duplicate chunks transmissions also cause time overhead. Besides the time overheads, there are also inevitable space overheads affecting the bandwidth saving efficiency, including the duplication query/answer packets to determine whether the chunks divided at the sender are duplicate at the receiver, alongside with the metadata produced at the sender and required at the receiver to reconstruct the original files from deduplicated data. To relievethe disk accessbottleneck, data domain [5] utilized an in-ram bloom filter to avoid unnecessary disk look-up operations for the non-duplicate chunks and an in-ram cache exploiting data locality to avoid disk look-up operations for subsequent duplicate chunks. A similar in-ram sparse index data structure was also used in Sparse Indexing [6] to reduce the number of disk look-up operations. To minimize the protocol related overheads, multiple TCP connections were utilized while building a high performance deduplication system [7] in cooperation with the existing disk bottleneck avoiding methods. However, the number of remote duplication queries and answers in each single deduplication thread was still to be reduced for further throughput optimization, especially in the wide area network (WAN) environment where the end-to-end latency is usually significantly large, as compared with that in the local area network (LAN). Theprimaryproblemwearetoaddressinthispaper is to reduce the duplication query/answer related time overhead by introducing a metadata feedback mechanism. With the chosen metadata piggy-backed from the receiver, the sender firstly conducts duplication identification locally and only the left hash values which cannot find duplicates at the sender are sent out to the receiver for remote duplication query. With inherent data locality preserved, each metadata feedback is a sequence of hash indices containing hash values and corresponding address information of consecutive deduplicated chunks, for which the metadata feedback is also called the manifest feedback in this paper. Thesecondproblemweareconcernedwithistoharness the metadata related disk I/O operations at the receiver, as well as the bandwidth overheads of metadata feedbacks by metadata utilization. In order to reduce the bytes of each metadata feedback, a hysteresis hash re-chunking based hash granularity adaptation method is introduced. A deduplicated file is initially indexed with an alternation of small and large granularity hash values. To conduct deduplication with the hash values of different granularities, a collaborative hit and matching extension deduplication framework acrossthenetworkisusedatboththesenderandthereceiver. When an incoming hash value is found identical to an existing hash value, a matching extension process is performed by hash comparison with the hash values before and after the hit hash value until a hash mismatch is encountered. When the chunk represented by the mismatched hash value which is usually of a large granularity contains a boundary between the incoming duplicate and non-duplicate chunks, the mismatched hash value is re-chunked into at most three new hash values of smaller granularity. In summary, we present a data deduplication based file communication system across WAN with metadata feedback and metadata utilization (MFMU) for an improved trade-off between the bandwidth saving efficiency and the file communication throughput. The traditional deduplication based file communication model and the MFMU model are compared in Fig.1. The kernel idea of MFMU is to trade more computation overheads for fewer remote duplication query operations at the sender and disk I/O operations at the receiver with metadata feedback and metadata utilization which are much more expensive. The most widely used baseline LBFS deduplication based file communication solution and the state-of-the-art Bimodal chunking solution are used in the experimental evaluation of MFMU. The contributions of this paper include: A metadata feedback method is proposed to reduce the number of remote duplication query/answer operations. When an old hash value is hit by an incoming duplicate hash value, the neighboring hash values are piggy-backed on the duplication answer packets and sent back to the sender, to avoid the remote dupli-

3 606 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 cation query/answer related time and space overheads for subsequent possible duplicate chunks at the sender, thereby accelerating the data deduplication process and improving the file communication throughput. Sender Receiver Sender Receiver Dup. Query Dup. Answer (a) Local Dedup. Dup. Query Dup. Answer Metadata Feedbacks (b) Metadata Utilization Fig.1. (a) Traditional model and (b) MFMU model. Dup.: Duplication; Dedup.: Deduplication. A hysteresis hash re-chunking based hash granularity adaptation method is proposed to save the bytes of metadata required to index the deduplicated data. Over each non-duplicate data slice which consists of a sequence of consecutive non-duplicate chunks, a few hash values of small granularity are uniformly sampled for subsequent duplication detection, and the hash values between two neighboring sampled hash values are initially merged into a hash value of large granularity. When and only when the data represented by a merged hash value is found straddling duplicate and nonduplicate chunks, the merged hash value is very conservatively re-chunked into at most three consecutive hash values. A hit and matching extension framework is proposed for data deduplication with hash values of different granularities. The hash values of small granularity are used for duplication detection firstly. When a duplicate hash value of small granularity is hit, the hash comparison granularity is enlarged to exploit the data locality in the matching extension process. It conducts a theoretical analysis on the lower bound of the metadata size required for a certain duplication elimination efficiency, as well as the metadata saving efficiency comparison among the proposed hysteresis hash re-chunking method, the baseline content defined chunking (CDC [1], used in LBFS [1] ) and existing Bimodal [8] chunking algorithms. The rest of this paper is organized as follows. The motivation is presented in Section 2. The design and the implementation of the proposed MFMU system are described in detail in Section 3. Theoretical analysis is given in Section 4, with experimental results in Section 5. The discussions and related work are in Section 6 and Section 7 respectively. Section 8 concludes this paper and gives the future work. 2 Motivation In data deduplication between two geographically distributed network nodes, when no priori knowledge is available, the duplication query/answer related time and space overheads for the non-duplicate chunks are inevitable, because the sender cannot confirm whether a non-duplicate chunk is non-duplicate without the confirmation by the receiver. The duplication query/answer operations for the duplicate chunks which are identified with data locality preserved metadata feedbacks at the sender can be avoided. For a better understanding of data locality information in data, Fig.2 shows the distribution of the numbers and contained bytes of duplicate data slices identified with different numbers of consecutive duplicate chunks when the expected chunk size (ECS) was set to 8 KB and 16 KB in a real-world dataset of 1.0 TB disk images. The disk images were collected from 10 PCs used by engineers running Windows, Linux or Max operating system with NTFS, FAT, FAT32, Ext3, Ext4 or HFS+ file system, including the user files as well as the system files. Various real-world applications were run on the collected PCs. The user-generated files included documents, source codes, pictures, and binary executive files but the video files, as we considered there were few duplicates within the video files which had been compressed by H.264 video compression standard. Percentage of Overall Dup. Data Slices # of Dup. Data Slices (ECS=16 KB) 0.2 # of Dup. Data Slices (ECS=8 KB) # of Dup. Bytes (ECS=16 KB) # of Dup. Bytes (ECS=8 KB) Upper Bound of Length of Dup. Data Slices (in s) Fig.2. Distribution of numbers and contained bytes of duplicate data slices identified in a real-world dataset of 1.0 TB disk images. Percentage of Overall Dup. Bytes

4 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 607 From the figure, we observed about 80% duplicate bytes were concentrated in 5% coarse-grained duplicate datasliceswhichwerelongerthan20chunks. Whenthe hash indices hit by the coarse-grained duplicates were piggy-backed to the sender, 80% duplicates can be identified at the sender with the corresponding duplication query/answer related overheads saved. As we know, when a single hash granularity is used in hash comparison based data deduplication, the smaller the hash granularity, the more the redundancies that can be identified and removed, at the cost of more hash values and related disk I/O operations produced. The distribution of the inherent duplicates of the input dataset determines the optimal hashing scheme, where adaptive hash granularity should be utilized to identify all the duplicates with the fewest hash values and related disk I/O operations produced. Intuitively, the fine-grained duplicates are indexed with hash values of small granularity and the coarse-grained duplicates are indexed with hash values of large granularity. Fig.3 gives an example for the optimal hashing scheme in a dataset of three files when full prior knowledge of duplication distribution is available. File 1 File 2 File 3 Slice 2 Slice 1 Slice 3 Slice 3 Slice 4 Slice 5 Slice 1 Fig.3. Optimal hashing scheme for a dataset of three files when full priori knowledge of duplication distribution is available. Each chunk is represented with a hash value. When file 1 is the only file to process, the chunk should be set to file 1. If there is a second file 2 that matches slice 1 of file 1, file 1 should be re-chunked into two chunks, namely slice 1 and slice 2. Similarly, if there is a third file 3 which matches slice 3 of file 2, file 2 should be re-chunked into three chunks, slice 3, slice 4 and slice 1. Only 5 hash values are used to eliminate all the duplicates in this example. When no priori knowledge is available, the duplication distribution of the input dataset is obtained during the data deduplication process and a sampling approach is used to create the sparse hooks on the detected non-duplicate data slices. According to the hit hooks and the content overlaps between data slices, we proposed the hysteresis hash re-chunking based hash granularity adaptation method. Multiple consecutive non-duplicate hash values of small granularity between two neighboring sampled hooks are initially merged into a single hash value of large granularity. When and only when a boundary between duplicate and non-duplicate data within the data block represented by the merged hash value is encountered, the merged hash value would be re-chunked into at most three hash values of smaller granularity. The sparsely sampled hooks help reduce the metadata related disk I/O operations and the sizereduced manifests help reduce the bandwidth related overheads. 3 Design and Implementation In this paper, terms sender and receiver refer to the sender and the receiver of a file to be transferred fromonepartytoanother, asopposedtothe senderand the receiver of various messages and other information required to facilitate this file transfer. The Rabin fingerprint algorithm [3] and the SHA-1 hash algorithm are used for data deduplication. 3.1 System and Metadata Overview The data deduplication system organization across WAN is shown in Fig.4. We assume that the connections between the sender and the receiver use the TCP/IP protocol. In order to avoid frequent initiations and terminations of TCP connections, a Packaging module is introduced at both ends of the connection respectively to combine small temporally and sparsely distributed data chunks and messages into a big data packet which would be carried in a single TCP session. There are usually four types of metadata utilized in a typical deduplication system across the network shown in Fig.4, the metadata produced for duplication identification and elimination (called deduplication metadata henceforth), the metadata produced for file restoration from deduplicated data, the metadata required for storage management, e.g., the inode data structure in a Unix-style file system, and the communication metadata exchanged between the sender and the receiver, e.g., the duplication query/answer packets and manifest feedbacks, as summarized in Table 1. All the non-duplicate chunks belonging to an input file are coalesced and written together into a newly created deduplicated file at the receiver, called Disk (as opposed to the chunk divided using the Rabin fingerprint algorithm), named after the SHA-1 hash value of the first non-duplicate chunk, in order to reduce the data deduplication caused file fragmentation. The metadata required for file restoration of an input file is a sequence of pointers, each of which points to a

5 608 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Sender Input Files ing Module SHA-1 Calculator Manifest Cache Wide Area Network Dup. Query Packets Disks FileManifests Receiver Packaging Bloom Filter Manifest Cache Packaging RAM Dup. Answer Packets Disk RAM Disk Manifest Feedbacks Disk FileManifest Disk FileManifest Hook Manifest Fig.4. Data deduplication system across WAN. deduplicated data block within the Disks, called FileManifest, stored in a file named after the original file name in a separate file directory. Table 1. Metadata Defined and Their Functionalities in a Typical Deduplication System Across the Network Metadata Defined Hooks and manifests FileManifests Inodes Dup. query/answer packets manifest feedbacks Functionality Dup. identification Data restoration Storage management Dedup. communication The deduplication metadata include the manifests and the hooks. Consistent with the terminologies used in Sparse Indexing [6], a manifest is a sequence of hash indices of adaptive granularities calculated over the consecutive data blocks within a Disk, stored in a file named after the first hash value of the sequence. A hook is an SHA-1 hash value of small granularity sampled from a manifest, stored as a content addressable file named after the sampled hash value and containing the entrance of the manifest. The hook (also called an anchor in [9]) is used for duplication detection and the manifest (also called a recipe in [10]) is used for data locality preserving and exploiting. In the data domain [5] system, the hooks are called segment descriptors which are stored together in the metadata section of a container, working as the manifest. In this paper, a hash index of a data block refers to the hash value and the address information of a data block. Since the receiver cannot see the duplicates identified at the sender while the sender can see all the duplicates found by either the sender or the receiver, the FileManifests and the Disks are produced at the sender and then transferred to the receiver. Considering reducing the bandwidth overhead, manifests and hooks are produced at the receiver according to the Disks, FileManifests and the non-duplicate hash values with corresponding byte sizes. As a result, the deduplication-related bandwidth overheads include the duplication query/answer packets, the FileManifests and the manifest feedbacks. One input file corresponds to one FileManifest and one Disk if the file is not completely duplicate. Each Disk corresponds to one manifest and at least one hook. The formats of the metadata are shown in Fig.5. As an example, Fig.6 shows FileManifest, Disk, manifest, and hook metadata organized for an input file consisting of nine divided chunks in which two chunks are detected as duplicates. 3.2 Deduplication Protocol The complete communication protocol with collaborative deduplication and metadata management is shown in Fig.7, including the following main steps. Step 1. The sender reads chunks from input files, calculates SHA-1 hash values over separate chunks, and conducts data deduplication using manifests cached. A duplication query packet is produced for a batch of hash values of the chunks detected as non-duplicate at the sender and sent out to the receiver.

6 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 609 Disk Name (20 Bytes SHA-1) Byte Start (8 Bytes) Byte Start (8 Bytes) (a) SHA-1 (20 Bytes) Byte Size (8 Bytes) Hook Flag (1 Byte) Manifest Name (20 Bytes SHA-1) (b) (c) Manifest Name (20 Bytes SHA-1) Offset in Bytes (8 Bytes) Direction (1 Byte) Index (4 Bytes) In- Matching Context (33 Bytes) SHA-1 and Byte Size (28 Bytes) (d) Disk Name (20 Bytes SHA-1) Byte Start (8 Bytes) Start Index in Query Packet (4 Bytes) Length in Query Packet (4 Bytes) Duplicate Data Slice Vector (36 Bytes) Manifest Feedback (e) Fig.5. Metadata formats. (a) FileManifest. (b) Manifest. (c) Hook. (d) Duplication query packet. (e) Duplication answer packet. Disk Manifest Original File FileManifest Bytes: Pointer to Pointer to Dup Pointer to Sender Read s and Calculate Values Deduplication with Manifest Cache Send Duplication Query Packet for Non-Duplicate s Receiver Parse Duplication Query Packet Hooks Hook Hook Fig.6. Example of metadata organization at the receiver. 5 and chunk 6 are detected as duplicates, and the rest seven chunks are written together into a Disk with a manifest and two hooks produced. The corresponding FileManifest contains three entries, one for duplicate data slice chunk 5-6, and two for two non-duplicate data slices chunk 1-4 and chunk 7-9 respectively. Step 2. The receiver parses the duplication query packet, and conducts data deduplication. After data deduplication, a hysteresis hash re-chunking process may be performed on the hit manifests, and the manifest last hit may be piggy-backed to the duplication answer packet which is later sent back to the sender. Step 3. The sender parses the duplication answer packet, updates the manifest cache, and writes the confirmed non-duplicate chunks out to the Disk file and the corresponding pointer lists out to the FileManifest file. N Parse Duplication Answer Packet and Update Manifest Cache File End? Y Creat Disk and FileManifest for Submission Duplication with Bloom Filter and Cache Hysteresis Re-ing Send Duplication Answer Packet and Manifest Hit Parse Disk and FileManifest Produce Hooks and Size- Reduced Manifest Fig.7. Extended data deduplication protocol.

7 610 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Step 4. If the data deduplication process for all the divided chunks of an input file is finished, the File- Manifest and the deduplicated Disk are created for transmission. Otherwise, step 1 step 3 are repeated for next query batch. The maximal number of small granularity hash values sent to the receiver in one duplication query batch is called the query granularity. The query granularity can be set large enough while in our implementation it was set to 100 hash values, considering the cost of the memory for caching the data being processed at the sender and the receiver. 3.3 Sender Deduplication At the sender, a duplicate chunk can be identified with theaidofthemanifests sentbackfromthe receiver and duplication queries across the network. All the manifest feedbacks are stored in an in-ram cache in which each manifest is organized into a separate hash table where the key is the SHA-1 hash value of a deduplicated data block, and the value is the correspondingaddressandbytesizeofthe datablock. When an existing hash value is hit by an incoming duplicate hash value, the hash values before and after the hit hash value in the manifest are utilized for duplication extension in the backward and forward matching extension processes. For the convenience of illustration, we term the hash value hit in the manifest as the Hit and the incoming duplicate chunk as the Hit. For the backward matching extension, temporary hash values of the same granularities are calculated over the buffered chunk bytes before the Hit and compared with the hash values before the Hit in the manifest until a hash value mismatch is found. Similarly in the forward math extension, temporary hash values are calculated for the following chunk bytes after the Hit and compared with the hash values after the Hit in the manifest until a hash value mismatch is found. If the byte size of the mismatched hash value in the manifest covers a boundary between incoming duplicate and non-duplicate chunks, the bytes represented by the mismatched hash value are exploited with matching extension, which we call the in-chunk matching extension process. As deduplicated data chunks are all stored at the receiver, the in-chunk matching extension processes are postponed to be finished at the receiver, with the aid of the in-chunk matching context recorded in the duplication query packet including the entrance of the manifest hit, the byte offset of the mismatched hash value in the manifest, the index of the mismatched hash value in the query batch, and the mismatched direction of the postponed in-chunk matching extension (see Fig.5). Each duplication query packet corresponds to a duplication answer packet in which the duplicate hash values of the query packet would be indicated using a sequence of vectors, each of which contains the offset and length of a duplicate data slice (see Fig.5). When a manifest is found piggy-backed on the duplication answer packet, the old version of the manifest with the same name would be replaced. If the manifest feedback is new, it would be stored in the manifest cache following the LRU caching policy, with expired manifests discarded directly by the sender. The manifest feedback is the manifest hit at the receiver by the last duplicate hash value of the duplication query packet. The backward matching extension triggered by the last duplicate hash value has been conducted at the receiver, and the corresponding forward matching extension would be continued at the sender, because of the mismatch of the hash granularities between the hash values in the manifest and the duplication query packet. At the sender, for the forward matching extension, temporary hash values of the same granularities are calculated over the incoming chunk bytes and compared with the hash values after the Hit in the manifest until a hash value mismatch is encountered. Using 10 incoming chunks, Fig.8 gives an example for the hit and matching extension deduplication framework across WAN from the perspective of the sender. The incoming chunk 6 first hits a cached manifest at hash 6, then chunk 2 chunk 5 are identified as duplicates via backward matching extension, and chunk 7 to chunk 8 are identified as duplicates via in-chunk matching extension at the receiver. An optimal feedback strategy is to choose the manifests that can help to identify as many duplicates as possible at the sender. Assuming the size of a manifest feedbackisabytes, themanifesthelpstoidentifyb duplicate hash values at the sender, and if the metadata feedback mechanism is not introduced, these B hash values would produce x bytes for the duplication query packet and y bytes for the duplication answer packet. When A is smaller than x+y, no extra bandwidth overhead would be required for the manifest feedback. If the bandwidth overheads are not considered, the whole set of deduplication metadata sent back to the sender can

8 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 611 Updated Manifest at the Receiver Hit Manifest at the Sender Edge Backward Matching Extension Forward Matching Extension Incoming s Fig.8. Hit and matching extension deduplication across the network from the perspective of the sender. The in-chunk matching extension for mismatched hash 9-12 and chunk 7 chunk 10 is performed at the receiver, with the hit manifest updated by hysteresis hash re-chunking. save all the remote duplication queries. 3.4 Receiver Deduplication At the receiver, duplicate hash values can be detected via in-chunk matching extension and content addressable hooks stored on the disk. Besides duplication identification, deduplication metadata generation and update are also managed at the receiver. After all the incoming hash values are parsed from the duplication query packet, in-chunk matching extension processes for the mismatched hash values indexed at the sender are first conducted by restoring the matching context with the entrance of the manifest hit, the byte offset of the mismatched hash values of large granularity in the manifest and the mismatched direction marked in the duplication query packet. The chunk bytes represented by the mismatched hash value in the manifest are reloaded from the disk to RAM, and temporary hash values of the same granularities are calculated over the reloaded bytes for hash comparison until a hash value mismatch is encountered. As shown in Fig.8, chunk bytes represented by hash 9-12 as part of the corresponding Disk are first reloaded from the disk into the RAM, and then temporary hash values of the same granularities are calculated over the reloaded chunk bytes for hash comparison with the incoming hash values of chunk 7 chunk 10. When the hash value mismatch between the temporal hash 11 and the incoming hash value of chunk 9 is found, the in-chunk matching extension process stops. After processing the marked in-chunk matching contexts, the remaining incoming hash values are checked one by one using the in-ram manifest cache, bloom filter and on-disk hooks. Except the last duplicate hash value, both the backward and the forward inchunk matching extension processes are performed for the hash values which cannot find duplicates and are around the newly detected duplicate hash values. For the last duplicate hash value, only the backward matching extension is performed at the receiver, and the forward matching extension is postponed to be finished at the sender with the manifest feedback. Fig.9 gives an example for the deduplication process conducted at the receiver. Incoming hash 3 and hash 8 hit manifest 1 and manifest 2 respectively, with hash 2, hash 4 and hash 7 identified as duplicates via in-chunk matching extension. The updated manifest 2 is sent back to the sender for subsequent deduplication. The metadata utilization mechanism is performed on the hash value sequence of each manifest, including two processes, namely, sample based hash merging, and hysteresis hash re-chunking. Since each in-chunk matching extension process requires one expensive disk I/O for reloading the chunk bytes from the disk to the RAM and the inchunk matching extension is triggered by the mismatch of a large granularity hash comparison, a hysteresis hash re-chunking process is introduced to divide the mismatched large granularity hash value into multiple small granularity hash values after each in-chunk matching extension process, for the purpose of preventing duplicate in-chunk matching extension processes triggered by identical duplicate data slices in the future. For conservative hash value introducing, over each reloaded chunk, a large granularity hash value is designed to be divided into at most three small granularity hash values, the first being a mismatched Edge- representing an EdgeBlock which does not cover a chunk boundary, and the other two for the data blocks before and after the EdgeBlock. With the Edge created, the matching extension process for subsequent

9 612 J. Comput. Sci. & Technol., May 2016, Vol.31, No Updated Manifest 1 Edge Manifest 1 Edge Incoming Values Manifest 2 Edge Updated Manifest Fig.9. Hit and matching extension across the network from the perspective of the receiver. The updated manifest 2 is sent back to the sender where the forward matching extension process would be continued. identical duplicate data slices would be stopped by a hash comparison failure at the Edge with no inchunk matching extension process triggered, because the Edge does not contain a chunk boundary. As examples, in Fig.8, hash 9-12 is re-chunked into three new hash values in which hash 11 is the created Edge; in Fig.9, hash 2-5 and hash 7-10 of manifest 1 are re-chunked into three new hash values respectively where hash 4 and hash 8 are the two created Edge- es, and hash 4-5 of manifest 2 is re-chunked into two new hash values and hash 4 is the created Edge-. Considering manifest size reduction, a sample based hash merging process is utilized to adaptively enlarge the granularity of chosen hash values while initially producing a manifest for a Disk and a FileManifest received. Each non-duplicate data slice constructed by a sequence of consecutive non-duplicate chunks within the Disk is first parsed according to the entries recorded in the FileManifest. Over each non-duplicate data slice, the hooks are uniformly sampled from the SHA-1 hash values of small granularity of the corresponding chunk sequence using a preset sample period termed as the sample distance, while the SHA-1 hash values between two neighboring hooks are merged into a single SHA-1 hash value calculated over the bytes represented by the original multiple hash values. At most two hash values are conservatively and infrequently added in a hysteresis hash re-chunking process which is conducted only when an in-chunk matching extension process is finished. Each non-duplicate data slice would trigger at most two in-chunk matching extension processes in the forward and backward directions. The metadata harnessing efficiency produced by the sample based hash merging processes would not be significantly affected, when the non-duplicates are concentrated in a relatively small number of non-duplicate data slices. Because each merged SHA-1 hash value represents a consecutive byte sequence of the input file and the SHA- 1 hash values generated from a non-duplicate data slice are ordered in accordance with the order in which the corresponding data blocks appear in the input file, the sample based hash merging process does not destroy the data locality within a manifest. In our preliminary work [11], the sample basedhash mergingprocessis conducted over the entire Disk rather than each independent non-duplicate data slice. The data block represented by the merged hash value may straddle two neighboring non-duplicate data slices and a short non-duplicate data slice may be completely concealed in a merged hash value of large granularity, leading to a proportion of future duplicates which cannot be found and about 5% duplication elimination ratio reduced in our experimental evaluation. In the example shown in Fig.10, chunk 1 chunk 5 and chunk 90 chunk 95 are two parsed non-duplicate data slices. Over the first non-duplicate data slice, hash 1 is sampled as a hook and hash values for chunk 2 chunk 5 are merged into hash 2-5. Over the second non-duplicate data slice, hash 90 and hash 95 are sampledastwohooks,withhashvaluesforchunk91 chunk 94 merged into hash

10 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN Disk Manifest Deduplicated s Hooks Hook Hook Hook Fig.10. Sample based hash merging used for size reduction while initially producing a manifest at the receiver. The sample distance is set to 5 hash values. 4 Analysis 4.1 Basic Definitions and Claims Definition 1. A duplicate data slice is the longest possible byte sequence containing a certain byte in a file detected as being identical to a deduplicated byte sequence in an older file. Definition 2. A non-duplicate data slice is the longest possible byte sequence containing a certain byte in a file detected as being non-duplicately interrupted by a duplicate data slice or the end of the file it belongs to. Definition 3. A reference data slice is a nonduplicate byte sequence in a file with at least one duplicate data slice detected to be identical. Definition 4. Given a referenced data slice set S, a referenced data slice cluster, denoted by C, is a subset of S, such that, E C: 1) E is the only one element of C, or, E C, E E and E E Λ; 2) E S C, E E is Λ. Claim 1. Given a dataset, after deduplication, the byte sequence of the dataset is divided into a duplicate data slice set and a non-duplicate data slice set, with a referenced data slice set corresponding to the duplicate data slice set. Claim 2. Given a dataset, a hash comparison based data deduplication algorithm with D duplicate chunks and N non-duplicate chunks produced, and a given query granularity Q, a lower bound on the number of the duplication queries required by the sender is (N +D)/Q when no metadata feedback is utilized and (N+D D )/Q when D duplicate chunks are identified at the sender with the metadata feedback utilized. In the best case, D is D L, and then the fewest duplication queries are (N +L)/Q. Claim 3. Given a dataset, a duplicate data slice set, and a corresponding reference data slice set, for any hash comparison based deduplication algorithm, a lower bound on the number of the hash indices required in manifests is the number of the clusters of the referenced data slice set. Claim 4. Given a dataset, and a referenced data slice set, assume the bytes of the reference data slice set are indexed using M separate manifests: 1) the number of the clusters of the reference data slice set is smaller than M; 2) for any hash comparison based deduplication algorithm, a lower bound on the number of the hooks required is M. Claim 5. Given a dataset, L duplicate data slices and K non-duplicate data slices, a lower bound on the number of the entries in FileManifests is L+K. Claim 6. Given a manifest feedback to the sender from the receiver, corresponding to a Disk which is constructed by K non-duplicate data slices of N nonduplicate chunks, after sample based hash merging, the bandwidth overhead of the manifest feedback is reduced to N/SD + K hash values from N hash values, when the sample distance is set to SD hash values. The above claims on the lower bounds can be easily proved by definitions and contradictions. If there exists a number smaller than the lower bound, the given conditions on duplicate data slices and non-duplicate data slices cannot be achieved. 4.2 Comparative Analysis In LBFS [1] where the content defined chunking (CDC) algorithm without metadata harnessing is used, given an expected chunk size ECS, each separate chunk is roughlyecs bytes and indexed by at least onesha- 1 hash value. For a deduplicated dataset of S bytes, the number of hash values required to index this nonduplicate dataset is about 20 S/ECS bytes. Considering metadata harnessing, the Bimodal [8] chunking algorithm first divides the input files into big chunks, and conducts duplication detection for such

11 614 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 big chunks. The non-duplicate big chunks adjacent to duplicate chunks (termed as transition points ) then would be re-chunked into small chunks, followed by the deduplication process on the smaller chunks. Three drawbacks exist in the Bimodal algorithm: 1) the duplicates within the big chunks not re-chunked into small chunks could not be found; 2) inaccurate rechunkings of big chunks may lead to more than three small chunks produced; 3) before re-chunking a nonduplicate big chunk, an unnecessary duplication query for the big chunk is introduced, affecting the deduplication throughput. Different from the big chunk first, small chunk second deduplication framework of Bimodal, the metadata utilization mechanism of MFMU performs duplication detection for all the hash values of small granularity before they are selectively merged into hash values of large granularity, and re-chunks each merged hash value into at most three hash values based on the hysteresis mechanism when the data block represented by the merged hash value is found straddling duplicate and non-duplicate chunks in the future. Given a dataset, an ECS, and a duplicate data slice set produced by LBFS, assuming that the shortest duplicate data slice contains SD consecutive chunks, when the sample distance used in MFMU is set to SD hash values, each duplicate data slice produced by LBFS can be identified by MFMU. Because each referenced data slice of the duplicate data slice would generate a hook in MFMU, then each duplicate data slice can be found by hitting a hook and conducting the following bidirectional matching extension process. We assume the same duplicate data slice set can also be produced by Bimodal, when the expected sizes of the small chunks and the big chunks are set to ECS and BC ECS respectively. Since each duplicate data slice identified by Bimodal is not shorter than BC small chunks, BC should not be greater than SD. Assuming that the same duplicate data slice set, non-duplicate dataset, and reference data slice set are produced by LBFS, Bimodal, and MFMU respectively, we compare the deduplication metadata size produced, and the deduplication metadata related disk I/O operations produced by different deduplication solutions. In the comparisons, the manifest and hook data structures for data locality leveraging are assumed to be implemented in all deduplication solutions. The LBFS and Bimodal algorithms with the same metadata feedback mechanism, denoted as MF-LBFS and MF-Bimodal respectively, are also implemented to evaluate the number of duplication queries/answers reduced. Given the notations and descriptions summarized in Table 2, the number of duplication queries/answers reduced in the best case as well as the deduplication metadata produced, and the deduplication metadata related disk I/O operations in the worst case are summarized in Table 3 and Table 4 respectively. In MF- Bimodal and Bimodal, since each non-duplicate data slice is adjacent to two duplicate data slices, two more duplication query operations are required for the rechunked small chunks. As a result, the number of the duplication query operations required by MF-Bimodal is (N +L)/Q+2K. Notation F L K N D Q SD BC ECS S Table 2. Notations and Descriptions in Analysis Description Number of manifests (Disks) Number of duplicate data slices Number of non-duplicate data slices Number of non-duplicate chunks Number of duplicate chunks Value of query granularity Value of sample distance Value of big chunk size Value of expected chunk size Value of deduplicated data size Table 3. Duplication Query/Answer Related Time Solution LBFS MF-LBFS Bimodal MF-Bimodal MFMU Overheads (in the Best Case) Number of Dup. Queries/Answers (N +D)/Q (N +L)/Q (N +D)/Q+2K (N +L)/Q+2K (N +L)/Q In MFMU, each manifest entry in MFMU is 29 bytes. Each duplicate data slice triggers at most two hysteretic hash re-chunking processes with four more manifest entries introduced, at most (Q/SD + 1) chunk reloads for in-chunk matching extension at both ends, and one on-disk manifest update operation. In LBFS and Bimodal, each manifest entry is 28 bytes with the unnecessary 1-byte hook flag saved. When SD > 30N (56BC 114)K 116L, MFMU produces fewer hooks and Q (2BC 3)K/L 2, fewer manifest bytes, and when SD > MFMU requires fewer deduplication metadata related

12 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 615 Table 4. Analytic Comparisons Among Different Deduplication Solutions Solution LBFS/MF-LBFS Bimodal/MF-Bimodal MFMU Deduplication metadata produced (in the worst case) Number of hooks Bytes of manifests N 28N N/BC +2K(BC 1) 28N/BC +2 28K(BC 1) N/SD +K 2 29(N/SD +K)+ 4 29L Deduplication metadata related disk I/O overheads (in the worst case) Number of Disk outputs Number of Disk inputs Number of hook outputs Number of hook inputs Number of manifest outputs Number of manifest inputs Summary F 0 N L F L 2(F +L)+N F 0 N/BC +2K(BC 1) L F L 2(F +L)+N/BC+2K(BC 1) F (Q/SD +1) L N/SD +K L F +L L 2F +(Q/SD +4)L+ N/SD +K disk I/O operations, as compared with LBFS and Bimodal. In MFMU, the computation overheads for SHA-1 hash value re-calculation introduced in hash merging and hysteresis hash re-chunking can be described as O(N ECS), and O(2 L SD ECS) respectively, assuming one computation operation is required for one byte. the applications such as remote incremental backups of file systems across WAN. GUI of Linktropy 5 Experimental Results Experiments were designed to evaluate the following for LBFS, Bimodal and MFMU deduplication algorithms: metadata feedback efficiency, deduplication throughput, metadata utilization efficiency, bandwidth saving ratio, RAM overhead for metadata feedbacks. 5.1 Experimental Setup, Dataset and Metrics We built a test bed as shown in Fig.11 using the same hardware and software configurations in Table 5 at both the sender and the receiver. The sender and the receiver were connected with a Linktropy network emulator 2 hardware which was introduced to simulate various WAN conditions of different RTTs and packet lossrates,througha100mbpssymmetriclink. Weused two SATA 3.0 connected hard disks for the implementation, oneforstoringtheinputdatasetatthesender,and the other for storing the output deduplicated dataset at the receiver. The real-world test dataset was 500 GB disk images, named WorkGroup, described in the motivation section. The test dataset was representative for Sender Linktropy Network Emulator Fig.11. Layout of the experimental test bed. Receiver In our experimental comparisons, the deduplication related metadata include the bytes contained in the manifest, the hook and the FileManifest files, alongside with the extra inodes for file storage management. Since each input file corresponds to a FileManifest file, the extra inodes refer to the inodes for the manifest, the hook and the Disk files. To quantitatively evaluate the size of the metadata, we assumed each inode was 256 bytes. The bandwidth overhead of data deduplication includes the total bytes for the FileManifests, the manifest feedbacks and the duplication query and answer packets in both directions. The deduplication throughput is defined as the ratio between the input file size and the execution time used in the whole deduplication process across the network. The deduplication ratio is defined as the ratio between the size of the detected duplicates and the input file size. The bandwidth saving ratio is defined as the ratio between the difference of the total size of the detected duplicates and the bandwidth overhead and the input file size. 2 Linktropy. Apr

13 616 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 Table 5. Hardware and Software Configurations of the CPU Memory Hard disk drive Network simulator Experimental Test Bed Quad-core 1.55 GHz 8 GB 7200 RPM 3 TB Linktropy mini Operating system Ubuntu Linux kernel File system Ext3 TCP TCP-Cubic [12] Test dataset 500 GB disk images Comparative LBFS (using CDC algorithm) [1], algorithms Bimodal [8] 5.2 Comparative Solutions and Methodology As mentioned previously, besides the baseline LBFS and Bimodal algorithms, to evaluate the flexibility and efficiency of the metadata feedback mechanism of MFMU, the LBFS and Bimodal deduplication algorithms with metadata feedback, denoted as MF- LBFS and MF-Bimodal respectively in which the manifests latest hit at the receiver were piggy-backed to the sender for subsequent deduplication, were also implemented for comparisons. For a fair comparison on the metadata harnessing efficiency between MFMU and Bimodal, when the sampledistanceinmfmuwassettosd,theexpectedsizes of the small and the big chunks in Bimodal were set to ECS and ECS SD/2 respectively. For the same nonduplicate data slice of ECS SD bytes, two manifest entries would be produced by both the MFMU and the Bimodal algorithms. The sample distance in MFMU was set to 10 hash values. The size of the in-memory bloom filter and the size of the manifest cache at the receiver were set to 50 MB and 10 manifests respectively. The size of the manifest feedback cache at the sender was set to 1000 manifests. The query granularity was set to 100 hash values. ECS was set to 2 KB, 4 KB and 8 KB. In the experiments, we first examined the deduplication throughputs, followed by the bandwidth saving ratios of different deduplication solutions, and finally the RAM overheads of the metadata feedbacks. 5.3 Metadata Feedback The deduplication throughput is determined by the transmission time used for the transmissions of the deduplicated data and related metadata and the deduplication execution time at the sender and the receiver. The transmission time is mainly determined by the number of duplication query/answer operations and the deduplication execution time is mainly determined by the number of disk I/O operations at the receiver. We first evaluated the efficiency of the metadata feedback mechanism which is designed to identify as many duplicates as possible at the sender and reduce the number of duplication queries at the sender. As shown in Fig.12(a), about 80% of overall duplicates were identified with the metadata feedbacks at the sender in MF-LBFS, MF-Bimodal and MFMU, and fewer than 1% of overall duplicates were identified at the sender in LBFS and Bimodal where the metadata feedback mechanism was not implemented. The numbers of the duplication query/answer operations at the sender and the receiver in different deduplication solutions are given in Fig.12(b). Compared with LBFS, MF-LBFS and MFMU achieved an average of 17% 35% duplication query/answer reduction for different ECS values. Since a nonduplicate big chunk may require two duplication queries/answers, one for the big chunk and the other for the constituent small chunks in Bimodal and MF- Bimodal, MF-Bimodal required 10% 15% more duplication queries/answers than MFMU, and Bimodal without metadata feedback required the most duplication queries/answers among all the solutions. Besides, Fig.12(c) gives the numbers of deduplication metadata related disk I/O operations at the receiver in different deduplication solutions. Without metadata harnessing, the number of disk I/Os in LBFS and MF-LBFS was almost five times of that in MFMU. As analyzed previously, each duplicate data slice may trigger Q/SD+1 in-chunk matching extensions and one extra manifest update process. Each of such processes introduces one disk I/O at the receiver. Because the hysteretic hash re-chunking processes prevented identical duplicate data slices from triggering duplicate disk I/Os, the number of the disk I/Os resulted from chunk byte reloads and manifest updates was only about 20% of the number of the overall duplicate data slices detected. Since fewer hooks were produced by sample based hash merging, MFMU required about 30% fewer deduplication metadata related disk I/O operations than Bimodal and MF-Bimodal.

14 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 617 Percentage of Duplicates LBFS MF-LBFS Bimodal MF-Bimodal MFMU Number of Queries (Million) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 0 2 KB 4 KB ECS (a) 0 8 KB 2 KB 4 KB ECS (b) 8 KB Number of Disk I/Os (Million) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 2 KB 4 KB ECS (c) Deduplication Throughput (Mb/s) LBFS MF-LBFS Bimodal MF-Bimodal MFMU 0 8 KB 20 ms 30 ms RTT (d) 40 ms Fig.12. Metadata feedback efficiency and deduplication throughput evaluation. (a) Percentage of duplicates identified at the sender. (b) Duplication query and answer operations performed at the sender and the receiver. (c) Deduplication metadata related disk I/O operations performed at the receiver. (d) Deduplication throughput. 5.4 Deduplication Throughput Our preliminary work [13] which is actually about MF-LBFS showed that different packet loss rates at the same RTT and link capacity setting made slightly effects on the deduplication throughput comparisons. We evaluated the deduplication throughputs of different deduplication solutions under different network conditions of 20 ms, 30 ms and 40 ms RTTs with a fixed 0% packet loss rate separately. The deduplication throughput versus RTT for different deduplication solutions is shown in Fig.12(d). As expected, different numbers of deduplication metadata related disk I/O operations and duplication query/answer operations led to different deduplication throughputs. With metadata feedbacks utilized at the sender, MF-LBFS and MF-Bimodal achieved a 20% deduplication throughput improvement, as compared with LBFS and Bimodal. Combining the efficiencies of the manifest feedback mechanism and the deduplication metadata related disk I/O reduction at the receiver, MFMU showed an about 40% deduplication throughput improvement compared with LBFS and an about 20% improvement compared with Bimodal. 5.5 Metadata Utilization The deduplication related metadata normalized against the input file size versus ECS for different deduplication solutions is plotted in Fig.13(a). Since MFMU performed hysteretic hash re-chunking less frequently and more conservatively, it showed much better metadata harnessing efficiency than Bimodal and MF-Bimodal. The deduplication related metadata produced by MFMU were 50% of those produced by Bimodal and MF-Bimodal respectively, consistent with the experimental results validated in our preliminary work [11]. Because the duplicates identified in MFMU via bi-

15 618 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 directional matching extension were more spatially concentrated than those detected by the other solutions, MFMU produced the fewest bytes for FileManifests. Without metadata harnessing, LBFS and MF-LBFS produced the most deduplication metadata, i.e., hooks and manifests. The curve for the lower bound on the deduplication related metadata required by MFMU, denoted by MFMU-LB, is also plotted in Fig.13(a). Specially, the lower bound indicated that the deduplication metadata which had been used for duplication detection were much smaller than the overall deduplication metadata actually produced by MFMU. 5.6 Bandwidth Saving The data deduplication introduced bandwidth overheads normalized against the input file size versus ECS values of different solutions are plotted in Fig.13(b). Without manifest feedbacks utilized at the sender, LBFS used the most duplication query/answer packets, because almost 100% of the duplicates were identified at the receiver. With the manifest feedback mechanism introduced, the duplication query/answer packets in MF-LBFS significantly decreased, but with significant bandwidth overhead introduced for the metadata feedbacks. MFMU required more bandwidth for the duplication query packets, as compared with Bimodal and MF-Bimodal, because MFMU conducted duplication detection for all the small granularity hash values and introduced extra bytes for the in-chunk matching extension contexts produced at the sender. Bimodal and MF-Bimodal performed duplication detection for big chunks and only the small chunks of the non-duplicate big chunks adjacent to duplicate big chunks. As a result, Bimodal and MF-Bimodal required fewer bytes for the duplication query/answer packets than MFMU. Normalized Metadata Size Deduplication Ratio LBFS (MF-LBFS) Bimodal (MF-Bimodal) MFMU MFMU-LB 2 KB 4 KB 8 KB ECS (a) LBFS (MF-LBFS) 0.66 Bimodal (MF-Bimodal) MFMU MFMU-LB Normalized Metadata Size (c) Normalized Bandwidth Overhead Bandwidth Saving Ratio KB 4 KB 8 KB ECS (b) LBFS MF-LBFS Bimodal MF-Bimodal MFMU LBFS MF-LBFS Bimodal MF-Bimodal MFMU MFMU-LB Normalized Metadata Size (d) Fig.13. Bandwidth saving evaluation (no data compression on deduplicated chunks). (a) Normalized metadata size. (b) Bandwidth overhead. (c) Deduplication ratio. (d) Bandwidth saving ratio.

16 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 619 The bandwidth overheads for the manifest feedbacks in MFMU and MF-Bimodal were similar, due to the similar metadata harnessing efficiencies on the manifests. As shown in Fig.13(b), the combined bandwidth overheads of MFMU were in between. The deduplication ratios versus the deduplication related metadata of different deduplication solutions are plotted in Fig.13(c), where the deduplication related metadata are normalized against the input file size. Consistent with the results in the work of Kruus et al. [8], Bimodal and MF-Bimodal overlooked the duplicates within the big chunks which were not rechunked into small chunks, leading to the lowest curve of deduplication ratio versus ECS. Because each nonduplicate data slice generated at least one hook for subsequent duplication detection, the duplicates within big chunks, including some of those not visible to hash comparison in LBFS and MF-LBFS, were detected via bi-directional matching extension and in-chunk matching extension, and the size of the deduplication related metadata was highly suppressed, MFMU provided a better trade-off between deduplication ratio and deduplication related metadata than LBFS and MF-LBFS which is close to the lower bound denoted by MFMU- LB. Considering the bandwidth overhead, the bandwidth saving ratio as a function of the normalized deduplication related metadata is shown in Fig.13(d). For the same metadata utilized, MF-LBFS provided a lower bandwidth saving ratio than LBFS because the bandwidth overhead introduced by the metdata feedbacks outweighed the reduction in the bytes of the duplication query and answer packets. Bimodal and MF-Bimodal showed the worst bandwidth saving ratio versus the deduplication related metadata, because of the lowest deduplication ratios achieved. With the high deduplication ratio maintained and the bandwidth overheads suppressed, MFMU provided the best costeffective bandwidth saving ratio and about 2% improvement to the peak bandwidth saving ratio compared with LBFS. The lower bound on the bandwidth saving ratio versus the normalized deduplication related metadata denoted by MFMU-LB is also plotted in Fig.13(d). 5.7 RAM Overhead for Metadata Feedbacks The manifest hit ratio of the manifest feedback cache at the sender increased as the cache capacity enlarged, and reached the maximal value of about 80% when the manifest feedback cache was limited to 1000 manifests, consistent with the metadata cache behavior shown in [14]. When the manifest feedback cacheat the sender was limited to manifests, the corresponding RAM overheads for the manifest feedbacks cached at the sender in MF-LBFS, MF-Bimodal and MFMU arecomparedinfig.14, from whichwecan observethat a smaller ECS led to a higher RAM overhead, because of the larger size of each manifest sent back from the receiver. As the sizes of the manifests were suppressed by MFMU and MF-Bimodal with a similar SD and expected big chunk size setting, the maximal RAM space sizes required in MFMU and MF-Bimodal were both kept at between 10 MB and 39 MB. Without metadata harnessing, MF-LBFS introduced an RAM overhead of 48 MB 194 MB at the sender, almost five times the RAM space used by MFMU. RAM Overhead (MB) KB 4 KB ECS MF-LBFS MF-Bimodal MFMU 8 KB Fig.14. RAM overhead for metadata feedbacks. 6 Discussions The distributions of the duplicate data slices within the input dataset determine the efficiency and performance of MFMU. The harnessing efficiency in metadata utilization could be validated in most types of dataset, while the corresponding overheads may vary. On a dataset where fine-grained duplicates account for a large portion of the overall duplicates, the metadata harnessing strategy may lead to a noticeable deduplication ratio reduced. When the duplicate data slice is smaller than the data block represented by the query granularity, the metadata feedback may not help to identify the whole duplicate data slice at the sender. In such case, no metadata feedback should be piggybacked to the sender. More efficient metadata feedback strategies could be introduced in MFMU to identify

17 620 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 more duplicates at the sender with fewer bandwidth overheads consumed. This paper is not to improve the bandwidth saving ratio, but rather to accelerate the data deduplication process across WAN. The deduplicated data and various intermediate data generated in the proposed system are usually compressible. Considering the overheads of maintaining reference data blocks at the sender for delta patching [2,14], local data compression using minilzo 3 was also applied on every 10 MB deduplicated data block before they were transmitted from the sender to the receiver in our evaluation for different deduplication solutions. The bandwidth could be saved further by about 27%. The corresponding latency, including the compression latency at the sender and the decompression latency at the receiver, was about 3 ms/mb processed data. MFMU is mainly designed and implemented for data deduplication across WAN where the end-toend RTT greatly affects the deduplication throughput. In the LAN environment where the end-to-end RTT may be low enough, the duplication queries/answers saved by the metadata feedbacks may not bring significant reduction in communication related time overheads. However, the metadata harnessing mechanism of MFMU could still accelerate the duplication detection process at the receiver with deduplication metadata and related disk I/O operations efficiently saved. Using a hash value of length K bits to represent a data chunk for duplication identification, assuming the hash values are independently and identically distributed, the probability that at least one hash collision occurs between N data chunks can be calculated and the upper bound is N(N 1) 2, which is a well-known K+1 Birthday Bound [4]. When the SHA-1 hash value of 160 bits is used and N is 2 50, the probability that hash collision happens is below When the chunk size is 1 KB, SHA-1hashvalue canbe used to represent TB data with hash collision probability below Byte-by-byte comparison used for duplication identification can thoroughly eliminate false-positive probability at the cost of higher computation complexity. 7 Related Work Bandwidth Saving. This problem is about how to eliminate more duplicates within a given dataset to be transmitted for more bandwidth saving. Rsync 4 exploited duplicate data transmissions using fixed-sized file chunking algorithm while LBFS [1] used the Rabin fingerprint based content defined chunking algorithm. For further redundance reduction, delta patching and data compression technologies were introduced to suppress the deduplicated data blocks in TAPER [2], [15] and [14]. However, these existing studies did not consider reducing the communication time overheads for the duplication query/answer operations across wide area network. On the contrary,[14] requires more metadata exchanging steps across networks to locate an optimal reference data block for delta compression. Lin et al. [16] improved the effectiveness of traditional compressors by introducing coarse-grained data transformation in addition to data deduplication. More file chunking and hashing method can be found in [17-20]. Black [4] explained that the probability for hash collisions in SHA-1 is ignorable in the practical system. MFMU does not consider to achieve greater redundancy reduction efficiency but guarantees that the bandwidth saving ratio is not affected with metadata harnessing. Data Deduplication Acceleration. This problem is about how to achieve a better trade-off between deduplication ratio and deduplication speed. Data locality preserved caching, in-ram bloom filter [21], in- RAM sparse index [6] and sampling [10] have been standard acceleration techniques to avoid the disk lookup bottleneck at the receiver during deduplication, as done in data domain [5], sparse indexing [6], HANDS [22], idedup [23] and Extreme Binning [10]. Guo and Efstathopoulos [7] utilized a multiple-tcp-connectionbased interaction model to optimize the deduplication throughput, where in each connection the baseline LBFS protocol was used. On the basis of the existing acceleration approaches, MFMU accelerates the deduplication process across WAN via reducing the metadata related disk I/O operations at the receiver and the remote duplication query operations at the sender. In addition, MFMU can also be implemented with faster hardware devices like solid state drive (SSD) as done in the studies such as [24-27]. Deduplication Related Metadata Saving. This problem is about how to achieve a better trade-off between deduplication ratio and metadata related overheads. Meister et al. [28] proposed an entropy coding based post-processing compression method for the 3 minilzo. Apr Rsync. Apr

18 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 621 FileManifests. FingerDiff [29] coalesced contiguous nonduplicate chunks up to a maximal number into one big chunk stored on the disk to reduce data fragmentation and the metadata required for storage management. But FingerDiff did not harness the deduplication metadata, e.g., the hash indices used during deduplication. Bimodal [8], Sub [9] and [30] first searched for duplicates using a large chunk size, with the nonduplicate big chunks selectively processed again using a small chunk size. In these studies, an unnecessary duplication query for the non-duplicate big chunk was introduced with a considerably high probability for missing duplicate data within the big chunks which are not re-chunked for further deduplication. MFMU replaces the big chunk first, small chunk second deduplication strategy with the bi-directional matching extension mechanism to avoid the unnecessary duplication queries for the non-duplicate big chunks. Maintaining the deduplication ratio, MFMU performs duplication detection for all the small granularity hash values before they are merged into a large granularity hash value, and exploits duplicates within big data blocks via in-chunk matching extension. Other deduplication related problems and solutions on deduplication tradeoffs, deduplication restoration, deduplication security and deduplication memory allocation optimization can be found in [31-34] respectively. Spyglass [35] uses several metadata search techniques, including index partitioning, file signature, and incremental crawling, as well as an index versioning mechanism, for achieving fast, scalable file metadata search performance. For the similar goal, SmartStore [36] uses decentralizedsemantic-awaremetadata organization method to limit the metadata search scope of a metadata query, and thereby to reduce the metadata query latency for large storage systems. MFMU reduces the deduplication related time overheads mainly by reducing the metadata related disk I/O operations, resulting from harnessing the size of the overall deduplication related metadata. MFMU currently does not maim to lower the metadata search latency based on the metadata already stored in RAM and on the disk. 8 Conclusions In this paper, we proposed a data deduplication system across WAN, named MFMU, combining the metadata feedback and metadata utilization features to accelerate the data deduplication process across WAN, while maintaining the bandwidth saving efficiency. In our experiments, about 35% duplication query reduction at the sender and about 80% disk I/O operations saved at the receiver were achieved, leading to a 20% 40% deduplication throughput improvement, compared with other comparative data deduplication solutions. In this paper, only the disk image dataset was used for experimental evaluation and one-receiver-onesender deduplication mode was discussed. In the future, more types of datasets such as the cloud storage related datasets will be collected and tested. In addition, the deduplication mode will be extended to one-receivermultiple-sender mode in which concurrency control and metadata consistency between multiple threads will be important topics. References [1] Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles (SOSP), October 2001, pp [2] Jain N, Dahlin M, Tewari R. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), December 2005, pp [3] Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, [4] Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp [5] Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, pp [6] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp [7] Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the USENIX Annual Technical Conference (ATC), June 2011, pp [8] Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, pp [9] Bartlomiej R, Lukasz H, Wojciech K, Krzysztof L, Cezary D. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-16:13. [10] Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.

19 622 J. Comput. Sci. & Technol., May 2016, Vol.31, No.3 [11] Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp [12] Ha S, Rhee I, Xu L. CUBIC: A new TCP-friendly highspeed TCP variant. In Proc. the ACM Operating Systems Review, Research and Developments in the Linux Kernel (SIGOPS), July 2008, Volume 42, pp [13] Zhou B, Wen J. Efficient file communication via deduplication with manifest feedback. IEEE Communications Letters, 2014, 18: [14] Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4): 13:1-13:26. [15] Park K, Ihm S, Bowman M, Pai V S. Supporting practical content-addressable caching with CZIP compression. In Proc. the USENIX Annual Technical Conference (ATC), June 2007, pp.14:1-14:14. [16] Lin X, Lu G, Douglis F, Shilane P, Wallace G. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proc. the 12th USENIX Conference on File and Storage Technologies (FAST), February 2014, pp [17] Eshghi K, Tang H K. A framework for analyzing and improving contentbased chunking algorithms. Technical Report, HPL R.1, Hewlett Packard Laboratories, Palo Alto, [18] Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6): [19] Pagh R, Rodler F F. Cuckoo hashing. Journal of Algorithms, 2004, 51(2): [20] Fabiano C, Botelho N G, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp [21] Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 1970, 13(7): [22] Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering (ICDE), April 2013, pp [23] Srinivasan K, Bisson T, Goodson G, Voruganti K. idedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp [24] Debnath B, Sengupta S, Li J. Stash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), February 2010, pp [25] Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May [26] Chen F, Luo T, Zhang X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Storage Technologies (FAST), February 2011, pp [27] Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp [28] Meister D, Brinkmann A, Süβ T. File recipe compression in data deduplication systems. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp [29] Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): [30] Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp [31] Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies (FAST), February 2015, pp [32] Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the USENIX Annual Technical Conference (ATC), June 2014, pp [33] Tang Y, Yang J. Secure deduplication of general computations. In Proc. the USENIX Annual Technical Conference (ATC), July 2015, pp [34] Zhang W, Yang T, Narayanasamy G, Tang H. Low-cost data deduplication for virtual machine backup in cloud storage. In Proc. the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), June 2013, pp [35] Leung A W, Shao M, Bisson T, Pasupathy S, Miller E L. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp [36] Hua Y, Jiang H, Zhu Y, Feng D, Tian L. Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(2): Bing Zhou received his B.S. degree in computer science and technology from Nanjing University, Nanjing, in 2009, and M.S. and Ph.D. degrees in computer science and technology from Tsinghua University, Beijing, in 2012 and 2015 respectively. He is working at 2012 Lab, Huawei Technologies Co., Ltd. His research interests include data communication/deduplication and high-performance distributed systems.

20 Bing Zhou et al.: Metadata Feedback and Utilization for Data Deduplication Across WAN 623 Jiang-Tao Wen received his B.S., M.S., and Ph.D. degrees (with honors), all in electrical engineering, from Tsinghua University, Beijing, in 1992, 1994, and 1996, respectively. From 1996 to 1998, he was a staff research fellow at the University of California, Los Angeles (UCLA), where he conducted cutting-edge research on multimedia coding and communications. Many of his inventions there were later adopted by international standards such as H.263, MPEG, and H.264. After UCLA, he served as the principal scientist at PacketVideo Corp., the chief technical officer at Morphbius Technology Inc., the director of Video Codec Technologies at Mobilygen Corp., and a technology advisor at Ortiva Wireless and Stretch, Inc. Since 2009, he has been a professor at the Department of Computer Science and Technology, Tsinghua University, Beijing. He is a world-renowned expert in multimedia communication over hostile networks, video coding, and communications. He has authored many widely referenced papers in related fields. Products deploying technologies that he developed are currently widely used worldwide. He holds over 30 patents with numerous others pending.

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information

SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput

SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput SiLo: A Similarity-Locality based Near- Exact Deduplication Scheme with Low RAM Overhead and High Throughput Wen Xia Hong Jiang Dan Feng Yu Hua, Huazhong University of Science and Technology University

More information

Inline Deduplication

Inline Deduplication Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.

More information

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract

More information

EMC DATA DOMAIN SISL SCALING ARCHITECTURE

EMC DATA DOMAIN SISL SCALING ARCHITECTURE EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily losing ground

More information

A Data De-duplication Access Framework for Solid State Drives

A Data De-duplication Access Framework for Solid State Drives JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science

More information

A Deduplication File System & Course Review

A Deduplication File System & Course Review A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror

More information

Deploying De-Duplication on Ext4 File System

Deploying De-Duplication on Ext4 File System Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

More information

Multi-level Metadata Management Scheme for Cloud Storage System

Multi-level Metadata Management Scheme for Cloud Storage System , pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1

More information

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose Abhirupa Chatterjee 1, Divya. R. Krishnan 2, P. Kalamani 3 1,2 UG Scholar, Sri Sairam College Of Engineering, Bangalore. India

More information

Read Performance Enhancement In Data Deduplication For Secondary Storage

Read Performance Enhancement In Data Deduplication For Secondary Storage Read Performance Enhancement In Data Deduplication For Secondary Storage A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Pradeep Ganesan IN PARTIAL FULFILLMENT

More information

Key Components of WAN Optimization Controller Functionality

Key Components of WAN Optimization Controller Functionality Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications

More information

Theoretical Aspects of Storage Systems Autumn 2009

Theoretical Aspects of Storage Systems Autumn 2009 Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements

More information

idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti

idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp 1 idedup overview/context Storage Clients

More information

09'Linux Plumbers Conference

09'Linux Plumbers Conference 09'Linux Plumbers Conference Data de duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009 09 25 Current storage challenges Our world is facing data explosion. Data is growing in a amazing

More information

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering ABSTRACT Deduplication due to combination of resource intensive

More information

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department

More information

A Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage

A Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage A Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage YAN-KIT Li, MIN XU, CHUN-HO NG, and PATRICK P. C. LEE The Chinese University of Hong Kong Backup storage systems often remove

More information

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) POOR INTERNET ACCESS IN THE DEVELOPING WORLD Internet access is a scarce

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

Cloud De-duplication Cost Model THESIS

Cloud De-duplication Cost Model THESIS Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker

More information

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen Design and Implementation of a Storage Repository Using Commonality Factoring IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen Axion Overview Potentially infinite historic versioning for rollback and

More information

CURRENTLY, the enterprise data centers manage PB or

CURRENTLY, the enterprise data centers manage PB or IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 61, NO. 11, JANUARY 21 1 : Distributed Deduplication for Big Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Senior Member, IEEE,

More information

DEDUPLICATION has become a key component in modern

DEDUPLICATION has become a key component in modern IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 3, MARCH 2016 855 Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge Min

More information

FAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory

FAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

Data Backup and Archiving with Enterprise Storage Systems

Data Backup and Archiving with Enterprise Storage Systems Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia slavjan_ivanov@yahoo.com,

More information

Frequently Asked Questions

Frequently Asked Questions Frequently Asked Questions 1. Q: What is the Network Data Tunnel? A: Network Data Tunnel (NDT) is a software-based solution that accelerates data transfer in point-to-point or point-to-multipoint network

More information

Data Deduplication in a Hybrid Architecture for Improving Write Performance

Data Deduplication in a Hybrid Architecture for Improving Write Performance Data Deduplication in a Hybrid Architecture for Improving Write Performance Data-intensive Salable Computing Laboratory Department of Computer Science Texas Tech University Lubbock, Texas June 10th, 2013

More information

The assignment of chunk size according to the target data characteristics in deduplication backup system

The assignment of chunk size according to the target data characteristics in deduplication backup system The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,

More information

Deploying Riverbed wide-area data services in a LeftHand iscsi SAN Remote Disaster Recovery Solution

Deploying Riverbed wide-area data services in a LeftHand iscsi SAN Remote Disaster Recovery Solution Wide-area data services (WDS) Accelerating Remote Disaster Recovery Reduce Replication Windows and transfer times leveraging your existing WAN Deploying Riverbed wide-area data services in a LeftHand iscsi

More information

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc. Abstract Disk-based

More information

Prediction System for Reducing the Cloud Bandwidth and Cost

Prediction System for Reducing the Cloud Bandwidth and Cost ISSN (e): 2250 3005 Vol, 04 Issue, 8 August 2014 International Journal of Computational Engineering Research (IJCER) Prediction System for Reducing the Cloud Bandwidth and Cost 1 G Bhuvaneswari, 2 Mr.

More information

File Systems: Fundamentals

File Systems: Fundamentals Files What is a file? A named collection of related information recorded on secondary storage (e.g., disks) File Systems: Fundamentals File attributes Name, type, location, size, protection, creator, creation

More information

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clusters Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC

More information

Original-page small file oriented EXT3 file storage system

Original-page small file oriented EXT3 file storage system Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: wzzhang@hit.edu.cn

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

Online De-duplication in a Log-Structured File System for Primary Storage

Online De-duplication in a Log-Structured File System for Primary Storage Online De-duplication in a Log-Structured File System for Primary Storage Technical Report UCSC-SSRC-11-03 May 2011 Stephanie N. Jones snjones@cs.ucsc.edu Storage Systems Research Center Baskin School

More information

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand

More information

Transport Layer Protocols

Transport Layer Protocols Transport Layer Protocols Version. Transport layer performs two main tasks for the application layer by using the network layer. It provides end to end communication between two applications, and implements

More information

ProTrack: A Simple Provenance-tracking Filesystem

ProTrack: A Simple Provenance-tracking Filesystem ProTrack: A Simple Provenance-tracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file

More information

A Deduplication-based Data Archiving System

A Deduplication-based Data Archiving System 2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy, and Hong Tang University of California at Santa Barbara, Alibaba Inc. Abstract In a virtualized

More information

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among

More information

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely

More information

Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2 #1 You can

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache

More information

Hybrid Cloud Storage System. Oh well, I will write the report on May1 st

Hybrid Cloud Storage System. Oh well, I will write the report on May1 st Project 2 Hybrid Cloud Storage System Project due on May 1 st (11.59 EST) Start early J : We have three graded milestones Milestone 1: demo part 1 by March 29 th Milestone 2: demo part 2 by April 12 th

More information

Understanding EMC Avamar with EMC Data Protection Advisor

Understanding EMC Avamar with EMC Data Protection Advisor Understanding EMC Avamar with EMC Data Protection Advisor Applied Technology Abstract EMC Data Protection Advisor provides a comprehensive set of features to reduce the complexity of managing data protection

More information

File Systems Management and Examples

File Systems Management and Examples File Systems Management and Examples Today! Efficiency, performance, recovery! Examples Next! Distributed systems Disk space management! Once decided to store a file as sequence of blocks What s the size

More information

Security Ensured Redundant Data Management under Cloud Environment

Security Ensured Redundant Data Management under Cloud Environment Security Ensured Redundant Data Management under Cloud Environment K. Malathi 1 M. Saratha 2 1 PG Scholar, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. 2 Assistant Professor, Dept.

More information

Keywords De-duplication, block level de-duplication, hash, Inline parallel de-duplication.

Keywords De-duplication, block level de-duplication, hash, Inline parallel de-duplication. Volume 5, Issue 4, April 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Parallel Architecture

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP Dilip N Simha (Stony Brook University, NY & ITRI, Taiwan) Maohua Lu (IBM Almaden Research Labs, CA) Tzi-cker Chiueh (Stony

More information

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011 Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased

More information

sulbhaghadling@gmail.com

sulbhaghadling@gmail.com www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for

More information

Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager Data Deduplication and Tivoli Storage Manager Dave Cannon Tivoli Storage Manager rchitect Oxford University TSM Symposium September 2007 Disclaimer This presentation describes potential future enhancements

More information

Two-Level Metadata Management for Data Deduplication System

Two-Level Metadata Management for Data Deduplication System Two-Level Metadata Management for Data Deduplication System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3.,Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea { kongjs,

More information

Data Deduplication: An Essential Component of your Data Protection Strategy

Data Deduplication: An Essential Component of your Data Protection Strategy WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING

More information

Edelta: A Word-Enlarging Based Fast Delta Compression Approach

Edelta: A Word-Enlarging Based Fast Delta Compression Approach : A Word-Enlarging Based Fast Delta Compression Approach Wen Xia, Chunguang Li, Hong Jiang, Dan Feng, Yu Hua, Leihua Qin, Yucheng Zhang School of Computer, Huazhong University of Science and Technology,

More information

idedup: Latency-aware, inline data deduplication for primary storage

idedup: Latency-aware, inline data deduplication for primary storage idedup: Latency-aware, inline data deduplication for primary storage Kiran Srinivasan, Tim Bisson, Garth Goodson, Kaladhar Voruganti NetApp, Inc. {skiran, tbisson, goodson, kaladhar}@netapp.com Abstract

More information

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2 Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...

More information

File-System Implementation

File-System Implementation File-System Implementation 11 CHAPTER In this chapter we discuss various methods for storing information on secondary storage. The basic issues are device directory, free space management, and space allocation

More information

Running SAP Solutions in the Cloud How to Handle Sizing and Performance Challenges. William Adams SAP AG

Running SAP Solutions in the Cloud How to Handle Sizing and Performance Challenges. William Adams SAP AG Running SAP Solutions in the Cloud How to Handle Sizing and Performance Challenges William Adams SAP AG Agenda What Types of Cloud Environments we are talking about Private Public Critical Performance

More information

Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication

Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication Type of Submission: Article Title: DB2 s Integrated Support for Data Deduplication Devices Subtitle: Keywords: DB2, Backup, Deduplication Prefix: Error! Bookmark not defined. Given: Dale Middle: M. Family:

More information

De-duplication-based Archival Storage System

De-duplication-based Archival Storage System De-duplication-based Archival Storage System Than Than Sint Abstract This paper presents the disk-based backup system in which only relational database files are stored by using data deduplication technology.

More information

Maximize MicroStrategy Speed and Throughput with High Performance Tuning

Maximize MicroStrategy Speed and Throughput with High Performance Tuning Maximize MicroStrategy Speed and Throughput with High Performance Tuning Jochen Demuth, Director Partner Engineering Maximize MicroStrategy Speed and Throughput with High Performance Tuning Agenda 1. Introduction

More information

PART III. OPS-based wide area networks

PART III. OPS-based wide area networks PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity

More information

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU Savita Shiwani Computer Science,Gyan Vihar University, Rajasthan, India G.N. Purohit AIM & ACT, Banasthali University, Banasthali,

More information

A Transport Protocol for Multimedia Wireless Sensor Networks

A Transport Protocol for Multimedia Wireless Sensor Networks A Transport Protocol for Multimedia Wireless Sensor Networks Duarte Meneses, António Grilo, Paulo Rogério Pereira 1 NGI'2011: A Transport Protocol for Multimedia Wireless Sensor Networks Introduction Wireless

More information

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng and Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong,

More information

Contents. WD Arkeia Page 2 of 14

Contents. WD Arkeia Page 2 of 14 Contents Contents...2 Executive Summary...3 What Is Data Deduplication?...4 Traditional Data Deduplication Strategies...5 Deduplication Challenges...5 Single-Instance Storage...5 Fixed-Block Deduplication...6

More information

A Novel Approach for Calculation Based Cloud Band Width and Cost Diminution Method

A Novel Approach for Calculation Based Cloud Band Width and Cost Diminution Method A Novel Approach for Calculation Based Cloud Band Width and Cost Diminution Method Radhika Chowdary G PG Scholar, M.Lavanya Assistant professor, P.Satish Reddy HOD, Abstract: In this paper, we present

More information

Online Remote Data Backup for iscsi-based Storage Systems

Online Remote Data Backup for iscsi-based Storage Systems Online Remote Data Backup for iscsi-based Storage Systems Dan Zhou, Li Ou, Xubin (Ben) He Department of Electrical and Computer Engineering Tennessee Technological University Cookeville, TN 38505, USA

More information

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP) TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) *Slides adapted from a talk given by Nitin Vaidya. Wireless Computing and Network Systems Page

More information

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.

More information

A Method of Deduplication for Data Remote Backup

A Method of Deduplication for Data Remote Backup A Method of Deduplication for Data Remote Backup Jingyu Liu 1,2, Yu-an Tan 1, Yuanzhang Li 1, Xuelan Zhang 1, Zexiang Zhou 3 1 School of Computer Science and Technology, Beijing Institute of Technology,

More information

Trends in Enterprise Backup Deduplication

Trends in Enterprise Backup Deduplication Trends in Enterprise Backup Deduplication Shankar Balasubramanian Architect, EMC 1 Outline Protection Storage Deduplication Basics CPU-centric Deduplication: SISL (Stream-Informed Segment Layout) Data

More information

Understanding EMC Avamar with EMC Data Protection Advisor

Understanding EMC Avamar with EMC Data Protection Advisor Understanding EMC Avamar with EMC Data Protection Advisor Applied Technology Abstract EMC Data Protection Advisor provides a comprehensive set of features that reduce the complexity of managing data protection

More information

The Linux Virtual Filesystem

The Linux Virtual Filesystem Lecture Overview Linux filesystem Linux virtual filesystem (VFS) overview Common file model Superblock, inode, file, dentry Object-oriented Ext2 filesystem Disk data structures Superblock, block group,

More information

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation R.Navaneethakrishnan Assistant Professor (SG) Bharathiyar College of Engineering and Technology, Karaikal, India.

More information

Enterprise Backup and Restore technology and solutions

Enterprise Backup and Restore technology and solutions Enterprise Backup and Restore technology and solutions LESSON VII Veselin Petrunov Backup and Restore team / Deep Technical Support HP Bulgaria Global Delivery Hub Global Operations Center November, 2013

More information

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant

More information

Question: 3 When using Application Intelligence, Server Time may be defined as.

Question: 3 When using Application Intelligence, Server Time may be defined as. 1 Network General - 1T6-521 Application Performance Analysis and Troubleshooting Question: 1 One component in an application turn is. A. Server response time B. Network process time C. Application response

More information

Estimating Deduplication Ratios in Large Data Sets

Estimating Deduplication Ratios in Large Data Sets IBM Research labs - Haifa Estimating Deduplication Ratios in Large Data Sets Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov Gil Vernik Estimating dedupe and compression ratios some motivation

More information

LDA, the new family of Lortu Data Appliances

LDA, the new family of Lortu Data Appliances LDA, the new family of Lortu Data Appliances Based on Lortu Byte-Level Deduplication Technology February, 2011 Copyright Lortu Software, S.L. 2011 1 Index Executive Summary 3 Lortu deduplication technology

More information

Improvement of Network Optimization and Cost Reduction in End To End Process Implementing in Clouds

Improvement of Network Optimization and Cost Reduction in End To End Process Implementing in Clouds Improvement of Network Optimization and Cost Reduction in End To End Process Implementing in Clouds A. Sree Valli 1, R. Chandrasekhar 2 PG Scholar, Department of C.S.E, KIET College, JNTUK A.P 1 Assistant

More information

Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu, Andrew Pavlo, Sudipta Sengupta Jin Li, Gregory R. Ganger Carnegie Mellon University, Microsoft Research CMU-PDL-14-108 December

More information

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System CS341: Operating System Lect 36: 1 st Nov 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati File System & Device Drive Mass Storage Disk Structure Disk Arm Scheduling RAID

More information

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble HP Labs UC Santa Cruz HP

More information

WAN Transfer Acceleration

WAN Transfer Acceleration WAN Transfer Acceleration Product Description Functionality Interfaces Specifications Index 1 Functionality... 3 2 Integration... 3 3 Interfaces... 4 3.1 Physical Interfaces...5 3.1.1 Ethernet Network...5

More information

3Gen Data Deduplication Technical

3Gen Data Deduplication Technical 3Gen Data Deduplication Technical Discussion NOTICE: This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and

More information

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24 FILE SYSTEMS, PART 2 CS124 Operating Systems Winter 2013-2014, Lecture 24 2 Last Time: Linked Allocation Last time, discussed linked allocation Blocks of the file are chained together into a linked list

More information