Comprehensive study of data de-duplication

International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV Comprehensive study of data de-duplication Deepak Mishra School of Information Technology, RGPV hopal, India Dr. Sanjeev Sharma School of Information Technology, RGPV hopal, India Abstract Cloud computing is an emerging computing paradigm in which resources of the computing infrastructure are provided as services over the Internet. There is limited source of storage and networking in cloud system. So data de-duplication takes an important role in the cloud structure. In this paper data de-duplication process will be discussed in detail. In data deduplication there are several methods available that makes it easy to implement. In this paper we will examine about all methods, processes that are used in data de-duplication. Keywords: cloud computing, Scalability, virtual machine, deduplication. I. INTRODUCTION Clouds are large pools of easily usable and reachable resources. In cloud all resources connected virtually to create single system image. These resources can be dynamically reconfigured to adjust to a flexible load (scale), allowing optimum resource utilization. Cloud storage refers to scalable and elastic storage capabilities that are delivered as a service using Internet technologies with elastic provisioning and usebased pricing that does not penalize users for changing their storage consumption without notice. There are five basic characteristic of any cloud system that are: On-demand self-service road network access Resource pooling Measured service Rapid elasticity Cloud computing contains both hardware and applications provided to users as a service by the Internet. With the fast development of cloud computing, ever more cloud services have emerged, such PaaS (platform as a service), as SaaS (software as a service), and IaaS (infrastructure as a service). Computing resources [19] are limited and eventually any system which grows in data or usage will saturate the resources available to it. The resources in question may be e.g. processing capacity for computationally intensive systems or storage capacity for data intensive systems. Network capability is an important scalability point in distributed systems. Structural scalability alarms the interior design of a system and provides methods to itself to manipulate the data model. It means it provides methods to shrink data model and expand data model. It can understand like its deployment. Cloud storage is a paradigm in which the online storage is networked and records are stored on several committed storage servers. Sometimes these storage servers can be maintained by other third parties. The concept of cloud storage is derived from cloud computing. It denotes to a storage device accessed over the Internet via Web service application program interfaces (API). For example: HDFS (Hadoop Distributed File System, hadoop.apache.org) is a distributed file system that runs on commodity hardware; it was introduced by Apache for managing huge data. Data de-duplication is also known as single instancing or intelligent compression. It essentially points to the removal of replicate data. In the de-duplication process, duplicate data is deleted, only one copy or single instance of the data to be stored in the database. Data de-duplication is a term used to describe an algorithm or technique that eliminates duplicate copies of data from a storage. Data de-duplication is commonly performed on secondary storage systems such as archival and backup storage. II. DATA DE-DUPLICATION The term data de-duplication points to the techniques that [1] saves only one single instance of replicated data, and provide links to that instance of copy in place of storing other original copies of this data. y the evolution of services from tape to disk, data de-duplication has turn into a key element in the backup process. It specifies that only one copy of that data is saved in the datacenter [10]. Every user, who want to access that copy linked to that single instance of copy. So it is clear that data de-duplication help to decrease the size of datacenter. So it could say that de-duplication means that the number of the replication of data that were usually duplicated on the cloud should be controlled and managed to shrink the physical storage space requested for such replications. The basic steps for de-duplication are: a) In first step files are divided into small segments. b) After the segment creation new and the existing data are checked for similarity by comparing fingerprints created by SHA-1 algorithm (another methods can also applicable). c) Then Metadata structures are updated. d) Segments are compressed. 107

International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV e) All the duplicate data is deleted and data integrity check is performed. III. REQUIREMENTS FOR DATA DE-DUPLICATION There is only one necessary condition for the data deduplication is that data de-duplication should be scalable [20]. It means that de-duplication should be elastic. It doesn t affect to the overall storage structure. To handle scalable deduplication, two methods have been proposed, 1. Sparse indexing: Sparse indexing is a method used to solve the chunk lookup blockage caused by disk access, by using sampling and manipulating the inherent locality inside backup streams. It picks a small slice of the chunks in the stream as samples; then, sparse index maps these samples to the existing sections in which they occur. The arriving streams are fragmented into relatively big segments, and each segment is de-duplicated against only some of the most similar previous sections (segments). 2. loom filters with caching. The loom filter exploits Summary Vector. asically summary vector is a compact in-memory data structure, for discovering new segments; and Stream-Informed Segment arrangement, which is a data arrangement method to improve on-disk locality, for consecutively accessed segments; and Locality Well-maintained Caching with cache fragments, which maintains the locality of the impressions of duplicated segments, to achieve high cache hit ratios. IV. TYPES OF DATA DE-DUPLICATION There are two major categories [18] of data de-duplication: Offline Data de-duplication:[7] In an offline deduplication state, first data is written to the storage disk and de-duplication process take place at a later time. Online Data de-duplication: In an online deduplication state, replicate data is deleted before being written to the storage disk. Once the timing of data de-duplication has been decided then there are number of existing techniques that can be apply. The most used de-duplication approaches are: whole file hashing (WFH), sub file hashing (SFH), and delta encoding (DE). Whole File Hashing: In a whole file hashing (WFH) technique, the whole file is directed to a hashing function. The hashing function is always cryptographic hash like MD5 or SHA-1. The cryptographic hash is used to find entire replicate files. This approach is speedy with low computation and low additional metadata overhead. It works very well for complete system backups when total duplicate files are more common. However, the larger granularity of replicate matching stops it from matching two files that only differ by one single byte or bit of data. Sub File Hashing: [12] Sub file hashing (SFH) is appropriately named. Whenever SFH is being used, it means the file is broken into a number of smaller sections before data de-duplication. The number of sections depends on the type of SFH that is being used. The two most common types of SFH are fixedsize chunking and variable-length chunking. In a fixed-size chunking approach, a file is divided up into a number of fixed-size pieces called chunks. In a variable-length chunking approach, a file is broken up into chunks of variable length. Some techniques such as Rabin fingerprinting [28] are applied to determine chunk boundaries. Each section is passed to a cryptographic hash function (usually MD5 or SHA-1) to get the chunk identifier. The chunk identifier is used to locate replicate data. oth of these SFH approaches find replicate data at a finer granularity but at a price. Delta Encoding: The term delta encoding (DE) is comes from the mathematical use of the delta symbol. In math and science, delta is used to calculate the change or rate of change in an object. Delta encoding is applied to show the difference between a source object and a target object. Suppose, if block A is the source and block is the target, the DE of is the difference between A and that is unique to. The expression and storage of the difference depends on how delta encoding is applied. Normally it is used when SFH does not produce results but there is a strong enough similarity between two items/ locks / chunks that storing the difference would take less space than storing the nonduplicate block. V. CATEGORY OF DATA DE-DUPLICATION STRATEGIES Data de-duplication strategies can be categorized according to their operational area. In this respect there are two main data de-duplication strategies: (1) File-level De-duplication [3]: File level de-duplication is performed over a single file. In this type of de-duplication two or more files are identified as similar if they have the same hash value. (2) lock-level de-duplication [3]: lock level deduplication is performed over blocks. It first divide files into blocks and stores only a single copy of each block. It could either use fixed-sized blocks or variable-sized chunks. It can be further divided on the basis of their targeted area [3]. I. Target based de-duplication: This type of de-duplication performed on the target data storage center. In this case the client is unmodified and not aware of any deduplication. This technology improves storage utilization, but does not save bandwidth 108

II. Source based de-duplication: This type of deduplication performed on the data at the source before it s transferred. A de-duplication aware backup agent is installed on the client which backs up only unique data. The result is increased bandwidth and storage efficiency. ut, this enforces extra computational load on the backup client. Replicates are changed by pointers and the actual replicate data is never sent over the network. International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV Andy.txt Andrew.txt Sim.txt VI. KEY FACTOR When data de-duplication is considering two questions arises in the mind those are: 1. How does the system find the data duplications? 2. How does the system maintain and manipulate the data to reduce the repetitions, in other words, to deduplicate them? For the first point, system can use MD5 and SHA-1 [2] algorithm to make a unique fingerprint for each file or data block, and set up a fast fingerprint index to identify the duplications. To delete the duplications, the first step is to discover the duplications. The two normal ways to discover duplications are: 1. Comparing data blocks or files bit by bit: System can compare data blocks or file bit by bit. It guarantee accuracy, but it cost of additional time consumption 2. Comparing data blocks or files by hash values: To compare blocks or files by hash value would be more efficient, but the chances of accidental collisions would be increased. The chance of accidental collision depends on the hash algorithm. However, the chances are so less. Thus, using a combination hash value to discover the duplications will greatly reduce the collision probability. Therefore, it is acceptable to use a hash function to discover duplications. For the second problem, system set up a distribution file system to store data and develop link files to manage files in a distributed file system. The de-duplication process may be very tough, if not incredible, to be performed manually, since real databases may have thousands of millions of records. VII. HOW IT WORKS The data de-duplication process is very systematic and controlled. Here a simple process of data de-duplication are explained with example. Suppose [5] there are some files that will be written in the database. So de-duplication process can be understand by this example. Fig.1: Sample files data and segments In the below figure we have three files named Andy.txt, Andrew.txt and Sim.txt that to be written in the database. When first file Andy.txt stored. Data de-duplication system breaks this file into four parts named A,, C and D. There can be more segments according to de-duplication approach. After breaking into segments system add a hash identifier to all segments for reconstruction [11]. Now all segments stores in the database separately. When another file named Andrew.txt comes to written in the database. Again it divided into four parts named A,, C and D. These segments are as same as Andy.txt file s segments. So now de-duplication system will not store it. System will delete this copy and provide a link to last stored segments. Now when another file named Sim.txt comes to written, System will again breaks it into several parts. In our example it was broken into four parts named E,, C and D. Here only one part E is new. Other parts are already store in the database. So system will store only E part and provide link for another part. So at the end, only five segments of data will be stored in place of 12 blocks as explained in fig.1. So it is clear that it reduces the storage space. If one segment s size is 1M and if we are not applying data de-duplication approach then total space needed is 12M. ut de-duplication system stores only 5 unique segments and provide link to another segments. In this way it saves 7M space. VIII. EXAMPLES OF DE-DUPLICATION STORAGE SYSTEM So far, several de-duplication storage systems have been previously designed. They are used for different storage purpose. The similarity between them is just that all are data de-duplication based storage system. Some of them are: Venti: It is a network storage system. It applies identical hash values to find block contents so that it decreases the data occupation of storage area. Venti generates blocks for huge storage applications and inspire a write-once policy to avoid collision of the data. This network storage system emerged in the early stages of network storage, so it is not suitable to deal with avast data, and the system is not scalable. 109

HYDRAstor [4]: It is a scalable, secondary storage solution, which includes a back-end consisting of a grid of storage nodes with a decentralized hash index, and a traditional file system interface as a front-end [15]. The back-end of HYDRAstor is based on Directed Acyclic Graph, which is able to organize large-scale, variable-size, content addressed, immutable, and highly-resilient data blocks. HYDRAstor detects duplications according to the hash table. The ultimate target of this approach is to form a backup system. It does not consider the situation when multiple users need to share files. Extreme inning: It is a scalable, paralleled deduplication approach aiming at a non-traditional backup workload which is composed of low-locality individual files. Extreme inning exploits file similarity instead of locality, and allows only one disk access for blocks lookup per file. Extreme inning arranges similar data files into bins and removes duplicated chunks inside each bin. Duplicates exist among different bins. Extreme inning only keeps the primary index in memory in order to reduce RAM occupation. This approach is not a strict de-duplication method, because duplications will exist among bins. MAD2: It is an accurate de-duplication network backup service which works at both the file level and the block level. It uses four techniques: a hash bucket matrix, a loom filters array, dual cache, and Distributed Hash Table based load balancing, to achieve high performance. This approach is designed for backup service not for a pure storage system. International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV Example 1: efore De-duplication applied File Name: dic.jpg Compressed file size = 33.1 Kb After De-duplication applied File Name: dic.jpg Number of segments = 1 Compressed file size = 33.1 Kb Now consider another file diccas.jpg which is a copy of dic.jpg efore De-duplication applied File Name: diccas.jpg Compressed file size =33.1 Kb After de-duplication applied File Name: diccas.jpg Duplicate Data Elimination (DDE): DDE applies a combination of content hashing, then copy-on-write, and lazy updates to get the functions of discovering and coalescing similar data blocks in a storage area network file system. It always works in the background. On the other hand, what sets DeDu decisively apart from all these approaches is that DeDu exactly de-duplicates and calculates hash values at the client side right before data transmission, all at file level. IX. PERFORMANCE METRICS Storage Space: When de-duplicated segments are saved on the cloud storage, storage space is reduced. Two runs were performed for this. It can test how it saves storage space by send two files, one original file and a duplicate file. The same process was repeated for different file sizes which get saved in different data bins. Storage space used in both test was recorded. Test outcomes are explained below. Number of segments created = 1 Compressed file size = 33.1 Kb After the de-duplication is performed, only one of the files (either dic.jpg or diccas.jpg) is stores. The other file is removed. Results are shown below. Average file size (of both the files) = (62.7+0)/2 = 31.3 Kb Total space saved= 31.3 Kb Here is an performance chart shown that show clearly how the data de-duplication reduce the overall size of the storage system. 110

International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV Fig.2: Performance chart In this example, the size of the file was in K.That makes it fewer important. ut what will be condition for entire cloud. For the entire cloud value of data could be very large. If system include it for whole cloud then data saving could be so much. It is clear by the above table. In this table only 2 copies of data are considered. It might be many copies which makes it more important. File Name No of copie s efore Deduplication File size Dic.jpg 2 2M Ryno.txt 2 26K Sh.mp3 2 20 M Dam.avi 2 2G After comp. File size After De-duplication No. of seg. After Comp Space saved 1.5M 2M 1 1.5M 1M 20K 26K 7 20K 13K 18M 20M 23 18M 10M 1.8G 2G 41 1.8G 1G X. SECURITY ISSUES Although de-duplication is secure method of the cloud system [7]. ut there is some security holes as well present in this process. asic property of data de-duplication creates a problem to itself. For example: When a query going to server then server responses to that query. Suppose a file is uploaded then a question condition arises that is Did anyone prior store a copy? It means that this particular file is already stored or not? [8][9] Then, this question is answered by the attacker requesting to upload a copy of the file, and noticing whether de-duplication occurs. Note that this is a restricted query: First, the answer is a true/false answer which does not detail who performed the task. Furthermore, in the basic form of attack the attacker can only requests this query once the query is requested by doing an upload of the file; afterwards the file is saved at the upload service and hence the answer to the query will at all times be positive [7].The latter limitation can be removed by the following strategy, the attacker starts uploading a file, and notices whether de-duplication occurs. If de-duplication does not happen, and a full upload starts, then the attacker shuts down the communication channel and terminates the upload. As a result, the copy of the file kept by the attacker is not saved at the server. This facilitates the attacker to repeat the same test at a later time, and check again whether the file was uploaded. Moreover, by applying this procedure at regular breaks, the attacker can get the time window where the file is uploaded. Three attacks [16] on online storage services are possible due to de-duplication. The first two enable an attacker to learn about the contents of files of other users, whereas the third attack describes a new covert channel. Attack I: Discovering Files Attack II: Learning the details of Files Attack III: A Covert Channel XI. ENEFIT There is number of benefit by data de-duplication in cloud network. Cloud network is interconnected. So when one things gives advantages then automatically all system s throughput increases. ut mainly there are three benefit from data deduplication those are: A. Storage-based de-duplication [13] decreases the amount of storage space needed for a given set of data files. It is most in effect in applications where so many replicas of very alike or even indistinguishable data are stored on a common single disk a unexpectedly common scenario. In the case of data backups, which regularly are performed to guard against data harm, most data in a given backup isn't changed from the earlier backup. Common backup systems try to exploit this by neglecting (or hard linking) files that haven't altered or keeping differences between files. Neither approach contains all repetition, however. Hard linking doesn t help with huge files that have only altered in minor ways, such as an email database; dissimilarities only find redundancies in end-to-end versions of a single file (assume a segment that was deleted and later added, or a logo image involved in many forms).. Network data de-duplication [17] is used to cut the number of data bytes that must be shifted between endpoints, which can decrease the quantity of bandwidth required. C. Virtual servers help from de-duplication because it permits nominally distinct system files for every virtual server to be merged into a common single storage. At the same time, if a given server alters a 111

International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV file, de-duplication will not alter the files on the servers. XII. TRADE-OFF Whenever data is changed, concerns arise [6] about potential loss of data. y definition of data de-duplication, systems saves data as it is written. It saves data in a different way. So, users are worried about integrity of their data. While data de-duplication can increase storage efficiency, the benefit achieves at a cost. After data de-duplication is performed, a file written to disk serially could appear to be written unsystematically. So, sometimes a read task performed can decrease time to read data. One method for de-duplicating data trusts on the use of hash functions to recognize matching segments of data [14]. This hash function is also used n cryptography. If two different segments of information produce the same hash value, then it is well-known as a collision. The chance of a collision relies upon that hash function which is used, and although the chances are very small, they are every time non zero. The computational resource strength of the procedure can be a challenge to data de-duplication. However, this is seldom concern for stand-alone devices or machines, as the computation is totally unloaded from other systems. This can be a concern when the de-duplication is implanted inside devices providing other services. To increase performance, many systems employ strong and weak hash function. Weak hashes are fast and easy to calculate but there is a greater threat of a hash collision. The reconstruction of files does not need this processing and any incremental performance punishment linked with re-associates of data chunks is not likely to affect application performance. Another issue with de-duplication is related with effect on backups, snapshots and archival, especially where deduplication is applied over primary storage (for example NASfiler). Reading data out of a storage device causes full reconstruction of the files, thus any secondary replica of the data set is possible to be bigger than the primary copy. If consider snapshots, if a file is snapshotted prior to deduplication, the post-de-duplication snapshot will preserve the whole original file. It indicates that while storage volume for primary file copies will shrink, capacity required for snapshots may enlarge theatrically. Another issue is the result of encryption and compression. While de-duplication is a variety of compression, it works in pressure with old compression technique. De-duplication reaches better efficiency against smaller chunks, however compression achieves better proficiency against bigger chunks. The aim of encryption is to remove any discernible forms in the data. Thus encrypted data cannot be de-duplicated, even though the basic data may be redundant. De-duplication finally reduces redundancy. If it was not expected and calculated for, it may destroy the underlying reliability of the system. Scaling (elasticity) has also been an issue for de-duplication systems. Data de-duplication is very profitable for space saving but it generates challenges for reliability and performance. XIII. FUTURE WORK AND ENHANCEMENTS De-duplication is a technique which saves storage space and bandwidth requirements. This technique is implemented on Amazon cloud platform. Eucalyptus platform is another example. A small sub set of the functionalities were successfully implemented. Few of the enhancements that will be desired for the present application are I. De-duplication of files is done for every user bucket. It would like to extend to the entire cloud. Keeping each user bucket as a physical virtualization for only user images and not for the data. II. Concurrent uploads from many node controllers. III. Metadata structures are maintained as files in this version of the application. Going forward, metadata can be stored in a database for easy querying. XIV. CONCLUSION In this paper all information about data de-duplication is discussed. It included all detail about data de-duplication and methods to achieve it. In this paper we discussed wide range of the topic and areas of future research works in the field of data de-duplication. It is not necessary that the area where applying data de-duplication is a homogeneous system. So this leads a challenge to data de-duplication. These challenges can be greater for unstructured files. It can also create challenges to future file system designs. However, data de-duplication is most necessary element of the cloud system. This will definitely lead to development of performance of the cloud system for user and business perspective. REFERENCES [1] H. Huang, W. Hung, and K. G. Shin. Fs2: dynamic data replication in free disk space for improving disk performance and energy consumption. On 2005 in Proc. 20th ACM Symposium on Operating Systems Principles. [2] P. Kulkarni, J. LaVoie, F. Douglis and J. Tracey Redundancy elimination within large collections of files. On 2004 in Proc. USENIX 2004 Annual Technical Conference. [3] K. Jin and E. Miller. The effectiveness of deduplication on virtual machine disk images. In Proc. SYSTOR 2009: The Israeli Experimental Systems Conference. [4]. Atkin, C. Ungureanu, C. Dubnicki, A. Aranya, S. Rago, S. Gokhale, G. Cakowski, and A. ohra. Hydrafs: A highthroughput file system for the Hydrastor contentaddressable storage system 2010. In Proc. 8th USENIX Conference on File and Storage Technologies. [5] C. Kruus. And E. Ungureanu imodal content defined chunking for backup streams, On 2010 In Proc. 8Th USENIX Conference on File and Storage Technologies. [6]. Zhu, H. Patterson and K. Li. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proc. 6th USENIX Conference on File and Storage Technologies, 2008. [7] Qian Wang, Cong Wang, Jin Li, Kui Ren, Wenjing Lou: Enabling Public Verifiability and Data Dynamics for Storage Security in Cloud Computing. ESORICS 2009:355-370. 112

International Conference on Cloud, ig Data and Trust 2013, Nov 13-15, RGPV [8] Ari Juels, urton S. Kaliski Jr.: Pors: proofs of retrievability for large files. ACM Conference on Computer and Communications Security 2007: 584-597. [9] Ralph C. Merkle: A Certified Digital Signature. CRYPTO 1989: 218-238. [10] Dave Russell: Data De-duplication Will e Even igger in 2010, Gartner, 8 February 2010. [11] Hovav Shacham, rent Waters: Compact Proofs of Retrievability. ASIACRYPT 2008: 90-107. [12] A Study of Practical Deduplication Dutch T. Meyer * and William J. olosky * * Microsoft Research and The University of ritish Columbia {dmeyer@cs.ubc. edu, bolosky @microsoft.com. [13] John K. Ousterhout and Mendel Rosenblum. The design and implementation of a log structured file system. ACM Transactions on Computer Systems, 10(1):26 52, February 1992. [14] Mark W. Storer, Kevin M. Greenan, Darrell D. E. Long, and Ethan L. Miller. Secure data deduplication. In Proceedings of the 2008 ACM Workshop on Storage Security and Survivability, October 2008. [15] Cristian Ungureanu, enjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Calkowski, Cezary Dubnicki, and Aniruddha ohra. HydraFS: a highthroughput file system for the HYDRAstor contentaddressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies FAST), San Jose, CA, February 2010. [16] E. L. Miller, D. D. E. Long, W. E. Freeman, and. C. Reed. Strong security for network-attached storage. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 1 13, Monterey, CA, Jan. 2002. [17] A. Muthitacharoen,. Chen, and D. Mazières. A lowbandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP 01), pages 174 187, Oct. 2001. [18] Atmos Multi-tenant, distributed cloud storage for unstructured content, 2009. [Online]. Available: http://www.emc.com/products/detail/software/ atmos.htm. [19] Fujitsu s storage systems and related technologies supporting cloud computing, 2010. [Online]. Available: http://www.fujitsu.com/global/ 113