Deploying De-Duplication on Ext4 File System
|
|
|
- Dominic Hines
- 9 years ago
- Views:
Transcription
1 Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College of Engineering Pune, India Abstract Memory space has become one of the most important and costly resources today in both personal as well as corporate areas. Amount of data to be stored is increasing rapidly as number of people using computing devices has increased. For effectively improving memory utilization, Data Deduplication has come under the spotlight. In deduplication, a single unique copy of data is stored in memory and all other copies refer to the address of this original data. The urgent need now is to make data deduplication more mainstream and integrate it with file systems of popular operating systems using inline deduplication. Our focus is on EXT4 File System for Linux, as Linux is one of the most widely used open source operating system. Also, since bit level deduplication lays severe overhead on the processor, chunk level deduplication is the most suitable for our task. 1. Introduction hundred instances have to be saved again, requiring 100MB more of storage space. Data Deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization[4] so, all the duplicate data is deleted, leaving a single copy of data which is referenced by all the files containing it. Data storage techniques have gained attention lately due to exponential growth of digital data. Not just data, but backup and distributed databases with separate local memories give rise to several copies of the same data. Large databases not only consume a lot of memory, but they can lead to unnecessarily extended periods of downtime. It becomes an enormous task to handle the database; and backup and restoration also bogs down the CPU. Since we cannot compromise on backup space, we must make sure that there are no duplicate files in the data itself. Prevention of further transmission of multiple copies of data is necessary for which we need data deduplication. For example, a typical server containing hundred instances of the same file of size about 1MB. Every time we take back up of the platform, all Figure 1.Deduplication concept Figure 1,shows two files, File 1 and File 2, which contain one repeating block of data. A block containing GHI is present in both files. So, the deduplication process replaces the GHI block in File 2 with a pointer to the same block in File 1. Data deduplication is not the same as data compression. In data compression, a function is applied for replacing a larger, frequently repeating pattern with a smaller pattern. Even though storage space is lessened, it will still contain redundant patterns. Data 3471
2 deduplication on the other hand, searches for redundant patterns and saves only one instance of it. 2. Classification of Deduplication Deduplication is a vast topic and can be classified in various ways as discussed below Level of Deduplication data is present in the data. On the other hand, in Target deduplication, we carry out deduplication in the same place where we finally store the data, that is, in the destination. 3. System Architecture Data Deduplication consists of several sub processes such as chunking, fingerprinting, indexing and storing. For managing these processes, there are logical components. Deduplication can take place at various levels depending on the granularity of the operation. We can have very high level deduplication in the form of File Level deduplication, or extremely fine Deduplication at very low level in the form of Byte Level deduplication. An intermediate between the two is block level deduplication. In File level deduplication, duplicate files are eliminated and replaced by a pointer to the original file. It is also known as Single Instance storage. It can be achieved by comparing each file to all the other files present in memory. Block Level Deduplication makes file level deduplication finer by dividing each file into clocks or chunks. If any of these blocks are repeated in memory, they will be removed and replaced by a pointer. It is also known as sub file level deduplication. Byte level deduplication is even more refined than block level but it requires far more processing. The data stream is analysed byte by byte and any recurring bytes are simply given references to the original byte instead of storing it again Post Process and Inline Deduplication 3.1. User Interface Figure 2.System Architecture This is the operating system interface with the help of which the user reads, writes, modifies or deletes files File Manager Post process deduplication carries out deduplication after the data has already been stored in memory. Here the need for hash calculations is removed, but when the memory is almost full, it may lead to problems as duplicate data may be present until deduplication is done. Before the data is written to disk, we check whether it is already present or not. Thus we ensure that duplicate data is never present in disk and we are not wasting precious memory space Source and Target Deduplication Source Deduplication refers to when deduplication is carried out at source itself. Before sending the data to the destination we ensure that no duplicate or repeating The File Manager manages namespace, so that file name in each namespace is unique. It also manages metadata Deduplication Task Manager Before deduplication can be done, we need to divide the file into chunks and then ID each chunk using a hash function. Next, these ID s or fingerprints are stored in a table. Each time we call the deduplication task manager, we check in a table, known as Hash Store, if our current fingerprint matches any of the previously stored fingerprints. The Hash store consists of all of the fingerprints which are stored in it right from the first block which is stored in memory. 3472
3 3.4. Store Manager Deduplication exploits three different storage components, File Store, Hash Store and Chunk Store. These stores are managed and updated by the Store Manager. 4. General Working Whenever a user wants to create, modify or delete any file, we need to invoke inline deduplication process. Since we are using target deduplication, we carry out deduplication exactly where the data is being store, in memory. Once we have obtained the data stream being modified, Deduplication Task Manager intervenes and takes over control of the data stream, instead of directly writing it to memory. This data stream is separated into chunks or blocks of equal size and using a hash function, we find the hash value or fingerprint. Next, check if this fingerprint is present in the hash store. If we find a match then already a block having the same contents is present in memory and we do not need to write the same block again. We must simply store a reference to this block in the File Store. However, if a match is not found in the Hash Store, then we need to write this block to memory, update this fingerprint in the Hash Store and give a reference of this block in the File Store in the appropriate position. 5. Chunking We can carry out deduplication at three levels as we have seen, that is, File Level, Block Level and Bit Level. Block level deduplication is more employable in the practical sense. The more important discussion is about the size of each block and whether it should be of fixed or variable size. Generally block size is taken to be 512 bits or a multiple of 512 as conventional hash algorithms work to this tune. As the data stream enters, we can mark the boundaries at the end of every 512 bits so that from the next bit, the next block of 512 bits will start. We can thus segregate the data stream into distinct well defined blocks. There are two ways in which we can chunk the data, by Fixed Size Chunking (FSC) or Content Defined Chunking (CDC). Fixed Size Chunking is simpler; the data stream is divided into equal sized chunks of data. But any changes made in these chunks will cause all of the following chunks to shift by that number of bits, due to the avalanche effect. So even if a previous match in the blocks was found, even a single bit change will prevent matching of a 99% similar block. On the other hand, if we use CDC, which sets the boundaries of the chunk according to how much of the data coincides, dynamically, we will spend a lot of time and processing power as the complexity of the operation increases greatly. As shown in Figure 3(b), a data stream ABCDEF is divided into two blocks of equal sizes, ABC and DEF. When we add Z after C, in FSC, the hash value of the entire sequence will change. Even if DEF is present in the following sequence, we will not be able to identify it as a duplicate block. If we use VSF as shown in Figure 3(a), we can allocate a separate block to Z and DEF will remain a separate block so that hash value will remain same and deduplication of data will be identified. Figure 3.Problems in fixed size chunking (FSC) Using a sliding window of 512 bits, we will have to shift it bit by bit for the entire data stream. 6. Hashing Rather than handling the entire original sequence, we compress each chunk using a hash function giving us a much smaller, handy fingerprint. It helps make comparison and storage simpler. We can use any conventional hash function or algorithm such as MD5 or SHA for creating the fingerprint. The MD5 algorithm is capable of reducing a string of 512 bits to 128 bits. Since the compressed fingerprint is much smaller than the original, chances are that two different strings may give the exact same fingerprint. This is known as collision and in MD5, the probability of collision is 2-64.SHA1 is more secure in this regard than MD5. It reduces the original string of 512 bits to 160 bits, therefore reducing the probability 3473
4 of collision to 2-80, which is much less than that of MD5. However, MD5 gives better throughput in terms of system performance and is faster and costs less than SHA, just about a third. MD5 and SHA algorithms consist of many rounds and steps of encryption making it complicated so as to reduce the possibility of a collision occurring. Different functions of logical, arithmetic and shift operations are performed on the block of data. Both MD5 as well as SHA1 suffer from certain drawbacks as mentioned above and the way to go forward would be to design a deduplication specific Hash Algorithm which will combine the benefits of both SHA1 and MD5 and will be more suitable for our work. 7. Indexing Once we have created fingerprints for all of the chunks, we can check if our current chunk is already present in memory or not. First of all let us understand what storage structures or stores are being used. 7.1 Chunk store The actual unique data blocks are present in the chunk store. These chunks may be a part of many different files and only one copy is maintained which is referenced by all the files containing it. 7.2 Hash Store Hash store contains the compressed fingerprints of the corresponding chunks stored in the Hash Store. It is these fingerprints which are compared when searching for a duplicate data block. 7.3 File Store The File Store is a logical table consisting of all the files. Each file is broken into chunks. A reference to each chunk is stored in the file store. In the Linux file system, it is known as the inode table. Indexing deals with the searching of hash store for a matching fingerprint. 8. Update stores As said above, the stores need to be updated whenever a modification is made to any file. If a match is found in the Hash store, then only the File Store needs to be updated. If the matching Hash is not found then this new fingerprint is added in the Hash Store, the new block is written to the memory and its address is stored in the appropriate place in the File Store. 9. Future Scope One of the major drawbacks of Data Deduplication is that a significant amount of time is lost in chunking and fingerprinting, as well as indexing. This degrades system performance. For this purpose, we can use pipelining and parallelism, especially in multicore systems[1][2]. We can use parallel processing using the multiple available cores simultaneously. Since the chunking and hashing are independent of each other. Even the chunking and hashing of adjacent blocks is not interdependent. Hence all of these operations can be conducted concurrently. The results of all these can be combined at the Indexing stage, where we require the results of all the stages serially so that we do not miss out on the blocks which we even carry out parallely. If efficient use of parallelism is made then it will be easier to integrate deduplication on the computer as it will reduce runtime overheads and prevent the processor from diverting all of its resources to deduplication process. All popular Operating Systems should employ deduplication process before the actual writing of data can take place so as to make best use of memory. Also, the hashing function should be specific to deduplication. We can combine advantages of existing hashing algorithms to make a new one which gives high performance as well as minimizes the risk of collision. Thus deduplication has a lot of scope in future and should be converted into a basic functionality available and inbuilt in all new age systems. 10. Conclusion Data deduplication is more of a necessity than a mere tool to make maximum usage of memory. It is simple to implement and it s only drawback, of consuming too much of processing power, can be overcome with the help of parallelism and pipelining. All operating systems should have inbuilt facility for deduplication, in all file systems. 3474
5 11. Mathematical model We now provide a model of the system in terms of the Set Theory 1. Let S be the de-duplication system S = { Identify the inputs F,M where F is File and M is Metadata S = {F, M... F = {i i is collection of row data} M = {t t is a collection of information, i.e. inodes of the file} 3. The output is identified as O S = {F, M, O... O = { O O is metadata i.e. inode structure of file} 4. The processor is identified as P S = {F, M, O... P = {S1, S2, S3, S4} 5. S1 is the set of chunking module activity and associated data S1 = {S1 in, S1 p, S1 op } S1 in = {f f is valid stream of bytes} S1 p = {f f is chunking function to convert S1 in to S1 op } S1 p (S1 in ) = S1 op S1 op = {t t is output generated by chunking algorithm i.e. chunks of data} 6. S2 is the set for fingerprinting activities and associated data S2 = {S2 in, S2 p, S2 op } S2 in = {t t is valid chunk of data} S2 p = {t t is the set of functions for fingerprinting} S2 p (S2 in ) = S2 op S2 op = {t t is finger print of the S2 in1 } 7. S3 is the set of data modules for searching the hash store and related activities S3 = {S3 in1, S3 in2, S3 p, S3 op } S3 in1 = {t t is valid fingerprint of chunk} S3 in2 = {t t is hash store} S3 p = {t t is set of functions for searching fingerprints in hash store} S3 p (S3 in ) = S3 op S3 op = {t t is valid fingerprint indicating presence or absence of chunk} 8. S4 is the set of modules and activities to update the store S4 = {S4 in, S4 p, S4 op } S4 in = {(t1, t2, t3)} t1 is the valid chunk value, t2 is the fingerprint of t1, t3 isthe search flag associated with t1} S4 p = {t t is the set of functions for updating chunk, hash and file metadatastore} S4 p (S4 in ) = S4 op S4o p = {t t is the metadata, i.e. inode of updated cluster} 9. Identify the failure cases as F S = {F, M, O, F... Failure occurs when F = {Φ} F = {p p is collision of fingerprints} F = {p p does not have fingerprint in hash store} 10. Identify success cases (terminating case) as e S = {F, M, O, F, E... Success is defined as E = {p p, i.e. successfullyupdated hash, chunk, file store} 11. Identified initial condition as S0 S= {F, M, O, F, E, S0} Initial condition for deduplication is there should be redundant data present inchunk store and respective fingerprint of chunks must be present in hash store, i.e., S2 in2 {Φ} if chunk store is not null. 12. References [1] Keren Jin, Ethan L. Miller, The Effectiveness of De-duplication on Virtual Machine Disk Images,SYSTOR 2009, Haifa, Israel, May [2] Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Zhongtao Wang,"P-Dedupe: ExploitingParallelism in DataDe-duplication System",IEEE Seventh International Conference on Networking, Architecture, and Storage [3] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, Sparse indexing: large scale, inline deduplication usingsampling and locality, in Proccedings of the 7th conference on Fileand storage technologies. USENIX Association, 2009, pp [4] [Online; accessed 28- September -2013] [5] tml[online;accessed 5- October -2013] [6] ne; accessed 5- October -2013] [7] [Online; accessed 5- October -2013] 3475
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
DEXT3: Block Level Inline Deduplication for EXT3 File System
DEXT3: Block Level Inline Deduplication for EXT3 File System Amar More M.A.E. Alandi, Pune, India [email protected] Zishan Shaikh M.A.E. Alandi, Pune, India [email protected] Vishal Salve
IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS
IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department
A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose
A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose Abhirupa Chatterjee 1, Divya. R. Krishnan 2, P. Kalamani 3 1,2 UG Scholar, Sri Sairam College Of Engineering, Bangalore. India
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,
INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP
INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP 1 M.SHYAMALA DEVI, 2 V.VIMAL KHANNA, 3 M.SHAHEEN SHAH 1 Assistant Professor, Department of CSE, R.M.D.
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant
Multi-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007
Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication February 2007 Though data reduction technologies have been around for years, there is a renewed
09'Linux Plumbers Conference
09'Linux Plumbers Conference Data de duplication Mingming Cao IBM Linux Technology Center [email protected] 2009 09 25 Current storage challenges Our world is facing data explosion. Data is growing in a amazing
Enterprise Backup and Restore technology and solutions
Enterprise Backup and Restore technology and solutions LESSON VII Veselin Petrunov Backup and Restore team / Deep Technical Support HP Bulgaria Global Delivery Hub Global Operations Center November, 2013
Data Deduplication and Tivoli Storage Manager
Data Deduplication and Tivoli Storage Manager Dave Cannon Tivoli Storage Manager rchitect Oxford University TSM Symposium September 2007 Disclaimer This presentation describes potential future enhancements
A Novel Deduplication Avoiding Chunk Index in RAM
A Novel Deduplication Avoiding Chunk Index in RAM 1 Zhike Zhang, 2 Zejun Jiang, 3 Xiaobin Cai, 4 Chengzhang Peng 1, First Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R.
Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation
Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely
A Deduplication-based Data Archiving System
2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System
Data Deduplication: An Essential Component of your Data Protection Strategy
WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy, and Hong Tang University of California at Santa Barbara, Alibaba Inc. Abstract In a virtualized
[email protected]
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for
3Gen Data Deduplication Technical
3Gen Data Deduplication Technical Discussion NOTICE: This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and
A Survey on Deduplication Strategies and Storage Systems
A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide
Barracuda Backup Deduplication. White Paper
Barracuda Backup Deduplication White Paper Abstract Data protection technologies play a critical role in organizations of all sizes, but they present a number of challenges in optimizing their operation.
An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space
An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space Jannu.Prasanna Krishna M.Tech Student, Department of CSE,
Availability Digest. www.availabilitydigest.com. Data Deduplication February 2011
the Availability Digest Data Deduplication February 2011 What is Data Deduplication? Data deduplication is a technology that can reduce disk storage-capacity requirements and replication bandwidth requirements
The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle
The Curious Case of Database Deduplication PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle Agenda Introduction Deduplication Databases and Deduplication All Flash Arrays and Deduplication 2 Quick Show
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
Enhanced Intensive Indexing (I2D) De-Duplication for Space Optimization in Private Cloud Storage Backup
Enhanced Intensive Indexing (I2D) De-Duplication for Space Optimization in Private Cloud Storage Backup M. Shyamala Devi and Steven S. Fernandez Abstract Cloud Storage provide users with abundant storage
Data Deduplication and Tivoli Storage Manager
Data Deduplication and Tivoli Storage Manager Dave annon Tivoli Storage Manager rchitect March 2009 Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in
STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside
Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among
Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos
Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business
Theoretical Aspects of Storage Systems Autumn 2009
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
A SURVEY ON DEDUPLICATION METHODS
A SURVEY ON DEDUPLICATION METHODS A.FARITHA BANU *1, C. CHANDRASEKAR #2 *1 Research Scholar in Computer Science #2 Assistant professor, Dept. of Computer Applications Sree Narayana Guru College Coimbatore
A Data De-duplication Access Framework for Solid State Drives
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science
Demystifying Deduplication for Backup with the Dell DR4000
Demystifying Deduplication for Backup with the Dell DR4000 This Dell Technical White Paper explains how deduplication with the DR4000 can help your organization save time, space, and money. John Bassett
Inline Deduplication
Inline Deduplication [email protected] 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.
WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January 2012. Permabit Technology Corporation
WHITE PAPER Permabit Albireo Data Optimization Software Benefits of Albireo for Virtual Servers January 2012 Permabit Technology Corporation Ten Canal Park Cambridge, MA 02141 USA Phone: 617.252.9600 FAX:
How To Make A Backup System More Efficient
Identifying the Hidden Risk of Data De-duplication: How the HYDRAstor Solution Proactively Solves the Problem October, 2006 Introduction Data de-duplication has recently gained significant industry attention,
Online De-duplication in a Log-Structured File System for Primary Storage
Online De-duplication in a Log-Structured File System for Primary Storage Technical Report UCSC-SSRC-11-03 May 2011 Stephanie N. Jones [email protected] Storage Systems Research Center Baskin School
Cloud De-duplication Cost Model THESIS
Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker
LDA, the new family of Lortu Data Appliances
LDA, the new family of Lortu Data Appliances Based on Lortu Byte-Level Deduplication Technology February, 2011 Copyright Lortu Software, S.L. 2011 1 Index Executive Summary 3 Lortu deduplication technology
Data Deduplication for Corporate Endpoints
Data Deduplication for Corporate Endpoints This whitepaper explains deduplication techniques and describes how Druva's industry-leading technology saves storage and bandwidth by up to 90% Table of Contents
Enhanced Dynamic Whole File De-Duplication (DWFD) for Space Optimization in Private Cloud Storage Backup
International Journal of Machine Learning and Computing, Vol. 4, No. 4, August 2014 Enhanced Dynamic Whole File De-Duplication (DWFD) for Space Optimization in Private Cloud Storage Backup M. Shyamala
Metadata Feedback and Utilization for Data Deduplication Across WAN
Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): 604 623 May 2016. DOI 10.1007/s11390-016-1650-6 Metadata Feedback
A block based storage model for remote online backups in a trust no one environment
A block based storage model for remote online backups in a trust no one environment http://www.duplicati.com/ Kenneth Skovhede (author, [email protected]) René Stach (editor, [email protected]) Abstract
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng and Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong,
Two-Level Metadata Management for Data Deduplication System
Two-Level Metadata Management for Data Deduplication System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3.,Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea { kongjs,
WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper
WHITE PAPER DATA DEDUPLICATION BACKGROUND: A Technical White Paper CONTENTS Data Deduplication Multiple Data Sets from a Common Storage Pool.......................3 Fixed-Length Blocks vs. Variable-Length
Data Deduplication Background: A Technical White Paper
Data Deduplication Background: A Technical White Paper NOTICE This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice
The assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
CA ARCserve r16.0 - Data Deduplication Frequently Asked Questions
CA ARCserve r16.0 - Data Deduplication Frequently Asked Questions Table of Contents For any backup on to Deduplication device, how many files does ARCserve generate and what do they contain?...4 Can I
Byte-index Chunking Algorithm for Data Deduplication System
, pp.415-424 http://dx.doi.org/10.14257/ijsia.2013.7.5.38 Byte-index Chunking Algorithm for Data Deduplication System Ider Lkhagvasuren 1, Jung Min So 1, Jeong Gun Lee 1, Chuck Yoo 2 and Young Woong Ko
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
File-System Implementation
File-System Implementation 11 CHAPTER In this chapter we discuss various methods for storing information on secondary storage. The basic issues are device directory, free space management, and space allocation
Symantec NetBackup PureDisk Optimizing Backups with Deduplication for Remote Offices, Data Center and Virtual Machines
Optimizing Backups with Deduplication for Remote Offices, Data Center and Virtual Machines Mayur Dewaikar Sr. Product Manager Information Management Group White Paper: Symantec NetBackup PureDisk Symantec
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
An Efficient Deduplication File System for Virtual Machine in Cloud
An Efficient Deduplication File System for Virtual Machine in Cloud Bhuvaneshwari D M.E. computer science and engineering IndraGanesan college of Engineering,Trichy. Abstract Virtualization is widely deployed
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Turnkey Deduplication Solution for the Enterprise
Symantec NetBackup 5000 Appliance Turnkey Deduplication Solution for the Enterprise Mayur Dewaikar Sr. Product Manager, Information Management Group White Paper: A Deduplication Appliance Solution for
Backup architectures in the modern data center. Author: Edmond van As [email protected] Competa IT b.v.
Backup architectures in the modern data center. Author: Edmond van As [email protected] Competa IT b.v. Existing backup methods Most companies see an explosive growth in the amount of data that they have
Chapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
Chapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING
FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.
ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, [email protected] DEPARTMENT OF ECE, IFET College of Engineering
ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, [email protected] DEPARTMENT OF ECE, IFET College of Engineering ABSTRACT Deduplication due to combination of resource intensive
Data Deduplication HTBackup
Data Deduplication HTBackup HTBackup and it s Deduplication technology is touted as one of the best ways to manage today's explosive data growth. If you're new to the technology, these key facts will help
ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
The Methodology Behind the Dell SQL Server Advisor Tool
The Methodology Behind the Dell SQL Server Advisor Tool Database Solutions Engineering By Phani MV Dell Product Group October 2009 Executive Summary The Dell SQL Server Advisor is intended to perform capacity
Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem
Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem Advanced Storage Products Group Table of Contents 1 - Introduction 2 Data Deduplication 3
NetApp Data Compression and Deduplication Deployment and Implementation Guide
Technical Report NetApp Data Compression and Deduplication Deployment and Implementation Guide Clustered Data ONTAP Sandra Moulton, NetApp April 2013 TR-3966 Abstract This technical report focuses on clustered
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
IJESRT. Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY Optimized Storage Approaches in Cloud Environment Sri M.Tanooj kumar, A.Radhika Department of Computer Science and Engineering,
Block-level Inline Data Deduplication in ext3
Block-level Inline Data Deduplication in ext3 Aaron Brown Kristopher Kosmatka University of Wisconsin - Madison Department of Computer Sciences December 23, 2010 Abstract Solid State Disk (SSD) media are
A Method of Deduplication for Data Remote Backup
A Method of Deduplication for Data Remote Backup Jingyu Liu 1,2, Yu-an Tan 1, Yuanzhang Li 1, Xuelan Zhang 1, Zexiang Zhou 3 1 School of Computer Science and Technology, Beijing Institute of Technology,
Speeding Up Cloud/Server Applications Using Flash Memory
Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Program Agenda. Safe Harbor Statement. What are SecureFiles? Performance. Features DBFS. Internals. Wrap Up 11/22/2014
11/22/2014 LOB Internals and Best Practices Andy Rivenes, Product Manager Mike Gleeson, Database Development Oracle Database Development November 19, 2014 Copyright 2014, Oracle and/or its affiliates.
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
EMC VNXe File Deduplication and Compression
White Paper EMC VNXe File Deduplication and Compression Overview Abstract This white paper describes EMC VNXe File Deduplication and Compression, a VNXe system feature that increases the efficiency with
DEDUPLICATION has become a key component in modern
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 3, MARCH 2016 855 Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge Min
Key Components of WAN Optimization Controller Functionality
Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral
Click here to print this article. Re-Printed From SLCentral RAID: An In-Depth Guide To RAID Technology Author: Tom Solinap Date Posted: January 24th, 2001 URL: http://www.slcentral.com/articles/01/1/raid
Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen
Design and Implementation of a Storage Repository Using Commonality Factoring IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen Axion Overview Potentially infinite historic versioning for rollback and
Chapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
Chapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
Quick Start - Virtual Server idataagent (Microsoft/Hyper-V)
Page 1 of 31 Quick Start - Virtual Server idataagent (Microsoft/Hyper-V) TABLE OF CONTENTS OVERVIEW Introduction Key Features Complete Virtual Machine Protection Granular Recovery of Virtual Machine Data
Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs
Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of
Product Brief. DC-Protect. Content based backup and recovery solution. By DATACENTERTECHNOLOGIES
Product Brief DC-Protect Content based backup and recovery solution By DATACENTERTECHNOLOGIES 2002 DATACENTERTECHNOLOGIES N.V. All rights reserved. This document contains information proprietary and confidential
Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago
Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University
