Inline Deduplication



Similar documents
Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

E-Guide. Sponsored By:

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

3Gen Data Deduplication Technical

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Theoretical Aspects of Storage Systems Autumn 2009

Metadata Feedback and Utilization for Data Deduplication Across WAN

Data Deduplication and Tivoli Storage Manager

Speeding Up Cloud/Server Applications Using Flash Memory

Deduplication Demystified: How to determine the right approach for your business

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Data Deduplication HTBackup

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Fragmentation in in-line. deduplication backup systems

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP


The What, Why and How of the Pure Storage Enterprise Flash Array

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Physical Data Organization

Availability Digest. Data Deduplication February 2011

A Novel Deduplication Avoiding Chunk Index in RAM

Data Deduplication: An Essential Component of your Data Protection Strategy

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Eight Considerations for Evaluating Disk-Based Backup Solutions

Deploying De-Duplication on Ext4 File System

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

Security Ensured Redundant Data Management under Cloud Environment

Key Components of WAN Optimization Controller Functionality

Data Deduplication and Tivoli Storage Manager

Byte-index Chunking Algorithm for Data Deduplication System

Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication

Trends in Enterprise Backup Deduplication

Contents. WD Arkeia Page 2 of 14

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

LDA, the new family of Lortu Data Appliances

DEDUPLICATION has become a key component in modern

Turnkey Deduplication Solution for the Enterprise

bup: the git-based backup system Avery Pennarun

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Reducing Backups with Data Deduplication

A Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage

Hardware Configuration Guide

Reducing Replication Bandwidth for Distributed Document Databases

A block based storage model for remote online backups in a trust no one environment

Data Backup and Archiving with Enterprise Storage Systems

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

DEXT3: Block Level Inline Deduplication for EXT3 File System

A Fast Dual-level Fingerprinting Scheme for Data Deduplication

Primary Data Deduplication Large Scale Study and System Design

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, DEPARTMENT OF ECE, IFET College of Engineering

Reducing Replication Bandwidth for Distributed Document Databases

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Data Deduplication and Corporate PC Backup

Enhanced Dynamic Whole File De-Duplication (DWFD) for Space Optimization in Private Cloud Storage Backup

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Data Deduplication for Corporate Endpoints

Demystifying Deduplication for Backup with the Dell DR4000

DeltaStor Data Deduplication: A Technical Review

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

A Survey on Deduplication Strategies and Storage Systems

Storage Management for Files of Dynamic Records

A Deduplication-based Data Archiving System

File-System Implementation

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

IS IN-MEMORY COMPUTING MAKING THE MOVE TO PRIME TIME?

CONSOLIDATE MORE: HIGH- PERFORMANCE PRIMARY DEDUPLICATION IN THE AGE OF ABUNDANT CAPACITY

Online De-duplication in a Log-Structured File System for Primary Storage

Multi-level Metadata Management Scheme for Cloud Storage System

HP StoreOnce: reinventing data deduplication

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services

Choosing an Enterprise-Class Deduplication Technology

Raima Database Manager Version 14.0 In-memory Database Engine

File Systems Management and Examples

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

File Management Chapters 10, 11, 12

Chapter 2 Data Storage

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Original-page small file oriented EXT3 file storage system

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

Data Deduplication. Hao Wen

Deduplication, Incremental Forever, and the. Olsen Twins. 7 Technology Circle Suite 100 Columbia, SC 29203

Read Performance Enhancement In Data Deduplication For Secondary Storage

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

Enhanced Intensive Indexing (I2D) De-Duplication for Space Optimization in Private Cloud Storage Backup

Transcription:

Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage. The former is called inline deduplication.[2] 1.2 Ideas Of Implementation: 1.Chunk-based deduplication : With this method, data to be deduplicated is broken into variable-length chunks using contentbased chunk boundaries, and incoming chunks are compared with the chunks in the store by hash comparison; only chunks that are not already there are stored. This approach uses significantly less RAM To solve the chunk-lookup disk bottleneck problem, we rely on chunk locality: the tendency for chunks in backup data streams to reoccur together. That is, if the last time we encountered chunk A, it was surrounded by chunks B, C, and D, then the next time we encounter A (even in a different backup) it is likely that we will also encounter B, C, or D nearby. We perform stream deduplication by breaking up each input stream into segments, each of which contains thousands of chunks.[1] To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks as samples; our sparse index maps these samples hashes to the already-stored segments in which they occur. By using an appropriate low sampling rate, we can ensure that the sparse index is small enough to fit easily into RAM while still obtaining excellent deduplication. At the same time, only a few seeks are required per segment to load its chosen segments information avoiding any disk bottleneck and achieving good throughput. Of course, since we deduplicate each segment against only a limited number of other segments, we occasionally store duplicate chunks. However, due to our lower RAM

requirements, we can afford to use smaller chunks, which more than compensates for the loss of deduplication the occasional duplicate chunk causes. chunk-lookup disk bottleneck The traditional way to implement inline, chunk-based deduplication is to use a full chunk index: a key-value index of all the stored chunks, where the key is a chunk s hash, and the value holds metadata about that chunk, including where it is stored on disk It is not cost effective to keep all of this index in RAM. However, if we keep the index on disk, due to the lack of short-term locality in the stream of incoming chunk hashes, we will need one disk seek per chunk hash lookup. This is the chunk-lookup disk bottleneck that needs to be avoided. 1.3 Implementation Under the sparse indexing approach, segments are the unit of storage and retrieval. A segment is a sequence of chunks. Data streams are broken into segments in a two step process: first, the data stream is broken into a sequence of variable-length chunks using a chunking algorithm, and, second, the resulting chunk sequence is broken into a sequence of segments using a segmenting algorithm. Segments are usually on the order of a few megabytes. We say that two segments are similar if they share a number of chunks. Segments are represented in the store using their manifests: a manifest or segment recipe is a data structure that allows reconstructing a segment from its chunks, which are stored separately in one or more chunk containers to allow for sharing of chunks between segments. A segment s manifest records its sequence of chunks, giving for each chunk its hash, where it is stored on disk, and possibly its length. Every stored segment has a manifest that is stored on disk. Incoming segments are deduplicated against similar, existing segments in the store. Deduplication proceeds in two steps: first, we identify among all the segments in the store some that are most similar to the incoming segment, which we call champions, and, second, we deduplicate against those segments by finding the chunks they share with the incoming segment, which do not need to be stored again. To identify similar segments, we sample the chunk hashes of the incoming segment, and use an in-ram index to determine which already-stored segments contain how many of those hashes. A simple and fast way to sample is to choose as a sample every hash whose first n bits are zero; this results in an average sampling rate of 1/2 n ; that is, on average one in 2 n hashes is chosen as a sample. We call the chosen hashes hooks. The in-memory index, called the sparse index, maps hooks to the manifests in which they occur.

The manifests themselves are kept on disk; the sparse index holds only pointers to them. Once we have chosen champions, we can load their manifests into RAM and use them to deduplicate the incoming segment. a. Chunking techniques Two-Threshold Two-Divisor (TTTD) chunking algorithm to subdivide the incoming data stream into chunks. TTTD produces variable-sized chunks with smaller size variation Only a few well chosen manifests suffice. So, from among all the manifests produced by querying the sparse index, we choose a few to deduplicate against. We call the chosen manifests champions. The algorithm by which we choose champions is as follows: we choose champions one at a time until the maximum allowable number of champions are found, or we run out of candidate manifests. Each time we choose, we choose the manifest with the highest non-zero score, where a manifest gets one point for each hook it has in common with S that is not already present in any previously chosen champion. If there is a tie, we choose the manifest most recently stored. The choice of which manifests to choose as champions is done based solely on the hooks in the sparse index; that is, it does not involve any disk accesses. We don t give points for hooks belonging to already chosen manifests because those chunks (and the chunks around them by chunk locality) are most likely already covered by the previous champions. b. Deduplicating against the champions Once we have determined the champions for the incoming segment, we load their manifests from disk. A small cache of recently loaded manifests can speed this process up somewhat because adjacent segments sometimes have champions in common. The hashes of the chunks in the incoming segment are then compared with the hashes in the champions manifests in order to find duplicate chunks. We use the SHA1 hash algorithm [15] to make false positives here extremely unlikely. Those chunks that are found not to be present in any of the champions are stored on disk in chunk containers, and a new manifest is created for the incoming segment. The new manifest contains the location on disk where each incoming chunk is stored. In the case of chunks that are duplicates of a chunk in one or more of the champions, the location is the location of the existing chunk, which is obtained from the relevant manifest. In the case of new chunks, the on-disk location is where that chunk has just been stored. Once the new manifest is created, it is stored on disk in the manifest store. Finally, we add entries for this manifest to the sparse index with the manifest s hooks as keys. Some of the hooks may already exist in the sparse index, in which case we add the manifest to

the list of manifests that are pointed to by that hook. To conserve space, it may be desirable to set a maximum limit for the number of manifests that can be pointed to by any one hook. If the maximum is reached, the oldest manifest is removed from the list before the newest one is added. 1.4 Design of Implementation: Explanation of the Design (In my own words): Bytes streams are broken down into chunks and chunks are broken down into segments. Segments are given to Champion chooser which chooses the appropriate champions(manifestsan info related to already stored segments) which may include most of the data similar to the incoming segment. This can be done by looking up into the sparse index which are stored in RAM.These sparse index returns the manifests pointers(which specify manifests location on disk). Then Champion chooser chooses the appropriate manifests pointers. These selected manifests pointers are called champion pointers. These pointers are given to Deduplicator. Deduplicator retrieves the corresponding manifests stored in Manifest store( on disk). Then the segments are compared with incoming segment. If there are any different chunks (other than the chunks stored on disk),the new chunks are stored on disk in Container store and the corresponding manifest(the related info of the incoming segment) is stored in manifest store in disk. Also the corresponding manifest pointer is stored in RAM for future use.

1.5 Pros of INLINE DEDUPLICATION [1] The advantage of inline deduplication is that it does not require the duplicate data to actually be written to disk. The duplicate object is hashed, compared, and re-referenced on the fly. 1.6 Cons of INLINE DEDUPLICATION [1] 1.The disadvantage of this method is that significant system resources are required to handle the entire deduplication operation in milliseconds. 2.Chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. Solution : To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks hashes are kept in RAM and the sampling rate is low, we dramatically reduce the RAM to disk ratio for effective deduplication. 1.7 Where it can be used? Deduplication, which is practical only with random-access devices, removes this redundancy by storing duplicate data only once and has become an essential feature of disk to-disk(d2d) backup solutions. REFERENCES [1] http://www.vmworld.com/servlet/jiveservlet/previewbody/2695-102-1-2634/wp-7028.pdf [2] http://blog.druva.com/2009/01/09/understanding-data-deduplication/ By : KASHISH BHATIA