Theoretical Aspects of Storage Systems Autumn 2009

Size: px
Start display at page:

Download "Theoretical Aspects of Storage Systems Autumn 2009"

Transcription

1 Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann

2 News

3 Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements

4 Motivation Backups: 26 full backups 26 times required backup capacity Few changes and high redundancy between different backups Similar behavior can be seen for Virtual Machine Images Home Directories Network file systems (LBFS) Data Deduplication removes redundant data and tries to ensure that information is only stored once

5 Fingerprinting Use hashing schemes to characterize content of data block Delta Encoding Search near-duplicates and just store delta between blocks Different Approaches

6 Fingerprinting Fingerprinting is based on four stages Chunking Divide data stream into chunks of fixed or variable size Fingerprinting Calculate hash-function for each chunk Duplicate Detection Compare hash result with already stored index Update indexes and store data

7 Chunking Data Stream / File /... Chunk Chunk Chunk The process of chunking divides the data stream in smaller, non-overlapping blocks Different Approaches Static Chunking Content defined Chunking File-based Chunking

8 Static Chunking Each chunk has a fixed size Very fast approach Vulnerable to shifts inside the data stream Seldom used Virtual machine deduplication Deduplication of block storage A

9 Content defined Chunking Chunks will be generated based on their content Fingerprint (hash calculation) for each substring of size w Chunk ends if it holds for fingerprint f that f mod n = c for some constant 0 <= c < n Influence on chunk size Variable length Expected length is n U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, 1994.

10 Content-defined Chunking Each change only impacts its direct neighbors A A A

11 Very small chunks Special Cases E.g. unfortunate repetition of a 48 Byte window Requires more memory for fingerprint than for actual data Big chunks E.g. many runs of 0 High memory demand during processing Solution: Define Min/Max-length Typically between 2 kbytes and 64 kbytes

12 Processing Overhead Content-defined chunking requires the calculation of one fingerprint for each substring of length w one fingerprint for each word Processing overhead for fingerprint typically depends on string length Small string length: good performance, but bad chunking properties Large string length: good chunking properties, but huge performance impact Way out: Use rolling hash functions, which allow calculation of new fingerprint based on previous fingerprint in constant time

13 Rolling Hash A rolling hash is a hash function where the input is hashed in a window that moves through the input Few hash functions allow a rolling hash to be computed very quickly: the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window Applications besides data deduplication Rabin-Karp string search algorithm Rsync Wikipedia: Rolling hash

14 Rabin Fingerprints Rabin Fingerprints only requires multiplications and additions F =c 0 a k 1 + c 1 a k 2 + c 2 a k c k 1 a 0 Typically, all operations are performed modulo n Of course, the choices of a and n are critical for good hashing properties The calculation of a new fingerprint from an old one just requires one addition, one substraction, and a multiplication by a Performance results: we have measured 102 MByte/s for each processor core of a 2 GHz processor M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.

15 Duplicate Detection The system has to check for every chunk, whether it is a duplicate or not Compare-by-Hash Calculate one Fingerprint for every chunk (typically SHA1) Check whether this fingerprint is already known to the system SHA 1 still very costly 73,1 Mbyte/s throughput on each 2 GHz core

16 Compare-by-Hash Applicability of approach is (at best) discussed (see e.g. [Hen03]): "Use of compare-by-hash is justified by mathematical calculations based on assumptions that range from unproven to demonstrably wrong. The short lifetime and fast transition into obsolescence of cryptographic hashes makes them unsuitable for use in long-lived systems. When hash collisions do occur, they cause silent errors and bugs that are difficult to repair. V. Henson. An analysis of compare-by-hash. In HOTOS'03: Proceedings of the 9th conference on Hot Topics in Operating Systems

17 Compare-by-Hash Data loss based on accidental collisions: Birthday paradoxon: It is sufficient for one block to have the same hash value as an arbitrary other block to produce silent data corruption Assuming n data block, and b bits hash length, this probability can be bounded by p nn 1 ( ) b Assume 1 Exabyte of data (2 60 Bytes), 4 Kbyte Chunk Size (2 12 Byte), 160 Bit SHA 1 fingerprint p < But attacks can become successful as soon as SHA 1 gets broken J. Black. Compare-by-hash: a reasoned analysis. In ATEC '06: Proceedings of the annual conference on USENIX '06 Annual Technical Conference

18 Internal Redundancy Redundancy inside an AFS file system at the University of Paderborn CDC = Content-defined Chunking SC = Static Chunking Datei = File-based Chunking D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

19 Internal Redundancy based on Data Type D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

20 < 4K > 4K > 8K > 16K > 32 K > 64K > 128K > 256K > 512K > 1M > 2M > 4M > 8M > 16M > 32M > 64M > 128M > 256M > 512M > 1G > 2G Internal Redundancy based on File Size 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% CDC-8 CDC-16 SC-8 SC-16 Datei D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

21 Temporal Redundancy Redundancy considering previous backup runs D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009

22 Bottleneck Index One fingerprint for each chunk Chunk index Index of all previous accessed chunks Assume chunk size of 8 Kbyte 2 40 /2 13 = 2 27 chunks for each Tbyte of data Using 20 Byte SHA 1 fingerprints 2,5 Gbyte index for each TByte of Data Index cannot be stored in main memory for large scale storage systems, but storing it on disk results in no locality in index access (besides for archiving) Random I/O accesses on disk 100 Mbyte/s throughput requires up to 24,000 index lookups Disk will become bottleneck

23 Use more disks What can we do? 200 IO/s per Disk 10 Disks lead to throughput of 6 MB/s Use SSDs Intel X25E: IO/s 1 SSD achieves 60 MB/s deduplication throughput Target throughput >> 200 MB/s

24 Approach 1: Bloom Filter (1) Probabilistic data structure similar to Set Insert(key), Lookup(key) Lookup(key) = false Item guarateed not in set Lookup(key) = true Item probably (!) in set Bloom Filter for fingerprints Lookup(fp) = false No chunk index lookup necessary Lookup(fp) = true Lookup still required B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

25 Data structure Bitmap b of variable length m k independent hash functions h i Insert(i) Set bit positions h 1 (i),..., h k (i) to 1 Lookup(i) Bloom Filter (2) If h 1 (i),..., h k (i) = 0, then the item is definitely not inside set Probability of False Positive E.g. 2% for k = 4 and 1 Byte for each FP Does this really helps? B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

26 Approach 2: Locality-Preserving Caching Container is sequence of new chunks 4 MB to 10 MB Metadaten (FP) and Daten is stored together on disk Chunk Lookup: Read and cache FP of complete container Lookup in Container-Cache for each sequential chunk Idea Long Runs of chunks to same container One IO for complete container metadata B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008

27 Approach 3: Sparse Indexing Do not keep complete chunk index Divide index into segments Sequence of 10 MB chunks Choose k champions for each segment Champion index (RAM): Champion Segment Lookup inside champion index Idea: Long Runs of chunks coming from same segment 1 (successful) champion lookup delivers FP for many Chunks Applied inside HP D2D2500, D2D4000 M. Lillibridge, et al.: Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality.. In Proceedings of the 7nd USENIX Conference on File and Storage Technologies (FAST'09)

28 Chunking Near-Duplicate Detection Delta-Encoding Delta-Encoding

29 Shingling Near-Duplicate Detection Calculate hash value of all w-windows of chunk c Choose k biggest hash values S (shingles, features) Seek for chunk c with maximum number of common features Then, whp., c is similar, but not necessarily identical to c Standard technique in information retrieval C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, May 2008

30 Resemblance Detection Resemblance of two chunks A and B is defined as ra,b ( )= S(A) S(B) S(A) S(B) 1 Calculating of r(a,b) too computational intensive Just use a clearly defined subset of shingles, e.g. biggest values (see [Man 1994]) Broder suggests to use k minimal Fingerprints and shows that the function is an unbiased estimator A. Z. Broder. Identifying and ltering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

31 Delta Encoding Chunk c can be compressed based on chunk c c = QWIJKLMNOBCDEFGHZDEFGHIJKL c' = ABCDEFGHIJKLMNOP (Insert; 2; QW) QW (Copy; 7; 8) IJKLMNO (Copy; 7; 1) BCDEFGH (Insert; 1; Z) Z (Copy; 9; 3) DEFGHIJK

32 Delta Encoding Douglis and Iyengar claim that "Delta-encoding itself has been made extremely efficient, and it should not usually be a bottleneck except in extremely highbandwidth environments. [...] The inclusion of the Ajtai delta-encoding work in a commercial backup system, also support the argument that DERD will not be limited by the delta-encoding bandwidth. F. Douglis and A. Iyengar. Application-specic deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference

33 Diligent Delta-Encoding: Memory Overhead Chunks of size 32 MByte Resemblance detection based on 4 Kbyte for each shingle k=8 maximum value shingles are used for resemblance detection Feature index for 1 TByte has size, m k f Bytes where m is chunk size, k is number of shinglings per chunk and f is the size of a fingerprint For Diligent, the feature index has size Optimized for Backup applications Diligent Technologies. HyperFactor -- a breakthrough in data reduction technology. Diligent White Paper Bytes = 4 MByte 25 2

34 Discussion Delta-Encoding Delta encoding can help to overcome disk bottleneck in today s deduplication systems, but reconstruction of chunks can trigger reconstruction of additional chunks slowdown, both for reading and in writing data problem becomes worse, if the system gets older Depth of dedup tree increases Clean-up process can help to limit this problem Diligent has, e.g. to our knowledge, restricted the maximum tree depth

A Deduplication File System & Course Review

A Deduplication File System & Course Review A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information

DEXT3: Block Level Inline Deduplication for EXT3 File System

DEXT3: Block Level Inline Deduplication for EXT3 File System DEXT3: Block Level Inline Deduplication for EXT3 File System Amar More M.A.E. Alandi, Pune, India ahmore@comp.maepune.ac.in Zishan Shaikh M.A.E. Alandi, Pune, India zishan366shaikh@gmail.com Vishal Salve

More information

Inline Deduplication

Inline Deduplication Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

More information

A Data De-duplication Access Framework for Solid State Drives

A Data De-duplication Access Framework for Solid State Drives JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science

More information

The assignment of chunk size according to the target data characteristics in deduplication backup system

The assignment of chunk size according to the target data characteristics in deduplication backup system The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,

More information

Data Backup and Archiving with Enterprise Storage Systems

Data Backup and Archiving with Enterprise Storage Systems Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia slavjan_ivanov@yahoo.com,

More information

Trends in Enterprise Backup Deduplication

Trends in Enterprise Backup Deduplication Trends in Enterprise Backup Deduplication Shankar Balasubramanian Architect, EMC 1 Outline Protection Storage Deduplication Basics CPU-centric Deduplication: SISL (Stream-Informed Segment Layout) Data

More information

Byte-index Chunking Algorithm for Data Deduplication System

Byte-index Chunking Algorithm for Data Deduplication System , pp.415-424 http://dx.doi.org/10.14257/ijsia.2013.7.5.38 Byte-index Chunking Algorithm for Data Deduplication System Ider Lkhagvasuren 1, Jung Min So 1, Jeong Gun Lee 1, Chuck Yoo 2 and Young Woong Ko

More information

FAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory

FAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory

More information

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,

More information

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Quanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data

More information

A Deduplication-based Data Archiving System

A Deduplication-based Data Archiving System 2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System

More information

Multi-level Metadata Management Scheme for Cloud Storage System

Multi-level Metadata Management Scheme for Cloud Storage System , pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1

More information

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract

More information

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of

More information

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose Abhirupa Chatterjee 1, Divya. R. Krishnan 2, P. Kalamani 3 1,2 UG Scholar, Sri Sairam College Of Engineering, Bangalore. India

More information

Security Ensured Redundant Data Management under Cloud Environment

Security Ensured Redundant Data Management under Cloud Environment Security Ensured Redundant Data Management under Cloud Environment K. Malathi 1 M. Saratha 2 1 PG Scholar, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. 2 Assistant Professor, Dept.

More information

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets Young Jin Nam School of Computer and Information Technology Daegu University Gyeongsan, Gyeongbuk, KOREA 7-7 Email:

More information

Theoretical Aspects of Storage Systems Autumn 2009

Theoretical Aspects of Storage Systems Autumn 2009 Theoretical Aspects of Storage Systems Autumn 2009 Chapter 1: RAID André Brinkmann University of Paderborn Personnel Students: ~13.500 students Professors: ~230 Other staff: ~600 scientific, ~630 non-scientific

More information

Deploying De-Duplication on Ext4 File System

Deploying De-Duplication on Ext4 File System Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College

More information

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely

More information

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble HP Labs UC Santa Cruz HP

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,

More information

Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager Data Deduplication and Tivoli Storage Manager Dave Cannon Tivoli Storage Manager rchitect Oxford University TSM Symposium September 2007 Disclaimer This presentation describes potential future enhancements

More information

CURRENTLY, the enterprise data centers manage PB or

CURRENTLY, the enterprise data centers manage PB or IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 61, NO. 11, JANUARY 21 1 : Distributed Deduplication for Big Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Senior Member, IEEE,

More information

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among

More information

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering ABSTRACT Deduplication due to combination of resource intensive

More information

A Fast Dual-level Fingerprinting Scheme for Data Deduplication

A Fast Dual-level Fingerprinting Scheme for Data Deduplication A Fast Dual-level Fingerprinting Scheme for Data Deduplication 1 Jiansheng Wei, *1 Ke Zhou, 1,2 Lei Tian, 1 Hua Wang, 1 Dan Feng *1,Corresponding Author Wuhan National Laboratory for Optoelectronics, School

More information

The Advantages and Disadvantages of Network Computing Nodes

The Advantages and Disadvantages of Network Computing Nodes Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

A Novel Deduplication Avoiding Chunk Index in RAM

A Novel Deduplication Avoiding Chunk Index in RAM A Novel Deduplication Avoiding Chunk Index in RAM 1 Zhike Zhang, 2 Zejun Jiang, 3 Xiaobin Cai, 4 Chengzhang Peng 1, First Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R.

More information

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department

More information

A Survey on Data Deduplication in Cloud Storage Environment

A Survey on Data Deduplication in Cloud Storage Environment 385 A Survey on Data Deduplication in Cloud Storage Environment Manikantan U.V. 1, Prof.Mahesh G. 2 1 (Department of Information Science and Engineering, Acharya Institute of Technology, Bangalore) 2 (Department

More information

Alternatives to Big Backup

Alternatives to Big Backup Alternatives to Big Backup Life Cycle Management, Object- Based Storage, and Self- Protecting Storage Systems Presented by: Chris Robertson Solution Architect Cambridge Computer Copyright 2010-2011, Cambridge

More information

De-duplication-based Archival Storage System

De-duplication-based Archival Storage System De-duplication-based Archival Storage System Than Than Sint Abstract This paper presents the disk-based backup system in which only relational database files are stored by using data deduplication technology.

More information

ABSTRACT 1 INTRODUCTION

ABSTRACT 1 INTRODUCTION DEDUPLICATION IN YAFFS Karthik Narayan {knarayan@cs.wisc.edu}, Pavithra Seshadri Vijayakrishnan{pavithra@cs.wisc.edu} Department of Computer Sciences, University of Wisconsin Madison ABSTRACT NAND flash

More information

Metadata Feedback and Utilization for Data Deduplication Across WAN

Metadata Feedback and Utilization for Data Deduplication Across WAN Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): 604 623 May 2016. DOI 10.1007/s11390-016-1650-6 Metadata Feedback

More information

Read Performance Enhancement In Data Deduplication For Secondary Storage

Read Performance Enhancement In Data Deduplication For Secondary Storage Read Performance Enhancement In Data Deduplication For Secondary Storage A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Pradeep Ganesan IN PARTIAL FULFILLMENT

More information

Online De-duplication in a Log-Structured File System for Primary Storage

Online De-duplication in a Log-Structured File System for Primary Storage Online De-duplication in a Log-Structured File System for Primary Storage Technical Report UCSC-SSRC-11-03 May 2011 Stephanie N. Jones snjones@cs.ucsc.edu Storage Systems Research Center Baskin School

More information

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory Biplob Debnath Sudipta Sengupta Jin Li Microsoft Research, Redmond, WA, USA University of Minnesota, Twin Cities, USA Abstract Storage

More information

sulbhaghadling@gmail.com

sulbhaghadling@gmail.com www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for

More information

A Survey on Deduplication Strategies and Storage Systems

A Survey on Deduplication Strategies and Storage Systems A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide

More information

Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2 #1 You can

More information

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy, and Hong Tang University of California at Santa Barbara, Alibaba Inc. Abstract In a virtualized

More information

Availability Digest. www.availabilitydigest.com. Data Deduplication February 2011

Availability Digest. www.availabilitydigest.com. Data Deduplication February 2011 the Availability Digest Data Deduplication February 2011 What is Data Deduplication? Data deduplication is a technology that can reduce disk storage-capacity requirements and replication bandwidth requirements

More information

Data Compression and Deduplication. LOC 2010 2010 Cisco Systems, Inc. All rights reserved.

Data Compression and Deduplication. LOC 2010 2010 Cisco Systems, Inc. All rights reserved. Data Compression and Deduplication LOC 2010 2010 Systems, Inc. All rights reserved. 1 Data Redundancy Elimination Landscape VMWARE DeDE IBM DDE for Tank Solaris ZFS Hosts (Inline and Offline) MDS + Network

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP

INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP 1 M.SHYAMALA DEVI, 2 V.VIMAL KHANNA, 3 M.SHAHEEN SHAH 1 Assistant Professor, Department of CSE, R.M.D.

More information

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc. Abstract Disk-based

More information

Cloud De-duplication Cost Model THESIS

Cloud De-duplication Cost Model THESIS Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker

More information

A Study on Data Deduplication in HPC Storage Systems

A Study on Data Deduplication in HPC Storage Systems A Study on Data Deduplication in HPC Storage Systems Dirk Meister, Jürgen Kaiser, Andre Brinkmann Johannes Gutenberg-University, Mainz Germany {dirkmeister, j.kaiser, brinkmann}@uni-mainz.de Toni Cortes

More information

An Efficient Deduplication File System for Virtual Machine in Cloud

An Efficient Deduplication File System for Virtual Machine in Cloud An Efficient Deduplication File System for Virtual Machine in Cloud Bhuvaneshwari D M.E. computer science and engineering IndraGanesan college of Engineering,Trichy. Abstract Virtualization is widely deployed

More information

Data Deduplication and Tivoli Storage Manager

Data Deduplication and Tivoli Storage Manager Data Deduplication and Tivoli Storage Manager Dave annon Tivoli Storage Manager rchitect March 2009 Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in

More information

Design of an Exact Data Deduplication Cluster

Design of an Exact Data Deduplication Cluster Design of an Exact Data Deduplication Cluster Jürgen Kaiser, Dirk Meister, Andre Brinkmann Johannes Gutenberg-University, Mainz, Germany {j.kaiser, dirkmeister, brinkman}@uni-mainz.de Sascha Effert Christmann

More information

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper HP StoreOnce D2D Understanding the challenges associated with NetApp s deduplication Business white paper Table of contents Challenge #1: Primary deduplication: Understanding the tradeoffs...4 Not all

More information

The Design of a Similarity Based Deduplication System

The Design of a Similarity Based Deduplication System The Design of a Similarity Based Deduplication System Lior Aronovich IBM Corp. lioraron@il.ibm.com Haim Bitner Marvell Corp. haimb@marvell.com Ron Asher IBM Corp. ronasher@il.ibm.com Michael Hirsch IBM

More information

RAID 5 rebuild performance in ProLiant

RAID 5 rebuild performance in ProLiant RAID 5 rebuild performance in ProLiant technology brief Abstract... 2 Overview of the RAID 5 rebuild process... 2 Estimating the mean-time-to-failure (MTTF)... 3 Factors affecting RAID 5 array rebuild

More information

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP Dilip N Simha (Stony Brook University, NY & ITRI, Taiwan) Maohua Lu (IBM Almaden Research Labs, CA) Tzi-cker Chiueh (Stony

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

Data Deduplication in BitTorrent

Data Deduplication in BitTorrent Data Deduplication in BitTorrent João Pedro Amaral Nunes October 14, 213 Abstract BitTorrent is the most used P2P file sharing platform today, with hundreds of millions of files shared. The system works

More information

File Systems Management and Examples

File Systems Management and Examples File Systems Management and Examples Today! Efficiency, performance, recovery! Examples Next! Distributed systems Disk space management! Once decided to store a file as sequence of blocks What s the size

More information

Frequency Based Chunking for Data De-Duplication

Frequency Based Chunking for Data De-Duplication Frequency Based Chunking for Data De-Duplication Guanlin Lu, Yu Jin, and David H.C. Du Department of Computer Science and Engineering University of Minnesota, Twin-Cities Minneapolis, Minnesota, USA (lv,

More information

PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage

PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage Jian Liu, Yunpeng Chai, Yuan Xiao Renmin University of China Xiao Qin Auburn University Speaker: Tao Xie MSST, June 5, 2014 Deduplication

More information

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor -0- Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor Lambert Schaelicke, Matthew R. Geiger, Curt J. Freeland Department of Computer Science and Engineering University

More information

Contents. WD Arkeia Page 2 of 14

Contents. WD Arkeia Page 2 of 14 Contents Contents...2 Executive Summary...3 What Is Data Deduplication?...4 Traditional Data Deduplication Strategies...5 Deduplication Challenges...5 Single-Instance Storage...5 Fixed-Block Deduplication...6

More information

The What, Why and How of the Pure Storage Enterprise Flash Array

The What, Why and How of the Pure Storage Enterprise Flash Array The What, Why and How of the Pure Storage Enterprise Flash Array Ethan L. Miller (and a cast of dozens at Pure Storage) What is an enterprise storage array? Enterprise storage array: store data blocks

More information

HTTP-Level Deduplication with HTML5

HTTP-Level Deduplication with HTML5 HTTP-Level Deduplication with HTML5 Franziska Roesner and Ivayla Dermendjieva Networks Class Project, Spring 2010 Abstract In this project, we examine HTTP-level duplication. We first report on our initial

More information

Sistemas Operativos: Input/Output Disks

Sistemas Operativos: Input/Output Disks Sistemas Operativos: Input/Output Disks Pedro F. Souto (pfs@fe.up.pt) April 28, 2012 Topics Magnetic Disks RAID Solid State Disks Topics Magnetic Disks RAID Solid State Disks Magnetic Disk Construction

More information

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu, Andrew Pavlo, Sudipta Sengupta Jin Li, Gregory R. Ganger Carnegie Mellon University, Microsoft Research CMU-PDL-14-108 December

More information

Storage Research in the UCSC Storage Systems Research Center (SSRC)

Storage Research in the UCSC Storage Systems Research Center (SSRC) Storage Research in the UCSC Storage Systems Research Center (SSRC) Scott A. Brandt (scott@cs.ucsc.edu) Computer Science Department Storage Systems Research Center Jack Baskin School of Engineering University

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache

More information

Deduplication in unstructured-data storage systems

Deduplication in unstructured-data storage systems ELEKTROTEHNIŠKI VESTNIK 82(5): 233 242, 2015 ORIGINAL SCIENTIFIC PAPER Deduplication in unstructured-data storage systems Andrej Tolič 1,, Andrej Brodnik 1,2 1 University of Ljubljana, Faculty of Computer

More information

Application-Aware Client-Side Data Reduction and Encryption of Personal Data in Cloud Backup Services

Application-Aware Client-Side Data Reduction and Encryption of Personal Data in Cloud Backup Services Fu YJ, Xiao N, Liao XK et al. Application-aware client-side data reduction and encryption of personal data in cloud backup services. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 28(6): 1012 1024 Nov. 2013.

More information

Fragmentation in in-line. deduplication backup systems

Fragmentation in in-line. deduplication backup systems Fragmentation in in-line 5/6/2013 deduplication backup systems 1. Reducing Impact of Data Fragmentation Caused By In-Line Deduplication. Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, Cezary Dubnicki.

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clusters Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC

More information

Everything you need to know about flash storage performance

Everything you need to know about flash storage performance Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices

More information

Efficient Deduplication in Disk- and RAM-based Data Storage Systems

Efficient Deduplication in Disk- and RAM-based Data Storage Systems Efficient Deduplication in Disk- and RAM-based Data Storage Systems Andrej Tolič and Andrej Brodnik University of Ljubljana, Faculty of Computer and Information Science, Slovenia {andrej.tolic,andrej.brodnik}@fri.uni-lj.si

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011 Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased

More information

Chapter 8: On the Use of Hash Functions in. Computer Forensics

Chapter 8: On the Use of Hash Functions in. Computer Forensics Harald Baier Hash Functions in Forensics / WS 2011/2012 2/41 Chapter 8: On the Use of Hash Functions in Computer Forensics Harald Baier Hochschule Darmstadt, CASED WS 2011/2012 Harald Baier Hash Functions

More information

EMAIL DATA DE-DUPLICATION SYSTEM

EMAIL DATA DE-DUPLICATION SYSTEM EMAIL DATA DE-DUPLICATION SYSTEM A Final Project Presented to The Faculty of the Department of General Engineering San José State University In Partial Fulfillment of the Requirements for the Degree Master

More information

E-Guide. Sponsored By:

E-Guide. Sponsored By: E-Guide An in-depth look at data deduplication methods This E-Guide will discuss the various approaches to data deduplication. You ll learn the pros and cons of each, and will benefit from independent

More information

PRUN : Eliminating Information Redundancy for Large Scale Data Backup System

PRUN : Eliminating Information Redundancy for Large Scale Data Backup System PRUN : Eliminating Information Redundancy for Large Scale Data Backup System Youjip Won 1 Rakie Kim 1 Jongmyeong Ban 1 Jungpil Hur 2 Sangkyu Oh 2 Jangsun Lee 2 1 Department of Electronics and Computer

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Efficiently Storing Virtual Machine Backups

Efficiently Storing Virtual Machine Backups Efficiently Storing Virtual Machine Backups Stephen Smaldone, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract Physical level backups offer increased performance

More information

Open Access Improving Read Performance with BP-DAGs for Storage-Efficient File Backup

Open Access Improving Read Performance with BP-DAGs for Storage-Efficient File Backup Send Orders for Reprints to reprints@benthamscience.net 90 The Open Electrical & Electronic Engineering Journal, 2013, 7, 90-97 Open Access Improving Read Performance with BP-DAGs for Storage-Efficient

More information

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert Backup Software Data Deduplication: What you need to know Presented by W. Curtis Preston Executive Editor & Independent Backup Expert When I was in the IT Department When I started as backup guy at $35B

More information

Reducing impact of data fragmentation caused by in-line deduplication

Reducing impact of data fragmentation caused by in-line deduplication Reducing impact of data fragmentation caused by in-line deduplication Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki 9LivesData, LLC {kaczmarczyk, barczynski, wkilian, dubnicki}@9livesdata.com

More information

Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity

Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity Youngjin Nam School of Computer and Information Technology Daegu University Gyeongsan, Gyeongbuk, KOREA 712-714

More information

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business

More information

Data Deduplication in Windows Server 2012 and Ubuntu Linux

Data Deduplication in Windows Server 2012 and Ubuntu Linux Xu Xiaowei Data Deduplication in Windows Server 2012 and Ubuntu Linux Bachelor s Thesis Information Technology May 2013 DESCRIPTION Date of the bachelor's thesis 2013/5/8 Author(s) Xu Xiaowei Name of the

More information

Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup

Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup Deepavali Bhagwat University of California 1156 High Street Santa Cruz, CA 9564 dbhagwat@soe.ucsc.edu Kave Eshghi Hewlett-Packard

More information

bup: the git-based backup system Avery Pennarun

bup: the git-based backup system Avery Pennarun bup: the git-based backup system Avery Pennarun 2010 10 25 The Challenge Back up entire filesystems (> 1TB) Including huge VM disk images (files >100GB) Lots of separate files (500k or more) Calculate/store

More information

Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3

Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3 Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM

More information