Theoretical Aspects of Storage Systems Autumn 2009
|
|
- Justin Perkins
- 8 years ago
- Views:
Transcription
1 Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann
2 News
3 Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
4 Motivation Backups: 26 full backups 26 times required backup capacity Few changes and high redundancy between different backups Similar behavior can be seen for Virtual Machine Images Home Directories Network file systems (LBFS) Data Deduplication removes redundant data and tries to ensure that information is only stored once
5 Fingerprinting Use hashing schemes to characterize content of data block Delta Encoding Search near-duplicates and just store delta between blocks Different Approaches
6 Fingerprinting Fingerprinting is based on four stages Chunking Divide data stream into chunks of fixed or variable size Fingerprinting Calculate hash-function for each chunk Duplicate Detection Compare hash result with already stored index Update indexes and store data
7 Chunking Data Stream / File /... Chunk Chunk Chunk The process of chunking divides the data stream in smaller, non-overlapping blocks Different Approaches Static Chunking Content defined Chunking File-based Chunking
8 Static Chunking Each chunk has a fixed size Very fast approach Vulnerable to shifts inside the data stream Seldom used Virtual machine deduplication Deduplication of block storage A
9 Content defined Chunking Chunks will be generated based on their content Fingerprint (hash calculation) for each substring of size w Chunk ends if it holds for fingerprint f that f mod n = c for some constant 0 <= c < n Influence on chunk size Variable length Expected length is n U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, 1994.
10 Content-defined Chunking Each change only impacts its direct neighbors A A A
11 Very small chunks Special Cases E.g. unfortunate repetition of a 48 Byte window Requires more memory for fingerprint than for actual data Big chunks E.g. many runs of 0 High memory demand during processing Solution: Define Min/Max-length Typically between 2 kbytes and 64 kbytes
12 Processing Overhead Content-defined chunking requires the calculation of one fingerprint for each substring of length w one fingerprint for each word Processing overhead for fingerprint typically depends on string length Small string length: good performance, but bad chunking properties Large string length: good chunking properties, but huge performance impact Way out: Use rolling hash functions, which allow calculation of new fingerprint based on previous fingerprint in constant time
13 Rolling Hash A rolling hash is a hash function where the input is hashed in a window that moves through the input Few hash functions allow a rolling hash to be computed very quickly: the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window Applications besides data deduplication Rabin-Karp string search algorithm Rsync Wikipedia: Rolling hash
14 Rabin Fingerprints Rabin Fingerprints only requires multiplications and additions F =c 0 a k 1 + c 1 a k 2 + c 2 a k c k 1 a 0 Typically, all operations are performed modulo n Of course, the choices of a and n are critical for good hashing properties The calculation of a new fingerprint from an old one just requires one addition, one substraction, and a multiplication by a Performance results: we have measured 102 MByte/s for each processor core of a 2 GHz processor M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.
15 Duplicate Detection The system has to check for every chunk, whether it is a duplicate or not Compare-by-Hash Calculate one Fingerprint for every chunk (typically SHA1) Check whether this fingerprint is already known to the system SHA 1 still very costly 73,1 Mbyte/s throughput on each 2 GHz core
16 Compare-by-Hash Applicability of approach is (at best) discussed (see e.g. [Hen03]): "Use of compare-by-hash is justified by mathematical calculations based on assumptions that range from unproven to demonstrably wrong. The short lifetime and fast transition into obsolescence of cryptographic hashes makes them unsuitable for use in long-lived systems. When hash collisions do occur, they cause silent errors and bugs that are difficult to repair. V. Henson. An analysis of compare-by-hash. In HOTOS'03: Proceedings of the 9th conference on Hot Topics in Operating Systems
17 Compare-by-Hash Data loss based on accidental collisions: Birthday paradoxon: It is sufficient for one block to have the same hash value as an arbitrary other block to produce silent data corruption Assuming n data block, and b bits hash length, this probability can be bounded by p nn 1 ( ) b Assume 1 Exabyte of data (2 60 Bytes), 4 Kbyte Chunk Size (2 12 Byte), 160 Bit SHA 1 fingerprint p < But attacks can become successful as soon as SHA 1 gets broken J. Black. Compare-by-hash: a reasoned analysis. In ATEC '06: Proceedings of the annual conference on USENIX '06 Annual Technical Conference
18 Internal Redundancy Redundancy inside an AFS file system at the University of Paderborn CDC = Content-defined Chunking SC = Static Chunking Datei = File-based Chunking D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
19 Internal Redundancy based on Data Type D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
20 < 4K > 4K > 8K > 16K > 32 K > 64K > 128K > 256K > 512K > 1M > 2M > 4M > 8M > 16M > 32M > 64M > 128M > 256M > 512M > 1G > 2G Internal Redundancy based on File Size 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% CDC-8 CDC-16 SC-8 SC-16 Datei D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
21 Temporal Redundancy Redundancy considering previous backup runs D. Meister and A. Brinkmann: Multi-Level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009
22 Bottleneck Index One fingerprint for each chunk Chunk index Index of all previous accessed chunks Assume chunk size of 8 Kbyte 2 40 /2 13 = 2 27 chunks for each Tbyte of data Using 20 Byte SHA 1 fingerprints 2,5 Gbyte index for each TByte of Data Index cannot be stored in main memory for large scale storage systems, but storing it on disk results in no locality in index access (besides for archiving) Random I/O accesses on disk 100 Mbyte/s throughput requires up to 24,000 index lookups Disk will become bottleneck
23 Use more disks What can we do? 200 IO/s per Disk 10 Disks lead to throughput of 6 MB/s Use SSDs Intel X25E: IO/s 1 SSD achieves 60 MB/s deduplication throughput Target throughput >> 200 MB/s
24 Approach 1: Bloom Filter (1) Probabilistic data structure similar to Set Insert(key), Lookup(key) Lookup(key) = false Item guarateed not in set Lookup(key) = true Item probably (!) in set Bloom Filter for fingerprints Lookup(fp) = false No chunk index lookup necessary Lookup(fp) = true Lookup still required B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
25 Data structure Bitmap b of variable length m k independent hash functions h i Insert(i) Set bit positions h 1 (i),..., h k (i) to 1 Lookup(i) Bloom Filter (2) If h 1 (i),..., h k (i) = 0, then the item is definitely not inside set Probability of False Positive E.g. 2% for k = 4 and 1 Byte for each FP Does this really helps? B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
26 Approach 2: Locality-Preserving Caching Container is sequence of new chunks 4 MB to 10 MB Metadaten (FP) and Daten is stored together on disk Chunk Lookup: Read and cache FP of complete container Lookup in Container-Cache for each sequential chunk Idea Long Runs of chunks to same container One IO for complete container metadata B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th Usenix Confence on File and Storage Technologies, 2008
27 Approach 3: Sparse Indexing Do not keep complete chunk index Divide index into segments Sequence of 10 MB chunks Choose k champions for each segment Champion index (RAM): Champion Segment Lookup inside champion index Idea: Long Runs of chunks coming from same segment 1 (successful) champion lookup delivers FP for many Chunks Applied inside HP D2D2500, D2D4000 M. Lillibridge, et al.: Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality.. In Proceedings of the 7nd USENIX Conference on File and Storage Technologies (FAST'09)
28 Chunking Near-Duplicate Detection Delta-Encoding Delta-Encoding
29 Shingling Near-Duplicate Detection Calculate hash value of all w-windows of chunk c Choose k biggest hash values S (shingles, features) Seek for chunk c with maximum number of common features Then, whp., c is similar, but not necessarily identical to c Standard technique in information retrieval C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, May 2008
30 Resemblance Detection Resemblance of two chunks A and B is defined as ra,b ( )= S(A) S(B) S(A) S(B) 1 Calculating of r(a,b) too computational intensive Just use a clearly defined subset of shingles, e.g. biggest values (see [Man 1994]) Broder suggests to use k minimal Fingerprints and shows that the function is an unbiased estimator A. Z. Broder. Identifying and ltering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
31 Delta Encoding Chunk c can be compressed based on chunk c c = QWIJKLMNOBCDEFGHZDEFGHIJKL c' = ABCDEFGHIJKLMNOP (Insert; 2; QW) QW (Copy; 7; 8) IJKLMNO (Copy; 7; 1) BCDEFGH (Insert; 1; Z) Z (Copy; 9; 3) DEFGHIJK
32 Delta Encoding Douglis and Iyengar claim that "Delta-encoding itself has been made extremely efficient, and it should not usually be a bottleneck except in extremely highbandwidth environments. [...] The inclusion of the Ajtai delta-encoding work in a commercial backup system, also support the argument that DERD will not be limited by the delta-encoding bandwidth. F. Douglis and A. Iyengar. Application-specic deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference
33 Diligent Delta-Encoding: Memory Overhead Chunks of size 32 MByte Resemblance detection based on 4 Kbyte for each shingle k=8 maximum value shingles are used for resemblance detection Feature index for 1 TByte has size, m k f Bytes where m is chunk size, k is number of shinglings per chunk and f is the size of a fingerprint For Diligent, the feature index has size Optimized for Backup applications Diligent Technologies. HyperFactor -- a breakthrough in data reduction technology. Diligent White Paper Bytes = 4 MByte 25 2
34 Discussion Delta-Encoding Delta encoding can help to overcome disk bottleneck in today s deduplication systems, but reconstruction of chunks can trigger reconstruction of additional chunks slowdown, both for reading and in writing data problem becomes worse, if the system gets older Depth of dedup tree increases Clean-up process can help to limit this problem Diligent has, e.g. to our knowledge, restricted the maximum tree depth
A Deduplication File System & Course Review
A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror
More informationA Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
More informationDEXT3: Block Level Inline Deduplication for EXT3 File System
DEXT3: Block Level Inline Deduplication for EXT3 File System Amar More M.A.E. Alandi, Pune, India ahmore@comp.maepune.ac.in Zishan Shaikh M.A.E. Alandi, Pune, India zishan366shaikh@gmail.com Vishal Salve
More informationInline Deduplication
Inline Deduplication binarywarriors5@gmail.com 1.1 Inline Vs Post-process Deduplication In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e.
More informationSpeeding Up Cloud/Server Applications Using Flash Memory
Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,
More informationA Data De-duplication Access Framework for Solid State Drives
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science
More informationThe assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
More informationData Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia slavjan_ivanov@yahoo.com,
More informationTrends in Enterprise Backup Deduplication
Trends in Enterprise Backup Deduplication Shankar Balasubramanian Architect, EMC 1 Outline Protection Storage Deduplication Basics CPU-centric Deduplication: SISL (Stream-Informed Segment Layout) Data
More informationByte-index Chunking Algorithm for Data Deduplication System
, pp.415-424 http://dx.doi.org/10.14257/ijsia.2013.7.5.38 Byte-index Chunking Algorithm for Data Deduplication System Ider Lkhagvasuren 1, Jung Min So 1, Jeong Gun Lee 1, Chuck Yoo 2 and Young Woong Ko
More informationFAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory
CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory
More informationMAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,
More informationQuanqing XU Quanqing.Xu@nicta.com.au. YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud
Quanqing XU Quanqing.Xu@nicta.com.au YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud Outline Motivation YuruBackup s Architecture Backup Client File Scan, Data
More informationA Deduplication-based Data Archiving System
2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System
More informationMulti-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
More informationWAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract
More informationData Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs
Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of
More informationA Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose
A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose Abhirupa Chatterjee 1, Divya. R. Krishnan 2, P. Kalamani 3 1,2 UG Scholar, Sri Sairam College Of Engineering, Bangalore. India
More informationSecurity Ensured Redundant Data Management under Cloud Environment
Security Ensured Redundant Data Management under Cloud Environment K. Malathi 1 M. Saratha 2 1 PG Scholar, Dept. of CSE, Vivekanandha College of Technology for Women, Namakkal. 2 Assistant Professor, Dept.
More informationAssuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets
Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets Young Jin Nam School of Computer and Information Technology Daegu University Gyeongsan, Gyeongbuk, KOREA 7-7 Email:
More informationTheoretical Aspects of Storage Systems Autumn 2009
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 1: RAID André Brinkmann University of Paderborn Personnel Students: ~13.500 students Professors: ~230 Other staff: ~600 scientific, ~630 non-scientific
More informationDeploying De-Duplication on Ext4 File System
Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College
More informationTop Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation
Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely
More informationSparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble HP Labs UC Santa Cruz HP
More informationStorage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann
Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies
More informationIDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,
More informationData Deduplication and Tivoli Storage Manager
Data Deduplication and Tivoli Storage Manager Dave Cannon Tivoli Storage Manager rchitect Oxford University TSM Symposium September 2007 Disclaimer This presentation describes potential future enhancements
More informationCURRENTLY, the enterprise data centers manage PB or
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 61, NO. 11, JANUARY 21 1 : Distributed Deduplication for Big Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Senior Member, IEEE,
More informationSTORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside
Managing the information that drives the enterprise STORAGE Buying Guide: DEDUPLICATION inside What you need to know about target data deduplication Special factors to consider One key difference among
More informationALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering
ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, umajamu30@gmail.com DEPARTMENT OF ECE, IFET College of Engineering ABSTRACT Deduplication due to combination of resource intensive
More informationA Fast Dual-level Fingerprinting Scheme for Data Deduplication
A Fast Dual-level Fingerprinting Scheme for Data Deduplication 1 Jiansheng Wei, *1 Ke Zhou, 1,2 Lei Tian, 1 Hua Wang, 1 Dan Feng *1,Corresponding Author Wuhan National Laboratory for Optoelectronics, School
More informationThe Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationA Novel Deduplication Avoiding Chunk Index in RAM
A Novel Deduplication Avoiding Chunk Index in RAM 1 Zhike Zhang, 2 Zejun Jiang, 3 Xiaobin Cai, 4 Chengzhang Peng 1, First Author Northwestern Polytehnical University, 127 Youyixilu, Xi an, Shaanxi, P.R.
More informationIMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS
IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS Nehal Markandeya 1, Sandip Khillare 2, Rekha Bagate 3, Sayali Badave 4 Vaishali Barkade 5 12 3 4 5 (Department
More informationA Survey on Data Deduplication in Cloud Storage Environment
385 A Survey on Data Deduplication in Cloud Storage Environment Manikantan U.V. 1, Prof.Mahesh G. 2 1 (Department of Information Science and Engineering, Acharya Institute of Technology, Bangalore) 2 (Department
More informationAlternatives to Big Backup
Alternatives to Big Backup Life Cycle Management, Object- Based Storage, and Self- Protecting Storage Systems Presented by: Chris Robertson Solution Architect Cambridge Computer Copyright 2010-2011, Cambridge
More informationDe-duplication-based Archival Storage System
De-duplication-based Archival Storage System Than Than Sint Abstract This paper presents the disk-based backup system in which only relational database files are stored by using data deduplication technology.
More informationABSTRACT 1 INTRODUCTION
DEDUPLICATION IN YAFFS Karthik Narayan {knarayan@cs.wisc.edu}, Pavithra Seshadri Vijayakrishnan{pavithra@cs.wisc.edu} Department of Computer Sciences, University of Wisconsin Madison ABSTRACT NAND flash
More informationMetadata Feedback and Utilization for Data Deduplication Across WAN
Zhou B, Wen JT. Metadata feedback and utilization for data deduplication across WAN. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(3): 604 623 May 2016. DOI 10.1007/s11390-016-1650-6 Metadata Feedback
More informationRead Performance Enhancement In Data Deduplication For Secondary Storage
Read Performance Enhancement In Data Deduplication For Secondary Storage A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Pradeep Ganesan IN PARTIAL FULFILLMENT
More informationOnline De-duplication in a Log-Structured File System for Primary Storage
Online De-duplication in a Log-Structured File System for Primary Storage Technical Report UCSC-SSRC-11-03 May 2011 Stephanie N. Jones snjones@cs.ucsc.edu Storage Systems Research Center Baskin School
More informationChunkStash: Speeding up Inline Storage Deduplication using Flash Memory
ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory Biplob Debnath Sudipta Sengupta Jin Li Microsoft Research, Redmond, WA, USA University of Minnesota, Twin Cities, USA Abstract Storage
More informationsulbhaghadling@gmail.com
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 3 March 2015, Page No. 10715-10720 Data DeDuplication Using Optimized Fingerprint Lookup Method for
More informationA Survey on Deduplication Strategies and Storage Systems
A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide
More informationReducing Replication Bandwidth for Distributed Document Databases
Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2 #1 You can
More informationLow-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy, and Hong Tang University of California at Santa Barbara, Alibaba Inc. Abstract In a virtualized
More informationAvailability Digest. www.availabilitydigest.com. Data Deduplication February 2011
the Availability Digest Data Deduplication February 2011 What is Data Deduplication? Data deduplication is a technology that can reduce disk storage-capacity requirements and replication bandwidth requirements
More informationData Compression and Deduplication. LOC 2010 2010 Cisco Systems, Inc. All rights reserved.
Data Compression and Deduplication LOC 2010 2010 Systems, Inc. All rights reserved. 1 Data Redundancy Elimination Landscape VMWARE DeDE IBM DDE for Tank Solaris ZFS Hosts (Inline and Offline) MDS + Network
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationINTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP
INTENSIVE FIXED CHUNKING (IFC) DE-DUPLICATION FOR SPACE OPTIMIZATION IN PRIVATE CLOUD STORAGE BACKUP 1 M.SHYAMALA DEVI, 2 V.VIMAL KHANNA, 3 M.SHAHEEN SHAH 1 Assistant Professor, Department of CSE, R.M.D.
More informationAvoiding the Disk Bottleneck in the Data Domain Deduplication File System
Avoiding the Disk Bottleneck in the Data Domain Deduplication File System Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc. Abstract Disk-based
More informationCloud De-duplication Cost Model THESIS
Cloud De-duplication Cost Model THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Christopher Scott Hocker
More informationA Study on Data Deduplication in HPC Storage Systems
A Study on Data Deduplication in HPC Storage Systems Dirk Meister, Jürgen Kaiser, Andre Brinkmann Johannes Gutenberg-University, Mainz Germany {dirkmeister, j.kaiser, brinkmann}@uni-mainz.de Toni Cortes
More informationAn Efficient Deduplication File System for Virtual Machine in Cloud
An Efficient Deduplication File System for Virtual Machine in Cloud Bhuvaneshwari D M.E. computer science and engineering IndraGanesan college of Engineering,Trichy. Abstract Virtualization is widely deployed
More informationData Deduplication and Tivoli Storage Manager
Data Deduplication and Tivoli Storage Manager Dave annon Tivoli Storage Manager rchitect March 2009 Topics Tivoli Storage, IM Software Group Deduplication technology Data reduction and deduplication in
More informationDesign of an Exact Data Deduplication Cluster
Design of an Exact Data Deduplication Cluster Jürgen Kaiser, Dirk Meister, Andre Brinkmann Johannes Gutenberg-University, Mainz, Germany {j.kaiser, dirkmeister, brinkman}@uni-mainz.de Sascha Effert Christmann
More informationHP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper
HP StoreOnce D2D Understanding the challenges associated with NetApp s deduplication Business white paper Table of contents Challenge #1: Primary deduplication: Understanding the tradeoffs...4 Not all
More informationThe Design of a Similarity Based Deduplication System
The Design of a Similarity Based Deduplication System Lior Aronovich IBM Corp. lioraron@il.ibm.com Haim Bitner Marvell Corp. haimb@marvell.com Ron Asher IBM Corp. ronasher@il.ibm.com Michael Hirsch IBM
More informationRAID 5 rebuild performance in ProLiant
RAID 5 rebuild performance in ProLiant technology brief Abstract... 2 Overview of the RAID 5 rebuild process... 2 Estimating the mean-time-to-failure (MTTF)... 3 Factors affecting RAID 5 array rebuild
More informationA SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP
A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP Dilip N Simha (Stony Brook University, NY & ITRI, Taiwan) Maohua Lu (IBM Almaden Research Labs, CA) Tzi-cker Chiueh (Stony
More informationHardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
More informationData Deduplication in BitTorrent
Data Deduplication in BitTorrent João Pedro Amaral Nunes October 14, 213 Abstract BitTorrent is the most used P2P file sharing platform today, with hundreds of millions of files shared. The system works
More informationFile Systems Management and Examples
File Systems Management and Examples Today! Efficiency, performance, recovery! Examples Next! Distributed systems Disk space management! Once decided to store a file as sequence of blocks What s the size
More informationFrequency Based Chunking for Data De-Duplication
Frequency Based Chunking for Data De-Duplication Guanlin Lu, Yu Jin, and David H.C. Du Department of Computer Science and Engineering University of Minnesota, Twin-Cities Minneapolis, Minnesota, USA (lv,
More informationPLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage
PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage Jian Liu, Yunpeng Chai, Yuan Xiao Renmin University of China Xiao Qin Auburn University Speaker: Tao Xie MSST, June 5, 2014 Deduplication
More informationImproving the Database Logging Performance of the Snort Network Intrusion Detection Sensor
-0- Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor Lambert Schaelicke, Matthew R. Geiger, Curt J. Freeland Department of Computer Science and Engineering University
More informationContents. WD Arkeia Page 2 of 14
Contents Contents...2 Executive Summary...3 What Is Data Deduplication?...4 Traditional Data Deduplication Strategies...5 Deduplication Challenges...5 Single-Instance Storage...5 Fixed-Block Deduplication...6
More informationThe What, Why and How of the Pure Storage Enterprise Flash Array
The What, Why and How of the Pure Storage Enterprise Flash Array Ethan L. Miller (and a cast of dozens at Pure Storage) What is an enterprise storage array? Enterprise storage array: store data blocks
More informationHTTP-Level Deduplication with HTML5
HTTP-Level Deduplication with HTML5 Franziska Roesner and Ivayla Dermendjieva Networks Class Project, Spring 2010 Abstract In this project, we examine HTTP-level duplication. We first report on our initial
More informationSistemas Operativos: Input/Output Disks
Sistemas Operativos: Input/Output Disks Pedro F. Souto (pfs@fe.up.pt) April 28, 2012 Topics Magnetic Disks RAID Solid State Disks Topics Magnetic Disks RAID Solid State Disks Magnetic Disk Construction
More informationData De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationReducing Replication Bandwidth for Distributed Document Databases
Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu, Andrew Pavlo, Sudipta Sengupta Jin Li, Gregory R. Ganger Carnegie Mellon University, Microsoft Research CMU-PDL-14-108 December
More informationStorage Research in the UCSC Storage Systems Research Center (SSRC)
Storage Research in the UCSC Storage Systems Research Center (SSRC) Scott A. Brandt (scott@cs.ucsc.edu) Computer Science Department Storage Systems Research Center Jack Baskin School of Engineering University
More informationCOS 318: Operating Systems
COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache
More informationDeduplication in unstructured-data storage systems
ELEKTROTEHNIŠKI VESTNIK 82(5): 233 242, 2015 ORIGINAL SCIENTIFIC PAPER Deduplication in unstructured-data storage systems Andrej Tolič 1,, Andrej Brodnik 1,2 1 University of Ljubljana, Faculty of Computer
More informationApplication-Aware Client-Side Data Reduction and Encryption of Personal Data in Cloud Backup Services
Fu YJ, Xiao N, Liao XK et al. Application-aware client-side data reduction and encryption of personal data in cloud backup services. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 28(6): 1012 1024 Nov. 2013.
More informationFragmentation in in-line. deduplication backup systems
Fragmentation in in-line 5/6/2013 deduplication backup systems 1. Reducing Impact of Data Fragmentation Caused By In-Line Deduplication. Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, Cezary Dubnicki.
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
More informationTradeoffs in Scalable Data Routing for Deduplication Clusters
Tradeoffs in Scalable Data Routing for Deduplication Clusters Wei Dong Princeton University Fred Douglis EMC Kai Li Princeton University and EMC Hugo Patterson EMC Sazzala Reddy EMC Philip Shilane EMC
More informationEverything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
More informationEfficient Deduplication in Disk- and RAM-based Data Storage Systems
Efficient Deduplication in Disk- and RAM-based Data Storage Systems Andrej Tolič and Andrej Brodnik University of Ljubljana, Faculty of Computer and Information Science, Slovenia {andrej.tolic,andrej.brodnik}@fri.uni-lj.si
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD
More informationReference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges
Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011 Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased
More informationChapter 8: On the Use of Hash Functions in. Computer Forensics
Harald Baier Hash Functions in Forensics / WS 2011/2012 2/41 Chapter 8: On the Use of Hash Functions in Computer Forensics Harald Baier Hochschule Darmstadt, CASED WS 2011/2012 Harald Baier Hash Functions
More informationEMAIL DATA DE-DUPLICATION SYSTEM
EMAIL DATA DE-DUPLICATION SYSTEM A Final Project Presented to The Faculty of the Department of General Engineering San José State University In Partial Fulfillment of the Requirements for the Degree Master
More informationE-Guide. Sponsored By:
E-Guide An in-depth look at data deduplication methods This E-Guide will discuss the various approaches to data deduplication. You ll learn the pros and cons of each, and will benefit from independent
More informationPRUN : Eliminating Information Redundancy for Large Scale Data Backup System
PRUN : Eliminating Information Redundancy for Large Scale Data Backup System Youjip Won 1 Rakie Kim 1 Jongmyeong Ban 1 Jungpil Hur 2 Sangkyu Oh 2 Jangsun Lee 2 1 Department of Electronics and Computer
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationEfficiently Storing Virtual Machine Backups
Efficiently Storing Virtual Machine Backups Stephen Smaldone, Grant Wallace, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract Physical level backups offer increased performance
More informationOpen Access Improving Read Performance with BP-DAGs for Storage-Efficient File Backup
Send Orders for Reprints to reprints@benthamscience.net 90 The Open Electrical & Electronic Engineering Journal, 2013, 7, 90-97 Open Access Improving Read Performance with BP-DAGs for Storage-Efficient
More informationBackup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert
Backup Software Data Deduplication: What you need to know Presented by W. Curtis Preston Executive Editor & Independent Backup Expert When I was in the IT Department When I started as backup guy at $35B
More informationReducing impact of data fragmentation caused by in-line deduplication
Reducing impact of data fragmentation caused by in-line deduplication Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki 9LivesData, LLC {kaczmarczyk, barczynski, wkilian, dubnicki}@9livesdata.com
More informationReliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity
Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity Youngjin Nam School of Computer and Information Technology Daegu University Gyeongsan, Gyeongbuk, KOREA 712-714
More informationBuilding a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos
Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos Symantec Research Labs Symantec FY 2013 (4/1/2012 to 3/31/2013) Revenue: $ 6.9 billion Segment Revenue Example Business
More informationData Deduplication in Windows Server 2012 and Ubuntu Linux
Xu Xiaowei Data Deduplication in Windows Server 2012 and Ubuntu Linux Bachelor s Thesis Information Technology May 2013 DESCRIPTION Date of the bachelor's thesis 2013/5/8 Author(s) Xu Xiaowei Name of the
More informationExtreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup
Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup Deepavali Bhagwat University of California 1156 High Street Santa Cruz, CA 9564 dbhagwat@soe.ucsc.edu Kave Eshghi Hewlett-Packard
More informationbup: the git-based backup system Avery Pennarun
bup: the git-based backup system Avery Pennarun 2010 10 25 The Challenge Back up entire filesystems (> 1TB) Including huge VM disk images (files >100GB) Lots of separate files (500k or more) Calculate/store
More informationUnit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3
Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM
More information