Efficient File Storage Using Content-based Indexing



Similar documents
De-duplication-based Archival Storage System

Multi-level Metadata Management Scheme for Cloud Storage System

The assignment of chunk size according to the target data characteristics in deduplication backup system

A Highly Available Replicated File System for Resource-Constrained Windows CE.Net Devices 1

Byte-index Chunking Algorithm for Data Deduplication System

DEXT3: Block Level Inline Deduplication for EXT3 File System

Cumulus: filesystem backup to the Cloud

Read Performance Enhancement In Data Deduplication For Secondary Storage

Authorized data deduplication check in hybrid cloud With Cluster as a Service

Data Deduplication Scheme for Cloud Storage

Efficient Locally Trackable Deduplication in Replicated Systems

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

3 Taking Advantage of Diversity

A Data De-duplication Access Framework for Solid State Drives

A Deduplication-based Data Archiving System

Efficiently Storing Virtual Machine Backups

Two-Level Metadata Management for Data Deduplication System

DEDISbench: A Benchmark for Deduplicated Storage Systems

Chapter 12 File Management

Chapter 12 File Management. Roadmap

Side channels in cloud services, the case of deduplication in cloud storage

Assessing Data Deduplication Trade-offs from an Energy and Performance Perspective

Efficient and Safe Data Backup with Arrow

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Physical Data Organization

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

A DHT-based Backup System

Deduplication Demystified: How to determine the right approach for your business

Deploying De-Duplication on Ext4 File System

Duplicate Data Elimination in a SAN File System

ABSTRACT 1 INTRODUCTION

PIONEER RESEARCH & DEVELOPMENT GROUP

HTTP-Level Deduplication with HTML5

DeltaStor Data Deduplication: A Technical Review

Improvement of Network Optimization and Cost Reduction in End To End Process Implementing in Clouds

File Management Chapters 10, 11, 12

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Data Deduplication in BitTorrent

Online De-duplication in a Log-Structured File System for Primary Storage

FLASH IMPLICATIONS IN ENTERPRISE STORAGE ARRAY DESIGNS

Efficient Deduplication in Disk- and RAM-based Data Storage Systems

Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity

Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup

A Novel Approach for Calculation Based Cloud Band Width and Cost Diminution Method

OPTIMIZING VIRTUAL TAPE PERFORMANCE: IMPROVING EFFICIENCY WITH DISK STORAGE SYSTEMS

Data Backup and Archiving with Enterprise Storage Systems

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Fossil an archival file server

Prediction System for Reducing the Cloud Bandwidth and Cost

Availability Digest. Data Deduplication February 2011

Venti: a new approach to archival storage

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Cumulus: Filesystem Backup to the Cloud

Frequency Based Chunking for Data De-Duplication

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

CHAPTER 17: File Management

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

Data Deduplication Background: A Technical White Paper

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

File-System Implementation

Speeding Up Cloud/Server Applications Using Flash Memory

3Gen Data Deduplication Technical

How To Make A Backup System More Efficient

Data Deduplication and Tivoli Storage Manager

A SIGNIFICANT REDUCTION OF CLOUD STORAGE BY ELIMINATION OF REPETITIVE DATA

Recovery Protocols For Flash File Systems

NEXT-GENERATION STORAGE EFFICIENCY WITH EMC ISILON SMARTDEDUPE

Snapshots in Hadoop Distributed File System

Cumulus: Filesystem Backup to the Cloud

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Backup and Recovery 1

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Analysis of Disk Access Patterns on File Systems for Content Addressable Storage

Demystifying Deduplication for Backup with the Dell DR4000

Key Considerations for Managing Big Data in the Life Science Industry

Data Deduplication HTBackup

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

A Method of Deduplication for Data Remote Backup

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

Inline Deduplication

An Oracle White Paper December Advanced Network Compression

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Understanding Data Locality in VMware Virtual SAN

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

PRUN : Eliminating Information Redundancy for Large Scale Data Backup System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Constrained Clustering of Territories in the Context of Car Insurance

EMC EXAM - E Backup and Recovery - Avamar Specialist Exam for Storage Administrators. Buy Full Product.

Target Deduplication Metrics and Risk Analysis Using Post Processing Methods

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

Berkeley Ninja Architecture

Efficient Cooperative Backup with Decentralized Trust Management

Transcription:

Efficient File Storage Using Content-based Indexing João Barreto joao.barreto@inesc-id.pt Paulo Ferreira paulo.ferreira@inesc-id.pt Distributed Systems Group - INESC-ID Lisbon Technical University of Lisbon http://www.gsd.inesc-id.pt/

Why Using Content-based Indexing for File Storage? Extending content-based indexing (e.g. as used by LBFS [MCM01]) from network transference of file contents to the context of local file storage is a natural step. Particularly interesting as a storage-efficient support for: Versioning file systems Resource-constrained embedded file systems However, access performance must be acceptable and storage gains significative. 2

Existing Solutions: Chunk Repository Storage Model To some extent, all existing file storage architectures that are based on content-based indexing share a core storage model [CN02, QD02, BF04]. File contents are divided into disjoint chunks of data, each individually stored with a unique hash of its contents in a repository of chunks. The actual files are then stored as sequences of possibly shared references to chunks in the repository. file 1 file 2 file 3 Example using Chunk Repository Model: r 5 r 2 r k r 3 r n r 3 r n r 4 r 1 Legend c i chunk contents h i chunk hash r i chunk reference Chunk Repository h 1 c 1 h 2 c 2 h 3 c 3 h 14 c 14 h 5 c 5 h 1k c 1k h n c n `...... ` 3

Problems of Chunk Repository Storage Model Higher storage penalties as chunk size is decreased: 1. Increased chunk meta-data overhead (mainly with chunk hashes); 2. Increased internal fragmentation, if the chunk repository is stored on a block-based device; 3. Lower chunk compression ratios are achieved. Trade-off restricts the choice of the expected chunk size to relatively high values, hence existing solutions do not fully exploit the similarity that may exist in a file system. Sequential read performance is penalized even if the file being accessed does not share any chunk with other files Since chunks are stored in a randomly organized repository. 4

The Proposed Storage Model Two distinguishing principles: 1. If a file shares no chunks with the remaining file system, it should be stored in a plain form. Succeeding files that share any chunks with that file will reference those portions of its plain contents 2. Hashes of chunks are not stored permanently along with contents of the file system Content similarity detection of new file system data requires indexing whole contents of the file system Performed periodically in background file 1 file 2 file 3 Previous Example using the Proposed Model: c 5 c 2 c k c 3 r 3 r n c 4 c n c 1 Legend c i chunk contents r i chunk reference 5

Chunk Coalescing Optimises cases where consecutive pointers to contiguous shared chunks are detected. Such pointers are replaced by a simple multiple-chunk pointer, thus reducing the storage overhead and allowing faster access performance. In practice, this is comparable (though not always equivalent) to considering a higher chunk size whenever resorting to a lower size would yield no additional similarity gains. Example of Chunk Coalescing file r 3 file c 3 r n c 4 c 1 3 2 c n (before chunk coalescing) file r c 3 3,n 4 c 1 6 (after chunk coalescing)

Advantages In case of no similarity, no storage overhead is imposed and read access performance is identical to that of a regular file system. Storage penalties that resulted from dividing files into smaller chunks are eliminated: In case of no sharing of a chunk, no chunk storage overhead; In case of sharing, storage overhead is negligible and always compensated by the gains resulting from the increased chunk sharing: Hash values are not stored along with contents; Update coalescing optimises chunk reference overhead; In case of an underlying block-based file system, internal fragmentation is, on average, not affected; Data compression may be applied to the unshared portions of each file as a whole, thus achieving higher compression ratios than to individual smaller chunks, ; 7

Current Status Partially functional in a simulator. Currently being implemented as a Linux Virtual File System. References [BF04] J. Barreto and P. Ferreira. A replicated file system for resource constrained mobile devices. In Proceedings of IADIS International Conference on Applied Computing, 2004. [CN02] L. Cox and B. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the Fifth ACM/USENIX Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002. [MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth network file system. In Symposium on Operating Systems Principles, pages 174 187, 2001. [QD02] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In First USENIX conference on File and Storage Technologies, Monterey,CA, 2002. 8