Probabilistic Deduplication for Cluster-Based Storage Systems

Similar documents

Tradeoffs in Scalable Data Routing for Deduplication Clusters

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Big Data & Scripting Part II Streaming Algorithms

VM-Centric Snapshot Deduplication for Cloud Data Backup

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Online De-duplication in a Log-Structured File System for Primary Storage

A Novel Deduplication Avoiding Chunk Index in RAM

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Cloud De-duplication Cost Model THESIS

Speeding Up Cloud/Server Applications Using Flash Memory

Understanding EMC Avamar with EMC Data Protection Advisor

Reducing Replication Bandwidth for Distributed Document Databases

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Primary Data Deduplication Large Scale Study and System Design

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Data Deduplication: An Essential Component of your Data Protection Strategy

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

How To Make A Backup System More Efficient

Data Deduplication in a Hybrid Architecture for Improving Write Performance

The Advantages and Disadvantages of Network Computing Nodes

File Systems Management and Examples

Windows Server 2012 授權說明

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Lecture 1: Data Storage & Index

G Porcupine. Robert Grimm New York University

MagFS: The Ideal File System for the Cloud

LDA, the new family of Lortu Data Appliances

Physical Data Organization

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Characterizing Task Usage Shapes in Google s Compute Clusters

Chapter 12 File Management

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Maginatics Cloud Storage Platform for Elastic NAS Workloads

Software-Defined Traffic Measurement with OpenSketch

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

The Classical Architecture. Storage 1 / 36

SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage

A survey of big data architectures for handling massive data

Berlin Storage, Backup and Disaster Recovery in the Cloud AWS Customer Case Study: HERE Maps for Life

3Gen Data Deduplication Technical

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal

Data Warehousing und Data Mining

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

Cloud Optimize Your IT

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

Overview of Storage and Indexing

A Deduplication File System & Course Review

Maximizing SQL Server Virtualization Performance

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Barracuda Backup Deduplication. White Paper

Discovery of Electronically Stored Information ECBA conference Tallinn October 2012

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Understanding EMC Avamar with EMC Data Protection Advisor

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Berkeley Ninja Architecture

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

Benchmarking Hadoop & HBase on Violin

Symantec Enterprise Vault And NetApp Better Together

ACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

Theoretical Aspects of Storage Systems Autumn 2009

Protect Microsoft Exchange databases, achieve long-term data retention

Tableau Server 7.0 scalability

Hadoop Architecture. Part 1

Protecting enterprise servers with StoreOnce and CommVault Simpana

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

HyLARD: A Hybrid Locality-Aware Request Distribution Policy in Cluster-based Web Servers

STRATEGIC PLANNING ASSUMPTION(S)

Inline Deduplication

Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Big Fast Data Hadoop acceleration with Flash. June 2013

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

Turnkey Deduplication Solution for the Enterprise

Cloud Computing. Chapter 4 Infrastructure as a Service (IaaS)

Transcription:

Probabilistic Deduplication for Cluster-Based Storage Systems Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France

Motivation Volume of data stored increases exponentially. Provided services are highly dependent on data. 2

Motivation Traditional solutions combine data in tarballs and store them on tape. Pros: cost efficient. Cons: low throughput. Tape Disk Acquisition cost $407,000 $1,620,000 Operational cost $205,000 $573,000 Total cost $612,000 $2,193,000 * Source: www.backupworks.com 3

Deduplication Store data only once and replace duplicates with references. 4

Deduplication Store data only once and replace duplicates with references. file1 5

Deduplication Store data only once and replace duplicates with references. file1 file2 6

Deduplication Store data only once and replace duplicates with references. file1 file2 7

Deduplication Store data only once and replace duplicates with references. file1 file2 8

Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. 9

Challenges Single-node deduplication systems. Compact indexing structures. Efficient duplicate detection. Cluster-based solutions. Single-machine tradeoffs. Deduplication vs Load balancing. We focus on Cluster-based Deduplication Systems. 10

Example: Deduplication Vs Load Balancing A client wants to store a file. A B C D Clients Coordinator Storage Nodes 11

Example: Deduplication Vs Load Balancing The client sends the file to the Coordinator. A B C D Clients Coordinator Storage Nodes 12

Example: Deduplication Vs Load Balancing The Coordinator computes the overlap between the contents of and those of each Storage Node. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 13

Example: Deduplication Vs Load Balancing To maximize DEDUPLICATION, the new file should go to node C. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 14

Example: Deduplication Vs Load Balancing To achieve LOAD BALANCING, the new file should go to node D. A 10% B 30% C 60% D 0% Clients Coordinator Storage Nodes 15

Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. 16

Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. 17

Goal: Scalable Cluster Deduplication. Good Data Deduplication. Maximize: Ideally, deduplication of a single-node system. Load Balancing. Minimize: Ideally, equal to 1. Scalability. Minimize memory usage at Coordinator. Good Throughput. Minimize CPU/Memory usage at Coordinator. 18

PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 19

Client: chunking Chunks: use content-based chunking techniques. basic deduplication unit. Super-chunks: group of consecutive chunks. basic routing and storage unit. 20

Client: chunking Split the file in chunks 21

Client: chunking Organize the chunks in super-chunks 22

Client: chunking 23

PRODUCK architecture Client Coordinator Split the file in chunks of data. Store and retrieve data. Assign chunks to nodes. Keep the system load balanced. Storage Nodes Store the chunks. Provide directory services. 24

Coordinator: goals Estimate the overlap between a super-chunk and the chunks of a given node. Maximize deduplication. Equally distribute storage load among nodes. Guarantee a load balanced system. 25

Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 26

Coordinator: Overlap Estimation Main observation : Do not need the exact matches. Need only an estimation of the size of the overlap. PCSA permits : Compact set descriptors. Accurate intersection estimation. Computationally efficient. 27

Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 28

Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() 7 6 5 4 3 2 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 29

Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() 7 6 5 4 3 2 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 p(y) = min(bit(y, k)) BITMAP 0 1 2 3 4 5 6 7 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 30

Coordinator: Overlap Estimation Original Set of Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 hash() INTUITION P(bitmap[0] = 1) = 1/2 P(bitmap[1] = 1) = 1/4 P(bitmap[2] = 1) = 1/8 p(y) = min(bit(y, k)) BITMAP 0 1 2 3 4 5 6 7 1 1 0 1 0 0 0 0 31

Coordinator: Overlap Estimation Intersection Cardinality Estimation? Union Cardinality Estimation? BITMAP(A) 0 1 2 3 4 5 6 7 1 1 0 1 1 0 0 0 BITMAP(B) 0 1 2 3 4 5 6 7 0 1 0 0 1 1 0 0 BitwiseOR BITMAP(A V B) 0 1 2 3 4 5 6 7 1 1 0 1 1 1 0 0 32

Coordinator: Overlap Estimation PCSA set cardinality estimation. Set intersection estimation. Selection of best storage node. 33

Coordinator: our contributions Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 34

Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. 35

Load Balancing Existing solution: choose Storage Nodes that do not exceed average load by a percentage threshold. Problems Too aggressive, especially when a few data are stored in the system. 36

Load Balancing: our solution Bucket-based storage quota management. Measure storage space in fixed-size buckets. Coordinator grants buckets to nodes one by one. No node can exceed the least loaded by more than a maximum allowed bucket difference. 37

Load Balancing: our solution Bucket-based storage quota management. Bucket 38

Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 39

Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 40

Load Balancing: our solution Bucket-based storage quota management. Yes, you can. Bucket 41

Load Balancing: our solution Bucket-based storage quota management. Bucket 42

Load Balancing: our solution Bucket-based storage quota management. Bucket 43

Load Balancing: our solution Bucket-based storage quota management. Bucket Can I get a new Bucket? 44

Load Balancing: our solution Bucket-based storage quota management. NO you cannot! Bucket 45

Load Balancing: our solution Bucket-based storage quota management. Searching for the second biggest overlap. Bucket 46

Load Balancing: our solution Bucket-based storage quota management. Bucket 47

Contribution Summary Novel chunk overlap estimation. Based on probabilistic counting PCSA [Flajolet et al. 1985, Michel et al. 2006]. Never used before in storage systems. Novel load balancing mechanism. Operating at chunk-level granularity. Improving co-localization of duplicate chunks. 48

in l- ge t. ser nt ith file ble ge idch he ng he he ck on binary dat a, t hus covering many common cases. Table 1 presents more details on these datasets. In particular, the Deduplication Evaluation: Factor is the ratio of the Datasets original size of each dataset divided by its size after being deduplicated based on our chunking mechanism. 2 real world workloads: D at aset Size D eduplication Fact or m at Dat a For - (G B ) W ikipedia 522 1.96 HT ML Images 142 4.27 OS images 2 competitors Table 1: [Dong D at asetet D escr al. 2011]: ipt ion Minhash BloomFilter 4.2 Competitors Before evaluat ing t he specific aspect s of Pr oduck, we compare it against two state-of-the-art cluster-based deduplication storage systems presented in [6]: Bl oomfil t er,a stateful strategy, and M inhash, astateless one. T he latter is used in a commercial product [6], thus representing the 49

Evaluation: Competitors MinHash: Use the minimum hash from a super-chunk as its fingerprint. Assign super-chunks to bins using the mod(# bins) operator. Initially assign bins to nodes randomly and reassign bins to nodes when unbalanced. 50

Evaluation: Competitors BloomFilter: The Coordinator keeps a Bloom filter for each one of the Storage Nodes. If a node deviates more than 5% from the average load, he is considered overloaded. 51

Evaluation: Metrics Deduplication: Load balancing: Overall: ED and TD are normalized to the performance of a singlenode system to ease comparison. Throughput : 52

Evaluation: Effective Deduplication Wikipedia Images 32 nodes : Wikipedia 7% Images 16% 64 nodes : Wikipedia 16% Images 21% 53

Evaluation: Throughput Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 54

Evaluation: Throughput Memory : 64KB for Produck 9,6bits/chunk or 168GB for 140TB/node Wikipedia Images 32 nodes : Wikipedia 11X Images 13X 64 nodes : Wikipedia 16X Images 21X 55

Evaluation: Load Balancing Wikipedia Load Balancing Images 56

57