Reducing Replication Bandwidth for Distributed Document Databases

Similar documents
Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases

Speeding Up Cloud/Server Applications Using Flash Memory

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Understanding EMC Avamar with EMC Data Protection Advisor

Data Deduplication HTBackup

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

09'Linux Plumbers Conference

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Deduplication Demystified: How to determine the right approach for your business

Protect Data... in the Cloud

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

A Deduplication File System & Course Review

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

3Gen Data Deduplication Technical

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Demystifying Deduplication for Backup with the Dell DR4000

Theoretical Aspects of Storage Systems Autumn 2009

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Read Performance Enhancement In Data Deduplication For Secondary Storage

CEMEX en Concreto con EMC. Jose Luis Bedolla EMC Corporation Back Up Recovery and Archiving

Metadata Feedback and Utilization for Data Deduplication Across WAN

Inline Deduplication

Understanding EMC Avamar with EMC Data Protection Advisor

Hardware Configuration Guide

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Estimating Deduplication Ratios in Large Data Sets

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

MongoDB Developer and Administrator Certification Course Agenda

Internet Storage Sync Problem Statement

Data Deduplication and Corporate PC Backup

idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti

EMC BACKUP-AS-A-SERVICE

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, DEPARTMENT OF ECE, IFET College of Engineering

Contents. WD Arkeia Page 2 of 14

In Memory Accelerator for MongoDB

A Data De-duplication Access Framework for Solid State Drives

Edelta: A Word-Enlarging Based Fast Delta Compression Approach

Sharding and MongoDB. Release MongoDB, Inc.

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

M710 - Max 960 Drive, 8Gb/16Gb FC, Max 48 ports, Max 192GB Cache Memory

File System Management

HP StoreOnce: reinventing data deduplication

MongoDB and Couchbase

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Technical White Paper for the Oceanspace VTL6000

VM-Centric Snapshot Deduplication for Cloud Data Backup

Tertiary Backup objective is automated offsite backup of critical data

Data deduplication is more than just a BUZZ word

Key Components of WAN Optimization Controller Functionality

De-duplication-based Archival Storage System

Symantec Backup Exec Blueprints

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Don t Get Duped By. Dedupe. 7 Technology Circle Suite 100 Columbia, SC Phone: sales@unitrends.com URL:

Optimize VMware and Hyper-V Protection with HP and Veeam

Data Deduplication in a Hybrid Architecture for Improving Write Performance

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

GET. tech brief FASTER BACKUPS

Lessons Learned while Pushing the Limits of SecureFile LOBs. by Jacco H. Landlust. zondag 3 maart 13

Veeam Best Practices with Exablox

Sharding and MongoDB. Release MongoDB, Inc.

Hybrid Cloud Storage System. Oh well, I will write the report on May1 st

Cloud Services. May 28 th, 2014 Athens, Greece

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

es T tpassport Q&A * K I J G T 3 W C N K V [ $ G V V G T 5 G T X K E G =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX *VVR YYY VGUVRCUURQTV EQO

Hyper-converged IT drives: - TCO cost savings - data protection - amazing operational excellence

Multi-level Metadata Management Scheme for Cloud Storage System

Asymmetric Caching: Improved Network Deduplication for Mobile Devices

Barracuda Backup Deduplication. White Paper

Online De-duplication in a Log-Structured File System for Primary Storage

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

Cloud De-duplication Cost Model THESIS

E-Guide. Sponsored By:

DEXT3: Block Level Inline Deduplication for EXT3 File System

Primary Data Deduplication Large Scale Study and System Design

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

NoSQL: Going Beyond Structured Data and RDBMS

Release Notes. LiveVault. Contents. Version Revision 0

A block based storage model for remote online backups in a trust no one environment

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Get Success in Passing Your Certification Exam at first attempt!

A Survey on Deduplication Strategies and Storage Systems

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Technical Overview Simple, Scalable, Object Storage Software

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Performance and scalability of a large OLTP workload

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Trends in Enterprise Backup Deduplication

THe exponential growth of mobile data traffic has led

Transcription:

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2

Document-oriented Databases { "_id" : "55ca4cf7bad4f75b8eb5c25c", "pageid" : "46780", "revid" : "41173", "timestamp" : "2002-03-30T20:06:22", "sha1" : "6i81h1zt22u1w4sfxoofyzmxd "text" : The Peer and the Peri is a comic [[Gilbert and Sullivan]] [[operetta ]] in two acts just as predicting, The fairy Queen, however, appears to all live happily ever after. " } Update { "_id" : "55ca4cf7bad4f75b8eb5c25d, "pageid" : "46780", "revid" : "128520", "timestamp" : "2002-03-30T20:11:12", "sha1" : "q08x58kbjmyljj4bow3e903uz "text" : "The Peer and the Peri is a comic [[Gilbert and Sullivan]] [[operetta ]] in two acts just as predicted, The fairy Queen, on the other hand, is ''not'' happy, and appears to all live happily ever after. " } Update: Reading a recent doc and writing back a similar one 2

Replication Bandwidth Operation logs Primary Database WAN Operation logs { "_id" : "55ca4cf7bad4f75b8eb5c25d, "pageid" : "46780", "revid" : "128520", "timestamp" : "2002-03-30T20:11:12", "sha1" : "q08x58kbjmyljj4bow3e903uz "text" : "The Peer and the Peri is a comic [[Gilbert and Sullivan]] [[operetta ]] in two acts just as predicted, The fairy Queen, on the other hand, is ''not'' happy, and appears to all live happily ever after. " } { "_id" : "55ca4cf7bad4f75b8eb5c25c", "pageid" : "46780", "revid" : "41173", "timestamp" : "2002-03-30T20:06:22", "sha1" : "6i81h1zt22u1w4sfxoofyzmxd "text" : The Peer and the Peri is a comic [[Gilbert and Sullivan]] [[operetta ]] in two acts just as predicting, The fairy Queen, however, appears to all live happily ever after. " } Secondary Secondary 3

Operation logs Replication Bandwidth { { "_id" : "55ca4cf7bad4f75b8eb5c25d, "_id" : "55ca4cf7bad4f75b8eb5c25c", "pageid" : "46780", "pageid" : "46780", "revid" : "128520", "revid" : "41173", "timestamp" : "2002-03-30T20:11:12", "timestamp" : "2002-03-30T20:06:22", Primary "sha1" : "q08x58kbjmyljj4bow3e903uz "sha1" : "6i81h1zt22u1w4sfxoofyzmxd "text" : "The Peer and the Peri is a "text" : The Peer and the Peri is a Database comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] [[operetta ]] in two acts just as [[operetta ]] in two acts just as predicted, The fairy Queen, on the other predicting, The fairy Queen, however, hand, is ''not'' happy, and appears to all appears to all live happily ever after. " live happily ever after. " } Goal: Reduce WAN } bandwidth WAN Operation logs for geo-replication Secondary Secondary 4

Why Deduplication? Why not just compress? Oplog batches are small and not enough overlap Why not just use diff? Need application guidance to identify source Dedup finds and removes redundancies In the entire data corpus 5

Traditional Dedup: Ideal Chunk Boundary Modified Region Duplicate Region Incoming Data {BYTE STREAM } 1 2 3 4 5 Deduped Data 1 2 4 5 Send dedup ed data to replicas 6

Traditional Dedup: Reality Chunk Boundary Modified Region Duplicate Region Incoming Data 1 2 3 4 5 Deduped Data 4 7

Traditional Dedup: Reality Chunk Boundary Modified Region Duplicate Region Incoming Data 1 2 3 4 5 Deduped Data 4 Send almost the entire document. 8

Similarity Dedup (sdedup) Chunk Boundary Modified Region Duplicate Region Incoming Data Delta! Dedup ed Data Only send delta encoding. 9

Compress vs. Dedup 20GB sampled Wikipedia dataset MongoDB v2.7 // 4MB Oplog batches 10

sdedup Integration Insertion & Updates Client Database Oplog Source Document Cache Source documents sdedup Encoder Dedup ed oplog entries Unsynchronized oplog entries Oplog Oplog syncer sdedup Decoder Re-constructed oplog entries Source documents Replay Database Primary Node Secondary Node 11

sdedup Encoding Steps Identify Similar Documents Select the Best Match Delta Compression 12

Identify Similar Documents Consistent Sampling Similarity Sketch Feature Index Table 32 17 25 41 12 Target Document Rabin Chunking 41 32 32 41 Candidate Documents 39 32 22 15 Doc #1 32 25 38 41 12 32 17 38 41 12 32 25 38 41 12 32 17 38 41 12 Doc #2 Doc #3 Doc #2 Doc #3 Similarity Score 1 Doc #1 2 Doc #2 2 Doc #3 13

Select the Best Match Initial Ranking Rank Candidates Score 1 Doc #2 2 1 Doc #3 2 2 Doc #1 1 Final Ranking Rank Candidates Cached? Score 1 Doc #3 Yes 4 1 Doc #1 Yes 3 2 Doc #2 No 2 Is doc cached? If yes, reward +2 Source Document Cache 14

Evaluation MongoDB setup (v2.7) 1 primary, 1 secondary node, 1 client Node Config: 4 cores, 8GB RAM, 100GB HDD storage Datasets: Wikipedia dump (20GB out of ~12TB) Additional datasets evaluated in the paper 15

Compression Compression Ratio 50 40 30 20 10 0 sdedup trad-dedup 38.4 38.9 26.3 15.2 9.9 9.1 4.6 2.3 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 16

Memory 800 sdedup trad-dedup 780.5 Memory (MB) 600 400 200 0 272.5 133.0 80.2 34.1 47.9 57.3 61.0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 17

Other Results (See Paper) Negligible client performance overhead Failure recovery is quick and easy Sharding does not hurt compression rate More datasets Microsoft Exchange, Stack Exchange 18

Conclusion & Future Work sdedup: Similarity-based deduplication for replicated document databases Much greater data reduction than traditional dedup Up to 38x compression ratio for Wikipedia Resource-efficient design with negligible overhead Future work More diverse datasets Dedup for local database storage Different similarity search schemes (e.g., super-fingerprints) 19