Reducing Replication Bandwidth for Distributed Document Databases

Similar documents

Speeding Up Cloud/Server Applications Using Flash Memory

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Protect Data... in the Cloud

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Understanding EMC Avamar with EMC Data Protection Advisor

Data Deduplication HTBackup

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

GET. tech brief FASTER BACKUPS

A Deduplication File System & Course Review

Demystifying Deduplication for Backup with the Dell DR4000

3Gen Data Deduplication Technical

Deduplication Demystified: How to determine the right approach for your business

Internet Storage Sync Problem Statement

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

09'Linux Plumbers Conference

LDA, the new family of Lortu Data Appliances

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

Multi-level Metadata Management Scheme for Cloud Storage System

Hyper-converged IT drives: - TCO cost savings - data protection - amazing operational excellence

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

In Memory Accelerator for MongoDB

Theoretical Aspects of Storage Systems Autumn 2009

Inline Deduplication

DEXT3: Block Level Inline Deduplication for EXT3 File System

MongoDB Developer and Administrator Certification Course Agenda

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Metadata Feedback and Utilization for Data Deduplication Across WAN

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Data deduplication is more than just a BUZZ word

A Data De-duplication Access Framework for Solid State Drives

Data Deduplication and Tivoli Storage Manager

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

M710 - Max 960 Drive, 8Gb/16Gb FC, Max 48 ports, Max 192GB Cache Memory

Data Deduplication in a Hybrid Architecture for Improving Write Performance

Online De-duplication in a Log-Structured File System for Primary Storage

Get Success in Passing Your Certification Exam at first attempt!

Sharding and MongoDB. Release MongoDB, Inc.

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Tertiary Backup objective is automated offsite backup of critical data

ALG De-dupe for Cloud Backup Services of personal Storage Uma Maheswari.M, DEPARTMENT OF ECE, IFET College of Engineering

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Primary Data Deduplication Large Scale Study and System Design

Estimating Deduplication Ratios in Large Data Sets

Read Performance Enhancement In Data Deduplication For Secondary Storage

Hardware Configuration Guide

Trends in Enterprise Backup Deduplication

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Don t Get Duped By Dedupe or Dedupe Vendors

Understanding EMC Avamar with EMC Data Protection Advisor

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

A Novel Deduplication Avoiding Chunk Index in RAM

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Key Components of WAN Optimization Controller Functionality

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

HP StoreOnce: reinventing data deduplication

EMC BACKUP-AS-A-SERVICE

Turnkey Deduplication Solution for the Enterprise

Data Backup and Archiving with Enterprise Storage Systems

CEMEX en Concreto con EMC. Jose Luis Bedolla EMC Corporation Back Up Recovery and Archiving

Bigtable is a proven design Underpins 100+ Google services:

Symantec Backup Exec Blueprints

Cloud De-duplication Cost Model THESIS

es T tpassport Q&A * K I J G T 3 W C N K V [ $ G V V G T 5 G T X K E G =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX *VVR YYY VGUVRCUURQTV EQO

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MongoDB Administration

Simplify IT with Hyperconvergence

File System Management

Lessons Learned while Pushing the Limits of SecureFile LOBs. by Jacco H. Landlust. zondag 3 maart 13

Barracuda Backup Deduplication. White Paper

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

SHARE Lunch & Learn #15372

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Effective Planning and Use of TSM V6 Deduplication

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

Intro to Map/Reduce a.k.a. Hadoop

Get Success in Passing Your Certification Exam at first attempt!

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

Today s Papers. RAID Basics (Two optional papers) Array Reliability. EECS 262a Advanced Topics in Computer Systems Lecture 4

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Inside Dropbox: Understanding Personal Cloud Storage Services

Transcription:

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2

#1 You can sleep with grad students but not undergrads.

#2 Keep a bottle of water in your office in case a student breaks down crying.

#3 Kids love MongoDB, but they want to go work for Google.

Reducing Replication Bandwidth for Distributed Document Databases In ACM Symposium on Cloud Computing, pg. 1-12, August 2015. More Info: http://cmudb.io/doc-dbs

Replication Bandwidth Primary Database Operation logs Operation logs WAN Secondary MMS

Replication Bandwidth Primary Database Goal: Reduce bandwidth Operation logs Operation logs for WAN geo-replication. WAN Secondary MMS

Why Deduplication? Why not just compress? Oplog batches are small and not enough overlap. Why not just use diff? Need application guidance to identify source. Deduplication finds and removes redundancies.

Traditional Dedup Chunk Boundary Modified Region Duplicate Region Incoming Data Deduped Data 1 2 3 4 5 Send deduped data to replicas. 1 2 4 5

Traditional Dedup Chunk Boundary Modified Region Duplicate Region Incoming Data Deduped Data Must send the entire document.

Similarity Dedup Chunk Boundary Modified Region Duplicate Region Incoming Data Delta! Delta! Delta! Delta! Delta! Deduped Data Only send delta encoding.

Compress vs. Dedup 20GB sampled Wikipedia dataset. MongoDB v2.7 // 4MB Oplog batches

sdedup: Similarity Dedup Insertion & Updates Client Database Oplog Source Document Cache Source documents Delta Compressor Deduplicated oplog entries Unsynchronized oplog entries Oplog Oplog syncer Delta Decompressor Re-constructed oplog entries Source documents Replay Database Primary Node Secondary Node

Encoding Steps Identify Similar Documents Select the Best Match Delta Compression

Identify Similar Documents Target Document Rabin Chunking Candidate Documents Similarity Score 32 17 25 41 12 Consistent Sampling Similarity Sketch Feature Index Table 41 32 32 41 39 32 22 15 Doc #1 32 25 38 41 12 32 17 38 41 12 32 25 38 41 12 32 17 38 41 12 Doc #2 Doc #3 Doc #2 Doc #3 1 Doc #1 2 Doc #2 2 Doc #3

Selecting the Best Match Initial Ranking Rank Candidates Score 1 Doc #2 2 1 Doc #3 2 2 Doc #1 1 Final Ranking Rank Candidates Cached? Score 1 Doc #3 Yes 6 1 Doc #1 Yes 3 2 Doc #2 No 2 Is doc cached? If yes, reward 3x Source Document Cache

Delta Compression Byte-level diff between source and target docs: Based on the xdelta algorithm Improved speed with minimal loss of compression Encoding: Descriptors about duplicate/unique regions + unique bytes Decoding: Use source doc + encoded output Concatenate byte regions in order

Evaluation MongoDB setup (v2.7) 1 primary, 1 secondary node, 1 client Node Config: 4 cores, 8GB RAM, 100GB HDD storage Datasets: Wikipedia dump (20GB out of ~12TB) Stack Exchange data dump (10GB out of ~100GB)

Compression: Wikipedia Compression Ratio 50 40 30 20 10 0 sdedup trad-dedup 38.4 38.9 26.3 15.2 9.9 9.1 4.6 2.3 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset

800 sdedup Memory: Wikipedia trad-dedup 780.5 Memory (MB) 600 400 200 0 272.5 133.0 80.2 34.1 47.9 57.3 61.0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset

Compression: StackExchange Compression Ratio 5 4 3 2 1.8 1.0 1.2 1.3 1.0 1.0 1.1 1.2 1 0 sdedup trad-dedup 4KB 1KB 256B 64B Chunk Size 10GB sampled StackExchange dataset

Throughput Overhead

Dedup + Sharding Compression Ratio 50 40 30 20 10 0 38.4 38.2 38.1 37.9 1 3 5 9 # of Shards 20GB sampled Wikipedia dataset

Failure Recovery Failure Point 20GB sampled Wikipedia dataset.

Conclusion Similarity-based deduplication for replicated document databases. sdedup for MongoDB (v2.7) Much greater data reduction than traditional dedup Up to 38x compression ratio for Wikipedia Resource-efficient design for inline deduplication with negligible performance overhead

What s Next? Port code to MongoDB v3.1 Integrating sdedup into WiredTiger storage manager. Need to test with more workloads. Try not to get anyone pregnant.

WiredTiger vs. sdedup Compression Ratio Snappy 1.6x zlib 3.0x sdedup (no compress) 38.4x sdedup + Snappy 60.8x sdedup + zlib 114.5x 20GB sampled Wikipedia dataset.

END @andy_pavlo