Estimating Deduplication Ratios in Large Data Sets

Similar documents

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Data Deduplication and Tivoli Storage Manager

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Deduplication Demystified: How to determine the right approach for your business

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Data Deduplication HTBackup

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Hardware Configuration Guide

09'Linux Plumbers Conference

Data Backup and Archiving with Enterprise Storage Systems

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Understanding the Economics of Flash Storage

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Everything you need to know about flash storage performance

WHITE PAPER. Storage Savings Analysis: Storage Savings with Deduplication and Acronis Backup & Recovery 10

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Data Deduplication and Tivoli Storage Manager

3Gen Data Deduplication Technical

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Speeding Up Cloud/Server Applications Using Flash Memory

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

CA ARCserve r Data Deduplication Frequently Asked Questions

Deploying De-Duplication on Ext4 File System

Demystifying Deduplication for Backup with the Dell DR4000

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Cloud De-duplication Cost Model THESIS

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

A Survey on Deduplication Strategies and Storage Systems

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Security of Cloud Storage: - Deduplication vs. Privacy

Reducing Replication Bandwidth for Distributed Document Databases

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Inline Deduplication

Avamar. Technology Overview

Veeam Best Practices with Exablox

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

LDA, the new family of Lortu Data Appliances

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

Protect Data... in the Cloud

Read Performance Enhancement In Data Deduplication For Secondary Storage

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

DEXT3: Block Level Inline Deduplication for EXT3 File System

Enterprise Backup and Restore technology and solutions

Data Deduplication: An Essential Component of your Data Protection Strategy

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Aspirus Enterprise Backup Assessment and Implementation of Avamar and NetWorker

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

SDFS Overview. By Sam Silverberg

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

efficient protection, and impact-less!!

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

CloudBerry Dedup Server

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

Hyper-converged IT drives: - TCO cost savings - data protection - amazing operational excellence

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

A block based storage model for remote online backups in a trust no one environment

Data deduplication is more than just a BUZZ word

HP StoreOnce: reinventing data deduplication

Data Deduplication in a Hybrid Architecture for Improving Write Performance

EMC VNX2 Deduplication and Compression

Theoretical Aspects of Storage Systems Autumn 2009

bup: the git-based backup system Avery Pennarun

How To Make A Backup System More Efficient

Tandberg Data AccuVault RDX

TSM Family Capacity Pricing Special Bid Offering IBM Corporation

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Availability Digest. Data Deduplication February 2011

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Target Deduplication Metrics and Risk Analysis Using Post Processing Methods

Big Data & Scripting Part II Streaming Algorithms

Creating a Cloud Backup Service. Deon George

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

idedup: Latency-aware, inline data deduplication for primary storage

EMC XTREMIO EXECUTIVE OVERVIEW

WHITE PAPER. Effectiveness of Variable-block vs Fixedblock Deduplication on Data Reduction: A Technical Analysis

Transcription:

IBM Research labs - Haifa Estimating Deduplication Ratios in Large Data Sets Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov Gil Vernik

Estimating dedupe and compression ratios some motivation Data reduction does not come for free Incurs overheads, sometimes significant Not always worth the effort Better to know in advance Different techniques give different ratios What technique to use? Chunks? Fixed? Variable-sized? Full-file? Compression? Sometimes better to consolidate storage pools, sometimes not Different data reduction ratio: different number of disks to buy Disks = money!

How do you estimate efficiently? Option 1: according to data type, application type, etc Can be grossly in-accurate I ve seen the same DB application with dedupe ratio 1:2 and 1:50 Better to actually look at the data. Option 2: Sample a small portion of the data set and deduce from it Problematic when deduplication is involved

Data set == picture Identical block == identical picture block In real life we don t see the full data set at once Rather, we probe locations In sampling we only probe a small number of locations

Why sampling is problematic for dedupe estimation Dedupe ratio is 1:2 In order to see the duplicates must hit the exact same location twice Birthday paradox need ~N ½ elements to hit a collision Ω(N ½ ) to see a large number of collisions. What about triple collisions?

Formal limitations of sampling for Distinct Elements Lower bounds: Can show 2 data sets with far apart dedupe ratio, but same behavior under sampling of O(N 1-ε ) elements Charikar, Chaudhuri, Motwani, & Narasayya 2000 - showed basic bounds and empirical failures of approximating distinct elements. Rashkodnikova, Ron, Shpilka & Smith 2009 need Ω(N 1-ε ) samples to approximate to within a multiplicative factor. Valiant & Valiant 2011 - need Ω(N / logn) samples for same task. Bottom line: need to look at essentially he whole data set

Algorithms that see the whole data set : challenges Still a hard task High resources requirements: memory, time, disk accesses Naïve approach: simulate the actual dedupe just don t store the data Problem: simply storing the table of hash indices is too big for memory E.g. [Meyer & Bolosky, FAST 11] Example: metadata for 7 TB of data 24 GB of metadata Using an efficient dynamic data structure will require additional overhead Many solutions, none easy Goal 1: find a memory efficient estimation scheme with high accuracy Goal 2: Be as efficient as possible (time & CPU). More pronounced when need to integrate local (LZ) compression with deduplication

Plan for rest of talk Our Algorithm Integrating compression Analysis both formal and empirical Related work Full-file deduplication a special case

Our Algorithm: Sample & Scan 4 1 2 1 1 1 Sample phase create a base sample and store in memory Scan phase Count appearances only of elements in the base sample Summary Average of the base sample tallies too give estimation

What about compression? 4 1 0.04 0.9 2 0.05 1 0.6 1 0.2 1 0.4 Add compression evaluation only to elements in the base sample Only in the sample phase Crucial since compression is a time consuming operation In the summary take the compression stats into account Note: estimating dedupe and compression ratios separately is not sufficient for accurate estimation of the combination

Formal Analysis We prove a relation between the size of the base sample and the accuracy achieved for a given confidence parameter Accuracy by how much we may error Confidence with what probability Intuition: Each block in the data-set has a contribution to the overall reduced output Example: Suppose a block s dedupe count is 4 then each block contributes ¼ of a block to the overall output We attempt to estimate the average contribution over the whole data set Average contribution = total output size / total input size Estimating of averages is well studied Our estimation should form a normal distribution around the actual value

Example of Size of Sample vs Error and Confidence Dedupe Ratio Error Confidence Sample Size Memory 3:1 0.01 0.0001 44557 10.7 MB 5:1 0.01 0.0001 1237938 29.7 MB 15:1 0.01 0.0001 11142000 267 MB Error is a percentage of the output size Good dedupe small output 1% error is much smaller in absolute values This is why some bound on dedupe ratio is required If error percentage of original size is acceptable, then life is easier and memory requirements become even smaller

Empirical Analysis Evaluated our estimation on 4 real life data-sets: 1. Workstation data 2. File repository of an organization 3. Backup repository 4. Exchange DB periodical backups Largest data set was the file repository with 7TBs of data The tests back-up our formal analysis

Convergence of the error percentage as base sample grows

Error is indeed normally distributed around the actual ratio

Related Work Distinct elements is a heavily studied topic DB analysis Streaming algorithms [FM85],,[AMS99],,[KNW2010], many more Distinct elemnts works: Focus on one-pass algorithms Always fixed size elements not files or variable sized blocks Do not consider compression Our work: not one pass Comparable to most works (memory wise) When combined with compression we give the best performance (minimal number of chunks to compress)

Full-file deduplication Files have properties and metadata that can help us Different file length no dedupe Different hash on first block no dedupe Our algorithm: In sample phase keep also the file length and first block hash of base sample During scan phase: Check if file size relevant to base sample, if not, discard Check if first block hash relevant to base sample, if not, discard If still relevant, then read whole file from disk Overall: all metadata is scanned, not all files read!

Full-file estimation: Percentage of data read from disk vs. desired error gurantee

What to take home from this talk? 1. Need to be careful when doing sampling for dedupe 2. We have a good algorithm if you can run a full scan or already have metadata available. 3. Our algorithm integrates compression and dedupe naturally Practically no overhead when adding compression 4. For full-file deduplication we reduce the data reads substantially!

Thank You!