Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs



Similar documents
Security of Cloud Storage: - Deduplication vs. Privacy

Estimating Deduplication Ratios in Large Data Sets

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Side channels in cloud services, the case of deduplication in cloud storage

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Data Backup and Archiving with Enterprise Storage Systems

EMC Backup solutions. Aleksandar Antić EMC BRS Territory Sales Adriatic region. Copyright 2011 EMC Corporation. All rights reserved.

09'Linux Plumbers Conference

EMC Data de-duplication not ONLY for IBM i

Next Generation Backup Solutions

Reducing Backups with Data Deduplication

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Data Deduplication and Tivoli Storage Manager

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

Theoretical Aspects of Storage Systems Autumn 2009

Cloud-integrated Storage What & Why

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

A Deduplication File System & Course Review

An In-Depth Look at Deduplication Technologies

Get Success in Passing Your Certification Exam at first attempt!

Demystifying Deduplication for Backup with the Dell DR4000

Copyright 2015 EMC Corporation. All rights reserved. 1

NetApp Data Fabric: Secured Backup to Public Cloud. Sonny Afen Senior Technical Consultant NetApp Indonesia

Rose Business Technologies

Speeding Up Cloud/Server Applications Using Flash Memory

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

Understanding EMC Avamar with EMC Data Protection Advisor

Multi-level Metadata Management Scheme for Cloud Storage System

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Data deduplication technology: A guide to data deduping and backup

Backup and Recovery Redesign with Deduplication

HP StoreOnce: reinventing data deduplication

Availability Digest. Data Deduplication February 2011

Backup of NAS devices with Avamar

DXi Accent Technical Background

CIGRE 2014: Udaljena zaštita podataka

Technical White Paper for the Oceanspace VTL6000

Multimedia Systems WS 2010/2011

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

zdelta: An Efficient Delta Compression Tool

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Analysis of Compression Algorithms for Program Data

Backup and Disaster Recovery Planning On a Budget. Presented by: Najam Saeed Lisa Ulrich

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

EMC BACKUP AND RECOVERY SOLUTIONS

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

DPAD Introduction. EMC Data Protection and Availability Division. Copyright 2011 EMC Corporation. All rights reserved.

Protecting Information in a Smarter Data Center with the Performance of Flash

Checklist and Tips to Choosing the Right Backup Strategy

Contents. WD Arkeia Page 2 of 14


DEXT3: Block Level Inline Deduplication for EXT3 File System

efficient protection, and impact-less!!

A block based storage model for remote online backups in a trust no one environment

Reducing Costs and Complexity with CommVault

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Oracle Data Protection Concepts

Deduplication Demystified: How to determine the right approach for your business

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

idedup: Latency-aware, inline data deduplication for primary storage

Data Deduplication in a Virtual Tape Library Environment

Dell Data Protection. Marek Istok Ŋ Dell Slovakia

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

Side channels in cloud services, the case of deduplication in cloud storage

EMC DATA DOMAIN PRODUCT OvERvIEW

Data Deduplication and Tivoli Storage Manager

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

E-Guide. Sponsored By:

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Barracuda Backup Deduplication. White Paper

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

Understanding the HP Data Deduplication Strategy

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Thin Deduplication: A Competitive Comparison

Understanding EMC Avamar with EMC Data Protection Advisor

vsphere Data Protection 6.0 VDP 6.0

Compression techniques

Choosing an Enterprise-Class Deduplication Technology

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

LDA, the new family of Lortu Data Appliances

WHITE PAPER. Effectiveness of Variable-block vs Fixedblock Deduplication on Data Reduction: A Technical Analysis

METHODOLOGY FOR OPTIMIZING STORAGE ON CLOUD USING AUTHORIZED DE-DUPLICATION A Review

On the Use of Compression Algorithms for Network Traffic Classification

We look beyond IT. Cloud Offerings

Trends in Enterprise Backup Deduplication

Clash of the Titans. I/O System Performance. mag. Sergej Rožman; Abakus plus d.o.o.

Платформа NetBackup 7.6. What's new in NetBackup 7.6? 1

Two-Level Metadata Management for Data Deduplication System

EMC DATA PROTECTION. Backup ed Archivio su cui fare affidamento

EMC AVAMAR. Deduplication backup software and system. Copyright 2012 EMC Corporation. All rights reserved.

The do s and don ts. E-Guide

Data Domain & Deduplication Basics 101

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM

Transcription:

Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs

Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of data at hand Is it possible? information theoretically, technically Is it financially worth it? storage is becoming cheaper all the time requires resources and time

Compression and Deduplication Compression What is the most succinct representation of this file? Deduplication Hasn t this file appeared before? Different workloads give different results: Some favor compression, some favor dedup Sometimes the combination is best

Compression

Compression Zip runs an algorithm called DEFLATE A combination of two techniques: Lempel Ziv [1977] Huffman code [1952] Will show these 2 techniques + Arithmetic encoding

LZ77 Compression Go over a stream At each point, search for the longest identical string that has already appeared in the past. If none appeared, write the string If appeared, save Pointer to start of string (how many bytes back) Length of current string. Many variations How far to search back? Typically 32KB LZ78 hold a dictionary table A good approximation of the entropy for some sources

Huffman Code An information theoretic approach to compression: A typical text of n characters (or bytes) is not uniformly distributed. Use the skewed distribution to achieve a shorter representation. Most popular byte character gets shortest representation E.g. In a typical English text: Use the shortest encoding for e The longest for q Huffman code: A method of presenting a text using nearly its shannon entropy worth in bits. Optimal when considering just single characters

Huffman Code this is an example of a huffman tree Example taken from: http://sector0.dk/?p=29

Deduplication

Deduplication Similar to Lempel Ziv 78, but at a whole different scale Basic Block is typically ~ 4KB, 8KB, 16KB, full file Rather than byte, or string of bytes An ongoing process. Need to address a file after it is saved and closed. Two main approaches Inline dedup process data as it arrives Offline dedup background process, first save data, then dedup in spare time.

How to dedupe? Fingerprint each block using a hash function Common hashes used: Sha1, Sha256, others Store an index of all the hashes already in the system New block: Compute hash Look hash up in index table If new add to index If known hash store as pointer to existing data If known hash, do you want to look at the actual data?? 11

Client-side deduplication A method to save bandwidth as well as storage. Also know as source-based dedupe or WAN deduplication Client computes hash and sends to server If new server requests client for the data (upload data) Otherwise (dedupe) skip upload and add a new pointer to the data Client Server Let it be.mp3 hash Index 2fd4e1 2fd4e1 2fd4e1 12 Let it be.mp3

Choice of hash function In most deduplication systems this is done using a cryptographic hash Usually SHA-1 which has an output of 160 bits Probability of a collision: 1. n is the number of blocks 2. b is the number of bits in the hash p n( n 1) 2 1 2 b The above is true for any random hash function. However, a malicious adversary may choose blocks especially to create a collision. This is why a cryptographic hash is used Typically more expensive than any random like hash function

Issues Smaller blocks = Better Dedup But smaller blocks = more work More fingerprints More searches More metadata Bottom line: the choice of block size depends on the workload E.g. a file system with a 1KB page size

Alignment issues What if we insert 1 byte into an existing file. Almost identical data Dedup will fail miserably. Solution: variable block size Rabin-Karp fingerprinting: Compute a rolling hash Cut when hash equals 0 mod p Average block size = p

Existing data reduction solutions (A sample of solutions for storage systems)

Deduplication some systems and applications Content Adressable Storage (CAS) mainly for archiving Venti (Lucent), Centera(EMC), JumboStore (HP), Hydrastor(NEC) Backup Virtual Tape Library (VTL) Backup Dilligent (IBM), DataDomain (EMC), D2D (HP) Backup with client side dedup Cloud backup services: Mozy(EMC), DropBox,. Avamar(EMC), Ocarina (Dell), Netbackup (Symantec) Tivoli Storage Manager (IBM) Primary (mainly file systems) useful for VM images Netapp Filer 2 to 1 ratio guarantee on some VMWare usage. ZFS (Sun open source file system) Dell (planned for next year)

Compression in storage systems Real-time (Inline) RTC (IBM) ZFS (Oracle) Nimble Storage Offline Mix EMC Data Compression Dell (planned for next year dedupe inline, compression offline) Netapp Writes online, updates offline. Backup

Dedup vs. Compression vs. both Compression and Deduplication for Various Data Types 1.2 Data Reduction Ratio (Compressed size / Original size) 1 0.8 0.6 0.4 0.2 Compress (Gzip) DedupV (4K, var) DedupV+Compress DedupF (4K, fix) DedupF+Compress Compress+DedupV Compress+DedupF 0 VM Images Medical Images Website Archive Project Repository DB2 TPC Laptop1 (29.9GB) Data type Data taken from C. Constantinescu, J. Glider, D. Chambliss: Mixing Deduplication and Compression on Active Data Sets. DCC 2011

Summary Data reduction is a useful concept, but not for all cases Compression and Deduplication 2 similar concepts at the two ends of the same scale The large scale in dedupe creates new challenges Different challenges and use cases No one solution fits all