Open Source Data Deduplication



Similar documents
SDFS Overview. By Sam Silverberg

Protecting the Microsoft Data Center with NetBackup 7.6

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

09'Linux Plumbers Conference

SOP Common service PC File Server

How To Protect Your Data From Being Damaged On Vsphere Vdp Vdpa Vdpo Vdprod (Vmware) Vsphera Vdpower Vdpl (Vmos) Vdper (Vmom

Backup and Disaster Recovery Planning On a Budget. Presented by: Najam Saeed Lisa Ulrich

For Hyper-V Edition Practical Operation Seminar. 4th Edition

Recoup with data dedupe Eight products that cut storage costs through data deduplication

Rethinking Backup in a Virtualized World

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Deduplication Demystified: How to determine the right approach for your business

VMware Backup and Deduplication. Written by: Anton Gostev Product Manager Veeam Software

Evaluation of the deduplication file system "Opendedup"

" " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " Datto Makes Virtual Hybrid Cloud Backup Easy! Product Analysis!

Data deduplication technology: A guide to data deduping and backup

LDA, the new family of Lortu Data Appliances

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

Hardware Configuration Guide

Dell AppAssure cloud replication

Backup Retention and Data Availability

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

(Formerly Double-Take Backup)

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Demystifying Deduplication for Backup with the Dell DR4000

Efficient Backup with Data Deduplication Which Strategy is Right for You?

VMware vsphere Data Protection 6.0

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

CA ARCserve Family r15

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Trends in Application Recovery. Andreas Schwegmann, HP

Server and Storage Virtualization with IP Storage. David Dale, NetApp

VMware vsphere Data Protection 6.1

An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space

Optimizing Ext4 for Low Memory Environments

Restoration Technologies. Mike Fishman / EMC Corp.

Online Backup Plus Frequently Asked Questions

Investigation of storage options for scientific computing on Grid and Cloud facilities

VMware VDR and Cloud Storage: A Winning Backup/DR Combination

Backup and recovery as agile as the virtual machines being protected

Checklist and Tips to Choosing the Right Backup Strategy

Barracuda Backup Deduplication. White Paper

Backup and Recovery Best Practices With Tintri VMstore

ITCertMaster. Safe, simple and fast. 100% Pass guarantee! IT Certification Guaranteed, The Easy Way!

Data Protection with NETGEAR Storage. Smart Solutions for Disk-to-Disk Backup and Disaster Recovery for SMB s and Multi-Office Environments

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Database Virtualization Technologies

Large Installation Administration. Comparing open source deduplication performance for virtual machines

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

Understanding EMC Avamar with EMC Data Protection Advisor

Deduplication has been around for several

Price Comparison ProfitBricks / AWS EC2 M3 Instances

Backup and Recovery Redesign with Deduplication

Investigation of storage options for scientific computing on Grid and Cloud facilities

A Survey on Deduplication Strategies and Storage Systems

Dell PowerVault DL2200 & BE 2010 Power Suite. Owen Que. Channel Systems Consultant Dell

Kingston Communications Virtualisation Platforms

Competitive Analysis Retrospect And Our Competition

VMware Data Recovery. Product Overview

Symantec NetBackup deduplication general deployment guidelines

Side channels in cloud services, the case of deduplication in cloud storage

Virtualization Backup/Replication Solution Comparison:

How To Run A Modern Business With Microsoft Arknow

Hardware/Software Guidelines

CEMEX en Concreto con EMC. Jose Luis Bedolla EMC Corporation Back Up Recovery and Archiving

Data Protection & Cloud. Corradino Milone PreSales Commvault Italia

Cloud Storage Backup for Storage as a Service with AT&T

Journey to the All-Flash Data Center

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation (Author and Presenter)

AccuGuard Enterprise for RDX

Is VMware Data Recovery the replacement for VMware Consolidated Backup (VCB)? VMware Data Recovery is not the replacement product for VCB.

Lessons Learned while Pushing the Limits of SecureFile LOBs. by Jacco H. Landlust. zondag 3 maart 13

Reducing Backups with Data Deduplication

OVERVIEW. IQmedia Networks Technical Brief

esxreplicator Contents

Investigation of Storage Systems for use in Grid Applications

Supported File Systems

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Next Generation Backup Solutions

Recommended Backup Strategy for FileMaker Server 10 and 11 for Macintosh & Windows Updated September 2010

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

NetBackup Best Practice Using Tape Storage with Deduplicating Disk Storage

OBM / FREQUENTLY ASKED QUESTIONS (FAQs) Can you explain the concept briefly on how the software actually works? What is the recommended bandwidth?

Use of Hadoop File System for Nuclear Physics Analyses in STAR

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

Redefining Microsoft SQL Server Data Management

Data deduplication is more than just a BUZZ word

Abila Grant Management. System Requirements

EMC DATA DOMAIN PRODUCT OvERvIEW

Cloud Computing and Amazon Web Services

The LCO Group. Backup & Disaster Recovery. Protect your data. Protect your business.

efolder BDR for Veeam Cloud Connection Guide

DEDUPLICATION BASICS

EMC BACKUP-AS-A-SERVICE

Backup Software Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

The safer, easier way to help you pass any IT exams. Exam : E Backup Recovery - Avamar Expert Exam for Implementation Engineers.

ABB Technology Days Fall 2013 System 800xA Server and Client Virtualization. ABB Inc 3BSE en. October 29, 2013 Slide 1

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

The future is in the management tools. Profoss 22/01/2008

Transcription:

Open Source Data Deduplication Nick Webb Red Wire Services, LLC nickw@redwireservices.com www.redwireservices.com @disasteraverted

Introduction What is Deduplication? Different kinds? Why do you want it? How does it work? Advantages / Drawbacks Commercial Implementations Open Source implementations, performance, reliability, and stability of each (disclaimer)

What is Data Deduplication Wikipedia:... data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored. Depending on the type of deduplication, redundant files may be reduced, or even portions of files or other data that are similar can also be removed...

Why Dedupe? Save disk space and money (less disks) Less disks = less power, cooling, and space Improve Writes (maybe) Be smart

Where does it Work Well? Backups/Archives (Secondary Storage) Online backups with limited bandwidth Save disk space additional full backups take little space Replicate disk to disk over slow links, only unique blocks are transfered Virtual Machines (Primary & Secondary) File Shares (Primary)

Not a Fit Unique data Video Pictures Music Encrypted files (many vendors dedupe, then encrypt)

Drawbacks Slow writes, slower reads High CPU/memory utilization (use a dedicated server) Increases data loss risk / corruption

How Does it Work?

Without Dedupe

With Dedupe

Block Reclamation In general, blocks are not removed/freed when a file is removed We must periodically check blocks for references, a block with no reference can be deleted, freeing allocated space Process can be expensive, scheduled during off-peak

Commercial Implementations Just about every backup vendor Symantec CommVault Cloud: Asigra, Baracuda, Dropbox, Mozy, etc. NAS/SAN/Backup Targets NEC HydraStor DataDomain/EMC Avamar Quantum NetApp

Commercial Implementations Dedupe is a competitive advantage on the commercial side Most do it, but many different flavors Fixed block versus sliding block size Global Dedupe

Open Source Implementations Lessfs SDFS (Open Dedupe) ZFS btrfs (soonish) Limited (file based / SIS) BackupPC (reliable!) Rdiff-backup Others?

How Good is it? Many see 10-30x deduplicaiton meaning 10-30 times more logical object storage than physical Especially true in backup or virtual environments

SDFS / OpenDedupe www.opendedup.org Java 7 Based / platform agnostic Uses fuse S3 storage support Snapshots Inline or batch mode deduplication Supposedly fast (290MBps+ on great H/W) Support for global/clustered dedupe Probably most mature OSS Dedupe (IMHO)

SDFS Pro Con Works when configured properly Appears to be multithreaded Slow / resource intensive (CPU/Memory) Fragile, easy to mess up options, leading to crashes, little user feedback Standard POSIX utilities do not show accurate data (e.g. df, must use getfattr -d <mount point>, and calculate bytes GB/TB and % free yourself) Slow with 4k blocks, recommended for VMs

LessFS www.lessfs.com Written in C = Less CPU Overhead Have to build yourself (configure && make && make install) Has replication Uses fuse

LessFS Pro Con Does inline compression by default as well Reasonable VM compression with 128k blocks Fragile (i.e. not robust) Single Threaded? Stats/FS info hard to see: sudo cat /fuse/.lessfs/lessfs_stats less Per file accounting, no totals

Other OSS ZFS? Tried it, and empirically it was a drag, but I have no hard data (got like 3x dedupe with IDENTICAL full backups of Vms)

Kick the Tires Test data set; ~330GB of data 22GB of documents, pictures, music Virtual Machines 220GB Windows 2003 Server with SQL Data 2003 AD DC ~60GB 2003 Server ~8GB Two OpenSolaris VMs, 1.5 & 2.7GB 3GB Windows 2000 VM 15GB XP Pro VM

Kick the Tires Test Environment AWS High CPU Extra Large Instance ~7GB of RAM ~Eight Cores ~2.5GHz each ext4

Compression Performance First round (all unique data) If another copy was put in (like another full), we should expect 100% reduction for that non-unique data (1x dedupe per run) FS Home Data % Home Reduction VM Data % VM Reduction Combined % Total Reduction MBps SDFS 4k 21GB 4.50% 109 GB 64% 128GB 61% 16 lessfs 4k (est.) 24GB -9% N/A 51% N/A 50% 4 SDFS 128k 21GB 4.50% 255 GB lessfs 128k 21GB 4.50% 130 GB tar/gz --fast 21GB 4.50% 178 GB 16% 276GB 15% 40 57% 183GB 44% 24 41% 199GB 39% 35

Compression ( Unique Data) 7 0. 0 0 % 6 4 % 6 1 % 6 0. 0 0 % 5 1 % 5 0 % 5 7 % 5 0. 0 0 % 4 4 % 4 0. 0 0 % 4 1 % 3 9 % 3 0. 0 0 % 2 0. 0 0 % 1 6 % 1 5 % % H o m e R e d u c t i o n % V M R e d u c t i o n % T o t a l R e d u c t i o n 1 0. 0 0 % 4. 5 0 % 0. 0 0 % - 1 0. 0 0 % S D F S 4 k l e s s f s 4 k ( e s t. ) - 9 % 4. 5 0 % S D F S 1 2 8 k 4. 5 0 % l e s s f s 1 2 8 k 4. 5 0 % t a r / g z - - f a s t % V M R e d u c t i o n % H o m e R e d u c t i o n % T o t a l R e d u c t i o n

Write Performance (don't trust this) M B p s 4 0 3 5 3 0 2 5 2 0 M B p s 1 5 1 0 5 0 r a w S D F S 4 k l e s s f s 4 k S D F S 1 2 8 k l e s s f s 1 2 8 k t a r / g z - - f a s t

Load (SDFS 128k)

Open Source Dedupe Pro Con Free Playtime Unstable, PIA, not in repos yet Efforts behind them seem very limited No/Poor documentation

The Future Right now all kinds of backup and storage businesses are pushing dedupe, but eventually it will become zip, and everyone will expect it Until then, everyone is fighting over what is better, and the OSS community seems to be waiting... brtfs Dedupe planned (off-line only)

Conclusion Dedupe is great, if it works and it meets your performance and storage requirements Currently, after 20+ hours using OSS Dedupe, I feel it is not ready Especially considering commercial versions; sometimes it's cheaper just to by 10x storage. If you have to do it now, in production, go commercial (sorry) unless...

Questions?

Open Source Data Deduplication Nick Webb nickw@redwireservices.com @disasteraverted Linuxfest Northwest 2011