Deduplication School 2010



Similar documents
Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Data deduplication technology: A guide to data deduping and backup

Seriously: Tape Only Backup Systems are Dead, Dead, Dead!

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

Backup Software Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

EMC DATA DOMAIN OVERVIEW. Copyright 2011 EMC Corporation. All rights reserved.

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Optimizing Your Backup System

EMC BACKUP MEETS BIG DATA

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Protect Data... in the Cloud

DPAD Introduction. EMC Data Protection and Availability Division. Copyright 2011 EMC Corporation. All rights reserved.

Eight Considerations for Evaluating Disk-Based Backup Solutions

Reducing Backups with Data Deduplication

How to Get Started With Data

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

Redefining Backup for VMware Environment. Copyright 2009 EMC Corporation. All rights reserved.

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

WHITE PAPER: customize. Best Practice for NDMP Backup Veritas NetBackup. Paul Cummings. January Confidence in a connected world.

Backup and Disaster Recovery Planning On a Budget. Presented by: Najam Saeed Lisa Ulrich

Vendor Landscape: Disk Backup

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

An In-Depth Look at Deduplication Technologies

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Turnkey Deduplication Solution for the Enterprise

Hardware Configuration Guide

Recoup with data dedupe Eight products that cut storage costs through data deduplication

We look beyond IT. Cloud Offerings

Managing the information that drives the enterprise. Deduplication

EMC Data de-duplication not ONLY for IBM i

EMC Disk Library with EMC Data Domain Deployment Scenario

Symantec NetBackup PureDisk Optimizing Backups with Deduplication for Remote Offices, Data Center and Virtual Machines

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

EMC DATA PROTECTION. Backup ed Archivio su cui fare affidamento

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

efficient protection, and impact-less!!

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Barracuda Backup Deduplication. White Paper

HP Data Protector software and HP StoreOnce backup systems for federated deduplication and flexible deployment

Maximizing Deduplication ROI in a NetBackup Environment

Protect Microsoft Exchange databases, achieve long-term data retention

Protecting the Microsoft Data Center with NetBackup 7.6

Aspirus Enterprise Backup Assessment and Implementation of Avamar and NetWorker

Quantum DXi6500 Family of Network-Attached Disk Backup Appliances with Deduplication

Dell PowerVault DL2200 & BE 2010 Power Suite. Owen Que. Channel Systems Consultant Dell

EMC AVAMAR. Deduplication backup software and system. Copyright 2012 EMC Corporation. All rights reserved.

Solutions for Encrypting Data on Tape: Considerations and Best Practices

Symantec Backup Appliances

Trends in Enterprise Backup Deduplication

Demystifying Deduplication for Backup with the Dell DR4000

CIGRE 2014: Udaljena zaštita podataka

Understanding EMC Avamar with EMC Data Protection Advisor

Mayur Dewaikar Sr. Product Manager Information Management Group Symantec Corporation

Get Success in Passing Your Certification Exam at first attempt!

EMC BACKUP & ARCHIVE SOLUTIONS

Avamar. Technology Overview

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

How To Deduplication

Deduplication has been around for several

Backup and Recovery Redesign with Deduplication

Backup Exec 2010 Deduplication Protect More, Store Less, Save More

E-Guide. Sponsored By:

ABOUT DISK BACKUP WITH DEDUPLICATION

Choosing an Enterprise-Class Deduplication Technology

Backup and Recovery: The Benefits of Multiple Deduplication Policies

A Business Case for Disk Based Data Protection

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Next Generation Backup Solutions

EMC DATA DOMAIN PRODUCT OvERvIEW

Universal Backup Device The Essential Facts of UBD

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Availability for your modern datacenter

LDA, the new family of Lortu Data Appliances

Nexsan and FalconStor Team for High Performance, Operationally Efficient Disk-based Backup Date: August, 2009 Author:

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Technology Fueling the Next Phase of Storage Optimization

Symantec NetBackup deduplication general deployment guidelines

White. Paper. Improving Backup Effectiveness and Cost-Efficiency with Deduplication. October, 2010

Deduplication, Incremental Forever, and the. Olsen Twins. 7 Technology Circle Suite 100 Columbia, SC 29203

W H I T E P A P E R R e a l i z i n g t h e B e n e f i t s o f Deduplication in a Backup and Restore System

The do s and don ts. E-Guide

Storage Backup and Disaster Recovery: Using New Technology to Develop Best Practices

STRATEGIC PLANNING ASSUMPTION(S)

Protecting Information in a Smarter Data Center with the Performance of Flash

White Paper. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

Technical White Paper for the Oceanspace VTL6000

DEDUPLICATION BASICS

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

EMC DATA DOMAIN OVERVIEW

Transcription:

Deduplication School 2010 W. Curtis Preston Executive Editor, TechTarget Founder/CEO Truth in IT, Inc. Follow on Twitter @wcpreston

A Little About Me When I started as backup guy at $35B company in 1993: Tape Drive: QIC 80 (80 MB capacity) Tape Drive: Exabyte 8200 (2.5 GB & 256KB/s) Biggest Server: 4 GB ( 93), 100 GB ( 96) Entire Data Center: 200 GB ( 93), 400 GB ( 96) My TIVO now has 5 times the storage my data center did! Consulting in backup & recovery since 96 Author of O Reilly s Backup & Recovery & Using SANs and NAS Webmaster of BackupCentral.com Founder/CEO of Truth in IT Follow me on Twitter @wcpreston

A Little Bit About Truth in IT, Inc. Inspired by Consumer Reports, but designed for IT No advertising, no partners = no need to SPIN No huge consulting fees just to find out which products work and which ones don t work (such fees typically start at $10K and go all the way to $100K!) Funded instead by y$ $999 annual subscription Private online community with written research, testing results, podcasts of interviews with users of products, and direct communication with real customers of the products you re interested in all included In beta now at http://www.truthinit.com truthinit

Agenda Understanding Deduplication Using Deduplication in Backup Systems Using Data Reduction in Primary Systems Recent Backup Software Advancements Backing up Virtual Servers Backups on a Budget Stump Curtis

Session 1 Understanding Deduplication

Why Disk? First a little history

History of My World Part I When I joined the industry (1993) Disks were 4 MB/s, tapes were 256 KB/s Networks were 10 Mb shared QIC 80 (60 KB/s) Seventeen years later (2010) Disks are 70 MB/s, tapes are 120 MB/s Networks are 10 Gb switched Changes in 17 years Exabyte 8200 (256 KB/s) 17x increase in disk speed (luckily, RAID has created virtual disks that are way faster) 500x increase in tape speed! 1000x+ increase in network speed DECStation 5000

More History Plan A: Stage to disk, spool to tape Pioneered by IBM in 90s, widely adopted in late 00s Large, very fast virtual disk as caching mechanism to tape Only need enough disk to hold one night s backups Helps backups; does not help restores Plan B: Backup to disk, leave on disk AKA the early VTL craze Helps backups and restores Disk was still way too expensive to make this feasible for most people

Plan C: Dedupe It s perfect for traditional backup Fulls backup the same data every day/week/month Incrementals backup entire file when only one byte changes Both backup file 100 times if it s in 100 locations Databases are often backed up full every day Tons of duplicate blocks! Average actual reduction of 10:1 and higher It s not perfect for everything Pre-compressed or encrypted data File types that don t have versions (multimedia)

Naysayers Eliminate all but one copy? No, just eliminate duplicates per location What about hash collisions? More on this later, but this is nothing but FUD If you re unconvinced, use a delta differential approach Doesn t this have immutability concerns? Everything that changes the format of the data has immutability concerns (e.g. sector-based storage, tar, etc) Job of backup/archive applications is to verify same in/out What about the dedupe d tax? Let s talk more about this one in a bit

Is There a Plan D? Some pundits/analysts think dedupe (especially target dedupe) is a band-aid, and will eventually be done away with via backup- software-based dedupe, d delta-backups, etc. Maybe this will happen in a 3-5 year time span, maybe it won t. (In fact, some backup software companies will tell you they don t need no stinking dedupe appliances.) That s still no argument for not moving on what s available to solve your problems now

How Dedupe Works

Your Mileage WILL Vary You really can get 10x to 400x It depends on Frequency of full backups (more fulls = more dupes) How much of a given incremental backup contains versions of other files (multimedia generally doesn t have versions) Length of retention (longer retention = more dupes) Redundancy in single full backup (if your product notices) Things that confuse dedupe Encrypting data before the dedupe process sees it Compressing data before the dedupe process sees it Compressing data before the dedupe process sees it Multiplexing to a VTL

How Do They Identify Duplicate Data? Two very different methods Chunking/hashing Asigra, EMC Avamar, Symantec PureDisk, CommVault Simpana EMC Data Domain, Greenbytes, FalconStor VTL & FDS, NEC, Quantum DXi Delta differential Exagrid, IBM Protectier, Ocarina, SEPATON Some systems may use a hybrid approach

Chunking/Hashing Method Slice all data into segments or chunks Run chunk through hashing algorithm (SHA-1) Check hash value against all other hash values Chunk with identical hash value is discarded Will find redundant blocks between files from different file systems, even different servers

Delta Differential Method Correlate backups Mathematical methods Using metadata Compare similar backups byte-by-byte Example Tonight s backup of Exchange instance Elvis is seen as similar to last night s backup of Elvis Tonight s backup of Elvis is compared byte-by-byte b b to last night s backup of Elvis & redundant segments are found

Hashing & Delta Differential Hashing Most used method with most mileage Some concerned about hash collisions (more on this later) Compares everything to everything, therefore gets more dedupe out of similar data in dissimilar datasets (e.g. production and test copy of same data) Delta Differentials Faster than hashing No concern about hash collisions Only compares like backups, so will get no dedupe on similar data in dissimilar datasets, but does get more dedupe on same data What will you get? Only testing with your data will answer that t question.

Hash Collisions: The real numbers Hash Size Number of Hashes & Amount of Data to achieve Desired Probability (Assuming 8k chunk size) 10-15 10-5 128 bits (MD5) 8.2 10 11 6.6 PB 8.2 10 16 20.9 YB 160 bits (SHA-1) 54 5.4 10 16 432.5 EB 54 5.4 10 21 1,371,181 181 YB 10-15 : Odds of single disk writing incorrect data and not knowing it (Undetectable Bit Error Rate or UBER) With SHA-1, we have to write 6.6 PB to get those odds 10-5 : Worst odds of a double-disk RAID5 failure We have to write 1,371,181 YB to reach those odds Original formula here: http://en.wikipedia.org/wiki/birthday_attack Original formula modified with MacLaurin series expansion to mitigate Excel s lack of precision and is here: backupcentral.com/hash-odds.xls

Where Is the Data Deduped? Target Dedupe Data is sent unmodified d across LAN & deduped d d at target t No LAN/WAN benefits until you replicate target to target Cannot compress or encrypt before sending to target Source Dedupe Redundant data is identified at backup client Only new, unique data sent across LAN/WAN LAN/WAN benefits, can back up remote/mobile data Allows for compression, encryption at source Hybrid Fingerprint data at source, dedupe at target Allows for compression, encryption at source

Let s Make It More Complicated Standalone Target Dedupe Dedupe appliance separate from backup software Integrated Target Dedupe Target dedupe from b/u s/w vendor that backs up to POD* Standalone Source Dedupe Full dedupe solution that only does source dedupe Integrated Source Dedupe Backup software that can dedupe at client (or not) Hybrid Also from backup software company *Plain Ol Disk

Name That Dedupe Standalone Target Dedupe Data Domain, Exagrid, Greenbytes, IBM, NEC, Quantum, SEPATON Integrated Target Dedupe Symantec NetBackup Integrated Source Dedupe Asigra, Symantec NetBackup Standalone Source Dedupe EMC Avamar, i365 evault, Symantec NetBackup Hybrid CommVault Simpana

Multi-node Deduplication AKA Global Deduplication AKA Clustered Deduplication

What We re Not Talking About Remember hashing vs. delta differential dedupe Delta compares like to like Hashing compares everything to everything Some sales reps from some companies (that don t have multi-node/global l l dedupe) d are calling the latter global dedupe. It s not. At a minimum this is honest confusion Possibly this is subterfuge uge to confuse the buyer

Single-node/Local l vs. Multi-node/Global l Assume a customer buys multiple nodes of a dedupe system Suppose, then, that they back up exactly the same client to each of those multiple nodes If the vendor fails to recognize the duplicate data and stores it multiple times, it has singlenode/local dedupe If the vendor recognizes duplicate data across multiple nodes and stores it on only one node, they have multi-node/global dedupe

Doctor It Hurts When I Do This Single-node/local dedupe vendors say then don t do that. Why would you do that? They tell you to split up your datasets and send a given dataset t to only one appliance Easy to do if Your dataset t sizes never change A given dataset never outgrows a node Some single-node sales reps will point out that t this also doesn t harm your dedupe ratio because most dedupe is from comparing like to like. They re also the same ones claiming they get better dedupe because they compare all to all. Which is it?

Multi-node Is the Way to Go Especially for larger environments & budget conscious environments that buy as they go With multi-node dedupe you can load-balance & treat same as you would a large tape library Single-node node dedupe pushes the vendors to ride the crest of the CPU/RAM wave Multi-node vendors can ride behind the wave, saving cost without reducing value

Multi/Single Node Dedupe Vendors Multi-node/global EMC Avamar (12 nodes) Exagrid (10 nodes) NEC (55 nodes) SEPATON (8 nodes) Symantec PureDisk, NetBackup & Backup Exec Diligent (2 nodes) Single-node/local (as of Mar 2010) EMC Data Domain NetApp ASIS Quantum

When Is It Deduped? AKA Inline or Post Process?

Get Out the Swords We d have just as much luck trying to settle these arguments Apple vs Windows Linux vs either of them Linux vs FreeBSD Vmware vs the mainframe (the original hypervisor) Cable modem vs DSL Initial common sense leans to inline, but postprocess offers a lot of advantages Cannot pick based on concept; must pick based on price/performance

What s the Difference? This only applies to target dedupe Inline is synchronous dedupe Post-process is asynchronous dedupe Both are deduping as the data is coming into the device (with most products and configs) The question is really where the dedupe process reads the native data from. If it reads it from RAM, we re talking inline. If it reads it from disk, we re talking post process.

Inline & Post-process: An I/O Walkthrough Step IL Hash IL Delta PP Hash PP Delta Ingest (100%) RAM write RAM write Disk write Disk write New segment RAM read RAM read Disk read Disk read Old segment RAM read Disk read RAM read Disk read Match (90%) Disk delete Disk delete No match(10%) Disk write Disk write For every 100 GB an inline hash system writes 10 GB to disk For every 100 GB an inline delta system writes 10 GB, reads 100 GB from disk For every 100 GB a post process hash system writes 100 GB, reads 100 GB, and deletes 90 GB from disk For every 100 GB a post process delta system writes 100 GB, reads 200 GB, and deletes 90 GB from disk Common sense seems like inline has a major advantage Things change when you consider the dedupe tax

The Chair Recognizes Inline When you re done with backups, you re done with dedupe Backups begin replicating as soon as they arrive The post-process vendors need a staging area The post-process vendors don t start deduping d until a backup is done; that will make things take longer

The Chair Recognizes Post-process When backups are done, dedupe is almost done Replication begins as soon as the first backup is done We wait until a backup is done, not until all the backups are done (unless you tell us to) The staging area allows Initial backups to be faster Allows copies and recent restores to come from native data Allows for staggered implementation of dedupe Selectively dedupe only what makes sense You don t need as much staging disk as you might think Inline vendors may slow down large backups and restore. They always rehydrate. We only rehydrate older data.

Inline & Post-process Vendors Inline EMC Data Domain IBM Protectier NEC HydraStor Post-process Exagrid Greenbytes Quantum DXi Quantum DXi SEPATON Deltastor

How Does Replication Work? Does replication use dedupe? Can I replicate many-to-one, one-to-many, cascading replication? If deduping many to one, will it dedupe globally across those appliances? Can I control what gets replicated and when? (e.g. production vs development)

Is There an Index? What happens if the index is destroyed? How do you protect against that? Does it need its index to read the data? What do you to verify data integrity? What about malicious people? Some dedupe vendors aren t very good at answering these questions, partially because they don t get them enough Make sure you ask them

Truth in IT Backup Concierge Service Community of verified but anonymous end-users (no vendors) Included in base service: Billable product & strategy-related questions Learn from other customer s questions & answers Much less expensive than traditional consulting Talk to real people using the products you are interested in Podcast interviews with end-users and thought leaders Unbiased product briefings written by experts Coming soon: Reports of lab tests by experts Field test reports designed by us, conducted by end-users One year subscription: $999

Session Two Using Deduplication in Backup Systems Using Data Reduction in Primary Systems

The Dedupe Tax AKA Rehydration Problem Essentially a read from very fragmented data Not all dedupe systems are equally adept at reassembling Humpty Dumpty Especially visible during tape copies & restores of large systems (single stream performance) Recent POC of three major vendors showed 3x difference in performance! Remember to test replica source & destination

Isn t It Cheaper Just to Buy tape? Tape is cheaper than ever & keeps getting cheaper Must encrypt if you re using tape Must use D2D2T to stream modern tape drives Must constantly tweak to ensure you re doing it right Take all that away and use dedupe May not be cheaper but definitely better Buy JBOD/RAID Even if it were free, you still have to power it Power/cooling bill will be 10-20x more with JBOD/RAID Replication not feasible, stuck with tape for offsite (see above)

Let s Talk About What Matters What are the risks of their approach? Data integrity questions How big is it? What s my dedupe ratio? How big can it grow (local vs global) How fast is it How fast can it backup/restore/copy my data? How fast is replication? How much does it cost? Pricing schemes are all over the board Try to get them on even playing field Also consider operational costs Adding storage Replacing drives (how long does rebuild take?) Monitoring, etc

Advanced Uses of Deduplication

Eliminate Tape Shipping Offsite backups w/o shipping tapes Backups with no human hands on them Make tapes offsite from replicated copy and never move them No tapes shipped = No need to encrypt tapes

Shorter Recovery Point Objectives Most companies run backups once per day Even though they back up their transaction logs, throughout the day, they re only sent offsite once per day Dedupe and replication could get them offsite immediately throughout the day

VMware Backup One of the challenges with typical VMware backup is the I/O load it places on the server Source dedupe can perform an incrementalforever backup with a much lower I/O load Could allow you to continue simpler backups without having to invest in VCB

ROBO & Laptop Backups Dedupe software can protect even the largest laptops over the Internet It can also protect relatively large remote sites without installing hardware Restores can be done locally (for slower RTOs) or locally using a local recovery server (for quicker RTOs)

Where to Use Target/Source Dedupe Laptops, Vmware, Hyper-V are easy: it s got to be source Small, remote sets of data also an easy decision. Could do target w/remote backup server, but cost usually pushes people to source. A medium-sized (<1 TB) remote site could use a remote target system or remote source dedupe backup server that replicates to CO Medium-large datacenter could also use either Large datacenter (10TB+) might start to find things they don t like about a source system Should do POC to decide

Source Dedupe: Remote Backup Server? If using source dedupe to backup a remote office, should you back up directly to a centralized backup server or backup to a remote backup server that t replicates to a central server? It s all about the RTO you need. Decide on RTO, test totally remote restore and see if it can meet it. If not, use a remote server

How Big is Too Big to Replicate Backups? Remote office replicating to a CO, or a CO replicating its backups to a DR site, there is a limit to how much you can replicate Make sure you ve done all you can to maximize deduplication ratio. A 10:1 site will need twice as much bandwidth as a 20:1 site. Depends on daily deduplicated change rate, which is a factor of data types and dedupe ratio Now common to protect 1 TB over typical WAN lines, much more over dedicated lines

Test, Test, Test!!!

Test Everything Installation and configuration, including adding additional capacity Support call and ask stupid questions Dedupe ratio Must use your data Must use your retention settings Must fill up the system All speeds Backup speed Copy speed extremely important to test Restore speed Aggregate performance With all your data types Especially true if using local dedupe Single stream performance Backup speed Restore and copy speed (especially if going to tape) Replication Performance Lag time (if using post process) Dedupe speed (if using post process) Loss of physical systems Drive rebuild times Reverse replication to replace array? Unplug things, see how it handles it Be mean!

Testing Methods: Source Dedupe Must install on all data types you plan to back up Must task the system to the point that you plan to use it VMware anyone? OK to back up many redundant systems; that s kind of the point Remember to test speed of copy to tape if you plan to do so

Testing Methods: Target Dedupe Copy production backups into IDT/VTL using your backup software s built-in cloning/migration/dupe features Use dedicated drives if possible and script it to run 24x7 You must fill up the system, expire some data, then add more data to see steady state info Copy/backup to one system, replicate to another, record entire time, then restore/copy data from replicated copy

Data Reduction in Primary Storage

A Whole New Ball Game In primary space, we use the term data reduction, as it s more inclusive than dedupe A very different access pattern; latency is much more important The standard in backup world is tape: just don t be slower so than that and dyou re OK The standard in primary world is disk: anything you do to slow it down will kill the project Will not get same ratios as backup Summary: the job is harder and the rewards are fewer And yet, some are still trying it

Options Compression File-level dedupe Sub-file-level dedupe Some files compress, but don t dedupe Some files dedupe but don t compress well

Vendors Compression Storwize, Ocarina File-level dedupe EMC Celerra Sub-file-level dedupe NetApp ASIS, Ocarina, Greenbytes, Exar/Hifn, SNOracle Usually you get compression or dedupe Ocarina & Exar claim to do both compression and sub-file-level level dedupe

Pros/Cons of Primary Data Reduction Saves disk space, power/cooling Can have positive or negative impact on performance must test to see which Does not usually help backups: data is re-duped before being read by any app, including backup Exception to above rule is NetApp SnapMirror to tape

Contact Me Email curtis@backupcentral.com Websites to which I contribute: http://www.backupcentral.com http://www.searchstorage.com http://www.searchdatabackup.com Follow me on Twitter @wcpreston My upcoming venture: http://www.truthinit.com