BALANCING FOR DISTRIBUTED BACKUP



Similar documents
Content-aware Load Balancing for Distributed Backup. Deepti Bhardwaj EMC Philip Shilane EMC

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

Everything you need to know about flash storage performance

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Deduplication has been around for several

Dell PowerVault DL2200 & BE 2010 Power Suite. Owen Que. Channel Systems Consultant Dell

Turnkey Deduplication Solution for the Enterprise

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

A Deduplication-based Data Archiving System

Barracuda Backup Server. Introduction

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Protect Microsoft Exchange databases, achieve long-term data retention

Data Backup Options for SME s

DEDUPLICATION BASICS

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Disk-to-Disk-to-Offsite Backups for SMBs with Retrospect

Distributed Backup with the NetVault Plug-in for VMware for Scale and Performance

Accelerating Backup/Restore with the Virtual Tape Library Configuration That Fits Your Environment

HP StoreOnce: reinventing data deduplication

SYMANTEC NETBACKUP APPLIANCE FAMILY OVERVIEW BROCHURE. When you can do it simply, you can do it all.

Solution Overview. Business Continuity with ReadyNAS

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

Demystifying Deduplication for Backup with the Dell DR4000

Deduplication, Incremental Forever, and the. Olsen Twins. 7 Technology Circle Suite 100 Columbia, SC 29203

Characteristics of Backup Workloads in Production Systems

Reducing Backups with Data Deduplication

Get Success in Passing Your Certification Exam at first attempt!

Characteristics of Backup Workloads in Production Systems

Long term retention and archiving the challenges and the solution

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Business-centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Maximizing Business Continuity and Minimizing Recovery Time Objectives in Windows Server Environments

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

Disaster Recovery. Maximizing Business Continuity and Minimizing Recovery Time Objectives in Windows Server Environments.

Understanding EMC Avamar with EMC Data Protection Advisor

Overcoming Backup & Recovery Challenges in Enterprise VMware Environments

A Comparative TCO Study: VTLs and Physical Tape. With a Focus on Deduplication and LTO-5 Technology

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Symantec NetBackup 7.5 for VMware

Reduced Complexity with Next- Generation Deduplication Innovation

How To Make A Backup System More Efficient

EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE: MEETING NEEDS FOR LONG-TERM RETENTION OF BACKUP DATA ON EMC DATA DOMAIN SYSTEMS

3Gen Data Deduplication Technical

VMware vsphere Data Protection

HP Store Once. Backup to Disk Lösungen. Architektur, Neuigkeiten. rené Loser, Senior Technology Consultant HP Storage Switzerland

Combining Onsite and Cloud Backup

Presentation Identifier Goes Here 1

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

Backup to the Future. Hugo Patterson, Ph.D. Backup Recovery Systems, EMC

EMC VNX2 Deduplication and Compression

Future-Proofed Backup For A Virtualized World!

Don t be duped by dedupe - Modern Data Deduplication with Arcserve UDP

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Hardware Configuration Guide

Sales Tool. Summary DXi Sales Messages November NOVEMBER ST00431-v06

Business-Centric Storage FUJITSU Storage ETERNUS CS800 Data Protection Appliance

Optimizing Data Protection Operations in VMware Environments

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Using HP StoreOnce Backup systems for Oracle database backups

Protecting enterprise servers with StoreOnce and CommVault Simpana

Veritas Backup Exec 15: Deduplication Option

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

The assignment of chunk size according to the target data characteristics in deduplication backup system

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Data Deduplication HTBackup

Deduplication Demystified: How to determine the right approach for your business

EMC Backup Storage Solutions: The Value of EMC Disk Library with TSM

Best Practices: Take the Guesswork out of Backup and Recovery This Is the Presentation Title

Introduction. Silverton Consulting, Inc. StorInt Briefing

ETERNUS CS800 data protection appliance featuring deduplication to protect your unique data

WHITE PAPER. Storage Savings Analysis: Storage Savings with Deduplication and Acronis Backup & Recovery 10

ESG REPORT. Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches. By Lauren Whitehouse. April, 2009

Capacity Forecasting in a Backup Storage Environment

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

Tandberg Data AccuVault RDX

Availability Digest. Data Deduplication February 2011

STORAGECRAFT SHADOWPROTECT 5 SERVER/SMALL BUSINESS SERVER

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Total Cost of Ownership Analysis

CrashPlan PRO Enterprise Backup

ABOUT DISK BACKUP WITH DEDUPLICATION

Barracuda Backup Deduplication. White Paper

LDA, the new family of Lortu Data Appliances

CHAPTER 9. The Enhanced Backup and Recovery Solution

Data-Deduplication Reducing Data to the Minimum

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Eight Considerations for Evaluating Disk-Based Backup Solutions

Using Data De-duplication to Drastically Cut Costs

Nimble Storage for VMware View VDI

E-Guide. Sponsored By:

Driving Data Migration with Intelligent Data Management

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

Deduplication and Beyond: Optimizing Performance for Backup and Recovery

Backup Software Data Deduplication: What you need to know. Presented by W. Curtis Preston Executive Editor & Independent Backup Expert

Sharding and MongoDB. Release MongoDB, Inc.

WHITE PAPER BRENT WELCH NOVEMBER

Quantum DXi6500 Family of Network-Attached Disk Backup Appliances with Deduplication

Transcription:

CONTENT-AWARE LOAD BALANCING FOR DISTRIBUTED BACKUP Fred Douglis 1, Deepti Bhardwaj 1, Hangwei Qian 2, and Philip Shilane 1 1 EMC 2 Case Western Reserve University 1

Starting Point Deduplicating disk-based backup storage Variable, content-defined chunks Strong hashes of content to find duplicates 2

Starting Point Deduplicating disk-based backup storage Variable, content-defined chunks Strong hashes of content to find duplicates Focused on making full backups after the first one use minimal extra disk space Internal deduplication duplicates from multiple copies of the same file from the same source Unchanged files dedupe trivially, while chunk-level deduplication catches changes scattered within large regions of unchanged content 3

Starting Point Deduplicating disk-based backup storage Variable, content-defined chunks Strong hashes of content to find duplicates Focused on making full backups after the first one use minimal extra disk space Internal deduplication duplicates from multiple copies of the same file from the same source Unchanged files dedupe trivially, while chunk-level deduplication catches changes scattered within large regions of unchanged content Deduplication can avoid sending the data at all Send the hashes and then only send new chunks 4

Starting Point Deduplicating disk-based backup storage Variable, content-defined t d chunks Strong hashes of content to find duplicates Focused on making full backups after the first one use minimal i extra disk space Internal deduplication duplicates from multiple copies of the same file from the same source Unchanged files dedupe trivially, while chunk-level deduplication catches changes scattered within large regions of unchanged content Deduplication can avoid sending the data at all Send the hashes and then only send new chunks Technology now common in backup products 5

Problem Statement Large-scale IT environment Hundreds d or thousands of systems ( clients ) ( li t ) to backup Many backup appliances to send the data 6

Problem Statement Large-scale IT environment Hundreds d or thousands of systems ( clients ) ( li t ) to backup Many backup appliances to send the data Impact of deduplication Affinity: send the same client to the same appliance so it will deduplicate well Moving it to a new system will cause everything to be written again Overlap: benefit from sending similar systems to the same backup appliance External deduplication, spanning clients 7

Problem Statement Large-scale IT environment Hundreds d or thousands of systems ( clients ) ( li t ) to backup Many backup appliances to send the data Impact of deduplication Backup Server Assignment Affinity: send the same client to the same Dedupe Dedupe Dedupe appliance so it will deduplicate well Storage Storage Storage Moving it to a new system will cause everything to be written again Overlap: benefit from sending similar systems to the same backup appliance External deduplication, spanning clients Simple approach: cluster clients by type 8

Benefits of Overlap Co-locating duplicate content Reduces capacity requirements May take a host from being overloaded to highly loaded, or highly loaded to moderately Reduces throughput requirements Duplicate copies in later clients first full are skipped Ongoing transfers benefit only if identical content being written to multiple hosts during a backup interval Deduplication changes traditional backup administration Backup devices are not all created equal They re not all identical tapes There is a stickiness to the assignment in order to benefit from savings But sometimes data migration benefits outweigh costs 9

Benefits of Overlap Co-locating duplicate content Reduces capacity requirements May take a host from being overloaded to highly loaded, or highly loaded to moderately Reduces throughput requirements Duplicate copies in later clients first full are skipped Ongoing transfers benefit only if identical content being written to multiple hosts during a backup interval Deduplication changes traditional backup administration Backup devices are not all created equal They re not all identical tapes There is a stickiness to the assignment in order to benefit from savings But sometimes data migration benefits outweigh costs Where do we put clients and when do we have to give in and move them? 1

Goals Capacity allocation Send data to backup appliances in the best way to fit them within constraints Balanced load Content-aware for best deduplication 11

Goals Capacity allocation Send data to backup appliances in the best way to fit them within constraints Balanced load Content-aware for best deduplication Performance (throughput) Support many backup streams simultaneously Avoid overloading any individual appliances Increased deduplication reduces overhead on network and appliance 12

Use Cases Sizing and deployment Figure out requirements (and assignments) from clean slate First assignment Given a set of clients and appliances, determine best assignments Reconfiguration Adjust when clients or appliances are added or removed, or load shifts Disaster recovery & replication i Select mappings of appliances onto other appliances for off-site replication 13

Approach Minimize a utility function Cost of a configuration is a function of capacity utilization and performance requirements Compare costs directly to identify best configuration Lots of tradeoffs E.g., migrate a client to a new appliance to reduce capacity overload, but pay a penalty for data movement Identify overlap Sample fingerprints for each client Find cases of significant overlap Ignore the rest 14

How Much Overlap is There? Many systems will have little or no overlap Some systems will have similar il overlap with many other systems, so picking one in particular has no advantage Want to identify special affinity in cases of high overlap among 2, or few, hosts Studied 21 hosts from saved workstation backups and live systems One host with 5% overlap with another and almost 25% additional overlap with a third Virtual machine images particularly likely to have high overlap Deduplicatio on (%) 1 9 8 7 6 4 host16 host21 5 2nd match host2 3 host16 2 1 host2 host4 Available deduplication Best Match Widely deduped host1 host2 host3 host4 host5 host6 host7 host8 host9 host1 host11 host12 host13 host14 host15 host16 host17 host18 host19 host2 host21 15

Cost Calculation In total, the cost for a given configuration is the sum of: A small, weighted penalty for imbalance in capacity or throughput Dedupe Storage Dedupe Storage Dedupe Storage Dedupe Storage Dedupe Storage Dedupe Storage 16

Cost Calculation In total, the cost for a given configuration is the sum of: A small, weighted penalty for imbalance in capacity or throughput A stepped penalty for exceeding thresholds in capacity or throughput Thresholds 1..8 Overload region Warning region Low region High cost for everything above this point Moderate cost for everything from here to the overload region Low cost. 17

Cost Calculation In total, the cost for a given configuration is the sum of: A small, weighted penalty for imbalance in capacity or throughput A stepped penalty for exceeding thresholds in capacity or throughput A small penalty for migrating off an existing appliance 18

Cost Calculation In total, the cost for a given configuration is the sum of: A small, weighted penalty for imbalance in capacity or throughput A stepped penalty for exceeding thresholds in capacity or throughput A small penalty for migrating off an existing appliance A very large penalty for each client that does not fit on its appliance In our experiments presented today, this penalty is the dominant cost. Above 1 means overload and below it means fit Smaller penalties are used to pick among plausible choices (A more formal definition appears in the paper) 19

Algorithms Compare intelligent assignment to brute force such as round-robin or random All the brute force approaches quite fast Random Pick arbitrary assignments. If random selection is full, iterate to find new appliance. Compute cost of configuration Repeat N times and take best result Round-robin Assign to each appliance in turn Skip a full appliance to find one with available capacity if possible Bin packing Assign based on size from largest to smallest (less likely to overflow) Simulated annealing Shuffles assignments from the current best position to try and improve the cost The first three take any existing assignments as a given; only annealing will migrate a client Generally, all work well under low load; annealing can adapt better to overload 2

Annealing Example utilization swap move 21

Evaluation (Simulations) 22

Incremental Assignment Experiment Define a number of clients of fixed size: small, medium, large, 2 per iteration 2 G B 1 GB 2 2 TB 2 TB TB 23

Incremental Assignment Experiment Define a number of clients of fixed size: small, medium, large, 2 per iteration Repeatedly put a set of clients into system and assign to appliances Better dedupe within a class than across 2 G B 1 GB 2 2 TB 2 TB TB 24

Incremental Assignment Experiment 2 TB 2 TB 2 TB 2 TB Periodically add a new appliance to increase capacity At the same time, forget 1/3 of existing assignments (so some assignments have a penalty for movement and some don t) Especially high dedupe with the corresponding client from other iterations stress overlap affinity If new load outpaces capacity, high cost. If the new appliance is added to keep up with added load, low cost. 25

Incremental Assignment Experiment Define a number of clients of fixed size: small, medium, large, 2 per iteration Repeatedly put a set of clients into system and assign to appliances Better dedupe within a class than across Periodically add a new appliance to increase capacity At the same time, forget 1/3 of existing assignments (so some assignments have a penalty for movement and some don t) Especially high dedupe with the corresponding client from other iterations stress overlap affinity If new load outpaces capacity, high cost. If the new appliance is added to keep up with added load, low cost. 26

When Capacity is Exceeded Co ost 1-1 23 45 1 6 2 4 6 8 1 1.5 1..5 apacity Used Frac ctional C Clients Cap w/o Dedupe Cap w/dedupe 27

When Capacity is Exceeded Co ost 1-1 23 45 2 clients added per iteration 1 6 2 4 6 8 1 1.5 1..5 apacity Used Frac ctional C Add appliance Clients Cap w/o Dedupe Cap w/dedupe 28

When Capacity is Exceeded Co ost 1-1 23 45 Offered load above 1. indicates overflow 2 clients added per iteration 1 6 2 4 6 8 1 1.5 1..5 apacity Used Frac ctional C Add appliance Clients Cap w/o Dedupe Cap w/dedupe 29

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 1.5 1..5 apacity Used Frac ctional C Cap w/o Dedupe Cap w/dedupe Random 3

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 1.5 1..5 apacity Used Frac ctional C Cap w/o Dedupe Cap w/dedupe Random 31

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 1.5 1..5 apacity Used Frac ctional C Cap w/o Dedupe Cap w/dedupe Random 32

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 1.5 1..5 apacity Used Frac ctional C Cap w/o Dedupe Cap w/dedupe Random Round Robin 33

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 Cap w/o Dedupe Cap w/dedupe Random Round Robin Bin Packing 1.5 1..5 apacity Used Frac ctional C 34

When Capacity is Exceeded Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 Cap w/o Dedupe Cap w/dedupe Random Round Robin Bin Packing Sim. Annealing 1.5 1..5 apacity Used Frac ctional C 35

When Capacity is Exceeded A few cases where annealing is the only Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 approach with an acceptable cost 2 4 6 Clients 8 1 Cap w/o Dedupe Cap w/dedupe Random Round Robin Bin Packing Sim. Annealing 1.5 1..5 apacity Used Frac ctional C 36

When Capacity is Exceeded Annealing is an order of magnitude lower cost, but it s still a very high cost Co ost 1 6 1 5 1 4 1 3 1 2 1 1 1 1-1 2 4 6 Clients 8 1 Cap w/o Dedupe Cap w/dedupe Random Round Robin Bin Packing Sim. Annealing 1.5 1..5 apacity Used Frac ctional C 37

Roughly Fitting Within Capacity 38

Roughly Fitting Within Capacity Several cases where bin packing and annealing improve on the others when cost already low 39

Roughly Fitting Within Capacity Costs only occasionally very high 4

What Else? Refer to the paper for: A more detailed discussion of overlap computation Some other examples of using the assignment tool Overhead analysis Simulated annealing often works much better but is dramatically slower Variants Ignoring previous assignments How to penalize for each client that doesn t fit Impact of content-awareness t Backup slides for Q&A 41

Summary In a large IT environment, important to automate assignment of clients to backup appliances to optimize for capacity and throughput Taking content overlap into account can reduce capacity requirements and may improve throughput due to duplicate suppression Many options for how to balance load All work well if not overloaded Bin Packing somewhat better than the other simple techniques as limits approached Simulated Annealing can handle some extra overload cases 42

THANK YOU 43