DEDUPLICATION NOW AND WHERE IT S HEADING Lauren Whitehouse Senior Analyst, Enterprise Strategy Group
Need Dedupe?
Before/After Dedupe
Deduplication Production Data Deduplication In Backup Process Backup Disk
Dedupe Evolution Block-level deduplication technology Deduplication appliances Multinode/Grid Solutions Backup with Deduplication Optimizes network bandwidth; Tape-centric aids with data disk-to-disk Multi-node Deduplication Eliminate configurations becomes redundancy Ability transport What s next? a across to create between files tapes now for optimized sites introduce more Symantec pervasive ability solves long-term to deliver feature catalog in retention that HA, tracking load backup Changes the economics of contain balancing, of deduped software copies deduplicated data performance Eliminate increase redundancy and disk-to-disk backup global deduplication within and between files? File-level deduplication OR single-instance storage WAN optimization VTL with deduplication Symantec OST Interface Dedupe on tape
Data Growth Out of Control?
Managing the Data Deluge At approximately what rate do you believe your total volume of data is growing annually? (Percent of respondents) 100 or fewer servers (N=247) More than 100 servers (N=246) 45% 42% 62% with <100 servers have <20% growth/year 40% 35% 30% 25% 20% 20% 28% 63% with >100 servers have >20% growth/year 23% 24% 30% 15% 10% 5% 9% 9% 9% 6% 0% 1% to 10% annually 11% to 20% annually 21% to 30% annually 31% to 40% annually More than 40% annually
Storage Spending Priorities In which data storage areas will your organization make the most significant investments over the next 12-18 months? (Percent of respondents, five responses accepted, N=289) Backup and recovery solutions Data replication solution for off-site disaster Purchase new SAN storage systems Improved storage management software tools Storage virtualization Data reduction technologies Purchase more power-efficient storage hardware Tiered storage Use cloud storage services as way to source Tape replacement Purchase new NAS storage systems Advanced file storage / file system technology Storage encryption solution Converged data and storage networking Unified storage systems Increase use of flash-based SSDs 24% 23% 21% 21% 18% 18% 17% 17% 15% 15% 14% 12% 9% 9% 8% 36% 0% 10% 20% 30% 40%
Why Do We Need Dedupe? Data Growth
Deduplication Creates Efficiencies in D2D Backup Financial benefits Reduce disk costs; delay capital expenditures Lower bandwidth costs Reduce power & cooling costs Tape replacement savings Operational benefits Reduce operational overhead in backup Reduces time and resource needs for recovery Business benefits Increase retention periods Improve recovery objectives Improve backup consolidation from ROBOs Improve DR
Best Dedupe Fit? Traditional file-level backup ROBO use cases Virtualized environments
and Worst Fit? Pre-compressed or encrypted data File types that don t have versions (multimedia)
What Impacts Reduction Ratios? Backup strategy (full vs. incremental or differential) Change rate between backups Retention When data is encrypted or compressed
Typical Dedupe Ratios On average, what degree of capacity reduction has your organization experienced by using data deduplication technology? (Percent of respondents, N=140) More than 20x reduction, 11% Don t know, 5% Less than 10x reduction, 29% 10x to 20x reduction, 56%
Capacity Savings Weekly full backup over 8 weeks 6 week retention 20:1 deduplication ratio Protected Capacity (TB) Stored Capacity (TB) 40 35 30 25 20 15 10 5 1.25 1.67 1.88 1.67 1.79 1.76 1.84 2.00 1 2 3 4 5 6 7 8 Retention Period (weeks)
Which Dedupe Approach Is Best? Backup Software VTL Gateway Appliance NAS Dedupe Device
Hash algorithms Identifying Duplicates More popular approach Fixed block size Variable block size Sliding window block size Hash collisions (false positives) a remote risk Central index of IDs Delta differences Faster No false positives Global deduplication across different backup streams is a limitation Hybrid approach Combines delta differencing & hash calculation Less CPU- and memory-intensive Index is smaller
Data Deduplication Where? Backup Source Backup Initiator Backup Target VMs Apps OS Apps OS Apps OS WAN ESX Server Remote or Branch Office
Data Deduplication When? Backup Source Backup Initiator Backup Target VMs Apps OS Apps OS Apps OS WAN Post-process deduplication after data is written to disk ESX Server Remote or Branch Office Inline deduplication - before data is written to disk
Inline vs. Post-Process Inline Requires less I/O Replication can begin immediately Re-assembly of data for recovery could impact performance Examples EMC Data Domain IBM ProtecTIER NEC Hydrastor Symantec NBU 5000 Series Typically all software approaches Post-Process Requires more I/O Requires disk landing zone (staging area) Dedupe & replication processes overlap Most recent full kept in native format Examples: Exagrid FalconStor GreenBytes HP VLS Quantum Dxi Sepaton DeltaStor
Single- vs. Multi-Node Solutions Single-Node Dedupe Performance & capacity is limited to upper threshold Forklift upgrade Add more islands of dedupe Over-purchase to accommodate future growth Examples EMC Data Domain Fujitsu CS GreenBytes Quantum Multi-Node Dedupe Manages multiple deduplication systems as one More linear throughput & capacity scaling Load balancing Examples IBM ProtecTIER EMC Avamar Exagrid EX Series FalconStor FDS HP VLS NEC HydraStor Sepaton DeltaStor Symantec NetBackup 5000 Series
Local vs. Global Dedupe Local Single domain backup data passes through an individual system and is compared with data passing through the same system Examples: EMC Data Domain Fujitsu GreenBytes Quantum Global Deduplication across domains means backup data is compared with data within its system as well as other systems in the domain Can result in higher dedupe ratios Examples: Exagrid FalconStor HP VLS IBM ProtecTIER NEC Sepaton Symantec NBU 5000 Series Typically most backup software solutions
Dedupe Approaches Software-Based Hardware-Based Content-aware; dedupe can be policy-based Can be more cost-effective Flexibility in disk selection End-to-end bandwidth efficiency; remote site backup Global dedupe Simplified management single console, policy engine Can extend to tape Examples: Arkeia Asigra Atempo CA Cofio CommVault Druva - EMC Avamar - I365 - IBM - PHD Virtual - Quest - Symantec NBU & BE - Veeam Multiple backup vendor environments No impact on application performance Optimized replication Scalability of some solutions may cause disruptive upgrades or dedupe islands Examples: EMC Exagrid FalconStor Fujitsu GreenBytes HP IBM NEC Quantum Sepaton Symantec
High-Value Feature Target system integration with backup catalogs and lifecycle policies Symantec OpenStorage (OST) EMC Networker
What s New in Dedupe? New dedupe techniques Example: Arkeia Progressive Dedupe on tape Example: CommVault Target solutions moving processes upstream Example: Data Domain Boost Modular dedupe Example: HP StoreOnce Dedupe in hardware/software from same vendor Example: Symantec Ongoing improvements in capacity and performance
Disruptive Trends
Purchase Considerations Which of the following considerations would you say are most important in your organization s evaluation and selection of data deduplication technology? (Percent of respondents, N=145, five responses accepted) Cost of solution 64% Ease of implementation/use 46% Impact on backup/recovery performance Integration with existing backup processes Scalability of solution 33% 31% 35% Vendor service and support Ability to deduplicate across systems/data sets as Ability to replicate deduplicated data off-site Existing relationship with vendor Where deduplication occurs Granularity of deduplication Deduplication ratio Experience of vendor in backup implementation When deduplication occurs 24% 23% 21% 17% 17% 14% 12% 10% 9% 0% 10% 20% 30% 40% 50% 60% 70%
Before Seeking Out Solutions Understand your needs Capacity and throughput requirements/planning Full backup size; incremental backup size Number of full/incremental backups per week Change rate of data Projected growth rate Retention policies Full backup window Offsite copy window Performance requirements Requirements for offsite copies Budget
How is Dedupe Evolving? Mix of hardware & software approaches Scale requirements Performance Capacity Focus on recovery considerations Speed of rehydration and restore Reliability Criticality of the index how is it protected? New architectures New packaging New dedupe techniques
THANKS! laurenw@esg-global.com Twitter: lauwhitehouse Blog: www.dataprotectionperspectives.com
APPENDIX
Fixed- vs. Variable-Length Blocks Fixed-Length Blocks Initial Examination Block A Block B Block C Block D Block E Subsequent Examination Block A Block B Block F Block G Block H Change in file Downstream blocks F, G & H change = no duplication detected after the change Variable-Length Blocks Initial Examination Block A Block B Block C Block D Subsequent Examination Block A Block E Block C Block D Change in file Downstream blocks C & D unchanged = duplication detected
Post-process dedupe Time to DR Backup Job Replication Time Inline dedupe Backup Job Replication Time