File Systems for Data Protection Windsor Hsu, Ph.D. EMC
SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
Abstract File Systems for Data Protection There is a strong industry trend to adopt disk-based platforms for data protection. But not all such systems were designed to satisfy the requirements for data protection. This tutorial will explain some of the differences between a general purpose file system and a file system designed for data protection. Participants will appreciate that file system targeted at data protection has unique requirements understand at an overview level the different techniques for satisfying these requirements and the operational implications become equipped to make informed decisions about selecting a file system for data protection 3
Data Protection Transformation WAN Backup Clients Backup/ media servers Onsite Retention Storage Replication Offsite Disaster Recovery Storage Deduplicating storage systems take role of tape libraries Replication of reduced data for WAN Vaulting 4
Tape Use Decline 70 60 59.3 Impact on Use of Tape among Dedupe Backup Users Reduced / will reduce the amount of tape used % of Respondents 50 40 30 20 10 0 Tape Reduced 32.4 Tape Gone 5.4 1 2 Eliminated or will eliminate tape No impact on tape use Increased / will increase use of tape Don't use tape Note: Notes: N=204, data is not weighted, Base=All respondents who have seen/expect to see specific results with tape reduction/elimination Source: IDC Deduplication Study, November 2009 5
Tape Automation Market Decline Tape use in re-evaluation Tape Automation Factory Revenue % of External Storage Factory Revenue Restore is moving to disk 1999: 11.9% DR will be from WANs Source: IDC Tape Automation Worldwide Forecast 2010-2014, April 2010 6
Protection Storage versus Primary Storage Workload Primary Storage Continuous random accesses Mostly reads Lots of meta-data accesses Protection Storage Large batches of accesses Mostly writes Few meta-data accesses Cost Dollars / IOPS Dollars / TB Performance Latency and throughput, IOPS Sequential throughput Safety Backup to protection storage Storage of last resort 7
Storage for Data Protection Cost (Low $/TB) Compete with tape Think Tier Z Performance (High Sequential Throughput) Finish backups within shrinking backup window Scale and Scalability Effectively handle large and growing data sets Safety Must work, last line of defense against data loss 8
Cost: Low $/TB
Data Reduction Regular Storage Array 1:1 LZ Compression ~ 2:1* Single Instance Storage ~ 3:1* Fixed Block ~ 3:1* Variable-size Segment (Data Deduplication) ~ 20:1* For backups, data deduplication achieves massive data reduction and savings in - Replication WAN Bandwidth - Power - Heat - Cooling - Management * Typical data reduction rate for backups; actual rate depends on data and algorithm 10
Data Deduplication: Technology Overview Friday Full Backup A B C D A E F G Mon Incremental A B H Backup Estimated Data Logical Reduction Physical FRIDAY FULL 1 TB 2 4x 250 GB Tues Incremental C B I Monday Incremental 100 GB 7 10x 10 GB Weds Incremental E G J Thurs Incremental A C K Second Friday Full Backup B C D E F L G H A B C D E F G H I J K L Tuesday Incremental 100 GB 7 10x 10 GB Wednesday Incremental 100 GB 7 10x 10 GB Thursday Incremental 100 GB 7 10x 10 GB Second FRIDAY FULL 1 TB 50 60x 18 GB TOTAL 2.4 TB 7.8x 308 GB 11
Data Deduplication: Store More for Longer with Less Week 1 Week 2 Rule of Thumb: Backup Cumulative Estimated Physical Data Logical Reduction First Full 1 TB 4x 250 GB April 7 2.4 TB 8x 308 GB April 14 3.8 TB 10x 366 GB Week 3 April 21 5.2 TB 12x 424 GB Physical capacity needed for 3 months of data protection ~ logical size of data protected Month 1 April 28 6.6 TB 14x 482 GB Month 2 Month 3 May 31 12.2 TB 17x 714 GB June 30 17.8 TB 19x 946 GB Month 4 July 31 23.4 TB 20x 1,178 GB TOTAL 23.4 TB 20x 1,178 GB 12
Data Deduplication: Store More for Longer with Less 10 30 times less data stored versus fulls + incrementals with typical retention policies Data Stored 1 5 10 15 20 Weeks in Use Deduplication storage Traditional storage 13
Inline vs Post-Process Deduplication INLINE Deduplication Before Storing POST- PROCESS Deduplication After Storing Deduplication Store Deduplication 3x disk accesses to shared store Other activities unimpeded Predictable Simpler Current backup before dedupe 3 months of deduped backups 2x capacity The more processes, the more resource contention Copy to tape: Too slow to stream tape Recovery: Unpredictability in meeting SLA Replication: Poor time-to-disaster-recovery Deduplication: Delayed if interleaved with backup/restore More administration to fight these issues 14
Performance: High Sequential Throughput
Speed Matters 400 12 Hour Backup Window Data Protected (TB) 300 200 100 Faster protection storage system can protect larger data sets 5 10 15 20 25 Throughput (TB/Hour) 30 16
Challenge in Fast Deduplication Bottleneck is in looking up index to determine whether segment is already in system Each disk: 1 MB/sec E.g. for a 7.2K RPM SATA drive: <120 seeks/sec 120 seeks/sec with 8 KB segments: 0.96 MB/sec/disk 17
How to Dedupe Fast? Parallel disk lookups Need 1000 disks to go 1 GB/sec Fast disks for index of segments already in system Use FC disks, short-stroke disks, use SSDs, etc Index in RAM RAM needs to be ~1% of storage Cache index in RAM Needs temporal or spatial locality to work Stream Informed Segment Layout (SISL) Segments tend to arrive in same order so store segments in order received 18
Scale and Scalability
Proportionate Performance Scaling 4 weeks of retention Performance 8 weeks of retention For a fixed retention period, performance must scale with capacity Capacity 20
CPU-Centric versus Spindle-Bound 1,500 CPU-Centric Architecture Throughput scales with CPU performance (e.g. SISL-based) Throughput MB/s With FC disks Spindle-Bound Throughput gated by disk IOPS With SATA Disks 50 50 100 150 200 Number of Disk Spindles Data sets are growing rapidly CPU-centric architecture provides required scalability 21
Challenge of Large Consolidated Data Centers Media Manager Pool... Dedupe Pools Days 1-3 Day 4... Capacity Used Day 1 Day 2 Day 3 Day 4 If backup rerouted to new dedupe domain, capacity required is same as first full 22
Large Dedupe Domain Best practice: Send same data set consistently to same dedupe domain Challenge: As data set grows, difficult to rebalance workloads among dedupe domains Solution: Large dedupe domain Results: Simplified capacity management Higher storage utilization Higher backup success rate 23
Safety
Storage of Last Resort When primary storage fails Restore from backup When backup fails No copy to restore from Data is lost 25
Design for Data Protection Make sure data is stored correctly when originally backed up Make sure new backups do not put old backups at risk Make sure old backups stay correct and recoverable Be prepared for the worst, design the system for recoverability Maintain a copy in another location 26
End-to-End Verification Test recoverability asynchronously after backups File system consistency Data integrity on disk Primary storage can t verify after write It would be too slow Primary storage discovers problems during restore The end-to-end check verifies all file system data and metadata 27
Fault Avoidance and Containment Overwriting good data with new data puts previous backups at risk Use log-structured file system Eliminate partial-raid-stripe writes 28
Fault Detection and Recoverability Verify all data read is correct Verify all data, even those not read, remain correct Ensure all metadata structures can be rebuilt Ensure file system check (FSCK) is fast when needed 29
Flexible and Efficient WAN Vaulting Data Center (DC) Disaster Recovery WAN Primary DC Remote DC Remote Office Data Protection WAN Primary DC Remote DC Remote Sites Multi-site Tape Consolidation 30
Time to DR-Ready Inline Dedupe Replication Replicate During Backup DR- Ready Adaptive Post-Process Dedupe Replication Backup to Cache Dedupe, replicate & present: <50% ingest speed DR- Ready Scheduled Post-Process Dedupe Replication Backup to Cache Dedupe, replicate & present: <50% ingest speed DR- Ready VTL/Tape/Truck Backup to VTL Copy to Tape Recall Tapes Truck to Storage Truck from Storage? Time Backup Window 31
Summary File system for data protection is different from file system for primary storage File system for data protection must offer Fast and effective data reduction CPU-centric inline data deduplication Scalable performance Large deduplication domain Resiliency fitting storage of last resort Efficient and flexible replication 32
Q&A / Feedback Please send any questions or comments on this presentation to SNIA: trackfilemgmt@snia.org Windsor Hsu Hugo Patterson Joseph White Nancy Clay Many thanks to the following individuals for their contributions to this tutorial. - SNIA Education Committee 33