Enabling Disaster Recovery Through Data Replication Technology June 7, 2010 Christophe Bertrand, BA (Hons), MBA Roselinda Schulman, CBCP Hitachi Data Systems 2005 Hitachi Data Systems About This Session Time is of the essence Objective: Familiarize non-technical attendees with replication technology Focus on Disaster Recovery/Business Questions please June 7, 2010 2 About This Session (Cont d) What is a disaster from an IT perspective? Business considerations: RTO/RPO; type of recovery technologies; tiers or recovery Introduction to the various types of data replication technologies available today June 7, 2010 3 1
Types of Disasters (1/2) The objective of a remote-replication solution is not to keep two copies of the data! It is to survive a disaster! Predictable You see it coming! Perform an orderly shutdown Switch to a backup data center No data loss and no physical damage Lingering Sudden event for an indeterminate duration (e.g., power failure) No physical damage Systems come back with no data loss. ideally June 7, 2010 4 Types of Disasters (2/2) The objective of a remote-replication solution is not to keep two copies of the data! It is to survive a disaster! Rolling Disasters occur over a span of time Seconds to hours During that span, components fail independently, resulting in corrupted and unusable data Sudden event Physical damage Failover to backup data center June 7, 2010 5 IT Life Is Full of Small Disasters High Virus Attacks Security Breach Maintenance Event Probability Insignificant Hacking Data Corruption Hardware Faults Network Problems Failed Backup Utility Failures Natural Disaster Terrorism Low Low Loss Potential x Vulnerability High June 7, 2010 6 2
A Surge of Regulations Self-regulation replaced by Law and Regulation Uncle Sam is getting in the game with a vengeance Responsibility cannot be passed on to Service Providers Sarbanes Oxley Act of 2002 Basel Capital Accord June 7, 2010 7 BC/DR Lessons Learned by Customers From Recent Events Document and automate as much as possible avoid reliance on one or two experts Some the of biggest obstacles to timely and successful recovery were: Breakdown in regional infrastructure (transportation, communication etc.) prevented access to recovery site Inadequate or untested disaster recovery plans Service providers overwhelmed during a widespread disaster Tape alone, was not an effective means of recovery for all applications June 7, 2010 8 Business Considerations Recovery-time Objective (RTO) RTO describes the time within which specific business functions must be restored Recovery-point Objective (RPO) The point-in-time in which data must be restored to successfully resume processing The actual solution will be based on cost vs. recovery time June 7, 2010 9 3
Business Considerations Cost of outage over time Cost of outage over time Cost Online Revenue Producing Applications Acceptable Cost/Time Window Back office, Batch Applications Cost of Solution and time-to-recover Minutes Hours Days Recovery Time Objective June 7, 2010 10 Diversity in Data Protection Requirements Different types of data require different levels of protection Risk assessment required to evaluate business criticality, and cost to recover Local Disk Mirroring Completely Duplicated/ Interconnected recovery-site Remote Disk Mirroring Remote PiT mediated copy Disk-to-disk backup and recovery Out of Region and multiple data center strategies More Electronic Vaulting Tape Back-up Off-site Tape On-site Importance of Data More Amount of Data Less Delayed Recovery Time Immediate Less June 7, 2010 11 Disaster Recovery Tiers: Definitions Recovery Time Objective (RTO) Time to resume operation at secondary IT facility this is the duration of the service interruption. Recovery Point Objective(RPO) Worst case time between last back-up and interruption time. Technology Tier RPO Range RTO Range Minimum # Disk Copies Distance Regional Disaster Support Tier 1 Tape BU 24-168 hours 48-168 hours N/A Any Yes Tier 2 Disk PiT s 4-36 hours 4-24 hours 3 1 Any Yes Tier 3a Synch 0-2 minutes 1-8 hours 2 1 Limited No Tier 3b Synch 0-2 minutes 5-60 minutes 2 1 Limited No W/Failover Tier 4a Asynch 0-5 minutes 2 1-8 hours 2 1 Any Yes Tier 4b Asynch 0-5 minutes 2 30 90 mins 2 1 Any Yes W/Failover Tier 5 3DC 0-2 minutes 1-8 hours 3-7 1,3 Any Yes Note 1 Best Practice is one additional copy for doing DR testing without impacting the ongoing replication session Note 2 Network Problems will extend the RPO Note 3 Depends on vendor and method deployed June 7, 2010 12 4
Local Replication Clones Volume-level, point-in-time copy Allows completely independent processing of copy Little impact to performance of primary volume Snapshots Volume-level view of the data at a point in time Not a full volume copy Trade-off less disk utilization vs. potential performance impact to primary volume June 7, 2010 13 Local Replication Uses Zero-downtime backup Disaster Recovery testing with real production data Additional data copies for added protection Immediate access to time-critical data Application development/testing June 7, 2010 14 Remote Replication (1/5) Synchronous: Provides a remote mirror The remote copy is identical to the local copy Allows very fast restart/recovery with no data loss Potential impact on application performance Distance depends on: Application write activity Network bandwidth Response-time tolerance and other factors Typically <40 Km Synchronous write to remote subsystem Local write does not complete until confirmed write to cache of Secondary System (1) Write I/O (2) Synchronous remote copy (4) Write complete P-VOL (3) Remote copy complete S-VOL Host Primary Secondary June 7, 2010 15 5
Different Types of Replication (2/5) Asynchronous Primary write activity is disconnected from secondary write activity Application continues with little or no performance impact Potential for some data loss and possible lack of I/O consistency, depending on the vendor s implementation Asynchronous write to remote storage system Host I/O process completes as soon as write completes to cache of Primary System. The write data is then asynchronously transferred to the Secondary System (1) Write I/O (3) Asynchronous remote copy (2) Write complete P-VOL (4) Remote copy complete S-VOL Host Primary June 7, 2010 16 Data Integrity Why should you care about data integrity anyway? Restart vs. recovery Hours vs. minutes Data needs to be I/O or crash recovery consistent How do you guarantee data integrity? Update sequence validation Missing record detection Subsequent and successful settling June 7, 2010 17 Different Types of Replication(3/5) Storage-hardware based No dependence on server, file system, database, or operating system Vendor-specific implementations No host resources required Usually highest performance Management June 7, 2010 18 6
Different Types of Replication (4/5) Host-software based Heterogeneous/Storage independent works with most storage hardware Operating system dependent Can leverage existing hosts and IP networks Some impact to server performance, uses host resources Centralized management features need to be carefully considered June 7, 2010 19 Different Types Of Replication (5/5) Fabric-based or network-based, appliance or switch Emerging technology Storage and operating system independent Proprietary hardware/appliances are required SAN environments Performance and management considerations Market acceptance June 7, 2010 20 Summary Many replication choices: Risk Analysis, Business Impact Analysis are a good place to start A necessary add-on to traditional data protection methods It s about recovery and rapid restart Enables more frequent and simpler disaster recovery testing Enables planned outages for scheduled maintenance, migrations, consolidation, or backups June 7, 2010 21 7
Questions? June 7, 2010 22 8