Storage and Disaster Recovery Matt Tavis Principal Solutions Architect
The Business Continuity Continuum High Data Backup Disaster Recovery High, Storage Backup and Disaster Recovery form a continuum of continuity of operations (COOP) solutions to avert data loss and application downtime In the face of internal or external events, how do you: Keep your application running 24x7? HA Make sure your data is safe? Backup Storage Get an application back up after a major disaster? DR
Disaster Recovery DR is one end of the continuum Recover from any event within a defined period of time (RTO) and data loss (RPO) Goal: Application restarted and data recovered within an acceptable period of time Application may run at lower function or lower capacity Traditional IT model: DR is off-site Low end DR: Off-Site Backups High end DR: full Hot-Site DR
How Does AWS Change Traditional DR? AWS is useful for traditional low-end DR to high-end HA, but AWS encourages a rethinking of traditional DR / HA practices Everything in the cloud is off-site and (potentially) multi-site Using multiple sites (multiple AZs) comes largely for free Using multiple geographically-distributed sites (multiple Regions) is significantly cheaper and easier Tends to move the default design point away from cold Disaster Recovery toward hot High, which blends application scaling and COOP design points Makes it easier to stack multiple mechanisms e.g., Basic HA within one Region, DR site in second Region
AWS Regions and Zones Regions are completely separate clouds Multiple connected Zones in each Region with private intra-az connectivity AWS services use AZs to provide their high reliability SLAs; you should too US East Region EU West Region Japan Zone A Zone B Zone A Zone B Zone A Zone B Zone C GovCloud US West Region Singapore Zone A Zone B Zone A Zone B Zone A Zone B
AWS Backup Storage Capabilities Amazon Simple Storage Service (S3) Highly-durable blob storage Highly useful for archival and backup Elastic Block Store (EBS) and EBS Snapshots Persistent Data volumes for EC2 instances Redundant within a single Zone Snapshot backups provide long-term durability, and volume sharing / cloning capability within a Region Copyright 2011 Amazon Web Services
Data Migration to AWS Amazon Import/Export Migration of large amounts of data to AWS Virtual Sneakernet send hard drives to AWS Continual Data Backup Backup products many products and partners here Replication (mirroring, db replication, log shipping, etc.) Managed File Transfer products Scripted rsync, tsunami, etc. Amazon VM Import Support for migrating virtual machines & disks to AWS Windows-only today with more OSes over time
Architectural Patterns Overview Variety of approaches exist Tradeoff between RTO/RPO vs. cost and complexity Example Architectural Patterns: Approach RTO RPO Backup and Restore Hours to Days Day(s) Pilot Light for Quick Recovery Hours Minutes to Hours Fully Functioning Low Capacity Standby Minutes to Hours Minutes to Hours Multi-Site Hot Standby Zero to Minutes Immediate to Minutes
Backup and Restore Pros and Prep Advantages Simple to get started Extremely cost effective (mostly backup storage) Preparation Phase Take backups of current systems Store backups in S3 Describe procedure to restore from backup on AWS Know which AMI to use, build your own as needed Know how to restore system from backups Know how to switch to new system Know how to configure the deployment
Backup and Restore Recovery Approach In Case of Disaster Bring up required infrastructure in AWS EC2 instances with prepared AMIs, Load Balancing, etc. Restore system from S3 backups Switch over to the new system Adjust DNS records to point to AWS Objectives RTO: as long as it takes to bring up infrastructure and restore system from backups RPO: time since last backup
Backup and Restore High-level Architecture Code/Logs Code/Logs Data Dumps Data Files Front-end Server Application Server Database Server Storage Data Backup Bucket Existing Data center
Pilot Light for Quick Recovery Pros and Prep Advantages Reduced RTO and RPO Very cost effective (very few 24/7 resources) Preparation Phase Enable replication of all critical data to AWS Standby DB, replica, mirror, etc. Reduced infrastructure that runs 24/7 in AWS Prepare all required resources for automatic start AMIs, Network Settings, Load Balancing, etc. Only runs when used for DR Reserved Instances
Pilot Light for Quick Recovery Recovery Approach In Case of Disaster Automatically bring up resources around the replicated core data set Scale the system as needed to handle current production traffic Switch over to the new system Adjust DNS records to point to AWS Objectives RTO: as long as it takes to detect need for DR and automatically scale up replacement system RPO: depends on replication type
Pilot Light for Quick Recovery High-level Architecture Pre-canned AMIs Real-time Replication DB Role-based AMIs DB Replication Data Backups Front-end Server Application Server Database Server Storage Data Backup Bucket Existing Data center
Fully Functioning Low-Capacity Standby Pros and Prep Advantages Can take some production traffic at any time Cost savings (IT footprint smaller than full DR) Preparation Similar to Pilot Light All necessary components running 24/7, but not scaled for production traffic Best practice continuous testing Trickle a statistical subset of production traffic to DR site
Fully Functioning Low-Capacity Standby Recovery Approach In Case of Disaster Immediately fail over most critical production load Adjust DNS records to point to AWS (Auto) Scale the system further to handle all production load Objectives RTO: for critical load: as long as it takes to fail over; for all other load, as long as it takes to scale further RPO: depends on replication type
Fully Functioning Low-Capacity Standby High-level Architecture Zero Weight DNS Route Auto scaling Group Auto scaling Group Real-time Replication DB Warm FE Tier Warm App Tier DB Replication Data Backups Front-end Server Application Server Database Server Storage Existing Data center Data Backup Bucket
Multi-Site Hot Standby Pros and Prep Advantages At any moment can take all production load Preparation Similar to Low-Capacity Standby But fully scaling in/out with production load
Multi-Site Hot Standby Recovery Approach In Case of Disaster Immediately fail over all production load Adjust DNS records to point to AWS Objectives RTO: as long as it takes fail over RPO: depends on replication type
Multi-Site Hot Standby High-level Architecture Active DNS Route Auto scaling Group Auto scaling Group Real-time Replication DB Hot FE Tier Hot App Tier DB Synchronization Data Backups Front-end Server Application Server Database Server Storage Existing Data center Data Backup Bucket
Best Practices for Being Prepared Start simple and work your way up Backups in AWS as a first step Incrementally improve RTO/RPO as a continuous effort Check for any software licensing issues Exercise your DR Solution Game Day Ensure backups, snapshots, AMIs, etc. are working Monitor your monitoring system
DR Solution Providers http://aws.amazon.com/solutions/solution-providers/ http://aws.amazon.com/solutions/case-studies/
Conclusion Advantages of DR with AWS Various building blocks available Fine control over cost vs. RTO/RPO tradeoffs Ability to scale up rapidly when needed Pay for what you use, and only when you use it (when an event happens) Ability to easily and effectively test your DR plan of multiple locations world wide Variety of Solution Providers
Thank You!