Architecting DR Solutions with ware Site Recovery Manager Breakout Session # BC1693 John Arrasjid Will Crittenden Consulting Architect, ware Consulting Architect, ware Worldwide BC/DR Practice Lead Worldwide BC/DR Practice Deputy September, 2008 Contributors: Kevin James, Lee Dilworth and Mornay Van Der Walt
Disclaimer This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from ware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. These features are representative of feature areas under development. Feature commitments are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery.
Agenda Objectives For DR Designs Influencing Factors for Geographical Failover Disaster Recovery Planning with SRM Case study of an SRM deployment
Objectives What are you trying to achieve? Issues and challenges Building Blocks
Objectives What are you trying to achieve? Examples: Limited business services vs. every business service Satisfy regulatory compliance requirements Failover to your own DR site vs. a third party DR site Fit a budget vs. fit a need Availability vs. performance What DR site topologies will you support
Challenges Determine issues and challenges to be addressed Service Level Agreements (RPO s and RTO s) Business Continuity (Strategic) Disaster Recovery (Tactical) Distance between your sites Budget Planning Process Technology Staffing
Building Blocks for Business Continuity Backup, Recovery & Archival Data vaulting Redundancy Reduction in SPF s Clustering Rapid failover within a site Host-Based Replication Faster data recovery at recovery site Array-Based Replication Rapid data recovery Mirrored Sites Full protection for continuous availability Building blocks to meet Business Continuity Requirements
Influencing Factors Availability, Performance, and Cost Key Questions Influencing Factors Regulatory Compliance
Disaster Type Classification Total loss of site catastrophic no warning Primary Site Lost Blackout serious no warning Primary site temporarily inaccessible Migration controlled scheduled operation Primary site lost, but users have warning Planned controlled scheduled operation Planned temporary outage, but users have warning Production Test controlled scheduled operation Production test of an actual recovery/failback, but users have warning Not all types involve a disaster. Total loss of site Primary Site Lost Fire, earthquake, terrorism No advanced warning. Sites disconnected during recovery. Primary site not preserved. Migration Primary site lost but users have warning Flooding after storm, migration after acquisition, upgrade hardware Moving to secondary site to help avoid outages do to immanent weather or storm Advanced warning. Sites connected during recovery. Primary site not preserved. Blackout Primary site temporarily inaccessible Blackout, fire on adjoining floor No advanced warning. Sites disconnected during recovery. Primary site preserved. Planned Planned temporary outage Heat wave, building maintenance Advanced warning. Sites connected during recovery. Primary site preserved. Production Test Production test of an actual recovery/failback Recovery/failback test Advanced warning. Sites disconnected during recovery. Primary site preserved.
Availability, Performance, and/or Cost? What type of failures are you trying cover for? Which systems must be available? How quickly must those systems be available? Which systems must operate at the same performance level? What technologies, software and hardware, will be needed? What is your budget? Cost Performance Availability The types of failure coverage will not only determine the technology, but how the technology will be deployed and the infrastructure topology will be set up.
Key Questions for DR Design Considerations What systems must be available What applications are Mission Critical? Is availability or performance more important? How much of my business capacity will run at the remote site and for how long will I be able to sustain that load? Distance to protect against geographic disasters Infrastructure requirements What technologies (software & hardware) will be needed? Remote site operations Test frequency Budget NOTES: Which applications are the most business critical, and what do I need restored to successfully resume business as usual Which applications will be most visible to those who are in charge (C-level approval) Am I planning on being able to sustain business activity at the remote site, or is this just a temporary solution until the main site is available again. With the resources available at the recovery site, how much business will be available, how much will I need from the beginning, and how quickly can I add resources if necessary? Am I planning for a natural disaster, or just for a site outage? Recovery site should be at least 25 miles separate to help provide geographical segregation and protect against natural disasters. If the primary site is in a suspect area, then more distance will help prepare for disasters. Further distance will require more cost to replicate data quicker Over ~200 km will eliminate the ability for synchronous replication Will the remote site act as a secondary datacenter in times of normal operation? If bandwidth is available, it may be preferred to have production services residing in both Do you have the proper licensing for cold standby servers (applications) in the recovery site? How often does your environment change? If frequent changes occur, testing should be done on a more frequent schedule The lower the RTO and RPO, the more it will cost
Network Influencing Factors Distance Synchronous or Asynchromous Bandwidth Throughput vs. Latency Hopcount Large pipe vs. more hops Bldg A 1000 ft. Synchronous Bldg. B San Francisco 2905 miles (4675 Km) Asynchronous New York San Francisco San Francisco T1 1.544 Mbps A New York City New York City San Francisco OC48 2448 Mbps New York City synchronous San Francisco A B C D E asynchronous New York City All three factors must be considered to determine the best options for replication of storage.
Protected and Recovery Sites with Flat VLAN Stretched VLAN
Protected and Recovery Sites with Disparate VLAN s There are two methods for easily applying this template to your presentation. Method One: Change this file 1. Save this file with a new name (File-Save As ) 2. Change and add slides Method Two: Apply the attributes of this file to an existing presentation 1. Open the existing presentation 2. Select Format-Apply Design 3. In the Apply Design dialog box: At Files of Type: choose Presentations and Shows At Look In: navigate to this file and select it Click Apply 4. Go through your file and adjust slides as necessary
Typical Enterprise Architecture Internet Web Tier Internal Network Production Web Servers ESX ESX Application Tier ESX ESX ESX Production App Servers ESX ESX Database Tier Physical Servers Physical Servers Unix Unix DB Mainframe SQL Database Servers ESX The typical enterprise includes a heterogeneous server environment with multi-layered applications. Each application may have one or more upstream dependencies (applications that it depends upon) and downstream dependencies (applications dependent upon it). Typically there will be multiple networks, VLANs, firewalls, routers, and switches.
Virtualized Application Internet Web Tier Internal Network Production Web Servers ESX ESX Application Tier ESX ESX ESX Production App Servers ESX ESX Database Tier Physical Servers Physical Servers Unix Unix DB Mainframe SQL Database Servers ESX When looking at virtualized applications there may also be upstream and downstream dependencies. Here we show a number of s with three showing on the internal network that are part of a multi-tiered application. This implies that each of these three s applications will have an upstream or downstream dependency with the other applications.
Physical/Virtual Hybrid Internet Web Tier Internal Network Production Web Servers ESX ESX Application Tier ESX ESX ESX Production App Servers ESX ESX Database Tier Physical Servers Physical Servers Unix Unix DB Mainframe SQL Database Servers ESX In this physical and virtual hybrid, we show that there are three applications, three firewalls, a Unix database, a physical server, and a mainframe that are all related to each other In a multi-tiered application. In this case, each item shown in green has an upstream or downstream dependency on the others. Business determines which business processes need to be available at time of disaster. These business processes map to application services. Application services map to systems (physical and virtual). This final mapping applies to the SRM workflows and the external callouts to cover no virtual dependencies. It is this mapping of virtual systems that will be pre-programmed into your SRM recovery plan with callouts via external scripts to physical systems to maintain up and downstream dependencies.
Regulatory Compliance What compliance guidelines control your business? How does this affect your Virtual Infrastructure? How does this impact your security procedures? How does this impact your backup, recovery and archival procedures? How does this impact your staff? Regulatory compliance will likely affect your decisions in developing a DR solution for your business. Ensure that the development of workflows will support the guidelines that govern your organization. This will include access controls, recovery steps, change control and tracking, and notification.
Regulatory Compliance What compliance guidelines control your business? Service Level Agreements (SLA s) Recovery Point Objectives (RPO s) Recovery Time Objectives (RTO s) Manual vs. automatic Failback requirements Security and access controls Technologies to use Are there requirements to ensure that data is isolated to its own media
What Drives a DR Solution to Success? Understanding which parts of the business need to be protected Complete business processes = simplified workflow creation Design and test DR Plan (ongoing) Core Virtualization Virtual Infrastructure configuration at DR site Well-designed architecture and resource management Handling of network and storage at recovery site Replication setup and management Network configuration Operational Readiness
Disaster Recovery Planning Customers may need a variety of services in order to achieve recoverability Business Continuity planning Recovery personnel, office facilities, succession planning, etc. Data center planning / build out Facilities, locations, etc. Technology planning / build out ware Infrastructure Network design Data backup / replication Recovery planning Application analysis, recovery planning DR testing Validate recovery plans Scope most relevant to Site Recovery Manager implementations
DR Planning and Logical Design with SRM
DR Planning and Site Recovery Manger (SRM) Business requirements must provide precedence for failover (recovery) sequencing Each application must be evaluated to understand dependencies and mappings to the business services being protected Assessment, planning, design, and testing processes are similar but are extended through the use of automation Domain Name Services Authentication Server Time Protocol Server Database Server Application X (Multi-tiered application server) Web Server Users Upstream Application X depends on these items Downstream These items depend on Application X
Pre-Requisites for SRM Failover Each component is required at both sites Resources at recovery site must support the business SLA s for applications Note: VC and the SRM server instance can coexist on the same physical server or Infrastructure layout. This is a starting point of what is needed from the standpoint of the ware Infrastructure. The following slides outline additional requirements.
SRM Setup Overview Protected Site Pairing of Site A with Site B Array Manager Configuration Inventory Preferences Protection Group Setup Recovery Site Recovery Plan Setup Test your Recovery Plan SRM allows you to test your recovery plans without impacting production services Practice makes perfect so test and test again
Site A and Site B Topology 26
27 Inventory Mapping These s are mapped to Networks, Compute Resources and Virtual Machine Folders that are available at the target site
Array Manager Configuration Protection Side
Array Manager Configuration Recovery Side
Array Manager Configuration Replicated Datastores
Protection Group Setup
Recovery Plan for Complete Site Failover Protected shutdown Prepare Storage External scripts Suspend non critical s Protected Recovery - High / Normal / Low Recover no power on s
SRM Failover Overview Shutdown protected s in Site A If online orchestrates the controlled shutdown of protected s If offline no action taken against protected s in Site A Promote the storage in Site B Replicated datastores are promoted to be Read/Write enabled Suspend non critical s in Site B s identified to be non critical are shutdown during failover Protected s from Site A powered up in Site B High priority s start up first followed by Normal and Low Priority s
Design Considerations Network address space Disparate networks Stretched VLAN s Datacenter connectivity Dedicated point-to-point connection Storage Infrastructure SAN Infrastructure Disparate networks will require updates to the Guest OS IP address, potential updates to application network configurations, and updates to DNS. Stretched VLANs will eliminate the need to change IP addresses and DNS entries. SLAs tied to RPO and RTO will determine bandwidth and hop count between sites. SAN Infrastructure currently requires iscsi or Fibre Channel SAN. Datastore can be FS or RDM.
Design Considerations Server Type Traditional rack servers Blade servers DNS Services Active Directory Services Dedicated AD for testing and failover Same production AD Server type can be traditional rack servers or blade servers. This can change the cabling complexity, cost, feature capabilities, and other aspects of designing the DR Plan and the deployment requirements. DNS services must support testing as well as actual failover requirements. Use dedicated, but isolated, Active Directory to facilitate BCDR testing and actual failovers, or use the same production environment. Most customers will choose dedicated AD to avoid risk of impacting production during test situations.
Design Considerations VirtualCenter Infrastructure VirtualCenter at both sites ESX hosts at both sites Site Recovery Manager at both sites ware ESX Host Infrastructure Resource requirements Will you failover all s or a subset? Can Distributed Power Management help? Data Protection SRM provides failover, backup provides additional recovery options The ESX Host infrastructure will be influenced by the number of virtual machines that will need to be run at the DR site. If all s failover and require the same performance, you will be looking at a similar number of servers. If a small subset will failover, the number of ESX hosts required may be greatly reduced.
Technology Planning / Build Out Develop VI architectural design for primary/recovery sites VirtualCenter, SRM, ware ESXs SAN replication, capacity requirements Networking connectivity and bandwidth requirements Implement ware Infrastructure components as required VirtualCenter, SRM, ware ESXs Configure Site Recovery Manager Protected site, recovery site, SAN connection
Sample Logical Design Blueprint This diagram illustrates a logical design of all the required component. This is a typical starting point for a full design. The following slides provide some of the implementation details as an example of the physical design.
Failover (Recovery) Planning Develop/validate the recovery plan for each application: Identify in-scope application SLA s RPO/RTO requirements Analyze application architecture & dependencies Identify all application components required at DR site Develop your recovery plan Develop failback plan Develop acceptance test plan and criteria Automate your recovery plans for virtual components (configure in SRM)
Protected Site Multi-tiered Application There are two methods for easily applying this template to your presentation. Method One: Change this file 1. Save this file with a new name (File-Save As ) 2. Change and add slides Method Two: Apply the attributes of this file to an existing presentation 1. Open the existing presentation 2. Select Format-Apply Design 3. In the Apply Design dialog box: At Files of Type: choose Presentations and Shows At Look In: navigate to this file and select it Click Apply 4. Go through your file and adjust slides as necessary
Sample Workflow with SRM This workflow is assuming that the SRM Server in the protected site is available at the initiation of the failover (recovery). We are also assuming that infrastructure services such as DNS are also available at the recovery site. SRM will orchestrate the failover - Step 1: Shutdown Web Access Portal. Step 2: Shutdown Sharepoint Portal. Step 3: Shutdown Application Server. Step 4: Shutdown SQL Database. Step 5: Prepare storage at secondary site. Step 6: Manually confirm DNS update. Note: Shutdown of s in protected site will only occur if there is connectivity between the VC s in the protected and recovery sites
Sample Workflow with SRM (continued) Note: Integrated DNS updates via SRM is currently not supported in SRM 1.0. If s are failed over across disparate networks with SRM, IP updates can be automated with SRM, however DNS updates need to be coordinated with the DNS infrastructure team. It is recommended to complete the DNS updates prior to step 6 in this workflow example. Step 7: SQL Database is started. Step 8: Manual callout to have admin verify that SQL services have started correctly. Step 9: Application server, which queries information from database, is started. Step 10: Sharepoint is started Step 11: External Web Access is started.
DR Testing Coordinate/participate in the DR testing with combined physical/virtual DR team Execute disaster recovery test for virtual components Facilitate acceptance test for virtual components SRM will automatically document success/failure or recovery plan for virtual components. This should be combined with the success/failure of the physical components that were failed over. Identify recovery plan gaps Develop gap remediation recommendations
Case Study of an SRM Deployment
Customer Infrastructure Overview VirtualCenter SRM ware Infrastructure Replication 30 miles VirtualCenter SRM ware Infrastructure Overall: 500+ physical machines ~300 virtual machines Production spread across both sites RPO 30 minutes RTO 4 hours At each site: 12 ESX 3.x servers 1 VirtualCenter Management Server + DB 1 SRM Server Instance + DB EMC Symmetrix storage SRDF/A Between sites: 2G connection Both sites are initially configured with the ware Infrastructure. ware Site Recovery Manager (SRM) is layered on top to provide disaster recovery automation. Asynchronous replication is used between the two sites.
Phase 1: Assessment Complete Business Impact Analysis Use subset of BIA for workflow generation Generate dependency list (upstream and downstream) Create workflow Setup POC to evaluate workflow and infrastructure
Typical Design Challenges Identified Application understanding Lack of a Configuration Management Database (CMDB) Lack of Infrastructure documentation Lack of complete BIA or BC/DR Plan Budget pressure Note: All of these are to be expected challenges at sites to some degree.
Results of Utilizing SRM with ware Infrastructure Aligned with customer goals DR for mission critical applications DR becomes a property of the Pre-programmed DR plan in VC Provide an audit trail Streamline and simplify DR testing w/ zero prod impact Provide a repeatable reliable approach for DR failover Support datacenter migration in next few years
ware BCDR References Download book on BCDR at: http://www.vmware.com/files/pdf/practical_guide_bcdr_vmb.pdf Visit the Site Recovery Manager site to learn more at: http://www.vmware.com/products/srm/ Download the Site Recovery Manager Compatibility Matrix, which is available at: http://www.vmware.com/pdf/srm_10_compat_matrix.pdf Download the Site Recovery Manager Evaluator's Guide, which is available at: http://www.vmware.com/pdf/srm_10_eval_guide.pdf
Site Recovery Manager Lighthouse Program Partners ware would like to thank the following partners for completing advanced Site Recovery Manager technical training and for setting up SRM in their lab environment Alliance Technologies Arraya Solutions Brigh Technologies Clearpath Solutions Group Coleman Technologies, Inc. Computer Design & Integration LLC Data Strategy Datalink Entisys Focus Technology Solutions Forsythe GreenPages Technology Solutions IT Partners LogicsOne Long View Systems PDS Presidio Siwel Consulting, Inc. Varrow Vicom VIRTERA
Q&A Breakout Session # BC1693 John Arrasjid Will Crittenden Consulting Architect, ware Consulting Architect, ware Worldwide BC/DR Practice Lead Worldwide BC/DR Practice Deputy September, 2008 For a panel, list moderator in this slide and panelists on the following slide.