Disaster Recovery Hosting Provider Selection Criteria By, Solution Director 6/18/07 As more and more companies choose to use Disaster Recovery (DR), services the questions that keep coming up are What business services need to be implemented in a remote disaster recovery site? What architecture is required? And finally, which site provider should I select? This paper will attempt to reveal some of the nuances of determining business data and services needs, network architecture of the disaster recovery solutions, disaster recovery objectives, as it relates to the selection a Disaster Recovery provider site. Before selecting a DR site provider the business must first define what level of DR services the business will require, as this will be one of the drivers of network architecture and overall complexity and cost of the DR solution. The first information needed is the services and requirements for the DR site and an assessment of the current production environment to ascertain all components within the production site. The decision of what services need to be available should be made in regards to business needs, compliance requirements, if any, as well as any legal requirements to the company s employees, investors or owners. Normally network architecture and infrastructure topology and services will be first determined. In conjunction with the architecture of the DR solution is a level of services or availability required in the DR solution. In most cases of hosted DR, the only Service Level Agreement (SLA) available from the hosting providers is that of Network and Server / Operating System availability, and only if the network and servers are directly managed by the DR Providers Site team, exclusively. These SLAs can range from 97% to 99.999% availability. On a side note, the higher availability the greater the SLA impact on the cost of the hosting service. The five nines, as the 99.999% SLA is referred to by the industry, requires redundancy of all networking components and services, which can double or triple the cost of a 97% SLA solution. To calculate downtime SLA, use the SLA calculator online, click here Application availability can be obtained but at substantial cost and is not normally included in scope for the provider, but is usually managed by the client desiring these services. The reasoning is that the DR Providers Site team may not have the expertise to support, troubleshoot and resolve the client s application issues therefore application availability is usually not supported by the DR Provider Site team. Application support, however, can be provided by 3 rd party providers or contractors. Each disaster recovery solution is unique, depending upon the business requirements for the DR site. Within these solutions, this paper will consider six of the most basic and common DR solutions. Each solution can be combined or blended with other solutions and other sites and services and customized to suite the businesses unique needs.
Disaster Recovery Solutions This paper addresses six basic classifications of DR solutions, these are abbreviated as: FMDRLB FMDR SDDR MDR NODR and COLO. Each abbreviated service is described below with some specifics of the services provided. 1. Full Mirrored with Data Replication and Load Balancing between the Production and DR Datacenters, (Active/Active or Active/Passive (with automatic fail-over)). (FMDRLB) a. Most Costly i. Equipment and connectivity duplicated between datacenters. ii. Load Balancing (LB) hardware, services or software required. 1. LB can be Geographical, Load Dependent IP specific or Round-Robin 2. Resilient replication connectivity and hardware or software is required. a. Real-time replication required iii. Full Backup architecture duplicated in both datacenters. iv. All Server OS and Application licensing required for both datacenters. v. Datacenter Staff manages equipment monitoring and patching and upgrades b. Fully Resilient i. Usually an Active / Active Configuration c. Each DR solution architected to handle full loading, independently. 2. Full Mirrored with Data Replication between the Production and DR Datacenters (FMDR) a. Costly i. Equipment and connectivity duplicated between datacenters. ii. Replication connectivity and software required. iii. Full Backup architecture duplicated in both datacenters. iv. All Server OS and Application licensing required for both datacenters. 1. May have OS and Application Licensing shared between sites, depending on the vendor s license. a. Only one site active at any time. v. Datacenter Staff manages equipment monitoring and patching and upgrades. b. Fully Resilient i. Active / Passive (with automatic or manual fail-over) Configuration c. Each DC architected to handle full loading, independently. 3. Scaled Down DR (SDDR) a. Costly i. Equipment and connectivity may be scaled down to provide the same functions and applications, but in a less efficient manner than the Production site.
1. Instead of fully resilient servers as in the production DC, the DR DC can utilize single servers running the same applications or multiple applications, depending on the clients understanding of the impact upon performance. 2. Instead of the full data pipe provided at the production site, a smaller data pipe connection may be provided for DR. ii. Some form of Replication of data required, whether it be Backup tapes, files, transaction logs, etc. iii. Backup services provided by the DC, dependent on the client s requirements. iv. Backup architecture datacenter specific to allow for restores and backups of the DR data until the production site is back up and functional. v. All Server OS and Application licensing required for both datacenters. vi. Datacenter Staff manages equipment monitoring and patching and upgrades. b. Somewhat Resilient i. Active / Passive (with manual fail-over) Configuration c. Only the Production DC can handle full loading i. The DR site may handle most of the load, but in a lower performance capacity. 4. Minimal (MDR) a. Barebones functionality compared to Production. b. Less Costly i. Equipment level and connectivity drastically cut back. ii. Servers performing many functions and running multiple applications. 1. Maintains most production functionality but with a drastic impact on performance, compared to the production site. iii. Some form of Replication of data required, whether it be Backup tapes, files, transaction logs, etc. iv. Backup services provided by the DC, dependent on the client s requirements. v. Backup architecture datacenter specific to allow for restores and backups of the DR data until the production site is back up and functional. vi. All Server OS and Application licensing required for both datacenters. 1. May have OS and Application Licensing shared between sites, depending on the vendor s license. a. Only one site active at any time. vii. Datacenter Staff manages equipment monitoring and patching and upgrades. c. Somewhat Resilient i. Active / Passive (with manual fail-over) Configuration d. Only the Production DC can handle full loading. i. The DR site can handle a sub-set of the production functionality in a lower performance capacity
5. Notice Only (NODR) a. Least Costly of managed services hosting b. HTML Notice page, hosted on Datacenter equipment that provides notification that the Production site is down and includes contact information for the clients business, with an ETA of functionality return, if required by the client. c. Equipment and connectivity are provided by the DC and provides no other functionality than the notification that the production site is down. i. Datacenter Staff manages equipment monitoring and patching and upgrades. ii. Backup services may be provided by the DC or by the client, dependent on the client s requirements. d. No Redundancy 6. Co-Location Only (COLO) a. Least Costly to Costly. i. Equipment and connectivity are provided by the client and are enclosed in a secured area of the DC. 1. Any redundancy or resiliency is the complete responsibility of the business or hosting client, also any performance issues are the responsibility of the Client to resolve. 2. Connectivity services are provided by the hosting provider and availability only contractually guaranteed up to a point of demarcation which occurs between the DC hosting managed area and the secured area, in which the clients equipment resides. ii. Any data replication is fully the responsibility of the client. iii. Backup services may be provided by the DC or by the client, dependent on the client s requirements. iv. The client is responsible for Server OS and Application licensing required for both datacenters. v. The client manages equipment monitoring and patching and upgrades. vi. The datacenter manages the physical access to the enclosed secure area of the DC that the client equipment is located and is responsible for nothing else. After a decision has been made on the architecture of the disaster recovery environment the recovery time and recovery plan objectives must then be determined. These two objectives are other drivers to the cost and complexity of a total disaster recovery solution tailored to businesses needs.
Disaster Recovery Objectives As mentioned previously, the other considerations that drive cost and complexity of a Disaster Recovery solution is that of Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Understanding the requirements for these goals and objectives is required for a successful DR solution that meets businesses needs and can deliver the required services when called upon to do so. Note that in regards to cost, the lower the RTO and RPO, the more expensive and complex the DR solution will be. Businesses need to understand that to meet the desired RTO and RPO objectives and goals may require upgrades to their current production site in equipment, services, software, connectivity and architecture as well as deploying additional equipment, software and connectivity and the new architecture and all associated components at the DR site as well. To determine what the RTO and RPO goals should be, the understanding of what the objectives indicate first needs to be defined. Recovery Time Objective (RTO) is the time objective, from which a disaster has been initially declared to the time when the Disaster Recovery Site or Production Site is back up and operational. Recovery Point Objective (RPO) is the data loss time objective, that determines from the time that the disaster initially occurred and the amount of data that the client can lose and is non-recoverable to the time that the Disaster Recovery Site or Production Site is back up and fully operational RTO requirements can range from zero seconds to 1 week in duration. o Usual durations from 2 6 hours are more common. RPO usually can range from zero seconds of data loss to 5 days. o Usual durations from 30 minutes to 4 hours are more common The longer RPO durations are usually for businesses who do not deal with emergency services such as governmental, military, vital services, or financial or healthcare information and the data lose is recoverable by other methods, such as contracted or archival services. Again, as stated above, the smaller the RTO and/or RPO the more expensive the DR solution will be due to cost and complexity of the total solution.
Once architecture, topology and the RTO/RPO objectives have been determined the criteria that the hosting provider(s) will have to meet can then be considered. DR Hosting Facilities Location is one of the first criteria that need to be considered. Depending on the geographical potential threats such as earthquakes, hurricanes, tornados, military bases, airports, refineries and chemical facilities, forests, etc will determine the location of the DR site(s). DR sites are usually geographically disparate from the production datacenter. o Disparately located Datacenters from the Production Datacenter are preferable and in some cases required, in most DR scenarios. Geographical Locations What is the geographical footprint of the provider? o Global Locations? If the business has partners or other branches located in other countries or continents, then a provider with facilities globally would be preferred and considered. o Regional Locations? If the business has partners or other branches located within the region, then a provider with facilities region-wide would be preferred and considered. Such facilities are located in areas, regionally, that do not suffer from the same geographical threats as that of the production site. o National Locations? If the business has partners or other branches located nationally, then a provider with facilities located nation-wide would be preferred and considered. Such facilities are located in areas, regionally, that do not suffer from the same geographical threats as that of the production site. Once a location has been determined the following will the need to be considered: Required Architecture & Topology Are there required architectures or network topologies for DC s managed solutions? o This may also drive the cost and complexity of the overall solution. Firewall & Security Services What firewall and security services are provided at the DC level? o Security coming into the DC should not be the only security provided. o The solution should also provide fire walling from within the DC to ensure security from potential competitors or hackers that may also reside in the DC
o Some infrastructures and topologies require firewalls between various zones or areas within the hosted solution. o Some hosting facilities provide additional security services or may require 3rd party security services to limit the hosting facilities risk and liabilities. Data, Voice & Video Infrastructure Data, Voice and Video Capable infrastructure? o Reviewing the businesses current and future services and growth plans may also drive the cost and complexity of overall solution If the business intends on having DR data, video, voice, email, FAX, phone systems, etc, then the capabilities of the DC hosting infrastructure will have to be evaluated to ensure the support of such services and the SLA of the services such as QoS, streaming audio-video, etc. Backbone Connectivity In a DR incident, all the companies data and services are at risk, due to the fact the in the production site is down and all the business critical services are now single-ended with no backup site and the data is usually traversing longer than normal distances to provide services to the businesses users and/or clients. Therefore the provider will need to have reliable and resilient backbone connectivity to and from the other associated sites and internet. In some instances dual providers are required, depending on the type of business and services that the company provides. What backbone connectivity services are available at each of the providers Datacenters? o AT&T? o Verizon? o Sprint? o Time Warner? o Others? Hardened Facilities Are the providers Datacenters hardened and protected from natural and man-made disasters? o Some Datacenters are specially constructed to protect the contents from damage or destruction. Note: These facilities are more expensive to reside in and should only be considered if the businesses data or services require such facilities.
Contracted SLAs What are the contractually guaranteed availability and uptime guarantees? o The Service Level Agreement (SLA) available from most providers is that of network and server / OS availability. o This availability is only if DR Providers Site team manages the servers the businesses services and applications run upon, SLAs range from 97% to 99.999% availability Facilities Physical Resources To calculate downtime SLA, use the SLA calculator online, click here What is the facilities infrastructure of the Datacenter? o Multi-provider for power? o Power Growth Factors? o Potential Physical Growth Factors? o Current Utilization of Power, Connectivity, IP allocations, and IP Space? Physical Access o What security is in place to guard against unauthorized access into the DC as well as into the DC managed hosting area? o Are the Co-Lo areas separate and well defined between the DC managed hosting areas? Are tours of the Datacenter by prospective clients available with advanced notice? o Most reputable Hosting centers will allow tours of their facilities with advance notice o Ensure that statements made during the tour are true. Ask to inspect any testing schedules for verification of periodic testing statements Ask for references from currently hosted businesses Observe the facility for conditions that may contradict the statements made Ask about any doubts that you may have or observed conditions to raise doubts What are the management options provided by the provider? o Of the Servers and Network equipment, what management options are provide by the provider? o Servers will require some form of remote management such as a Remote Access Card or similar device DC Provider Team Managed 3 rd Party Managed Business Managed o Routers and Switches usually require some form of remote management such as a Code Activated Switch, Management Router or similar devices. DC Provider Team Managed 3 rd Party Managed Business Managed In the event of a Co-Location hosted environment, what processes are required for 3rd party vendors to gain access to Co-Location client space? o Of the Servers and Network equipment, what management options are provide by the provider?
The Disaster Recovery Kit (DR Kit) The DR kit should be developed in conjunction with DR site solution activities. As many of the tasks and information will be readily available during the DR site planning and design, it will make use of synergies of these activities to assist in additional risk mitigation. The DR kit information can also assist in deploying the DR site elements. The Disaster Recovery Kit is a tool that can rebuild the businesses environment from scratch. The DR Kit is designed to contain all the information regarding the businesses IT environment. The DR Kit is a living and breathing document that is in constant change as business requirements, new services, new hardware, new applications or tools come into the businesses IT environment. This kit is a valuable asset in rebuilding a business s IT environment that has been destroyed by a natural or manmade disaster. This DR Kit is usually electronic in nature and stored on media that allows for easy retrieval if needed. The DR Kit can have files and data saved on the media or embedded within the document itself. This ensures that all the information required to rebuild the entire business IT infrastructure can be retrieved from this documentation. The DR Kit should contain: Hardware configurations for every router, switch, or server found within the IT environment All applications running within the IT environment should be documented along with licensing information, this applies to both servers and workstations Source files for all custom developed tools and applications should also be saved in this document Configuration files for every network device so the network devices can be configured quickly and efficiently and to the same specifications as the original, when rebuilding the environment Service contracts for network hardware, servers, workstations, or special devices should be scanned and saved into this document WorkStation, servers, network devices, or special hardware should have their serial numbers and or service / asset tags recorded and saved in this document Network diagrams indicating connectivity both internal and external, and any special IP addressing and subnetting required
All connectivity contracts and services should be recorded along with the contract numbers expiration date and contact information for the services Any special services which are required for the IT environment to function correctly, containing description, vendor contact information and contract numbers, where applicable. This is just a small sampling of the items that should be part of a DR Kit. Each kit will be unique and specific to the business s needs and requirements. Summary As shown above, disaster recovery solutions can be complex and difficult for businesses to comprehend and implement. Hopefully, this paper can provide some insight into the steps and requirements of disaster recovery solutions and the selection of a DR site provider. Obviously, only with correct planning, architecture, design, services, and implementation will successful disaster recovery solutions come to fruition and provide businesses DR needs. As disaster recovery is but one facet of Business Continuance Planning (BCP), and activities. Disaster recovery plans and solutions to not take the place BCP activities. Below are some disaster recovery solutions that have been successfully architected and implemented and provided required business services. Meet the Author: is a Solution Director and Manager with the Revere Group. His specialties are: Network Architecture and Infrastructure Engineering and Design, Hosting Services Architecture Engineering and Design, LAN / WAN design, Desktop & Server Systems Engineering and Design, Software Development and Support and Project Management. Recently as a consultant with one of the largest banks in the US, he provided the infrastructure architectural designs and assisted in the implementation of the many Hosting Solutions for the bank s organizations and branches. Also as a Systems Consultant, provided Project and Team Management for a large networking corporation s Server Consolidation Initiative managing the technical project and an international team of VMWare migration engineers. Mr. Ostler has performed as a Microsoft Infrastructure Architect, where he provided Microsoft Infrastructure assessments and architectural recommendations and designs for the U.S. Army and other military and federal entities and organizations