Architecting for Disaster Recovery A Practitioner View

Transcription

1 Architecting for Disaster Recovery A Practitioner View Octavian Paul ROTARU ACMS, Montréal, PQ, Canada Octavian.Rotaru@ACM.org Abstract Few businesses have the capability to effectively recover after a disaster. For vast majority of organizations, business continuity management activities are compromised by limited budgets and insufficient time and resources. A well-made contingency plan can save an organization from going out-of-business should an incident or disaster occurs. This paper gives a practical perspective on disaster recovery plans and fault tolerant architectures. The intention behind the paper is to be an easy to read practical guide for disaster recovery practitioners. Practical advises, guidelines as well as tips and tricks, are presented, in an attempt to make Disaster Recovery Planning look less murky. Keywords: Business Continuity (BC), Business Resilience, Data Replication, Disaster Recovery (DR), Fault Tolerant Architectures. 1. Introduction Most organizations today depend heavily on their IT infrastructure and their data in order to be able to provide service to their customers, but how many of them are really ready for a disaster scenario, either natural or man-made? The Business Continuity Planning (BCP), Business Continuity Management (BCM), Testing and Execution are referred to collectively as Business Resiliency Planning and Business Continuity Management (BCM). This paper address only issues related to the IT infrastructure side of BCM, more specifically related to Disaster Recovery infrastructure and how well is it prepared for a real disaster scenario. Cyber-infrastructure protection, business continuity and disaster recovery, includes safeguarding and ensuring the reliability and availability of key information assets, including personal information of citizens, consumers and employees. [2] The existence of business risk observed with service disruptions is an inescapable concern for many organizations. Depending on the criticality of the data handled and service rendered, the business continuity approaches cover a wide range of options. Disaster Recovery is a critical issue when it comes to information security and business resumption. Disaster recovery concerns a wide range of activities, from backing up data and retrieving it from backups, to repairing networking capabilities and rebuilding primary production sites. Disaster Recovery planning is the preparation for recovery from any disaster and its main aim is to help an organization become resilient after a disaster [4]. Susanto [11] considers IT to be the most important issues of all when discussing BC and DR, not only for being the foundation and backbone of the business but also because IT can play important roles in strategies development and improving efficiency of the whole BCP plan. 2. Fault-Tolerant Architectures and Cost The basic system architecture that is being considered for disaster recovery consists of a primary and a backup site. The primary site is the one the handles the production functions, while the backup site is usually a stand-by location that can be used to run production functions if needed. The backup location needs to store enough information so that if the primary location is unavailable, the information available at the backup site can be used to recover data lost at the primary and resume production activities. The backup sites are classified into two main categories: - Data Recovery Sites Data is available at an alternate location, but service cannot be resumed until the primary site is back online. - Service Recovery Sites Both data and processing capabilities are available at the backup site and service can be resumed from the alternate location. Both data recovery sites and service recovery sites require a way to synchronize or backup data, either online or at pre-defined time intervals. On-line synchronization of data allows service to be resumed much faster from alternate locations. However, this approach incurs a higher synchronization cost. Cloud computing also offers a good platform for disaster recovery. Cloud-based applications can be

2 accessed from any stand-by location, provided that the required communication lines are in place. A healthy compromise between cost and business objectives is usually hard to achieve in a disaster scenario. However, there are ways to combine functions and save on costs while not compromising your goals: Distributed environments (Active Active) A simple way to reduce cost is to distribute processing between multiple sites. In case one of the sites is affected by a disaster, the service loss is only related to the capacity of that site, while the service will still be rendered, even if at reduced capacity. The cost of communication lines, remote clustering and data synchronization are the main drawbacks of distributed processing environments. Data needs to be synchronized between the sites in order to allow distributed service processing. Also, load distribution mechanisms are required. Some companies distribute the load based on geographical areas while others distribute the load evenly between the centers, irrespective of the geographical origin of the request. In case of even load distribution, one of the sites will need to provide front-end service and load balancing services. In case the front-end site is lost, a backup front-end site needs to be available to take over load balancing. Regional processing centers do not require a load balancing service, but each regional site need to have a backup site available to take over at any given moment. Regional processing centers reduce the need for data replication. Each site can have a backup site or even two where to replicate stand-by data. Instead of replicating all data to all locations, you replicate only parts of the data (regions) to other regional centers that are ready to take over the load if needed. Use the processing power in a backup site for alternate purposes Another option is to use the processing power in your backup site in order to serve other business needs. For example, a DR site can be used for testing and development or any other functions that are not mission critical in a disaster scenario. Share the cost of disaster recovery Some organizations choose to share the cost of disaster recovery, by means of sharing resources. A common alternate site is usually setup. Data replication is done by all the partners to the alternate site and the site has enough processing power available to handle the processing needs of any of the organizations that are sharing the cost in a disaster scenario. The main assumption is that only one organization can use the processing power of the alternate center of any given time. 3. Tips and Tricks for efficient DR planning Clearly define your recovery goals One of the most challenging parts of disaster recovery planning is to define your recovery goals and get them approved by all the stakeholders. Clearly defined disaster recovery goals are the barebone of a valid business resilient architecture and define its requirements. Each organization needs to have well defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for its infrastructure. The recovery time objective defines how long the business can basically go without a specific application, function or service. The RTO is the maximum allowable outage time that the business can tolerate during a disaster scenario. The recovery point objective is the point in time to which you must recover data as defined by your organization. The recovery point objective defines the acceptable loss of data in a disaster situation. RPO and RTO are independent parameters. RPO is more important than RTO if data availability is more important than service recovery. On the other side, if service recovery is critical, the service availability may overshadow the availability of data. In any case, the prevalence of RTO over RPO and the other way around are extreme scenarios. For most organizations the requirements are somewhere in the middle, even if one of the recovery parameters has more importance than the other. The RPO and RTO together define the guidelines of disaster recovery planning for an organization and they need to be in sync with the organization s mission statement and goals. The RPO and RTO of any organization are translated into an architecture that has a price tag, and ultimately you end up comparing the price tag with the existing budget. RTO and RPO definition is a very sensitive exercise. Defining goals that are either too ambitious or too low is something to be avoided. The goals need to be realistic in order to be able to translate them into a disaster recovery plan. Many organizations define unrealistic goals that translate in plans that will fail if ever a disaster occurs because of the many assumptions that are made. A general RTO and RPO definition for the entire environment is almost impossible to define in most of the organizations. Each business function or service has its own level of criticality.

3 For example, in case of a bank it is crucial to preserve the customer database and account balances, but the transactions history is not as equally important. Not knowing what the balances of the accounts are will result in revenue loss. A similar example is the telecommunications industry. You need to know the customer details and what each customer is due to pay, while displaying call details on the bill is only a nice to have. Defining different goals for each set of data and service will help prioritizing their recovery in case of a disaster, and reduce the cost of your DR infrastructure. Filter mission critical data The amount of data that is being stored and processed is growing at a very high pace, and the question that an organization needs to answer is how much of it is really mission critical and needs to be preserved in case of a disaster. Preserving all the data available is simply a nice to have and not a necessity for many organizations. Careful business impact analysis is required in order to identify the impact of loosing information, define priorities and filter what is really critical. Minimizing the volumes of data and business services that need to be preserved following a disaster is the main solution for minimizing the cost of DR. Replicating and backing up only critical data will reduce the cost and in the same time simplify planning. The law of parsimony applies perfectly to disaster recovery plans: the simplest of two or more competing solutions is to be preferred. A complicated disaster recovery plan that has too many variables and needs to much manual intervention is most of the times bound to fail. Key resources may not always be available in the aftermath of a natural disaster, and a complicated disaster recovery plan may be inapplicable because it assumes the availability of those key resources (human or material). The best option you have is to keep the business continuity plan document to the absolutely bare minimum. Don t overcomplicate procedures and processes. Provide just simple information that the crisis management team can use as the basis of taking action and decisions. [1] From a Disaster Recovery point of view, data can be classified in a few categories, each category requiring a different approach: - Temporary Data In most organizations there is no need to replicate temporary data to the alternate site. Some examples of temporary data are work files created by long-running batch processes and temporary files created by online transactions. Temporary data is not required in a Disaster Recovery scenario unless the long-running jobs can resume from a point close to that where the primary site became unavailable. Such a DR approach is needed for applications of extreme criticality only. Very few organizations have DR goals that are so ambitious and can also cover the cost of such architecture. Apart from data replication and processing power availability at the alternate site, a lock-step mechanism is also required. Most of the times temporary data becomes unusable the moment the process that created it crashes and it will be re-extracted or re-created once the process reruns at the primary site or at the alternate site in case of DR. Due to its perishable characteristic, temporary data is ignored in most organizations while planning DR, and an assumption is made that all the processes interrupted by the disaster event will need to run again from the beginning when the alternate site will be up. - Raw Data Raw data is data that is being processed, and once processed it is not needed anymore. In certain industries the volumes of temporary data are extremely big and replicating them to DR will be very costly. Unprocessed raw data is sometimes needed in DR while in certain situations it can be regenerated. The decision to make raw data available in DR is most of the time influenced by legal requirements, and not by business decisions. - Replaceable Data Most of the organizations collect data that is not critical, but helps employees perform their jobs faster, or IT systems to run faster or more efficient. Such data can most of the times be re-generated or collected following a disaster event. A good example of such data is database indexes. You need to have the data available, but indexes can be rebuilt. It will take a while to rebuild the indexes and database access will be slow during this time, but the information is redundant. Most of the organizations can avoid replicating replaceable data and wait for it to be rebuilt after switching to the alternate site. - Mission-Critical Data Mission-Critical data is always replicated to the alternate site in one way or another. The success or failure of a disaster recovery depends on the ability to make mission-critical data available. Stay away from unrealistic assumptions We just need to preserve the data. We will buy the servers (or any other equipment) required after the disaster occurs. We will install them and resume service very fast. Of course your vendors will no doubt provide the hardware or any equipment that you need, but how long will it take? And even if the equipment is

4 provided immediately, how long will it take to install and configure it? Making such an assumption is dangerous because the availability of equipment in the aftermath of a natural disaster is usually limited. The natural disaster affecting a certain area may affect equipment vendors alike. The availability will be limited and many organizations may compete to provision any available piece of equipment. Furthermore, installing equipment requires time and human resources with skills that may be hard to locate. Infrastructure projects take time to implement and assuming that they will be done in a very short time is not realistic. Think about your last similar infrastructure project and its duration. Take that duration and multiply it with three, and you got yourself a very optimistic estimate of how long the same implementation will take during disaster recovery. We need to recover the service as soon as possible. We will reprocess the data while in parallel we will handle new incoming transactions. Processing old data in parallel with new transactions requires processing power that is usually not available in a DR scenario. Your backup environment needs to be strong enough to process incoming transactions as well as to catch up and reprocess the data that was lost. The reprocessing of data is usually a lengthy process that assumes resource availability. It will never happen to us One of the biggest problems of any disaster recovery architecture is cost. Making it cost-effective and proficient enough to be able to restore both data and service in a timely manner is a very complicated problem even for the best system architects. Ostrich-like upper management sees disaster recovery plans as an expense and not as a necessity, assuming that it will never be needed. The this will never happen to us approach is both dangerous and counter-productive when dealing with disaster recovery plans and resilient system architectures. Upper management support and firm commitment is a must for implementing resilient infrastructures, and managers that only pay lip service to disaster recovery planning are doing more harm than they imagine. Convincing management that the risk of a disaster is real is the biggest hurdle any DRP specialist must overcome. The price tag of a disaster resilient infrastructure is the main problem for most organizations in today s economic stance and creativity is required in order to drive costs down and make the solution more attractive and easier to present to executives that think mainly in terms of $. Protect Personal Information Organizations that deal with personal information are in many countries subject to a strict set of rules. An Organization is responsible for protection of personal information and the fair handling of it at all times, even during a disaster recovery scenario. Care in collecting, using and disclosing personal information is essential to continued consumer confidence. Canada is one of the countries that regulate how private sector organizations collect use and disclose personal information in the course of commercial business under the Personal Information Protection and Electronic Documents Act (PIPEDA) that became law in April Each business is subject to the laws of the country where it operates. The reason to bring PIPEDA into this discussion is the ten principles of fair information practices developed under the auspices of the Canadian Standards Association [3]: 1. Accountability 2. Identifying purposes 3. Consent 4. Limiting collection 5. Limiting use, disclosure, and retention 6. Accuracy 7. Safeguards 8. Openness 9. Individual Access 10. Challenging Compliance The ten principles of fair information practices listed above can constitute the backbone of a successful DR plan. Limiting collection will reduce the amount of data you need to safeguard and preserve accurate. Clear accountability and well identified purposes for collecting information helps identify the stakeholders and makes it easier to develop an efficient disaster recovery plan. 4. Information Assurance Techniques There are multiple ways to make sure that data is always available and can be accessed and used in case of a DR. Most of the information assurance techniques fall into two categories: - Backup - Data Replication If restore time is not a problem, data backups to tape or virtual tape libraries (VTL) can be effective methods of data recovery. Tapes or VTL backups (disk) can be used to restore data once a disaster occurs provided that enough storage is available at the alternate site and tapes (either physical or virtual) can be made available (recalled to site from the vault for physical tapes or available at the alternate site for virtual tape backups) in a timely manner.

5 Virtual tape libraries can replicate content at distance, allowing backups taken at the primary site to be replicated at remote locations and ready for recovery when needed. Data replication is the process of sharing information in order to ensure consistency between redundant sources. The purpose of data replication is to improve reliability and fault-tolerance. Data replication can be done in many ways and results differ. Ranging from live data replication methods to regular data copies, the data replication goals and techniques need to be in harmony with your DR goals. Choosing between backup and data replication is usually driven by the recovery time objective. If your RTO is very tight, you cannot afford to wait for tape recovery to complete. Also, the quality of the tapes may influence the time to restore. Having multiple copies will mitigate the risk of a restore failing because of a bad tape, but having to run the restore once again is time consuming. Aggressive RTO goals imply data replication. Once a decision is made between backup and data replication, the way to backup or replicate the data will be driven by the recovery point objective. Aggressive RPO goals usually require live data replication. Live replication can be done in different ways, depending on the characteristics of the data that needs to be replicated. Database systems can use transactional replication. All transactions running at the primary site can be replicated at the alternate site, either by using the redo logs (transfer them to the alternate site at pre-defined time intervals), or by running the same transaction simultaneously at different sites. Database replication usually imposes a master-slave relationship between the original and the replicas. Disk storage replication is done by distributing updates of a block device to several physical disks located at different sites. Disk storage replication can be classified into two categories, depending on the way it is handling write operations: synchronous replication and asynchronous replication. Storage replication covers a wider range of applications and can be used for any kind of data. Synchronous replication guarantees zero data loss. Atomic write operation either complete on both sites or not at all. The biggest disadvantage of synchronous replication is that the primary site will need to wait for the alternate site to confirm the write before proceeding further. As the distance between the sites grows larger, the delay introduced by the communication lines will impact the performance of the writes. Asynchronous replication doesn t guarantee zero data loss but eliminates the performance penalty. Atomic writes are considered completed as soon as the local storage acknowledges it. Data is replicated at predefined time intervals to the alternate site (with a small lag). In case of losing the local storage, the remote storage is not guaranteed to have the most current copy of data and information will be lost. All remote data replication techniques require considerable bandwidth. Communication lines cost is substantial and becomes an on-going operational cost. Most storage vendors offer data replication solutions, among which the most notable are EMC SRDF [5], NetApp SanpMirror [6], Hitachi TrueCopy [7], IBM Copy Services [8], HP Continuous Access [9], and FalconStor CDP [10]. Choosing between synchronous and asynchronous replication is usually done based on RPO. If your RPO is zero, the only available choice is synchronous data replication and the performance penalty cannot be avoided. However, if the RPO is greater than zero, an asynchronous data replication technique can be used and the acceptable replication lag will be driven by the defined RPO. Semi-synchronous replication techniques are also available and provide a good compromise between synchronous and asynchronous methods. Performance penalty is also reduced. Atomic writes are acknowledged by the remote site as soon as received instead of when the write is completed. 5. DRP Testing Testing is an essential part of disaster recovery planning. A plan that was never testing will probably never work in a real disaster scenario. A new disaster recovery plan requires more frequent testing. After each test, the plan needs to be reviewed in order to make any necessary corrections. The changed procedures need to be retested and incorporated into the disaster recovery plan. Disaster Recovery plans can be tested in several ways [13, 14]: - Structured Walk-Through Testing DR team members meet to verbally walk through specific steps of the plan, trying to identify gaps, bottlenecks and other weaknesses or confirm the effectiveness of the plan. - Checklist Testing ensures that the organization complies with the requirements of the DR plan. - Simulation Testing disaster scenario is being simulated so that the normal operations will not be impacted. - Parallel Testing testing is performed at the alternate site while production is not impacted

6 - Full-interruption Testing A production systems are shut down and the disaster recovery plan is activated in a situation as real as possible. This is the best way to test your DRP plan, but it is costly and is disrupting the normal operations. There will always be surprises during DR testing. Unexpected results will occur and alterations to the plans will be needed. The ultimate goal of testing the DR plan is to reduce the sources of error and make your DR plan as best as possible in order to avoid unpleasant surprises when the plan will be employed for real. 6. DRP Guidelines 1. Check the legal requirements applicable for your organization. Legal requirements can highly influence the cost of your DR solution, and your DR solution needs to be harmonized with them. 2. Make sure that all business processes are properly documented. You cannot protect what you don t know. All business processes, data inventory, data flows, and data classifications need to be available when DR planning is done. 3. Classify your data. Data classification will help you decide on your DR strategy. Make sure that only what is really important will be available in DR. What you don t collect you don t have to pay to store and provide information assurance for. 4. Define clear DR goals. Make sure that the business understands those goals and is in complete agreement with them. The best way to make the business decision makers understand DR recovery goals is to discuss with them scenarios. Start by taking a set of very specific DR goals and analyze what will be the business impact for it. 5. Fine-Tune your DR goals Try to avoid a general set of DR goals that is meant to cover all types of data and services. Even if finetuned DR goals add to complexity, they reduce the cost of the DR solution. 6. Create DR plans that meet your DR goals and choose the one you want to implement. You always have multiple ways to implement a DR solution, and each architecture has its advantages and disadvantages. My advice is to apply the law of parsimony and chose the simplest one. A disaster scenario is not the time to test exotic technologies. Stick with what you know best. 7. Include as much information as possible in your DR kit. More details than needed will probably do no harm. Missing critical information may make your DR plan fail or increase the time required for recovery. Include as much information as possible in your DR kit. Keep your plan as concise as possible and include additional information in annexes to make sure you have it at hand if needed. 8. Don t try to achieve too much too soon. Try not to overstate your DR capability and readiness. Take time to test every function as soon as it is recovered. Diagnose problems early and do not leave testing for the end. 9. Avoid making assumptions Yes, PBX and digital phone lines may not work in a real DR scenario as well as many other services and this is only an example. Don t assume that services will be available and always prepare for the worst case. 10. Avoid the easy way Recovering first the functions that you know are easy to recover is a temptation that needs to be avoided. Business functions need to be recovered in the order of importance and not depending on how is it is to recover them. If in a real DR scenario, always follow the plan and the priorities defined (same in case of DR plan testing). 11. Check for opportunities to combine high availability and disaster recovery. High availability is a business requirement for many organizations. High availability architectures protect mission critical applications and services from hardware failure. The usual implementation is using stand-by hardware that is available at the primary site. Combining high availability and disaster recovery architecture can reduce the cost of both, by using the DR hardware available at the remote site for high availability failover in case of hardware failure. Combining high availability and disaster recovery architecture is not always possible, and it needs to be carefully analyzed. 12. Automate as much as possible. Automation can protect your solution from human errors. Limited human intervention can make your disaster recovery plan succeed even in situations when critical human resources are not available.

7 13. Test your DR architecture and plans as often as possible Regular testing of your DR architecture and plans gives builds in it. Knowing that the plan was dressrehearsed many times is the best assurance you can have. Quite simply, a plan which has not been tested cannot be assumed to work. Likewise, a plan documented, tested once and then filed away to await the day of need provides no more than a false sense of security. [12] 14. Keep your DR plans and architecture up to date. Make sure that any application change is analyzed and if needed reflected in DR. Identify as early as possible the impact on your DR of any change, no matter of its scope (new service or changes into existing ones). Reflecting changes in your DR plans and architecture requires budget and cost and impact needs to be well understood and communicated. 15. Regularly review your DR goals. Business needs may change and DR goals review is often required. New business contexts require adjustments of the DR goals, triggering as a result changes in the DR architecture and plans. 7. Conclusions Disaster recovery architecture and plans are driven by many factors. The number of variables involved is very high, budget being one of the most important, and the temptation of making unrealistic assumptions is very high. Proper disaster recovery planning and IT infrastructure ready to support it are crucial for survival of organizations that are facing disasters. This paper provides recommendations for developing an effective disaster recovery plan and discusses the architectural options available, proposing a set of guidelines that can help practitioners create solid Disaster Recovery plans while avoiding common mistakes. Finally, the only recommendation that I can make is to use your common sense and keep the solutions you choose as simple as possible. Simplicity never failed me in the design of fault tolerant architectures and it is the biggest lesson I learned. 8. References [1] David Honour, Business Continuity on a Limited Budget, The Business Continuity Institute. [2] Constatine Karbaliotis, Critical Interests: Business Continuity, Disaster Recovery and Privacy, Symantec, September 2009 [3] Office of the Privacy Commissioner of Canada PIPEDA A Guide for Businesses and Organizations Your Privacy Responsabilities Canada s Personal Information Protection and Electronic Documents Act (PIPEDA), Updated September 2009, [4] Philip Clark, Contingency Planning and Strategies, Proceedings of InfoSecCD 2010, October [5] Symmetrix SRDF Product Page, EMC, [6] NetApp SnapMirror Product Page, NetApp, [7] Products: Hitachi TrueCopy (R) Remote Replication, HDS, [8] Donald Chesarek, John Hulsey, Mary Lovelace, John Sing, IBM System Storage FlashCopy Manager and PPRC Manager Overview, IBM RedBooks paper, 5.pdf [8] Nick Clayton, Global Mirror Whitepaper, IBM TechDocs, 2008, 03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/ WP [9] HP StorageWorks Continuous Access EVA, QuickSpecs, 7_div/11617_div.PDF [10] FalconStor Continuous Data Protector (CDP) - Overview, [11] Lukman Susanto, Business Continuity/Disaster Recovery Planning, 2003, [12] U.S. Department of Commerce National Bureau of Standards, FIPS PUB 87 Federal Information Processing Standards Publication, Guidelines for ADP Contingency Planning, 1981 March 27. [13] Geoffery Wold, Testing Disaster Recovery Plans, Disaster Recovery Journal, Vol. 3, No. 3, p. 34. [14] Guy Witney Krocker, Disaster Recovery Testing: Cycle the Plan, Plan the Cycle, SANS Institute InfoSec Reading Room, 2002.