Beyond Traditional Disaster Recovery Goals Augmenting the Recovery Consistency Characteristics Octavian Paul ROTARU American Sentinel University Octavian.Rotaru@ACM.org Abstract For most organizations the disaster recovery goals are limited to Recovery Point Objective (RPO) and Recovery Time Objective (RTO). This perspective on disaster recovery overlooks very important factors that can contribute to the successful implementation of a Disaster Recovery plan. Evaluating metrics beyond recovery time (RTO) and recovery point (RPO) is essential to meet the recovery commitments of an organization. The purpose of this paper is to review existing Disaster Recovery metrics that can augment the Recovery Point and Recovery Time and to propose new metrics for Recovery Consistency. Recovery Consistency Objective is measuring the total data consistency of your Disaster Recovery solution post recovery. Recovery Consistency Objective (RCO) ads data consistency objectives to the disaster recovery objectives of an organization, but often RCO is not enough to evaluate consistency. Going beyond the traditional disaster recovery goals, this paper introduces an assessment method and metrics for the consistency of the module interfaces in addition to the module consistency. The RCO as well as the proposed interface consistency metrics are evaluated in the context of the seven disaster recovery tiers defined by SHARE User Group. Keywords: Business Continuity (BC), Business Resilience, Disaster Recovery (DR), Disaster Recovery Goals, Recovery Consistency Objective (RCO). 1. Introduction A successful disaster recovery plan has well defined goals that are in line with the business requirements. Defining the recovery objectives is one of the most important steps in creating a disaster recovery plan and the objectives are the result of the Business Impact Analysis (BIA). The maximum acceptable downtime in case of a disaster will vary depending on the nature of the business and the financial impacts of the downtime. Depending on the criticality of the data handled and service rendered, the business continuity approaches cover a wide range of options. Most organizations today depend heavily on their IT infrastructure and their data in order to be able to provide service to their customers and the recovery objectives will drive the selection of the disaster recovery strategy and the cost of the IT infrastructure required to support it. Cyber-infrastructure protection, business continuity and disaster recovery, includes safeguarding and ensuring the reliability and availability of key information assets, including personal information of citizens, consumers and employees [13], but reliability and availability need to be backed by data consistency in order to provide proper recovery. Susanto [9] considers IT to be the most important issues of all when discussing BC and DR, not only for being the foundation and backbone of the business but also because IT can play important roles in strategies development and improving efficiency of the whole BCP plan. In today s complex business environment data and application consistency is becoming more and more important. As outlined in [14], managing a combined store consisting of database data and file data in a robust and consistent manner is a challenge for large scale software systems. In such hybrid systems, images, videos, engineering drawings, etc. are stored as files on a file server while meta-data referencing/indexing such files is created and stored in a relational database to take advantage of efficient search. Consistency between database content and files is required for the application to function properly post recovery. Defining a Recovery Point and Recovery Time objective is often not enough to insure successful recovery following a disaster event. Recovery Consistency Characteristics (RCC), as well as Recovery Object Granularity (ROG) and Recovery Time Granularity (RTG) need to be assessed in order to discover risks to which the environment is exposed.
2. Metrics that augment the Recovery Point and Recovery Time Objectives Proper evaluation of a disaster recovery solution requires well defined metrics and risk assessment. The Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are usually driven by Service Level Agreements (SLA) that the organization is contractually or legally bound to. RTO defines the time required to recover the lost data while RPO define the potential loss of data (the time gap between the most recent data point that can be recovered and the disaster event). Even if RTO and RPO are enough to measure SLAs, these two metrics do not measure the overall consistency of the date or the risks to which the organization is exposed in case of a disaster event. Meeting the defined RTO and RPO doesn t mean that processing can be resumed. The recovered data may be inconsistent if components are recovered at different points in time. More comprehensive metrics are needed to assess the quality of the recovery plan. The recovery metrics used by most organizations in their business continuity plans fall into three main categories: 1. Recovery Time Characteristics 1.1. Recovery Time Objective (RTO) is the main Recovery Time Characteristics define for any DR solution and defines how quickly service (data and application) is recovered following a disaster scenario. 1.2. Recovery Time Granularity (RTG) measures the time spacing required for selecting a recovery point. RTG defines a logical recovery point selection. 2. Recovery Data Characteristics 2.1. Recovery Point Objective (RPO) defines the time gap between the disaster event and the point in time where data can be recovered. It is essentially a measurement of how much data (measured in time updates) is estimated be lost following a disaster. 2.2. Recovery Object Granularity (ROG) is measuring the granularity of the objects that a disaster recovery solution is capable to recover. 3. Recovery Consistency Characteristics 3.1. RCO measures the usability of recovered data by the associated applications. RCO is defined as percentages, evaluating the number of entities that are consistent after recovery. RTG complements RTO and RPO in situations in which logical failures are encountered. For example, a data replication solution with a zero RPO and well defined RTO will recover from a physical failure but not from a logical failure. Data corruption that is not detected in time will be replicated and compromise the ability to recovery. In such a situation, if no other way to recover exists, the RTG will be undermined, and another recovery solution need to be put in place to provide a recovery point in time in the past prior to the disaster event. As a result, RTO will highly increase. The object granularity defined by ROG can be a storage volume, a file system, a database, a cluster package or service (including all associated storage), etc. Going below a volume or a file system in terms of granularity proves in most situations to be very expensive and requires manual intervention (labor intensive). Measures the Recovery Consistency Characteristics in terms of only RCO is often not enough and the purpose of the next sections of this paper is to assess RCO and introduce new metrics to complement it. 3. Why RCO is not enough Data is point in time consistent only if all of the interrelated data components are exactly as they were at any single instant in time. Disaster recovery plans usually define only recovery point and recovery time objectives. The Recovery Time Objective (RTO) is the duration of time within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with business continuity disruptions. The Recovery Point Objective (RPO) is the maximum tolerable period in which data may be lost. In many circumstances, the consistency of the data may be compromised even if the RPO and RTO are met. In this context the introduction of the Recovery Consistency Objective (RCO) is necessary in order to evaluate the data consistency following recovery. RCO is defined as a percentage measuring the deviation between the actual and the targeted state of business data across systems. RCO is calculated as a percentage that measures the number of consistent modules of the system after recovery reported to the total number of modules of the system: = = c = Number of Consistent Modules t = Total number of modules i = Number of Inconsistent Modules where = +. Even if the recovery point objective and recovery time objective are properly evaluated and can be met,
the system can be restored in an inconsistent state, and some of the applications may not be able to properly recover. Let's consider as an example a complex system that is spanning across multiple storage systems. In case replication is synchronous for all storage systems the alternate site will always be in sync with the main production site and data and application consistency will be preserved. However, if the storage systems are using different replication techniques, then the recovery point will be different for each of them. Even if the general RPO is preserved (all storage frames at the alternate site have data at a point in time lower than the general RPO), the difference between the recovery points of different storage devices may result in data inconsistency between applications. Modules may try to access data assuming that it is in sync and fail to find the records that are needed. Similar inconsistencies may occur in case different replication techniques are used for data inside the same frame (synchronous and asynchronous). Depending on the criticality of the application and the way applications or modules are implemented, such application data inconsistencies (different recovery points) may prolong the recovery time and may require manual intervention. Evaluation and correction will take time and extend the recovery time beyond what the business can tolerate. The recovery consistency objective is very important in such situations and reflects the individual requirements of corresponding business data crosssystem consistency. An unplanned IT outage can equate to a disaster, depending on the scope and severity of the problem. RCO is more important than RTO and RPO in the context of BCP processes. RPO and RTO emphasize the traditional IT Disaster Recovery Planning, while RCO goes beyond DRP. An important part of preparing for a disaster is to understand the type of risks your organization is facing. The risk of data inconsistency following disaster recovery is measured by RCO. The cost associated with creating and assuring availability for the enterprise rise dramatically as you approach the requirement for 100% availability. In certain contexts data inconsistency is acceptable as long as system availability is maintained and downtime is reduced. In other business environments data inconsistency may have staggering costs. 4. Improved consistency assessment metrics for disaster recovery RCO measures the usability of recovered data by the associated applications. RTO calculated as described above does not take into consideration interfaces/links between application modules. Let's assume for example that our system has 3 modules, and each module is consistent after recovery. In this example the RCO is 100%. However, if the point in time recovery for the 3 modules is different, some of the links between modules may be inconsistent. In my view interfaces/links between modules of the same system need to be considered when calculating the overall consistency of the system following recovery. If all modules except one are recovered at the same point in time and one module at a different point in time, the interfaces between the module with the different recovery point and the others may be compromised. A better way to asses over consistency of the system is to combine RCO with a consistency objective for module interfaces (Recovery Interface Consistency Objective). The proposed RICO can be calculated as a percentage as well based on the following formula: = = ci = Number of Consistent Module Interfaces ii = Number of Inconsistent Module Interfaces ti = Total Number of Module Interfaces Combining RCO and RICO into a single measurement for consistency is more practical and can be achieved by merging RCO and RICO into one metric. The proposed measurement for recovery consistency is Recovery Total Consistency Objective and is calculated based on the following formula: = + + where n and m are weigh parameters that can be defined based on the number of modules and interfaces and the importance of interfaces vs modules. RTCO is covering both module consistency and interface consistency providing a better measurement than RCO. 5. Disaster Recovery Tiers and Goals The SHARE User Group established 7 tiers of Disaster Recovery solutions. Each of these tiers addresses different requirements and corresponds to a different set of disaster recovery goals. The table below provides disaster recovery goals estimated for each of
the 7 Disaster Recovery tiers. Understanding the 7 tiers of Disaster Recovery and the goals associated helps organizations evaluate the DR solution that they currently have in place and determine what level is matching their business requirements. Tier 0 - No offsite data Tier 0 has no DR solution in place. There is no alternate location (hot site) available, no saved information, and no documentation and DR plans. Tier 0 offers no recovery options following a disaster. The ability to recover following a disaster is completely unpredictable and exposes the business to the risk of not being able to recover. No DR goals can be defined and recovery time and recovery point are unpredictable. Even if backups are done, the solution is categorized as Tier 0 if the backups are stored at the same location as the production environment and no proper vaulting procedure is in place. Tier 1 Off-site vaulting Tier 1 DR is relying on backups that are stored at an offsite storage facility (vaulting). No alternate environment (location, hardware, etc.) is available where to restore the date in case of a disaster. RCO depends on the consistency of the backups. Backups taken at different times for different application modules may render the application or some of its modules and interfaces inconsistent. RPO is defined depending on the frequency of the backups. RTO is very hard to define - as a new site needs to be built from scratch (location, infrastructure and equipment), and usually RTO is higher than a week. Recovery time is dependent on when hardware can be supplied or when a building for the new infrastructure can be located and prepared, and can take months. Tier 2 Offsite vaulting with a hot site Tier 2 is relying on backups sent offsite for recovery (same like) Tier 1 plus a hot site. An alternate location is available and backups can be transported there from the offsite storage facility in the event of a disaster. The availability of a hot site reduces the recovery time. RTO can be estimated and is lower than in Tier 1. No time is required to locate an alternate location, purchase and install hardware. The RTO is driven by the time required to recall the backups at the hot site available and load them (restore). RCO and RPO are similar with those offered by Tier 1. Tier 3 - Electronic Vaulting Tier 3 includes everything offered by Tier 2 plus electronic vaulting of a subset of the critical data. Electronic vaulting requires communication lines between the 2 sites and the creation and transmission of backups (traditional backups or data replication) more frequently than traditionally in the regular backup process. The recovery time improves and can be as low as one day. The recovery point improves for the critical data that is electronically vaulted to the remote site. The recovery consistency objective is the same like in Tier 2 the solution continues to be exposed to the risk of inconsistency. There is no notable RCO improvement when compared with Tier 2. Tier 4 - Electronic vaulting to secondary active site Tier 4 is comprised of two data centers with electronic vaulting between the two sites. The secondary site is also active and recovery can be bidirectional. The workload is shared between the two active sites and critical data is continuously transmitted between them, while the recovery of non-critical data continues to rely on off-site vaulting. Data loss is still possible in Tier 4 so the recovery point depends on the frequency with which the two site are synchronized The recovery time is lower, but the risk of inconsistency still exists between critical and noncritical data. Tier 5 - Transaction integrity (two-site, two-phase commit) Tier 5 maintains selected data in sync between the 2 sites. Transactions involving the selected critical set of data will be committed in the same time at both locations (single commit scope). Both primary site and the secondary site need to be updated before the update request is considered successful. A high bandwidth connection is required between the two sites.
Recovery consistency is improved, but it is unsure if transaction integrity of critical data is enough to make the whole consistent. Tier 6 - Zero or near-zero data loss Tier 6 involves immediate transfer of data to the alternate site. Data replication is used in order to maintain the two locations in sync. The recovery time is very low as well as the recovery point (data is in sync so the data loss is minimal or zero). There is not dependence on the applications or other resources to provide data consistency. Recovery consistency is assured. Data is considered lost only if a transaction has commenced but the request was not satisfied. This tier encompasses zero loss of data and almost immediate transfer to secondary platform. In such a configuration RCO is usually100%. 6. Conclusions Disaster recovery architecture and plans are driven by business requirements and consistency is often overlooked and not properly evaluated as a disaster recovery parameter. The goal of this paper was to introduce new metrics for consistency following disaster recovery. Interface consistency is introduced and evaluated in addition to module consistency. RICO and RTCO are proposed as alternate metrics to complement RCO and enhance the Recovery Consistency Characteristics. The disaster recovery goals (RPO, RTO and RCO) are evaluated in the context of the 7 tiers of disaster recovery defined by the SHARE User Group. 7. References [1] Octavian Paul Rotaru, Architecting for Disaster Recovery A Practitioner View, WORLDCOMP, Proceedings of SAM 2011, July 2011. [2] Cathy Warrick, John Sing, A Disaster Recovery Solution Selection Method, IBM RedBook, http://www.redbooks.ibm.com/redpapers/pdfs/redp384 7.pdf, February 2004. [3] C. Brooks, M. Bedernjak, I. Juran, J. Merryman, Disaster Recovery Strategies with Tivoli Storage Management, IBM RedBook, http://www.redbooks.ibm.com/redbooks/pdfs/sg24684 4.pdf, November 2002. [4] Philip Clark, Contingency Planning and Strategies, Proceedings of InfoSecCD 2010, October 2010. [5] Richard Cocchiara, Beyond disaster recovery: becoming a resilient business, IBM Global Services, ftp://ps.boulder.ibm.com/common/ssi/rep_wh/n/buw0 3014USEN/BUW03014USEN.PDF, June 2009. [6] David Rudawitz, Enterprise Architecture and Disaster Recovery Planning On the way to an effective Business Continuity Planning Philosophy, Antervorte Consulting LLC, http://www.antevorte.com/whitepapers/enterprise_arc hitecture_and_disaster_recovery_planning.pdf, November 2003. [7] N. Arshad, D.Heimbigner, A. Wolf, Dealing with failures during failure recovery of distributed systems, Proceedings of DEAS 05, NY, USA, 2005. [8] Y. Edmund Lien, P. J. Weinberger, Consistency, concurrency, and crash recovery, Proceedings of the 1978 ACM SIGMOD international conference on management of data, NY, USA, 1978. [9] Lukman Susanto, Business Continuity/Disaster Recovery Planning, 2003, http://www.susanto.id.au/papers/bcdrp10102003.asp [10] U.S. Department of Commerce National Bureau of Standards, FIPS PUB 87 Federal Information Processing Standards Publication, Guidelines for ADP Contingency Planning, 1981 March 27. [11] Geoffery Wold, Testing Disaster Recovery Plans, Disaster Recovery Journal, Vol. 3, No. 3, p. 34. [12] Guy Witney Krocker, Disaster Recovery Testing: Cycle the Plan, Plan the Cycle, SANS Institute InfoSec Reading Room, 2002. [13] Constatine Karbaliotis, Critical Interests: Business Continuity, Disaster Recovery and Privacy, Symantec, September 2009 [14] Suparna Bhattacharya, C. Mohan, K W Brannon, I Narang, Hui-I Hsiao, M Subramanian, Coordination backup/recovery and data consistency between database and file systems, Proceedings of the 2002 ACM SIGMOD International Conference on Management of data.