DR Drills Harbinger of IT Recovery Readiness Sanovi DRM, making IT Recovery Work

Size: px
Start display at page:

Download "DR Drills Harbinger of IT Recovery Readiness Sanovi DRM, making IT Recovery Work"


1 Disaster Recovery Management Software Issue 2 DR Drills Harbinger of IT Recovery Readiness Sanovi DRM, making IT Recovery Work 2 Improve Your IT Disaster Recovery Plan, and Your Ability to Recover From Disaster 6 Best Practices for Planning and Managing Disaster Recovery Testing 9 Disaster Recovery Management (DRM), an essential building block for IT Recovery 15 About Sanovi Technologies Featuring research from Dear Reader Performing regular DR Drills is a key strategy to improve recovery readiness of your IT applications. In this newsletter we present two very revealing Gartner research reports. These reports cover technologies and process that help make DR Drills more manageable and improve drill success rate which in turn helps increase overall IT recovery readiness. Working with IT teams across several organizations for the past several years, we have come across similar challenges in doing DR Drills. I would tabulate the top three to be - IT applications are complex, owing to large interdependencies making manual recovery unwieldy and error prone. Dependence on people and expertise to do the right thing at the right time. Are my primary and DR systems in sync? How do I find out if my DR solution is adhering to set recovery SLA s. There are capabilities that address the above challenges - Data dependency mapping technology - provides tracking of service level dependencies DR process automation technology - mitigates dependence on people and eliminates operator errors Validation/monitoring tools help check primary and DR environment equivalence and keep track of DR SLA metrics These are capabilities that will dramatically improve the overall recovery readiness of an organization. Sanovi DRM takes a life cycle approach that includes the above capabilities and offers a comprehensive way to manage DR. As you read through the recent trends in DR Drills/testing in this newsletter, I trust you will consider how a life cycle approach to DR can add resiliency and recovery readiness to your IT. Lakshman Narayanaswamy Co-founder & VP Products Sanovi Technologies Refer to page 2 to read about Sanovi s inclusion in Gartner s research Best Practices for Planning and Managing Disaster Recovery Testing.

2 From the Gartner Files: Improve Your IT Disaster Recovery Plan, and Your Ability to Recover From Disaster Many organizations have inconsistent IT disaster recovery plans that vary in quality, scope and detail. We help disaster recovery and business continuity planners improve their IT disaster recovery plans, and their ability to recover from disaster, by outlining best practices for key problems. Key Challenges Minor discrepancies, omissions and oversights in an organization s disaster recovery plan can have a major impact on the time required to recover from a disaster and the associated business impact. While most organizations claim to have some form of IT disaster recovery plan in place, there are wide-ranging differences in quality, scope and detail level from one plan to another. Respondents to the 2011 Gartner Risk Management Disciplines Survey were asked which types of disasters their organizations planned for. IT outage was ranked highest among the 13 categories, with 66% of respondents stating that they plan for IT outages. Recommendations Organizations should focus their disaster recovery plans specifically on the recovery of IT services, and should clearly define the intended use and scope of the plan as a critical first step. Two to three senior executives in the organization should be authorized to make a disaster declaration, and only after specific criteria have been met to qualify the event as a disaster. Organizations should include the details of ongoing recovery operations and failback processes and procedures as highlighted sections in the disaster recovery plan. Analysis IT organizations spend considerable time and money developing and managing IT disaster recovery plans they hope will reduce downtime and minimize the business impact when a disaster arises. Although most large organizations claim to have some form of IT disaster recovery plan in place based on the numerous plan reviews Gartner performs each year there are significant differences in quality, scope and detail level from one plan to another. Disaster recovery plans should be specific enough to address the individual recovery requirements, technologies and processes of an organization. Although no two plans are exactly alike, there are certain issues all organizations should consider and missteps to avoid when developing their plans. Having a focused, detailed and wellorganized disaster recovery plan can mean the difference between smooth recovery operations and chaos during a disaster. This research looks at common mistakes organizations make within their IT disaster recovery plans, and provides recommendations for improvement. Define the Scope of the Plan A common mistake organizations make when developing disaster recovery plans is not limiting their scope exclusively to the recovery of IT services. For example, some organizations include general business continuity requirements, which typically fall outside the purview of IT. Despite IT service recovery being a key part of overall business continuity, each department should have its own plan, coordinated at a high level, but managed and owned separately. Organizations should focus disaster recovery plans specifically on the recovery of IT services, and should clearly define the intended use and scope of the plan as a critical first step. This includes developing a concise statement about what s included and what s not, who the intended audience is and how the document should be used. The scope also should identify the specific locations, businesses, companies and functions covered by the recovery plan. Note: Business continuity management (BCM) ensures business resilience before, during and after an operational disruption. BCM includes supplier management, crisis management, emergency management, IT disaster recovery management (IT DRM), business recovery, contingency planning and preparedness. Identify Key Terminology Most disaster recovery plans reviewed by Gartner fail to include a formal glossary of key terminology and language. Because most recovery plans must address a wide variety of individuals with varying levels of knowledge from multiple internal and external organizations, an advanced understanding of language or terminology cannot be assumed. A well-defined and easily accessible glossary of key terms and phrases should be included in all disaster recovery plans. Establishing early in the recovery document a common language and terminology including industry-specific terms, recovery terminology, commonly used acronyms, location and facility names, and abbreviations helps minimize misinterpretations and potential mistakes. Make the Plan Easy to Use Although it may seem a basic point, one constant with good disaster recovery plans is that they are well-organized, easily navigated and easy to use. Organizations DR Drills Harbinger of IT Recovery Readiness is published by Sanovi. Editorial content supplied by Sanovi is independent of Gartner analysis. All Gartner research is used with Gartner s permission, and was originally published as part of Gartner s syndicated research service available to all entitled Gartner clients Gartner, Inc. and/or its affiliates. All rights reserved. The use of Gartner research in this publication does not indicate Gartner s endorsement of Client Name s products and/or strategies. Reproduction or distribution of this publication in any form without Gartner s prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. The opinions expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner s Board of Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see Guiding Principles on Independence and Objectivity on its website, 2

3 3 often structure their recovery plans as novels instead of reference documents. Disaster recovery plans are rarely read from front to back, and are most likely to be used during a crisis, not as leisure reading beforehand. To improve effectiveness and ease of use, organizations should separate their disaster recovery plans into multiple, stand-alone sections or subdocuments. For example, a recovery planning section covers items such as methodologies, management and program goals, while a recovery operations section focuses on recovery processes and procedures. Target each section to the specific audience or individual role, and format and organize the plan for the targeted user and by content (see Table 1). Reference Roles, Not Individuals Names Having an accurate and up-to-date recovery plan is critical for success. Unfortunately, it is not uncommon for recovery plans to be out of date. Organizations typically do not update their plans frequently enough to keep pace with the rate of personnel changes associated with the individuals who are assigned recovery responsibilities. This opens the door for tasks to be assigned to people who are no longer in the required role, have left the company or have changed their contact information. Avoid the use of individuals names and contact information in the recovery document, and use roles and job titles instead. References to roles and job titles can be indexed against an appendix of individual names and contact information. This way, only the appendix needs to be updated on a regular basis, and can be achieved automatically via standard HR reports. Address Ongoing Recovery and Failback, as Well as Failover Most disaster recovery plans Gartner reviews focus almost exclusively on failover processes and procedures. These plans usually fail to include adequate levels of detail, if any details are addressed at all, on what should happen in operations after a disaster failover occurs, or on re-establishing production operations via failback. Ongoing recovery operations and failback procedures are almost as important as failover, and should be covered in detail in all disaster recovery plans. Organizations should ensure that disaster postmortem processes are established to understand the root cause of the disaster and how it impacted IT, and to assess recovery performance. Consider the Types of Disasters to Plan For What types of disasters should organizations planned for? Two common approaches to answering this question are: One size fits all where all types of disaster scenarios are treated the same Individual subplans to address a wide array of potential disaster scenarios While there is no right answer, many recovery plans we review are overly general or too comprehensive and complex. Organizations should plan for disaster scenarios based on their ability to manage and benefit from including the various scenarios. Scenarios based on criteria such as notification time (e.g., a tornado warning is in effect starting tomorrow at 12 noon), type of disaster and potential business impact should be established only if material differences exist in the way the type of disaster is managed. Organizations should avoid planning for disasters that are highly unlikely to occur (e.g., a blizzard in the Caribbean). Figure 1 shows 2011 Gartner Risk Management Disciplines Survey respondents answers to the question, What disaster scenarios does your organization plan for in its business continuity management efforts? Maintain Version and Configuration Control Maintaining consistency between production and recovery environments remains one of the biggest disaster recovery testing and exercising challenges organizations face. While configuration and asset management tools can help, few organizations use them or other tools as part of ongoing disaster recovery plan updates. Establish formal processes via the use of management tools and libraries, or manually, to ensure that all hardware and software references in a disaster recovery plan are up to date, and represent actual production and recovery configurations. Specific version and patch-level details should be included for all hardware, software and OSs, and these should be updated on a regular basis. For example, it is insufficient to state Windows 2000 in the recovery plan for a server running Windows 2000 Advanced Server Service Pack 4. Codify What Constitutes a Disaster Table 1. Recovery Planning and Recovery Operation: Document Differences Item Recovery Planning Recovery Operations Target IT leaders IT operations Formatting Paragraphs and sections Bulleted lists Order Varied Sequential Writing Detailed Straightforward and concise Indexed Not important Highly important Knowledge assumption High Low Source: Gartner (June 2012) 3

4 FIGURE 1 Common Disasters Organizations Plan for in BCM Efforts N = 159 Source: Gartner (June 2012) Defining what qualifies as a disaster and how it is declared are key considerations not covered by most recovery plans in adequate detail or focus. Yet, this is especially important, given the cost and potential level of disruption associated with declaring a disaster. Organizations must ensure that processes and safeguards are established and documented within the disaster recovery plan to protect against mistaken declarations. Two to three senior executives should be authorized to declare a disaster, and this should occur only after specific criteria have been met to qualify the event as a disaster. Similar processes and criteria should be established to declare the end of a disaster, and to initiate failback procedures. Include Testing in the Disaster Recovery Plan Disaster recovery testing is challenging and expensive, but is a critical component of disaster recovery preparedness. Given the time and money spent on disaster recovery testing, it is surprising we don t see it called out more regularly or covered in enough detail within disaster recovery plans. Testing should be a highlighted section of all disaster recovery plans, and should include specific details, such as when it is scheduled throughout the year, what types of tests are planned, which applications or business functions will be tested, and what testing processes and procedures should be followed. Besides physical recovery testing, organizations should establish a regular paper test schedule of when major reviews and walk-throughs of the recovery plan occur (see Best Practices for Planning and Managing Disaster Recovery Testing ). Consider the Communication Infrastructure The communication infrastructure is a top recovery priority for many organizations. However, since it is not necessarily seen as an application or a business service, it is not always called out or prioritized appropriately within disaster recovery plans. The communication infrastructure should be considered a high-priority recovery function, and treated similarly to other missioncritical business services. This is especially important when business continuity functions 4

5 5 such as an emergency response system might depend on the availability of the communication infrastructure for operation. Even for execution of the recovery plan, primary and alternative communication methods should be established and documented. well as direct discussions with Gartner clients regarding the creation and management of disaster recovery documents and plans. Source: Gartner Research G , Kevin Knox, 4 June 2012 Evidence This research is the result of over 40 disaster recovery document reviews and analyses, as 5

6 From the Gartner Files: Best Practices for Planning and Managing Disaster Recovery Testing Annual costs for disaster recovery testing can be as high as $150,000. Solutions for discovering and mapping software and data dependencies among Web-based applications is likely to become essential for DR testing/exercising, as part of an organization s best practices. Overview The time and resource costs of disaster recovery (DR) plan exercising, especially that which is supported by manual or semimanual processes, has become the most significant IT DR management (IT-DRM) pain point for many of Gartner s clients. Specific steps can be taken and technologies can be deployed to reduce recovery plan testing costs and complexities. Key Findings The annual costs of DR testing can reach or exceed $150,000 for many Gartner clients. These costs could go even higher, as new business applications are rolled into production. Tools capable of discovering and mapping software and data dependencies between Web-based applications are likely to become essential for managing efficient and effective recovery testing/exercising. The need for more thorough business application inquiry and transaction testing will drive enterprises to assess organizational and test management consolidation and integration to more efficiently scale recovery testing in the future. Recommendations Evaluate IT service dependency mapping technologies from vendors such as BMC Software (Tideway), CA Technologies, HP, IBM, Neebula, ServiceNow and VMware to assess the extent to which they can simplify the testing process and make it more reliable. Pilot software change management tools (from vendors such as BMC Software, CA Technologies; HP, IBM Maximo, SAP and ServiceNow) and procedures that have the potential to most effectively synchronize change implementation between primary production and secondary recovery data centers. Evaluate the possible savings that can be gained by consolidating the application testing resources, processes and tools used by the DR and quality assurance (QA) testing teams. Strategic Planning Assumption By the end of 2014, 15% of enterprises will have significantly reduced or eliminated traditional DR testing as a result of supporting more resilient IT operations. Analysis DR testing is critical for supporting business resiliency. However, as the scope of missioncritical business processes, applications and data increases, sustaining the quality and thoroughness of the test process can be a challenge. Gartner client recovery and continuity-specific inquiries indicate that many enterprises are now implementing new approaches for managing recovery exercising, mostly because of the increasing cost and logistical complexity of traditional approaches. Gartner research shows the importance of effectively managing recovery exercising costs. In one study of the exercising costs of federal government agencies (see Cost-Cutting IT: Should You Cut Back Your Disaster Recovery Exercise Spending? ), clients reported that IT-DRM annual exercise budget allocations ranged from $20,000 to more than $150,000, depending on the size, location, number of participants, scope of exercise and organizational structure of the governmental unit. Results from nongovernment client inquiries have shown that it isn t unusual for the annual cost of DR exercising to be between $75,000 and $150,000. Gartner has identified some of the key reasons enterprises find DR testing increasingly difficult and/or costly: Increasingly complex dependencies Web applications and services often have logically meshed relationships with, and dependencies on, other applications and data, some of which is often part of a lower recovery tier (see Table 1). Table 1. Recovery Tiers Tier Service Levels 1 24/7 scheduled 99.9% availability (less than 45 minutes/month) Recovery time objective (RTO) = two to eight hours; recovery point objective (RPO) = four hours 2 24/6 3/4 scheduled 99.5% availability (less than 3.5 hours per month) RTO = eight to 24 hours; RPO = four hours 3 18/7 scheduled 99% availability (less than 5.5 hours per month) RTO = one to three days; RPO = one day 4 24/6 1/2 scheduled 98% availability (less than hours per month) RTO = more than three days; RPO = one day Source: Gartner (August 2011) 6

7 7 Inconsistencies These occur between the current state of the data center infrastructure, applications and data, and their state at the time of the last recovery test. This may affect the extent to which production applications and data can be successfully recovered, unless robust change and configuration management processes (and tools) are in place. For example, a monthly volume of even a few hundred changes to a data center s OS, middleware, applications or management agents can result in a difference of thousands of changes between the current production configuration and the production configuration at the time of the last recovery test. Lack of resources With the increasingly complex scope of testing, enterprises rarely have adequate recovery testing resources to exercise all production application inquiries and transactions on a regular basis. Some organizations test only their most mission-critical applications. Others rotate testing among applications, while still others focus on systems that have failed previous tests. A frequent result is that lower-priority applications are tested far less frequently, and their recoverability is qualified as being on a best effort basis. In light of these challenges, Gartner is increasingly seeing clients rethink their test strategies and implement a series of best practices. Establishing a Minimum Acceptable Level of Recovery Testing The 2011 Gartner Risk Management Survey shows that enterprises test recoverability, on average, once or twice a year. However, anecdotal evidence based on more than 3,000 DR-related Gartner client inquiries in a three-year period suggests that fewer and fewer of these live tests involve all production applications and data. Instead, tests are specific to an individual recovery tier (typically, the recovery tier corresponding to the most mission-critical applications) or include an affinity group of production applications that have related software and data dependencies. This means that many organizations follow the 80/20 rule 80% of the testing is done on the applications that are the most missioncritical (which are often 20% or less of the total number of production applications). Despite this data, however, you shouldn t completely ignore test procedures for less critical applications and data. Rather, IT must ensure the recovery of the business processes and supporting applications, the loss of which would cause the greatest loss of revenue, productivity or organizational reputation. In terms of how often an organization should conduct testing, we offer the following baselines, again subject to your organization s special circumstances: Conduct live testing for Tier 1 and Tier 2 applications and data at least twice per year. Initiate more frequent (monthly, quarterly) manual or (ideally) automated testing on application affinity groups. Perform failover and failback testing during the same or separate planned downtime periods. Ensure that the required data restoration and application activation cycle times meet or beat the RTO and RPO targets. Regardless of how you determine recovery tier definitions, it is important to begin thinking about how you can best test recoverability, especially for the most mission-critical application data. Test more frequently the related applications and data that support a smaller set of key business processes, and shift the testing focus to how IT can best meet or beat the associated recovery targets. Pain Point Remediation Alternatives Automated Dependency Mapping The challenge of ensuring that all required software and data dependencies are addressed in a recovery configuration will become more complex, as new business applications that have been purchased, created by in-house development teams, or acquired through merger and acquisition (M&A) activity are turned over to production. Increasingly mature IT service dependency mapping tools can help. These products, available from vendors such as BMC Software, CA Technologies, HP and IBM, enable IT organizations to discover, document and track relationships by mapping dependencies among the infrastructure components, such as servers, networks, storage and applications, that form an IT service (see IT Service Dependency Mapping Tools: Market Dynamics Update ). These tools are used primarily for applications, servers and databases; however, a few discover network devices (such as switches and routers), mainframeunique attributes and virtual infrastructures, thereby presenting a complete service map. Although these tools are often bought in conjunction with configuration management database (CMDB) projects, we have seen a significant increase in their acquisition and use for data center-specific projects, such as IT-DRM modernization and data center consolidation. Data dependency mapping products from 21st Century Software, AppAssure, Bocada, Continuity Software, InMage and Sanovi are software products that provide automated data, metadata and index consistency assurance between production files and databases and their replicas that are maintained at one or more recovery sites. Background software agents determine and report on the likelihood of achieving specified recovery targets, based on analyzing and correlating data from applications, databases, clusters, OSs, virtual systems, networking and storage replication mechanisms. These products perform their consistency checking on data located on direct-attached storage (DAS), storagearea-network (SAN)-connected storage or network-attached storage (NAS) at the primary production and secondary recovery data centers. Synchronizing Distributed Change Ensuring 100% change consistency between the production data center configuration, applications and data and their recovery data center counterparts is a challenging task. At a minimum, the recovery infrastructure at the secondary site must be dedicated, although 7

8 this may not be the case for the recovery facility itself. Typically, asynchronous data replication (either host- or storage controller-based) and server virtualization are used to support a partial or full development and testing configuration that is used by in-house application development, support and testing teams during normal production hours. In this scenario, synchronizing changes between the primary production and development and test (which can or might support recovery) configurations is typically managed by the development and testing teams, in conjunction with operations support. This may involve the automated replication of updated production virtual server images to the secondary configuration, in parallel or in tandem with production data replication. Several product options support virtual server replication, including offerings from such vendors as Acronis, Asigra, Atempo, BakBone Software, CA Technologies, CommVault, Double-Take Software, EMC, FalconStor Software, HP, i365, IBM, InMage, Microsoft, NetApp, Novell, PHD Virtual, Quest Software, Symantec, Syncsort and Veeam. However, for recovery configurations that include a mix of physical and virtual servers, as well as a combination of shrink-wrapped and in-house-developed applications, the use of IT process automation tools that orchestrate infrastructure configuration, provisioning and change updating is likely to be required. (Further information on the current state of IT process automation, change and configuration management can be found in Hype Cycle for IT Operations Management, ) Consolidating Testing Personnel, Tools and Skill Sets One approach that has met with some client success is consolidating what were previously separate QA and recovery testing teams into a single organization. Organizational consolidation, together with the consolidation and standardization and testing platforms and scripts, is an approach that can be used to support preproduction turnover regression, as well as ongoing DR, testing. Organizations that implemented this approach did so to address a lack of recovery testing breadth and depth. Given the increasing numbers of mission-critical applications requiring recovery, as well as the related numbers of inquiries and transactions, it became clear that manual or semimanual testing processes could only provide limited recovery assurance. This was because the extent to which a full set of production inquiries and transactions could be consistently exercised by the recovery exercising team was limited by testing time constraints. In one specific instance, a recovery team was able to meet the required RTO and RPO targets for the most mission-critical applications, but the recovery of the production environment, as perceived by the business unit end users, was short-lived, because undiscovered (and, therefore, unaddressed) software and data dependencies resulted in several inquiries and transactions prematurely aborting or incurring unacceptably long response times. The net result was that the recovery team won the battle by supporting the required RTOs and RPOs, but lost the war, because the usability and effectiveness of the recovery operations configuration was limited. A new approach was needed that could not only improve the breadth and depth of application testing coverage, but could increase the efficiency and effectiveness of recovery exercising as a whole. Following an assessment of the technical benefits and cost savings that could result from a merger of the internal QA and the DR testing teams, a decision was made to consolidate them into a single organization and to standardize the management and automation of test processes by leveraging many of the tools, scripts and staff resources that were already in place. The benefits that have been realized by some of the early adopters of this approach include increasingly reliable and more-effective test exercises, combined with more-thorough testing of representative production inquiries and transactions against the recovery configuration. The latter improves the likelihood that recovery operations can be initiated within required RTO and RPO targets, and ensures more stable recovery operations. Summary IT-DRM managers may recognize one or more of these approaches as potentially adding value to their IT-DRM programs. Regardless of which side of the issue you see your organization leaning toward, it is important to consider the key technologies your organization uses, because, for many organizations, the use of more traditional recovery testing and technology that helps manage more sustained availability may not be so much a case of either/or in the next five years, but rather a case of and. Source: Gartner Research G , John Morency, 16 August

9 Disaster Recovery Management (DRM), an essential building block for IT Recovery Introduction Disaster Recovery Management provides a systematic lifecycle approach using tools and best practices to monitor and manage IT Disaster Recovery. A typical DR solution has several subsystems and logical relationships. The subsystems includeservers, applications, data replication, networks and storage across the primary data center and DR site. Logical relationships include order of recovery, interdependency between components and actions required to recover a subsystem. Disaster Recovery Management encompasses all these subsystems and relationships, and provides orchestration of IT system recovery. 4. Have you been delaying DR drills due to insufficient resources or due to worries about impact to production? 5. While it is common to perform DR drill for one, two or three applications, have you performed DR drill of several applications together, which is what will be required in the event of a larger outage? 6. Does management get a weekly report on application recovery readiness status and recovery SLA report? 7. Do you have the reports and evidence to show to audit and regulators about application s DR capability? 8. If your Data Base Administrator quits, will your DR work? In seeking answers to the above questions, we encourage you to think about your Disaster Recovery capabilities and highlight some of the challenges of having a IT Disaster Recovery solution that will work when required. DR Challenges Businesses are impacted when critical applications are not available and cannot be recovered within s set Recovery T SLAs. Major challenges faced by organization and their IT groups are: Production Downtime. Manual drills and unpredictable outcomes cause critical applications to be down impacting business. How confident are you with your organization s IT recovery? There are common myths about IT recovery readiness. One of the biggest misconceptions is if data replication is in place, then the application is recovery ready. Data replication is only one of the important ingredients of recovery readiness. The others include process, people and integration with technology subsystems on primary and DR site. All of these are required for predictable recovery. To assess your organization s recovery readiness, answer the following questions: 1. Have you defined Recovery Point and Recovery Time Objectives for all your critical applications? 2. When was the last DR drill? Was it successful? 3. When doing a DR drill, have you found the run book to be out of sync with the current configuration? Deployment & Operational Cost. When every deployment of DR becomes a professional services engagement, cost and project times escalate. Lack of visibility into DR SLA s. When management & IT operations do not know if their recovery solutions are meeting Service Levels, it leads to lack of confidence and reduced ROI. Manual Operations: Being dependent on people to execute recovery steps at the time of crisis, exposes the business to more risks. People tend to make more mistakes when performing in a crisis situation. Need for DR Expertise: A typical enterprise uses heterogeneous technologies. Without a single dashboard to monitor and automate recovery and drill steps, the organization is dependent of various technology experts to be available to recover its systems. Disaster Recovery Management a comprehensive approach to IT Recovery IT Disaster Recovery Management (IT-DRM) is an emerging discipline that enables IT to meet business set recovery objectives. Without IT- DRM, IT recovery is largely manual, expertise dependent and with little visibility into how well recovery service levels are being met. 9

10 Gartner says The potential business impact of this emerging technology is high, reducing the amount of spare infrastructure needed to ensure HA, as well as helping to ensure that recovery policies work when failures occur. 1 IT DR lifecycle A DR solution for application must be designed to meet key DR metrics. We briefly review key DR metrics and DR processes that must be covered by a DR solution. Key DR Metrics Recovery Point Objective: It is the amount of application data in time that an organization can afford to lose before it adversely impacts the business. E.g. A bank that cannot afford to lose any data for its ATM application, hence its RPO is zero. Recovery Time Objective: It is the amount of time an application can be down before its non availability impacts business. E.g. An application with a RTO of two hours must be recovered in under two hours after it becomes unavailable due to an outage. Data Lag: This is the amount of data that the DR site is behind the primary production site. The unit of Data lag is dependent on the technology deployed to replicate data and is usually measured in MB, or number of files. DR Processes There are several process as part of the IT Disaster Recovery solution that must be thought out and designed for. There must be a run book that has a series of steps for each of the DR process. Provision: Deploy best practice DR solution for the application by deploying best of DR infrastructure, and best practice procedures recommended by application vendors. Monitoring: Perform real-time monitoring of DR metrics and ensure the objectives are met and DR systems are healthy and ready to go. Validation: Perform daily/weekly configuration checking to ensure the DR systems are up-to-date with production systems with regards to ongoing change management updates. Test/ DR Drills: Perform quarterly or half yearly DR drills including Switchover and Switchback on the Application at the DR site and validate the DR readiness capability. Switchover is when production is brought down and services are made available from the DR site. The business user typically tests the application that has come up on the DR. The Switchback process moves services back to the production and the Normal Copy process resumes. Failover Recovery: Document and automate the application recovery steps including Failover and Fallback procedures for different scenarios. Recover the applications successfully within the Recovery Time objective, when invoked under crisis. When an outage occurs on the production, the failover process is invoked to recover services on the DR. After the cause of the outage has been rectified the Fallback process covers the steps to move services back to the production site. Reports: Furnish yearly audit and compliance reports on DR drills and other DR activities to meet regulatory requirements. Furnish weekly reports on DR status to management and application owners in the organization. IT-DRM Software must offer capabilities to monitor and report on DR metrics as well as provide automation of all of the DR processes. The solutions must offer the following capabilities: Monitoring and validation of recovery service levels Recovery Point and Recovery Time are metrics that need to be monitored for a DR solution. Real time monitoring of RPO and RTO ensures that applications are meeting their recovery objectives. 1 Gartner Inc., Hype Cycle for Business Continuity Management and IT Disaster Recovery Management, 2011, 20 July 2011, G

11 Automation of Failover and DR Drill processes In the lifecycle of a DR solution there are several stages requiring several steps to be performed. Failover is a series of steps that bring up the application on the DR site when the primary is down. Switchover is a series of steps that shuts down the primary and brings up the DR in a planned manner. Automation of these steps ensures the DR process takes place in a predictable and reliable manner. Unified management approach that takes an application view of recovery Application recovery requires the various components including operating system, network, storage, data protection & applications be recovery ready. An unified approach helps interface and manage a complete view that includes event management across the stack. Analytics & Reporting for compliance and regulatory purposes Regulatory authorities require evidence of control that demonstrates that drills have been conducted. RPO and RTO trending reports help IT managers identify saturation of resources like network bandwidth and draw focus on recovery steps that are time consuming. Sanovi Disaster Recovery Management Sanovi DRM takes a comprehensive view of the various DR processes and enables DR monitoring, reporting, testing and work flow automation of complex IT infrastructure and applications. Sanovi DRM suite offers a unified disaster recovery management class of product that delivers real-time DR readiness validation with clear business and operational advantages. Sanovi DRM layers on top of existing DR infrastructure to provide DR management capability and ensures customer s DR investments are protected. Sanovi DRM interoperates with leading platforms including Microsoft Windows,IBM AIX, HP-UX, Oracle Solaris and several Linux flavors. Various virtualization platforms and leading data replication technologies from EMC, Hitachi, HP, Oracle, Microsoft and Symantec are also supported. Sanovi DRM Recovery Monitor provides monitoring that includes alerts on DR solution health, current RPO, exception reporting and policy-driven actions for real time DR readiness validation. Sanovi DRM Recovery Monitor offers: Dashboard view of application recovery solution and health Real-time RPO/RTO monitoring Replication monitoring Event alerts and policy-based corrective responses Application / Database environment monitoring to identify change Sanovi DRM Recovery Manager offers out-of-box DR solution and automation based on the best industry practices. Whether you are deploying DR solutions for the first time or looking to automate time consuming manual execution, Recovery Manager offers powerful automation. It provides automation of failover and fallback recovery at the secondary site. 11

12 Recovery manager is built upon a powerful automation engine that understands and orchestrates dependencies required for successful recovery of applications. A central web based console offers an easy way to collaborate execution and status tracking of recovery actions as they execute. Sanovi DRM Drill Manager offers comprehensive automation of DR Drill workflows that dramatically reduces time and expertise required to automate your drill run book. Sanovi s Recovery Automation Library (RAL) is a collection of recovery actions. By using RAL actions, complex drill workflows can be built or customized without the need for hand crafted scripting and DR expertise. The web based workflow execution console enables co-ordination and tracking amongst remote teams when executing drills. Sanovi DRM Reporting offers extensive reporting and analytics on various aspects of DR. Reports for RPO, RTO, DR Drill, Replication and Workflow execution details are available. Advanced Reporter can readily interface with the standard reporting packages in the industry, to enable users to generate customized reports. Web based console for easy collaboration and tracking of drill status 12 Out-of-box switchover and switchback workflows

13 Recovery Automation Library for building recovery workflows Built-in granular non-intrusive tests Drill reports and analytics on drill performance Sanovi Replicator - Keeping Application Environment in Sync. Sanovi DRM offers an integrated and easy to deploy replication that is suited to keep application environments in sync. Application environments including folders and configuration files required for the application to recover on the DR side can be replicated when they change on the primary. Success story HDFC Bank HDFC Bank is one of India s largest financial institutions. It has a network of 1412 branches and 2890 ATMs over 528 cities throughout India. The bank offers services in wholesale, retail and treasury banking and has current assets of over Rs 45 billion. The bank has been named amongst the top 50 companies in Asia by Forbes magazine. The bank has its primary data centre in Mumbai and its remote data centre in Bangalore. The application environment is a heterogeneous mix of Unix and Windows platforms, SAN based storage and three tier application architecture. Synchronous block based replication is deployed to a near site and asynchronous replication to the remote site. File based replication is used for applications with recovery point of more the sixty minutes. Oracle, Sybase and MSSQL database are in use for various applications. The bank has key processes in place to effect change management on the primary and remote systems. These were largely manual processes that required project management and co-ordination amongst the various stake holders. DR drills are the surest way to check for application recovery readiness. The challenge in doing DR drills more frequently is the time and resources required to prepare and execute a drill. Starting with the recovery workflow library gave the bank the head start that was key to meeting DR solution deployment project time lines. Every supported DR solution signature in the product, ships with fail-over workflow for application recovery and switchover and switchback workflows for DR drills. Using these workflows ensured that the solutions deployed by the bank followed industry best practices and recovery automation had met software quality control metric. The bank used this opportunity to review their current solutions and processes. Further, custom business process that are specific to the bank were easily added by extending the out-of-box workflows. Application fail-over that would take two hours earlier now happens in twenty minutes using Sanovi s DR automation engine, our recovery confidence level is phenomenal. We are no longer dependent on the right person being available to recover the applications, it happens at the press of a button. The bank has measured the following business benefits of using Sanovi DRM. Productivity Over 85% reduction in application fail over time 100% increase in frequency of DR drills Over 75% reduction in the time required for DR drill preparation & validation Operational Efficiency Real-time alerts on RPO, RTO deviations of critical applications Five time increase in number of DR solution deployed with no increase in staff Business Benefits Sanovi DRM enables the following business benefits: Reduce Business Exposure to IT Outages Non availability of IT applications poses a huge business risk. Sanovi DRM dramatically reduces this exposure by ensuring IT recovery readiness. Sanovi s software enables IT to recover applications within business set recovery objectives, thus minimizing the impact of IT outages. Achieve Higher Operational Efficiency Do it faster. Do it cheaper. Do more of it, is the IT mantra. Sanovi DRM can help realize over 40% efficiency by enabling DR experts and IT operations to deploy, test and recover IT applications in a scalable manner. As new technologies are adopted to keep up with growing business needs, Sanovi s software masks complexity while providing familiar metrics and operational procedures. Adopt Industry Best Practices Successful recovery is the coming together of people, process and technologies. Sanovi DRM implements DR best practices that are aligned to ITIL service delivery practices. By deploying Sanovi DRM, your IT recovery process easily integrates with best practices of incident, configuration, change, audit & service level management processes. Enable an Agile Organization IT automation is the cornerstone of an agile IT organization. Sanovi DRM delivers out-of-box monitoring & recovery automation over heterogeneous platforms and application from a single console. No longer is DR for outages caused only by acts of god. The new agility enables you to switch to DR with confidence and reduce impact of planned and unplanned down times. Financial Sense in Disaster Recovery Management The primary reason organizations deploy Disaster Recovery solutions is to reduce the financial impact of IT outages. The table below demonstrates how DR Management software helps reduce DR operational costs while increasing IT availability through enhanced IT recovery readiness. 13

14 DR automation drives the Return on Investment of DR Management software in two key areas. DR automation enables scaling of DR drills as more applications need to be tested at regular times. And secondly, with a single button failover recovery of applications, the confidence to invoke application on the DR site increases dramatically resulting in reduced downtime and higher utilization of DR assets. The chart below graphs how DR automation scales as the number of applications increase without increase in number of DR/IT personnel. Cost of doing DR Operations Traditional approach Vs DR Management Software Personnel required for DR. The DR team works on DR strategy, plan readiness, co-ordinations amongst various teams and DR readiness reporting. The DR team size does not have to increase with the addition of number of application under DR. Central web based console for coordination, DR dashboards with real time status on DR readiness and inbuilt DR best practices help the DR team manage more applications. DR Process. IT process of change, policy, backup management that are implemented on the primary must also be implemented on the DR. Event monitoring, exception reporting and SLA compliance and analytics are part of DR process. Real time monitoring of DR health and data replication, validation of primary and DR environment equivalence and exception reporting when DR SLA s are not being met drive down operational cost while increasing DR readiness. DR automation. Automation of steps involved in doing DR drill and failover recovery. Customers using DR automation have reported up to 75% reduction in people cost of doing DR drills and failover recovery. Further the time to recover applications reduces since manual intervention is reduced, customer have reported up to a 90% reduction in time to do DR drills. All of this while improving collaboration and communications amongst the various teams. Sanovi customers are realizing up to 75% reduction in DR operations and over 60% reduction in number of personnel and resources required to do DR drills. Source ; Sanovi 14

15 About Sanovi Technologies Sanovi helps organization across the globe proactively manage disaster recovery (DR) environments and ensure business managers that applications can be recovered in compliance with service level agreements. Sanovi DR Management Suite is a comprehensive family of enterprise class DR management software solution for validating, monitoring, testing and automating recovery. For more information visit Disaster Recovery Management Software 15