Automated Disaster Recovery With BMC Atrium Orchestrator



Similar documents
Align IT Operations with Business Priorities SOLUTION WHITE PAPER

SOLUTION WHITE PAPER. Align Change and Incident Management with Business Priorities

BSM for IT Governance, Risk and Compliance: NERC CIP

Taking the Service Desk to the Next Level BEST PRACTICES WHITE PAPER

Veritas Cluster Server from Symantec

Reduce IT Costs by Simplifying and Improving Data Center Operations Management

BMC Cloud Management Functional Architecture Guide TECHNICAL WHITE PAPER

Copyright 11/1/2010 BMC Software, Inc 1

Blackboard Managed Hosting SM Disaster Recovery Planning Document

BMC Asset Management SAP Integration

Atrium Discovery for Storage. solution white paper

Four Steps to Faster, Better Application Dependency Mapping

Veritas InfoScale Availability

High Availability for Citrix XenApp

The Power of BMC Remedy, the Simplicity of SaaS WHITE PAPER

SOLUTION WHITE PAPER. BMC Manages the Full Service Stack on Secure Multi-tenant Architecture

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Storage Monitoring & Automation Solutions

The CMDB: The Brain Behind IT Business Value

Total Business Continuity with Cyberoam High Availability

Mastering Disaster Recovery: Business Continuity and Virtualization Best Practices W H I T E P A P E R

Cloud Lifecycle Management

Securing the Service Desk in the Cloud

The Art of High Availability

The SMB IT Decision Maker s Guide: Choosing a SaaS Service Management Solution

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

BMC Control-M Workload Automation

Symantec Cluster Server powered by Veritas

The Dirty Little Secret About Online Backup

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Automated IT Asset Management Maximize organizational value using BMC Track-It! WHITE PAPER

Business Continuity Planning and Disaster Recovery Planning

Oracle EPM Disaster Recovery High Level Overview

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

SOLUTION WHITE PAPER

How to Improve Service Quality through Service Desk Consolidation

BMC Mainframe Solutions. Optimize the performance, availability and cost of complex z/os environments

Hybrid Cloud Delivery Managing Cloud Services from Request to Retirement SOLUTION WHITE PAPER

The Difference Between Disaster Recovery and Business Continuance

Real-time Protection for Hyper-V

Scalability and BMC Remedy Action Request System TECHNICAL WHITE PAPER

Solution White Paper BMC Service Resolution: Connecting and Optimizing IT Operations with the Service Desk

Skelta BPM and High Availability

BMC ProactiveNet Performance Management: Delivering on the Promise of Predictive Control Across the Total IT Environment SOLUTION WHITE PAPER

Creating A Highly Available Database Solution

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

White Paper August BMC Best Practice Process Flows for ITIL Change Management

Realizing the Value of Standardized and Automated Database Management SOLUTION WHITE PAPER

How Organizations Are Improving Business Resiliency With Continuous IT Availability

FireScope + ServiceNow: CMDB Integration Use Cases

Symantec and VMware: Virtualizing Business Critical Applications with Confidence WHITE PAPER

Unleash the Full Value of Identity Data with an Identity-Aware Business Service Management Approach

BMC ProactiveNet Performance Management Application Diagnostics

ITIL, the CMS, and You BEST PRACTICES WHITE PAPER

Pervasive PSQL Meets Critical Business Requirements

IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity

Veritas Storage Foundation High Availability for Windows by Symantec

Release Management for BMC Remedy IT Service Management version 7.0 WHITE PAPER

BMC BladeLogic Application Release Automation TECHNICAL WHITE PAPER

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

BMC Remedy IT Service Management Suite Installing and Configuring Server Groups

TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase Applications in a Fault Tolerant Configuration

The case for cloud-based disaster recovery

SOLUTION WHITE PAPER. Remedyforce Powerful Platform

Meeting the Challenge of Service Request Management SOLUTION WHITE PAPER

Connect Converge / Converged Infrastructure

ITIL Event Management in the Cloud

Predictive Intelligence: Identify Future Problems and Prevent Them from Happening BEST PRACTICES WHITE PAPER

THE BUSINESS CASE FOR BUSINESS CONTINUITY MANAGEMENT SOFTWARE

Streamlining Service Request Processes: A Key to Business Success

Whitepaper Continuous Availability Suite: Neverfail Solution Architecture

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

Confidently Virtualize Business-critical Applications in Microsoft Hyper-V with Symantec ApplicationHA

BMC Recovery Manager for Databases: Benchmark Study Performed at Sun Laboratories

TECHNICAL WHITE PAPER. Accelerate UNIX-to-Linux Migration Programs with BMC Atrium Discovery and Dependency Mapping

White Paper: BMC Service Management Process Model 7.6 BMC Best Practice Flows

Why you need an Automated Asset Management Solution

Business white paper. Top ten reasons to automate your IT processes

BMC Remedy OnDemand. Product Overview

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

IBM Enterprise Linux Server

Empower Human Ingenuity IT Process Automation Buying Guide

BEST PRACTICES WHITE PAPER. BMC BladeLogic Client Automation and Intel Core vpro Processors

High availability and disaster recovery with Microsoft, Citrix and HP

Backup and Redundancy

High Availability and Disaster Recovery Solutions for Perforce

Keep Users Happy By Integrating I.T. Operations and I.T. Support

Business Continuity: Choosing the Right Technology Solution

TECHNICAL WHITE PAPER. Monitoring Cisco Hardware with Sentry Software Monitoring for BMC ProactiveNet Performance Management

HA / DR Jargon Buster High Availability / Disaster Recovery

BMC Software s ITSM Solutions: Remedy ITSM & Service Desk Express SOLUTION WHITE PAPER

Transcription:

BEST PRACTICES WHITE PAPER Automated Disaster Recovery With BMC Atrium Orchestrator Applying the capabilities of IT Process Automation to help meet the daily challenges faced by Disaster Recovery / IT Service Continuity Professionals

Contents Introduction 1 The Challenges 1 Automation Drives Business Recovery 1 Modular Approach to Disaster Recovery Automation 2 Fast and Efficient Communication 3 The Daily Disasters 3 Loss of a Data Center Active-Active 4 Loss of a Data center Active-Passive 4 Benefits and Showing Value 4 Cost of Downtime 5 Cost of Hardware 5 Cost of Staff 5 Risk of disaster recovery plan drift 5 Risk of infrequent testing 5 The Complete BMC Solution 6 Conclusions 6

Introduction Events such as 9/11, Hurricane Katrina, and other recent, high-profile disasters have brought a harsh realism to the potential devastation that can impact any of us unexpectedly, at a moment s notice. And while the concern for human tragedy will always remain the central one during such times, there is a growing understanding of and concern for the impact to businesses, as well. Indeed, with the rise in importance technology plays toward delivery of core business services, planning and preparing for disaster scenarios has taken on its own essential role to ensure continuity of business services. Disruption of business services, whether the result of minor technology failures that occur daily or full-scale disasters, can be highly detrimental to a company s financials, as well as to its reputation. Once you acknowledge the value technology has to your organization, you must also consider the related consequences when and if that technology becomes temporarily unavailable or more severely impaired due to catastrophic failure. Business continuity planning is used to identify needs, analyze consequences, and develop recovery strategies specifically designed to ensure operational continuity at a minimal level or standard in the event of a disruption to the business. Such events come in a variety of shapes and sizes, but as the old adage goes, they always tend to come at the most unexpected and inopportune times. The full impact of such events on production services is, as a rule, directly dependent on the speed and efficiency with which IT Operation s can detect a disruption, triage the impact on services, and then execute the recovery process. The Challenges The common reality across IT is that, even today, detection is left to existing piece-part monitoring tools, and the overall Disaster Recovery Lifecycle is managed in an ad-hoc fashion, manually developed on the fly. This often results in less than optimal recovery times which will have a direct impact to an organization s bottom line. In some cases, the disaster recovery process may involve automated scripts that are triggered manually, in conjunction with step-by-step instructions codified in procedural documents. Even still, the execution of the disaster recovery processes are highly manual, difficult to coordinate, and cumbersome to achieve. Business continuity managers also face challenges around testing and maintaining the currency of their disaster recovery plans and procedures. Although many business continuity managers now sit on change review boards, it is certainly not uncommon that some changes slip through the net, thus rendering the disaster recovery plan out of date and ineffective. This disaster recovery plan drift problem is exacerbated due to the infrequency in which testing is done. Most organizations only test their disaster recovery plans once a year (if at all!) and this is usually a very expensive operation requiring tens if not hundreds of skilled IT staff working over nights and weekends. Automation Drives Business Recovery BMC Atrium Orchestrator is an enterprise-class automation platform that speeds and simplifies the process of developing and maintaining business recovery scenarios and, most importantly, restoring service. BMC Atrium Orchestrator automates any set of repeatable manual tasks and scripts as a workflow, ensuring speed, accuracy, and consistency of execution. Through the BMC Atrium Orchestrator Development Studio, business recovery scenarios can be rendered, maintained, and executed as repeatable workflows all from a single location that is accessible globally. Whether recovery scenarios call for multiple database restores, porting multiple applications to backup data center servers, reconfiguring SAN-based storage, or simply notifying large numbers of people quickly, a BMC Atrium Orchestrator workflow can be built and executed to automate the required tasks. To provide a more complete solution, BMC Atrium Orchestrator also leverages the capabilities provided by other enterprise management solutions. For example, event management solutions have become much more effective at identifying and associating technical problems with impacts on business services. This maturity enables rapid identification of serviceimpacting problems, which can then be picked up by the automation tool as a trigger to quickly drive the initiation of a disaster recovery plan. 1

As an example, the BMC ProactiveNet Performance Management solution provides a real-time view of available business services and the relative priority and importance of those services to the business. The underlying service models can accurately assess the impact to any supporting technical component with the overall availability and performance of a business service. The additional benefit of the solution s service impact management functionality is that the encapsulated service models are generated from information within the BMC Atrium CMDB, meaning that so long as the information in the CMDB is current (through proper change processes), the service models are also automatically kept current, which helps alleviate some of the issues associated with disaster recovery plan drift. Figure 1: BMC ProactiveNet Performance Management Business Service Impacted BMC Atrium Orchestrator complements and easily links to other existing management systems, IT service management solutions, and individual infrastructure devices enabling centralized control and visibility across an entire technology infrastructure. BMC Atrium Orchestrator can execute recovery scenarios in a fully automated mode, semi-automated mode (e.g. generate change requests/e-mails and gather authorizations before initiating a disaster recovery plan), or can be used as a centralized interface from which an operator can run individual recovery scenarios as necessary. Modular Approach to Disaster Recovery Automation Fully automating all disaster recovery processes is not something that tends to happen overnight, so an incremental approach is required. There are various levels of disaster recovery protection that provide building blocks to help form a more complete end-to-end solution.»» Component Protection: Most hardware (compute, network, and storage) provide some form of high-availability/ clustering technology which is the first line of defense against failures,»» Application / Business Service Protection: Business services can be monitored and managed as a logical entity. Each business service can have its own set of recovery processes that can be initiated in isolation to any other business service.»» Site / Data center Protection: In extreme circumstances, loss of a location can be a risk which organizations need to protect themselves against. In these cases, a much more complex and coordinated plan is required where multiple business services are prioritized and recovered to a remote site. 2

A key factor when approaching disaster recovery is cost. As in many situations, there is a cost-vs-risk balance that needs to be considered as part of any plan that is put together. What is the acceptable downtime for a component or business service and what will the impact be to the business? How much are you willing to invest to keep downtime to a minimal? Again, various tiers of protection can be implemented for each component or business service. For high-priority services, an active-active approach may be acceptable. Although costly to implement, dedicated hardware and constant data synchronization between remote sites can enable an extremely fast recovery process. In other cases, dedicated hardware is too expensive, so an active-passive configuration may be more appropriate. This is where a plan is put in place to utilize test or development platforms in the event of a disaster and reconfigure these systems to manage the production environments. Typically, the time taken to recover active-passive configurations is longer, and the steps taken to implement failover are more complex and risky. There are also considerations about actual ownership of disaster recovery hardware. These days, organizations may choose to have their own dedicated secondary data center or they may rent space from a specialist disaster recovery service provider. With the advent of cloud computing, there are further options, which enable organizations to build their recovery plans utilizing hosted resources in a public cloud. Fast and Efficient Communication Regardless of the nature of a disaster, there is always a need to communicate quickly and effectively to all employees who may be impacted whether critical IT personnel needed to restore and verify services, users impacted by outages, or staff required to report to an alternate site. In each of these situations, BMC Atrium Orchestrator can be the single point of control and execution of communications. Workflows that interact with voice systems can also be executed to establish bridges for announcements, call out to critical resources as a part of recovery to notify responsible IT personnel, or deliver infrastructures to employees based on the nature of the disaster. The Daily Disasters While we may not think of the outages that occur on a daily basis as disasters, they are certainly events that disrupt business services; and, as discussed, the procedures for restoring service from these daily outages should be leveraged for larger events that can, in many cases, truly be classified as disasters. Loss of a database environment is an event that can occur at any time as well as during a major disaster. While applications and networks may be working fine, without the database, the business service is disrupted. BMC Atrium Orchestrator workflows can be written to execute database fallback or restore scenarios in support of any or all environments. In the event of a loss of a database or larger event that affects multiple databases, BMC Atrium Orchestrator workflows can be accessed and executed from any location to restore database services. BMC Atrium Orchestrator workflows can be designed to execute in a fully automated fashion or interactively. In this scenario, the BMC Atrium Orchestrator recovery workflow would intercept an event, such as an SNMP trap or similar notification from an event management system, and based on the event type, automatically execute steps to: a. b. c. d. e. f. Verify that a problem exists and that the event was not part of a planned outage Determine what other resources may be affected and require attention Document the current state by opening a service desk incident or generating and distributing a report Provision hardware and software resources to replicate the operating environment Recover the data to the status quo ante Restore associated resources g. Close the incident or update the report 3

Figure 2: Example Disaster Recovery Workflow While this example depicts a fully automated recovery process, it would be just as easy to insert pauses in the workflow to report progress-to-date and request operator confirmation of next steps to perform. You can see how this isolated event-and-recovery process can be incorporated into a larger process to recover from an outage that affects an entire end-to-end business service. Loss of a Data Center Active-Active Many IT organizations employ a dual data center strategy where business services are running live in two hot data centers. Both data centers are setup to run all critical business services, and at any point in time, services are running live in one of the two data centers. Various types of events can result in what is operationally the loss of a data center. Events, such as power loss, building destruction, or a disaster that impacts telecommunications services, result in business services in the data center becoming unavailable. The first step in this situation is to determine what was lost and which databases and applications were running in the failed site. Once that has been determined, procedures can be executed to restore business services in the operational data center. BMC Atrium Orchestrator workflows can be written and executed to identify those services that were running in the failed site and provide fast guidance as to what requires recovery in the operational site. Once determined, BMC Atrium Orchestrator workflows can be used to execute the appropriate recovery scenarios to quickly restore service. Loss of a Data center Active-Passive In some environments, running a secondary hot data center is not practical. These IT organizations typically employ a cold or warm backup site that contains the IT infrastructure components to recover critical business services, but does not keep database and application infrastructure environments running. In this case, BMC Atrium Orchestrator workflows can be written for each specific business service that executes the tasks to load and bring up application environments, applications, and databases in backup data center. Benefits and Showing Value Business Continuity managers will often also want or be required to show the value that automation provides. Executives will want to see the returns on any investment made in automation technology or understand the level to which they have mitigated risk. Typically the three big indicators around cost for a disaster recovery plan are»» Cost of downtime of a service to the business (can be both a financial cost and impact to business reputation.)»» Cost of hardware / real estate to implement plans»» Cost of staff to test or, if necessary, implement disaster recovery plans 4

There are also risk factors to consider»» Currency of disaster recovery plan and procedures»» Frequency in which they can be tested. Automation can help in all of these areas. Cost of Downtime The biggest, measurable benefit of automation is likely to be around the time taken to recover a service. If the loss of a critical business service can cost a business $500k an hour, reducing the recovery window from 4 hours to 30 minutes is a very compelling story. Cost of Hardware With the advent of virtualization technologies, there is much greater flexibility in the use (or not) of dedicated physical hardware. Virtual images can now easily be copied and migrated between physical hypervisor hosts and rebooted and reconfigured on the fly using automation technology. Not only does this speed the recovery time in an active-passive situation, but existing hardware running non-production virtual images can quickly be re-purposed to host the production environment to quickly restore service. Cost of Staff In non-automated environments, huge swathes of expensive, experienced IT engineers are required to properly test or implement a disaster recovery plan. In larger environments, testing can involve hundreds of staff, working over a weekend. Without automation, a disaster recovery Playbook which describes the recovery procedures is walked through step by step by many different IT teams (Network, UNIX, Oracle DB etc). After each step, the IT team with responsibility for the next step in the playbook needs to be contacted. Then, that team needs to notify the next team that their steps have been completed, and so forth. Automation would manage both the communication and orchestration of a recovery plan, vastly reducing the number of people required to either test or initiate the disaster recovery plan. Risk of disaster recovery plan drift A common problem, which is often exacerbated by the infrequency of testing, is that disaster recovery plans quickly become out of date. Perhaps part of a service is moved to a new server or additional load balancers are added to make the service more resilient. In either case, you will get one of two things happening. a. b. If a service does go down and the disaster recovery plan is initiated, at some point during the recovery process, something isn t going to work, which will add time to the recovery window and impact the business whilst the error is tracked down. You will get false notifications of disasters when really the business services are functioning just fine. In most cases, you d expect confirmation of a disaster before any plan would be initiated, so these kind of issues should cause limited exposure. Still, it is an unwanted distraction. The solution in both of these cases is tight change and configuration controls and, again, automation can play its part in ensuring these processes are always executed as part of any infrastructure updates. BMC Atrium Orchestrator has specific runbooks which integrate with server, network, and database configuration tools which automatically generate change tasks for any updates made, thus keeping an accurate audit record of change and also keeping the CMDB up to date. In the BMC ProactiveNet Performance Management example, this would in turn maintain the business service models and impacts of technology faults on supporting configuration items ensuring that the monitoring / alerting mechanism is also current and accurate. Risk of infrequent testing As in the section above, infrequent testing of a disaster recovery plan results in inaccuracies in the plan. Whereas before automation a disaster recovery plan could take tens or even hundreds of staff many hours to test, automation could test the plan in a fraction of the time using far fewer people. The combination of fewer people and much faster testing times means that plans can be tested on a much more frequent basis, greatly reducing the risk of out of date plans. 5

The Complete BMC Solution Business Service User 1 Service Model Site A 2 CMDB / CMS Trading Service 3 4 Atrium Orchestrator 6 5 Authorizations Decision makers confirm disaster (Example: Change requests generated in Remedy ITSM wait on ap provals) Failover Process (e.g.) Shutdown what's le of production environment Re-allocate resources at DR site Data synchronization Restart service at DR site Re-direct users Site B / CLOUD Trading Service 1 2 3 4 5 6 Service Model generated from CMDB Real time monitoring of service model through BMC ProactiveNet Performance Manager Service Impacting event causes service outage / failure Service Impact alert picked up by BMC Atrium Orchestrator Atrium Orchestrator initiates associated DR workflow. Service is automatically recovered at secondary site and service resumed. Conclusions The disaster recovery and business continuity processes in place at most companies typically consist of written procedures augmented by traditional systems management tools for recovering IT resources. This fragmented approach extends recovery time and hinders continuous process improvement initiatives. BMC Atrium Orchestrator provides a single point of visibility and control for executing business recovery in the event of minor daily disasters or major events that disrupt business services. It provides immediate value by allowing you to incrementally build-out your recovery processes, automating key recovery processes first, until you have a fully integrated end-to-end process. And by implementing automation, you get a reliable, repeatable process that will serve as the foundation for continuous process improvement. 6

Business Runs on IT. IT Runs on BMC Software. Business thrives when IT runs smarter, faster and stronger. That s why the most demanding IT organizations in the world rely on BMC Software across distributed, mainframe, virtual and cloud environments. Recognized as the leader in Business Service Management, BMC offers a comprehensive approach and unified platform that helps IT organizations cut cost, reduce risk and drive business profit. For the four fiscal quarters ended December 31, 2010, BMC revenue was approximately $2 billion. Visit www.bmc.com for more information. BMC, BMC Software, and the BMC Software logo are the exclusive properties of BMC Software, Inc., are registered with the U.S. Patent and Trademark Office, and may be registered or pending registration in other countries. All other BMC trademarks, service marks, and logos may be registered or pending registration in the U.S. or in other countries. Oracle is a registered trademark of Oracle Corporation. UNIX is the registered trademark of The Open Group in the US and other countries. All other trademarks or registered trademarks are the property of their respective owners. 2011 BMC Software, Inc. All rights reserved. *197132*