The 5 Most Commonly Used Disaster Recovery Process

Similar documents
Oracle Maps Cloud Service Enterprise Hosting and Delivery Policies Effective Date: October 1, 2015 Version 1.0

Managed Security Services SLA Document. Response and Resolution Times

Blackboard Managed Hosting SM Disaster Recovery Planning Document

BUSINESSES NEED TO MAXIMIZE PRODUCTIVITY, LOWER COSTS AND DECREASE RISKS EVERY DAY.

DISASTER RECOVERY. Omniture Disaster Plan. June 2, 2008 Version 2.0

ACME Enterprises IT Infrastructure Assessment

IT Disaster Recovery Plan Template

Program: Management Information Systems. David Pfafman 01/11/2006


California Department of Technology, Office of Technology Services WINDOWS SERVER GUIDELINE

How To Understand Your Potential Customer Opportunity Profile (Cop) From A Profit Share To A Profit Profit (For A Profit)

SmartFiler Backup Appliance User Guide 2.0

Our Colorado region is offering a FREE Disaster Recovery Review promotional through June 30, 2009!

<Client Name> IT Disaster Recovery Plan Template. By Paul Kirvan, CISA, CISSP, FBCI, CBCP

Template Courtesy of: Cloudnition LLC 55 W. 22 nd St Suite 115 Lombard, IL (630)

Unitrends, Inc. Software and Hardware Support Handbook

Active Directory Infrastructure Design Document

Disaster Recovery Checklist Disaster Recovery Plan for <System One>

Offsite Disaster Recovery Plan

Leverage The Cloud, Bulletproof Your Data, Manage Your Documents, Spare Your Wallets. Presented By Ilene Rosoff, CEO The Launch Pad

Technical Considerations in a Windows Server Environment

a Disaster Recovery Plan

Service Catalog. it s Managed Plan Service Catalog

INSIDE. Preventing Data Loss. > Disaster Recovery Types and Categories. > Disaster Recovery Site Types. > Disaster Recovery Procedure Lists

Disaster Recovery Hosting Provider Selection Criteria

IT Sr. Systems Administrator

UMHLABUYALINGANA MUNICIPALITY IT PERFORMANCE AND CAPACITY MANAGEMENT POLICY

Complete Managed Services. Proposal for managed services for the City of Tontitown

Data Center Colocation - SLA

Managed IT Secure Infrastructure Flexible Offerings Peace of Mind

SERVICE SCHEDULE INFRASTRUCTURE AND PLATFORM SERVICES

OPERATIONAL CAPABILITY TECHNOLOGY QUESTIONNAIRE

Cisco Change Management: Best Practices White Paper

Symantec Database Security and Audit 3100 Series Appliance. Getting Started Guide

BOWMAN SYSTEMS SECURING CLIENT DATA

How to Plan for Disaster Recovery

Backup & Disaster Recovery Options

Exhibit to Data Center Services Service Component Provider Master Services Agreement

System i and System p. Customer service, support, and troubleshooting

MSP Service Matrix. Servers

APC Enterprise KVM Switches

The Nuts and Bolts of Autodesk Vault Replication Setup

Proactive. Professional. IT Support and Remote Network Monitoring.

Appendix E to DIR Contract Number DIR-TSO-2736 CLOUD SERVICES CONTENT (ENTERPRISE CLOUD & PRIVATE CLOUD)

Bare Metal Cloud. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description

Request for Proposal Technology Services Maintenance and Support

Information Technology Security Procedures

SCADA Business Continuity and Disaster Recovery. Presented By: William Biehl, P.E (mobile)

3.1 Connecting to a Router and Basic Configuration

Office of Information Technology Hosted Services Service Level Agreement FY2009

Customized Cloud Solution

Application Notes for Configuring Dorado Software Redcell Enterprise Bundle using SNMP with Avaya Communication Manager - Issue 1.

Information Technology Solutions. Managed IT Services

Data Center Management

Workflow Templates Library

Sagari Ltd. Service Catalogue and Service Level Agreement For Outsource IT Services

Ongoing Help Desk Management Plan

Windows Operating Systems. Basic Security

Note: This case study utilizes Packet Tracer. Please see the Chapter 5 Packet Tracer file located in Supplemental Materials.

Leveraging Virtualization for Disaster Recovery in Your Growing Business

SWAP EXECUTION FACILITY OPERATIONAL CAPABILITY TECHNOLOGY QUESTIONNAIRE

Reboot the ExtraHop System and Test Hardware with the Rescue USB Flash Drive

Unit Network Infrastructure Maintenance. Standard Service Agreement (SA)

Hyper-V Protection. User guide

[Insert Company Logo]

IT Onsite Service Contract Proposal. For. <<Customer>> Ltd

Hosting Users Guide 2011

Created By: 2009 Windows Server Security Best Practices Committee. Revised By: 2014 Windows Server Security Best Practices Committee

Managed Device Support Service Agreement Page 1 of 10

TROUBLESHOOTING INFORMATION

Network Computing Architects Inc. (NCA) Network Operations Center (NOC) Services

Backup and Redundancy

Cloning Utility for Rockwell Automation Industrial Computers

Designtech Cloud-SaaS Hosting and Delivery Policy, Version 1.0, Designtech Cloud-SaaS Hosting and Delivery Policy

Business Continuity & Recovery Plan Summary

Hosted Exchange. Security Overview. Learn More: Call us at

Acronis Backup & Recovery 11.5 Quick Start Guide

Cloud Management Service Agreement. 1.0 Terminology. 2.0 Service Description

SmartFiler Backup Appliance User Guide 2.1

Tk20 Network Infrastructure

ReadyRECOVER. Reviewer s Guide. A joint backup solution between NETGEAR ReadyDATA and StorageCraft ShadowProtect

Novell ZENworks Asset Management

Network Configuration Management

StorSimple Appliance Quick Start Guide

BME CLEARING s Business Continuity Policy

Software Citrix CSP Service Agreement. 1.0 Terminology. 3.0 Service Options. 4.0 Service Delivery. 2.0 Service Description

OmniCube. SimpliVity OmniCube and Multi Federation ROBO Reference Architecture. White Paper. Authors: Bob Gropman

Recovery Management. Release Data: March 18, Prepared by: Thomas Bronack

INFORMATION TECHNOLOGY INFRASTRUCTURE ANALYST

About Backing Up a Cisco Unity System

Data Loss in a Virtual Environment An Emerging Problem

JOB OPENING. Please see attached Job Description: Last day to apply: February 27, 2013

Spyders Managed Security Services

SNAP WEBHOST SECURITY POLICY

APPENDIX 8 TO SCHEDULE 3.3

Network Client. Troubleshooting Guide FREQUENTLY ASKED QUESTIONS

Windows Domain Network Configuration Guide

Client Security Risk Assessment Questionnaire

Why Fails MessageOne Survey of Outages

NETWORK SERVICES WITH SOME CREDIT UNIONS PROCESSING 800,000 TRANSACTIONS ANNUALLY AND MOVING OVER 500 MILLION, SYSTEM UPTIME IS CRITICAL.

Transcription:

DR Risk Assessment White Paper This document provides an overview of Equilibrium s disaster recovery risk analysis and remediation methodology. This methodology was developed over a period of 10+ years by referencing industry best practices. Author: Glen Hampton, President

Equilibrium s Disaster Recovery Methodology Equilibrium s Disaster Recovery Risk Analysis & Remediation Lifecycle methodology is a pragmatic and cost effective approach that consists of five modular phases of engagement: Phase 1 Failure Mode and Effects Analysis Risk Assessment Phase 2 Risk Remediation Recommendations Phase 3 Implement Remediation Measures Phase 4 Troubleshooting & Recovery Procedure Flow Charts Phase 5 DR Testing & Recovery Procedures Phase 1 FMEA Risk Assessment During Phase 1, the Failure Mode and Effects Analysis (FMEA) methodology is used as a risk assessment tool to help systematically determine the organization s threshold for acceptable risk levels, identify where potential points of failure are located within the IT environment, understand the severity to operations an impact will have and state the high-level plans to resolve them before they become a problem. Both physical and logical infrastructures are represented in the FMEA process to assure physical single points of failure are documented and logical architecture is defined. The key delimiter in this phase is the threshold that is defined by the team (a blend of client and Equilibrium representatives). This threshold defines what failure modes are represented in phase 2 and which once are acceptable to leave without remediation. The threshold is defined as the process develops and potential failures become apparent, they require remediation. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 2

Phase 2 - Risk Remediation Planning During Phase 2, Equilibrium develops a formal remediation report that that systematically defines the remediation strategy of each potential failure that exceeds the threshold by utilizing three variables. Risk Priority Number and the amount it exceeds the threshold Budgetary Costs at a high-level associated with the implementation of a solution to remediate potential failures Complexity that the solution incorporate into the IT computing environment The output of this remediation report is assembled in the recommendations which are prioritized and focus on improving the computing environment. The recommendations include budgetary costs per initiative for both product requirements and labor efforts. Phase 3 Implementation During Phase 3, the approved solutions identified to remediate risk to the business are implemented. This work is typically accomplished by a combination of internal IT staff, Equilibrium s engineers and consultants, and appropriate 3rd party vendors. Project Manage throughout the term of the implementation. Work with the client to coordinate the scheduling and successful completion of the recommendations. Create detailed Project Plan to track tasks and milestones through out the implementation Provide regular weekly status updates on the overall progress Specify, quote and requisition product required to complete the list in the recommendations. Illustrate at a high-level the attributes of the future-state computing environment. Phase 4 Troubleshooting & Recovery Procedure Flow Charts During Phase 4, the troubleshooting and recovery procedure flow charts are developed which are utilized to determine the cause of a failure in the environment. This pre-planned documentation helps guide the internal IT staff to efficiently execute on the logical troubleshooting and recovery steps pointing the way to the specific failed area of the environment that would need to be repaired. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 3

Phase 5 DR Testing & Recovery Procedures During Phase 5, the Disaster Recovery (DR) testing and the Recovery Procedures are defined and documented. The DR testing procedures define the detailed step by step process for failure over and failing back of the production environment in the case of a disaster. This should include details to fail a single point in the environment to the entire data center fail over. This procedure must include a fail back procedure to assure the environment can successfully be brought back to the primary location. There should be scheduled events to assure the recovery procedures function efficiently and the staff is versed on the process. The Recovery Procedures define the step by step process to resolving a failed device. These procedures are extremely detailed to include vendor phone numbers, Serial numbers, key codes, service tags, IP addresses, circuit IDs, etc., to assure the successful resolution of a failed device. These verbose documentations are sometimes referred to as the DR Run Book. Armed with these verbose step by step documentations, someone unfamiliar with the environment should be able to step in and help with recovery measures. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 4

Disaster Recovery Overview Every business should prepare for the potential impact disasters can have on business operations, no matter whether they result from forces of nature, acts of terrorism, careless (or malicious) employee actions or simple hardware failure. Every business should, no matter what size, re-examine its IT preparedness for a disaster and disaster response plans. Data loss can be devastating. Business records today are increasingly in electronic form. Dependency on these records, and the tools used to process and store them, continues to grow. Most electronic records, such as emails never get printed out. If electronic records are lost, they might be impossible to re-create. For most businesses, data loss is not an option. Some businesses are now obligated to comply with legal requirements for retention of electronic information. In this case those organizations must implement the technologies and policies needed to ensure the safe preservation and availability of their data and guarantee the timely recovery of that data when and/or if disaster strikes. Disaster recovery starts with sound planning and design. Most companies do not want to put extensive costs into the what if for disaster recovery, but they are starting to realize that if a disaster did occur they need to be prepared. This is a form of insurance that protects your business data in the case of a disaster. Levels of Disaster A disaster can be as simple as a file or database lost or corrupted and as extensive as the entire facility destroyed. Planning for disaster should encompass all levels of disaster. Disaster Recovery Scenarios Server Room Fire Building fire Malicious intent or sabotage Theft of all server equipment Natural Disaster Pandemic Extended power outage Civil unrest Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 5

Most businesses have standard daily backups of their data. This usually covers most simple disasters like a file deletion or corruption. The design and layout of the daily backup needs to cover all aspects of the data to assure any and all data can be recovered. Processes and procedures should be in place to assure this is well defined and repeatable. Many businesses do not realize that there are many single points of failure in their environment. A single point of failure is a device or solution that if it fails there is no physical backup. The data might be protected, but if the device fails there is no redundant device to take over. These could be firewalls, routers, switches or servers. Every company is different. i.e. If the Internet is not a critical requirement for standard operations then the devices that keep the Internet available are not critical to the company, thus a single point of failure in that area is not an issue. If the entire data center is lost for any reason the process of recovery gets more complicated. A new location may need to be configured to support the company s data center; hardware usually needs to be replaced, etc. This process needs to be defined to assure the process of recovery runs smooth, the amount of time it takes to recover is known, and the amount of data loss is kept to an acceptable level. Level of Protection Every company is unique when it comes to disaster recovery. Many questions need to be answered about the protection of the electronic data and systems in the case of a disaster; there are many variables that need to be defined. What level of Disaster Recovery is right for my business? What is the acceptable amount of data lost in the case of a disaster? How long can my environment be down without devastating effects? Does my data need to be taken off site? If so, how far off site and how often? Are there any single points of failure that affect critical systems in my environment? Do I have a recovery procedure and recovery plan in the case of a disaster? In most cases there is not a simple answer to these questions. The person or persons in the organization that may have some of the information required to answer these questions may not have all the information. A team is usually created to resolve this. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 6

How long can my environment be down without devastating effects? In Disaster Recovery terms this is called Recovery Time Objective (RTO) Most businesses know the cost of their down time. Every hour they are down is an hour of production that is lost. Most companies have a few hours that they can be down, but this is usually limited. What is the acceptable amount of data loss? In Disaster Recovery terms this is called Recovery Point Objective (RPO). Most companies utilize tape backup as their primary solution to protect their data. These backups usually run every night. This would allow their data to be up to 24 hours old in the case of a disaster. Is this acceptable? Can you lose the last 24 hours of your data and still operate the business? Can the data be recreated? If the answer to any of these questions is NO then the standard tape backup is not sufficient for your company. There are many solutions that can reduce the RPO. These would need to be reviewed and a solution that best fits your company s needs would need to be implemented. Does my data need to be taken off site? If so, how far off site and how often? The answer to this question is almost always YES. There are many ways to get your data off site. It can be replicated off site. This eliminates the human factor allowing the systems to take the data off site automatically. There are also services that come to your facility that take the data off site for you. You can also designate a person in the company to take the tapes home with them. The distance is usually a minimum of 30 miles. This helps cover for natural disasters. How often? - is another difficult question. If you use the automated method of getting your data off site, then it is always off site and you do not have to worry about it. If you designate a person in the company to take the data off site then you can define the duration of taking data off site. It could be as frequent as every day. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 7

Are there any single points of failure that affect critical systems in my environment? Single points of failure can be defined as any device or system that does not have redundancy built into the design. i.e. If the Internet is critical to your business and you have a single router or firewall as part of that solution. This would be deemed a single point of failure on a critical system. Another example would be a server that is not utilizing RAID. If any drive is lost on this server the server is not accessible or functionally down. Single points of failure on noncritical systems are usually not an issue. There are many ways to resolve critical single points of failure. First you need to define them. The best way to define critical points of failure is to systematically go through your entire infrastructure and define the critical points and any redundancy. A methodology original developed in the 1940s for the aviation industry, then adopted by other industries, called Failure Mode and Effects Analysis (FMEA) is used to help complete this. NOTE: See section Failure Mode and Effects Analysis in this document for more details. The FMEA and the Analysis report that follows the FMEA are documents that granularly define every individual piece of the infrastructure and its importance to the organization as a whole. It ranks each component and system with a level of severity and defines the probability of failure and your ability to detect that failure. Do I have a recovery procedure and recovery plan in the case of a disaster? All the disaster recovery systems in the world will not help you if you do not know how to recover from a disaster. If there are people in your organization that know in their head how to recover from a disaster and they are not available you basically have no recovery procedure. Having a properly documented plan that includes detailed recovery procedures to support any and all disaster scenarios will assure that any IT engineer can perform the recovery steps. If you have a disaster site that you fail to in the case of an emergency there must be a procedure to fail back as well. Once you are running in a disaster type scenario you will need to get the updated or changed data back on to your production environment once that environment is operational again. If this part of the process is not defined you may be running on the disaster site longer than required. Also remember that this data is changing while running at the disaster site. You should still be running backups is case data is lost at the DR site. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 8

Testing, Quality Assurance and Documentation Now that the entire organization is covered by a DR plan and/or procedure you can relax. Think again. The disaster recovery plan needs to be tested periodically to assure that it operates as planned. If tape backups are your primary backup solution you should be doing periodic restores of files, directories, and potentially an entire server to assure the process has been documented properly and can be executed by any IT engineer. Backups should be watched and documented weekly to assure they continue to operate by design. Many time the backups will start to fail and will not be caught for several months or until a file or directory is needed. At that time it is too late. If there are standby devices that backup the primary device, then these should be tested to fail over and fail back as well. Recovery procedures and checklists should be created to support the process if it is not automated. In the case of a fail over site the initial test should be handled very carefully. Recovery Procedures and Work Plans should be created for fail over and fail back. The entire process should be defined before any process is initiated. This planned outage should be communicated to all users to assure everyone knows that the systems will be down for a period of time for DR fail over and fail back. The duration should be known prior to the start of the process. Procedures should also be in place to assure if the business is running at the DR site that the data is protected as if it was at the primary site. This is one of the most important steps in the disaster recovery process. Without it a DR plan could fail miserably. Every step of the process must be defined in detail so any IT engineer can perform the process without incident. Practice strict control of these documents. Only the latest document should be available to assure the process is followed by design. Also keep a printed copy of the documentation in several places. There should be a copy at the primary site as well as off site in the case of a site wide disaster. These documents are living documents and should be reviewed and updated annually. If they are not kept up to date the repercussions could be devastating. A process should be defined and adhered to, to keep these documents current and available in the case of a disaster. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 9

FMEA Methodology for DR Risk Analysis Overview The Failure Mode and Effects Analysis (FMEA) methodology is a risk assessment tool that helps systematically define where potential points of failure are located within the IT environment, the critical nature of the problems and logically layout the plans to resolve them before they become a problem. The purpose of the FMEA: To attempt to resolve potential failures in order of critical nature. Systematically define the potential failures and laying out the plan to resolve them. To assist in the transformation of network management from reactive to proactive. There are three indices that collectively generate the priority ranking of a failure. Each is rated from 1 10. Severity This relates to the relative impact of a failure on the part of the infrastructure the device, system or software is associated. Severity ranking is based on the extent of the outage. If the productivity of the entire network is affected the severity is higher than if a single user is affected. Each organization will have its own definition of what is a short term and a long term outage. Detection This relates to the ability to effectively detect that a problem has or will occur. Controls and monitoring systems are the keys to increasing the capability of detecting a problem. If there is limited or no detection capability the detection ranking will be higher than if complete monitoring system with paging or email capability was operational. Occurrence This relates to the probability that the failure will occur. There are several factors that can increase the occurrence probability. The age of device, systems or software can increase the probability that a failure will occur. Also if there is insufficient redundancy or a single point of failure the probability of a failure is higher. MTBF (Mean Time Between Failures) is known for most hardware on the market today and can help determine the probability of a failure. If the Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 10

device, system or software is older, then the occurrence factor is going to be higher than if new devices, systems or software have been implemented. Each failure mode has its own priority ranking. This allows the failures to be prioritized. The next step is to determine which index is causing the high priority ranking and attempt to reduce it. 1. Severity of a failure mode cannot change unless the design of the infrastructure changes. 2. Detection can be reduced by implementing monitoring tools that email or page engineers when a problem has or will occurred. 3. Occurrence can be reduced by upgrading devices, systems or software and/or eliminating single points of failure. Defining the Document Device, System or Software Utilizing Network Diagrams, Hardware and Software Inventory, populate the first column with all hardware and software pertinent to business operations. Logical details should also be present to better understand the severity of some of these devices. Infrastructure o Power Uninterruptable power Supplies (UPS) Generators o Cabling o Switches o Routers Security o Firewall o VPN o Virus Protection o Patching Server & Data o Server Hardware o Appliances o Data Storage (SAN, NAS, DASD, etc.) Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 11

o Backups Operating System & Services o Windows Server o Active Directory, DNS, DHCP, etc. o Citrix o VMware (Virtual Center, DRS, HA, etc.) Applications & Databases o Messaging (Exchange, BES, etc.) o Databases (SQL, Oracle, etc.) o Vertical Applications, ERM, CRM, etc. o File & Print o Web Failure Mode Populate this column with any possible failure for each specific device, system or software. Keep the failure modes lined up with each item in the first column. There can be several failure modes for each item. Add lines if necessary. A failure is anything that can cause this device, system or software to go off line or fail. Including but not limited to: hardware failure, hardware configuration issue, power failure, UPS failure, and cabling failure. Effects of Failure The effects of a failure can be a variety of things, from a single user not being able to see a server to the entire network being inaccessible. Keep the effects lined up with the failure modes to minimize confusion. There can be many effects per failure mode. An effect is the potential impact of a failure of a device, system or software. Including but not limited to: cannot print, cannot save documents, cannot access the Internet, email is down, server inaccessible, etc. Cause of Failure The cause of the failure can also be a variety of things. This can be as simple as user error and as complex as installation or configuration issues, including viruses and hackers. There can be many causes of a single failure mode. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 12

RPN - Risk Priority Number The RPN Risk Prioritization Number is calculated by multiplying the determined values for: Severity Detection Occurrence The Risk Prioritization Number is a key tool for identifying various types of failure probabilities that can impact specific areas of your business operations. The key delimiter is the threshold that is defined by the team. This threshold defines what failure modes are represented and which once are acceptable to leave without remediation. The threshold is defined as the process develops and potential failures become apparent that require remediation. Risk Remediation Report This may be the most important phase of this process. Once the FMEA is complete the process of remediation starts. The Risk Remediation Report includes two areas of focus. Analysis of Potential Failures that exceed the Threshold Mitigation of Potential Failures During the analysis stage of this report there are 3 variables used to define what is going to be mitigated. Risk Priority Number This number along with the Threshold that was define in the FMEA stage are utilized in the remediation report. Only the Potential Failures that are above the Threshold are brought to this stage of the process. Budgetary Costs During the FMEA process a Budgetary Cost was associated with any RPN that exceeded the Threshold. These budgetary costs help understand any remedial actions that may filter down into the next phase of this process. Infrastructure Complexity Adding redundancy to an existing infrastructure can add considerable complexities as well. This report helps to show the level of complexity that is being added. This complexity variable is represented on a scale from 1 10, 10 being the most complex. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 13

These three variables are used to define what is going to make it to the Observations and Recommendations stage which is the second area of focus for this report. E.g. If a potential failure has a low RPN (but still above the threshold) and a high Budgetary Cost and high Complexity rating it would probably not make it into the O&R stage. On the contrary, if a potential failure has a high RPN and a low budgetary cost and low complexity rating it would probably make it to the O&R stage. The Observations & Recommendations (O&R) section has a detailed description of the issue that requires mitigation and a recommendation that defines the solution of that observation that are directed towards the root cause of the potential failure. Also included in this section is an investment list that defines the hardware, software, and labor to complete each O&R. The final focus in the report is where each O&R is listed in order of criticality and another list that shows order or implementation. Failure Process Flow Chart A failure process flow chart represents the process which a technician or engineer would take to attempt to determine the cause of the failure. Each flow chart should focus on a specific type of failure. Can t print Can t get to the Internet Can t get to the File and Print server Can t access Email Can t Login These flow charts should be detailed enough that any technician should be able to follow them and determine the root cause of the problem. The resolution should not be part of these documents. Once the root cause has been determined the Hardware & Software failure Procedures should be followed to resolve the issue. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 14

Hardware and Software Recovery Procedures There should be a Recovery Procedure for EVERY production device. These procedures depict a step by step process to replacing a failed device. These steps should include: Installation and configuration parameters. Serial numbers, key codes, service tags, IP addresses, etc. Phone numbers for technical support. Warranty or Service plans for each device and the expiration date. Circuit ID s Location of all hardware pertaining to each failure. This should include spares, secondary servers, and standby devices. Fail over procedures if a secondary device is in standby mode. Any limitations of secondary or standby devices. There should also be a Failure procedure for EVERY production software installation. These procedures depict a step by step to reinstall the failed software. These steps should include; Installation and configuration parameters and procedures. Key codes, location of CD s, serial numbers, etc. Data restore procedures if data is to be restored from tape. Installation and Configuration Procedures Each device on the network that has a specific configuration or installation should have a step by step procedure on how that device was configured. These procedures should include any specific information pertaining to that device. If a configuration file is required, that file should be included. In the case of a disaster these devices will need to be rebuilt per these procedures. They need to be accurate and reviewed regularly to assure they are current. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 15

Disaster Recovery Testing There should be a procedure and a plan to test random fail over s on a scheduled basis. These plans or procedures should include detailed step by step procedures on how to fail a device over to a standby device or a spare and how to fail back. These plans should include: Standby or spare device locations Technical Support contact information Warranty or Service plans and the expiration dates Serial numbers, service tags and key codes Fail back procedures Contact information if there is a problem with the fail over or fail back Make and model of each device (primary and secondary) Any limitations of secondary or standby devices There should also be a plan to test the staff in case of a disaster. These regularly scheduled plans should include; A disaster scenario procedure A way of creating a disaster on paper to represent a real disaster. A way of randomly creating a disaster to each regularly scheduled session A copy of the Disaster Recovery Manual for each participant. A time limit on each written or verbal resolution Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 16

Appendix A FMEA Spreadsheet Example Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 17

Appendix B Failure Process Flowchart Example User calls Helpdesk Can not get to Internet or any off site resources like email Is this limited to one Workstation No Try to ping router Does it Ping Yes Try to Ping a device on the other side of the router Can you ping through the router No T1 or remote router is probably down. Check remote router. If it is operational call SBC and have them test the line Yes No Yes Check connections and workstation configuration. Check switch connection Trace connection back to switch Try pinging the device the user is trying to access Replace switch Yes Is there something wrong with the switch Can you ping the device Yes Something may be wrong with the device. Check the server or application No No Check router and its connect at the switch Is the device through another router No Something may be wrong with the device. Check the server or switch the server is plugged into Yes Replace router No Does the router and its connection look ok Ping the next router Yes Reboot router. If this does not work, replace the router. Can you ping the next router No Check the router and/or the switch the router is plugged into Yes Try pinging another device through that router Can you ping the other device No T1 is probably down. Call SBC and have them test the line Yes Something may be wrong with the device. Check the server or switch the server is plugged into Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 18

Appendix C Cisco 1720 Satellite Office Router Recovery Procedures In the advent that the Cisco 1720 router fails at a Satellite Office, the IT Director, Network Administrators and Equilibrium Service Center (ESC) team would be notified by an email alert to the Distribution List from Equilibrium Systems Monitoring (ESM). The current notification systems utilize the network to send an email to the IT Director. If the Satellite Office router does not respond to a ping after 2 minutes a WAN Link down notification would be sent via an email and SMS Text to the Network Manager. Once it is known that the system is down the Network Manager would then diagnose the failure to determine what caused the connections to fail. If the connection to Satellite Office was down and could not be pinged. The first thing to determine is if the 1720 router is down or another point of failure on the network. Check to make sure the 3745 Corporate Ethernet is still up. On a windows computer at a DOS prompt type ping 10.10.10.123. If you get a reply then the Ethernet port to Corporate is up. If you do not get a reply then the 3745 has a problem. Now that it has been determined that the Cisco 1720 has failed, follow these recovery steps: 1. Call Cisco tech support (7x24x4 hour response) a. 800-553-2447 Contract Number: 2823318 (7x24x4 support) 2. Implement the spare Cisco 1720 router. a. Remove all connections from failed Cisco 1720 router Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 19

b. Remove Cisco 1720 from the rack c. Install the Cisco 1720. This router does not have a rack mount capability, so place the router on top of the switches. It may need to be strapped down or velcro d into place so it is secure and will not fall. d. Connect the cables in the same configuration as the original router. 3. The 1720 may not have the correct configuration on it. If not, it will not function correctly. You will have to connect to the router via a console cable and put the proper configuration on the router. a. Try to ping the router first. On a windows computer at a DOS prompt type ping 10.10.31.254. If you do not get a reply, then you will have to install the configuration for that location. b. The configuration is located on the TMF-NT-VAULT server in \\tmf-ntvault\data\infosys\network\cisco\cisco tftp server directory. Open the file associated with this router, highlight all of its contents and click edit / copy from the menu. c. Using a console cable connect your laptop to the router via the console port and the serial port on your laptop. Use Hyper Terminal to access the routers current configuration. The Hyper terminal program is located in accessories. i.e. PROGRAMS/ACCESSORIES/COMMUNICATIONS/HYPERTERMINAL d. Set the name to Cisco Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 20

e. Set the Connect using to COM 1: f. Set the Bits per second to 9600: g. You may have to hit Enter a few times to get the router to respond. h. It will prompt you for a password i. Type in the password, then enable and the enable password ABC123 enable ABC123 j. config t (This will get you into configuration mode) k. Now right click on the screen and paste the information that you copied from the file on the server to the router. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 21

l. You will need to get out of configuration mode m. end (This will bring you back to the enable prompt) n. Once this is complete you can save the configuration to the router. o. write memory (This will save the configuration to memory) p. Now you should be able to ping the router. It should be fully operational at this time. On a windows computer at a DOS prompt type ping 10.10.10.234. 4. If the router is pingable, Telnet into Cisco 1720 and make sure all connections are up / up and connected correctly. a. On a windows computer at a DOS prompt type Telnet 10.10.10.234. This will take you to the user access verification screen b. Type in the password, then enable and the enable password ABC123 enable ABC123 c. Type show interface to display all interfaces and their current state. Use the space bar to move from screen to screen. 5. If all information is correct and all of the interfaces are up / up, then you will need to check if you have connectivity to all facilities for data. a. Go to any computer and open Windows Explorer. Check to see if you can access servers that are at other ABC Company facilities. If you can login to those servers the Cisco 1720 is operational for data. Cisco should show up with a new Cisco 1720 as long as the damaged unit was not due to Human error. The spare router is not utilizing Smartnet so it will have to be taken out of service once the new router has been received. You will need to repeat steps 2 5 during off hours so not to disrupt current operations. Place the spare Cisco 1720 back in its box and back into storage. Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 22

Conclusion Assessing your need for DR improvements can be confusing as there is often a wide scope of technology involved. A common approach is to leverage third-party experts to guide you through what is required in your environment. About Equilibrium Founded in 2004, half of Equilibrium is dedicated to providing IT project consulting and the other half dedicated to providing IT services. Equilibrium specializes in IT strategy, security, cloud, datacenter upgrades and ongoing operations. Visit us at EQinc.com Request a Free Consultation: http://eqinc.com/request-a-free-consultation For more information: info@eqinc.com, 773-205-0200 Equilibrium IT Solutions DR Risk Assessment White Paper - EQInc.com 23