Building a Disaster Recovery Testing Program Presented by Steve Carroll Email: scarroll@aboundresources.com Phone: 717-256-1865
About Our Speaker Steve Carroll is a Senior Consultant with Abound Resources. With more than 25 years experience as a community financial institution executive, Steve has worked in a variety of capacities in financial institutions, including consultant and CEO. Since 1996, Steve has worked as a lead consultant on more than 100 financial institution consulting engagements across the country. His areas of expertise include business continuity planning, risk management, strategic business planning, and strategic technology planning. Steve has developed software applications to assist financial institutions in improving their risk management positions, including Abound Resources bplan Web-based Business Continuity Planning system. Steve has completed Institute of Financial Education courses at the University of Texas at Austin, the University of Georgia, and the University of Connecticut.
Who We Are Management consulting firm for the Community Financial Institution (CFI) industry We empower CFIs to achieve their goals. Goals achieved. Guaranteed. Based in Austin, TX; clients in 40+ states Founded in 1997 by industry execs and Big 5 consultants 500+ software evaluations Vendor Neutral Advisors average 25+ years in CFI management; lending, cash management, compliance, operations and IT
What We Do Sales & Marketing
Presentation Highlights Regulatory Issues & Terminology Building a Testing Program Conducting Tests Examples High Availability Environments A Simple Pandemic Exercise Summary Top Five Testing Mistakes
Regulatory Background FFIEC Guidance March, 2008 The Board must approve the Testing Program & review test results IT is responsible for DR Testing The Crisis Management Team should be involved in the testing process Those responsible for Facilities should be involved in the process Test results must be subjected to an Independent Review
Common Regulatory and Audit findings There is no Comprehensive Testing Program in place Testing activities show an over-reliance on a single testing methodology (Table-top) Test activities do not involve departments/users in a meaningful way Test documentation is inadequate No step-by-step restoration procedures No order of restoration defined
Terminology Testing Program A schedule of test events spanning a complete testing cycle. What? When? Who? How? Where? DR Test An event that demonstrates that a given resource can be restored to a production state within a target time frame using a documented restoration method. BCP Test An event that demonstrates that a Business Function can be completed using a contingency procedure.
Budgeting Your BCP Effort 32% 36% Business Impact Analysis Risk Assessment Documentation Emergency Response 9% 5% Testing 18%
Testing Methodologies Methodology Tabletop Exercise Due Diligence Vender Service Levels Independent Review Incident Tracking Compatibility Testing Simulation Production Testing Type of Test/Administrator BCP/BCP Coordinator BCP/BCP Coordinator BCP/BCP Coordinator BCP/BCP Coordinator BCP/BCP Coordinator Disaster Recovery/IT Disaster Recovery/IT Disaster Recovery/IT
Testing Team Getting started Technical representation Operations Department Staff Inventory of Resources Software Applications (Core and Network) Critical Services (Data Communications, Internet) Outsourced Applications & Services List of Business Functions By Department Linked to Resources (if possible)
Build a Database Resource Name* Critical Level Test? RTO (hours) RPO (hours) MAD (hours) Control Group Core System 1 Yes 8 2 72 Core Data Communications 1 Yes 1 -- 24 Network Loan Prospector 3 Yes 48 72 96 Loans Fedline 1 Yes 4 24 48 Fed Network Files 3 Yes 24 72 72 Network Branch Capture 1 Yes 4 8 48 Item Proc Internet Access 1 Yes 1 -- 24 Internet Acrobat Reader 5 No 96 -- 120 -- EMail 1 Yes 12 24 48 Email Internet Banking 1 Yes 4 8 24 Core Laser Pro 3 Yes 48 72 96 Loans
Assign Criticality Levels Criticality is assigned to both Resources and Business Functions Sometimes called Mission Critical or Business Critical Better to use multiple criticality Levels for flexibility Three or five levels, matched to a time frame Example: Level 1 = 1 to 24 hours, Level 2 = 24 to 48 hours, etc. Test Flag Will we test this? (yes or no) Level 1 Resources usually tested by default
Assign Target Timeframes Recovery Time Objective (RTO) Target time frame for resource restoration RPO The maximum capacity for data loss of a given information system, measured in time. Can be assigned to any application, but should be applied at a minimum to Transaction Interfaces. RPO s should be supplemented with a description of how lost data could be reconstructed. Maximum Allowable Downtime (MAD) Estimated maximum downtime for a given resource.
Assign Control Groups Control Groups Create a Control Group for resources that should be tested together. Examples: Core System Loan Systems Internet (Web sites) Examples By Server By Criticality Level By Application type
Create Test Events Build a Control Document for each Test Event: Statement of Objective be clear and concise Example Show that the [software application] can be restored onto new hardware from backup media. Users will log in and verify that the system has been returned to a production state Description of Test Environment How will hardware be replaced? Preinstalled software External connections needed Most likely test methodology Test Date Who is responsible? Who will be present? What documentation (evidence) will be retained? Write a Test Script Include a Reviewer section
Example Test Script Step # Start Time Activity 1 Set up server hardware, workstation & test LAN 2 Install O/S & backup/restore utility 3 Install application from original media (d/l from vendor Web site) 4 Locate backup image for application data & restore to server 5 Install client onto workstation 6 Have user login and verify that work can resume 7 User runs samples of typical transactions 8 Print screens and reports retain for documentation 9 Expected Results Actual Results
Build a Testing Timeline Test Cycle 12 months Assign a target test date to each Test Event/Control Group Strategically space test events across the entire Test Cycle Easy tests can happen more quickly Allow more time for complex tests Consider likely unplanned outages (Incident Tracking)
Resource Restoration Methods Applications & Data Restore from backup Reinstall from original media Installed in multiple locations (redundant) High Availability System failover Hardware Backup Equipment Replace from available market sources (add time to RTO!) Services Contact Supplier
Test Day Print the appropriate Test Control Documents and Scripts, or open the documents on a laptop Line up the Test Participants Test Administrator Technical support Observers/documenter Department users when appropriate Execute test script note start time and results for each step. Complete the document as you go. Testing 80% preparation & documentation, 20% execution
Reporting Electronic files are better that physical Create a folder structure on your network folders and test events with the same name Scan completed Test Scripts and Control Documents; attach to Test Event Keep a schedule of all Test Events, past and future Be able to sort by Date and Status (Pending, Complete) When you re ready to distribute -- copy or Zip the folder structure for emailing or copy to media Print a Test Summary by Date showing test status
High Availability Environments Virtual servers Pro - can immediately cut RTO s in half Con Testing challenges Bandwidth Licensing Staffing Core Synchronization Synch how often? Is more always better? Production testing is complex
A Simple Pandemic Exercise Preparation BCP Coordinator Create an Excel Spreadsheet with 2 columns Column A = Employee Name Column B = Department Use a random selection formula to select 40% of the employee records Make sure you can reference which departments become impacted (use a count record formula or a pivot table) Practice this until you can do it quickly
A Simple Pandemic Exercise Pull your team together for the exercise meeting Use the Pandemic Simulator (Excel spreadsheet) to determine which employees are absent with the flu Determine which Department has the highest level of absenteeism are most affected Review the Business Functions for that Department and develop a strategy for dealing with the incoming work Document everything
Top Five Testing Mistakes 5. Procrastination/Cramming 4. Hiding Failed Tests 3. Reliance on a single methodology 2. Failure to leverage real life 1. Documentation missing or confusing
Questions
Contact Steve Carroll Abound Resources, Inc. Senior Consultant Cell: 717-256-1865 E.Mail: scarroll@aboundresources.com Twitter: @bankbcp