Business Continuity and Disaster Recovery Planning 1
More than 20% of all small medium sized businesses suffer a major disaster every 5 years. Almost all that lose their data for 10 days or more file for bankruptcy within a year. www.palindrome.com
Project initiation steps Recovery and continuity planning requirements Business impact analysis Selecting, developing, and implementing disaster and continuity plans Backup and offsite facilities Types of drills and tests
Any disruptive event (natural or man-made) that interrupts normal system in such a significant way that a considerable and coordinated effort is required to achieve a recovery.
Geological: earthquakes, volcanoes, lahars, tsunamis, landslides, and sinkholes Meteorological: hurricanes, tornados, wind storms, hail, ice storms, snow storms, rainstorms, and lightning
Other: avalanches, fires, floods, meteors and meteorites, and solar storms Health: widespread illnesses, quarantines, and pandemics (remember Anthrax? What will you do if they find Anthrax in the mailroom?)
Labor: strikes, walkouts, and slowdowns that disrupt services and supplies Social-political: war, terrorism, sabotage, vandalism, civil unrest, protests, demonstrations, cyber attacks, and blockades
Materials: fires, hazardous materials spills Utilities: power failures, communications outages, water supply shortages, fuel shortages, and radioactive fallout from power plant accidents
Damage to facilities and equipment Utility outages Communication outages Transportation/delivery delays Personnel unavailable (or unable to travel) to work
Remember CIA? Which of these security services (security pillars) does business continuity and disaster recovery planning support?
Disasters are a fact of life Personnel need to be trained and prepared for their occurrence
Plan Type Business Resumption Plan Continuity of Operations Plan (COOP) IT Contingency Plan (ITCP) Crisis Communications Plan Cyber Incident Response Plan Disaster Recovery Plan (DRP) Description Focus on necessary business processes instead of IT procedures Establishes management and headquarters after a disaster. Outlines roles and authorities, orders of succession, and individual role tasks. Plan for restoring systems, networks, major apps after a disruption at the original facility. Provides procedures for disseminating internal and external communications; means to provide critical status information and control rumors. Provides procedures for mitigating and correcting a cyber attack addresses mitigation and isolation of affected systems, clean up, and loss minimization How to recover IT mechanisms after a disaster. Focuses on disasters that require IT processing to take place at another facility.
BCP and DRP are two distinct, but related, plans Business Continuity Plan (BCP) - ensures that the business will continue to operate before (includes a focus on prevention), during, and after an event. A strategic (long-term) plan. Identifies alternate personnel, equipment, and facilities
BCP and DRP are two distinct, but related, plans Disaster Recovery Plan (DRP) Tactical, shorter-term plan that focuses on the immediate response and recovery of critical IT systems during a disruption. Contains procedures for emergency response (assessment, salvage, repair, and eventual restoration of damaged facilities and systems)
NIST 800-34: Contingency Planning Guide for Information Technology Systems. Seven step process for BCP and DRP projects.
ISO 17799: Code of Practice for Information Security Management. Section 14 addresses business continuity management. BS25999: Code of Practice for Business Continuity Management.
NFPA 1600: Standard on Disaster / Emergency Management and Business Continuity Programs. NFPA 1620: The Recommended Practice for Pre-Incident Planning. HIPAA: Requires a documented and tested disaster recovery plan.
Cheaper cyber insurance (reduced risk from long term outages) Market advantage Process improvements Improved organizational maturity
(ISC)2 Project initiation Business Impact Assessment Recovery strategy Plan design and development Implementation Testing Continual maintenance
Pre-planning Activities/Policy Integrate law and regulations Define the scope, goals, and roles Choose project team members Develop project plan and project charter Management approval BIA Identify critical functions (criticality analysis and impact statements) and resources Calculate MTD (Maximum Tolerable Downtime) and other key metrics (RTO, RPO) Identify threats Calculate risks Identify backup solutions Identify Preventive Controls Implement controls Mitigate risk
Develop Recovery Strategies Business process Facility Supply and technology User and user environment Data Document procedures, recovery solutions, roles and tasks, and emergency response Develop BCP Exercise test drill Test plan Improve plan Train employees
Maintain BCP Integrate into change control process Assign responsibility Update plan Distribute after updating
Identify a business continuity coordinator to lead BCP team Develop team: Business units, senior management, IT dept. Security dept. Communications department, legal department Develop a project plan Gain management approval
Formal method for determining how a disruption to the organization s IT systems will impact the mission. Consists of 2 processes: Identification of critical assets Comprehensive risk assessment
Steps Description Identify critical assets IT assets that are mission-essential and must be recovered first Identify interdependencies Conduct BCP/DRP-focused Risk Assessment Determine Maximum Tolerable Downtime (MTD) - the maximum time each business process can be inoperative before significant damage or long-term viability is threatened MTD=RTO+WRT Identify risks to each asset Conduct vulnerability analysis Statements of Impact Consists of two metrics: Recovery Time Objective (RTO) - maximum time allowed to recover business or IT systems (from disaster onset to resumption of businesses processes) Work Recovery Time (WRT) time required to configure a recovered system
Term Recovery Point Objective (RPO) Mean Time between Failures (MBTF) Mean Time to Repair (MTTR) Minimum Operating Requirements Definition Level of data/work loss or system inaccessibility (measured in time) resulting from a disaster that an organization can withstand counted backwards from onset of disaster Average amount of time a system or device is runs before it fails Length of time to recover a failed device or system Minimum environmental and connectivity requirements required to operate
RPO Technologies 8 14 days New equipment, data recovery from backup 4 7 days Cold systems, data recovery from backup 2 3 days Warm systems, data recovery from backup 12-24 hours Warm systems, recovery from high speed 6 12 hours Hot systems, recovery from high speed backup media 3 6 hours Hot systems, data replication 1 3 hours Clustering, data replication <1 hour Clustering, near real time data replication Adapted from CISSP Guide to Security Essentials
For each process, describe the impact on the rest of the organization if the process is incapacitated Examples Inability to process payments Inability to produce invoices Inability to access customer data for support purposes
Fortification of facility Redundancy (clustered servers, drives, etc.) Power lines Fire suppression/detection Redundant vendor support Insurance UPS/generators Data backup technologies Media protection safeguards Inventory
5 Steps that we ll discuss: 1. Business process recovery 2. Facility recovery 3. Supply and technology recovery 4. User environment recovery 5. Data recovery
Define critical steps of a company s processes Required roles Required resources Input and output mechanisms Workflow steps Time for completion Interfaces with other processes
3 types of disruptions: Nondisasters disruption in service due to a device malfunction or failure Disasters An event causes the loss of the entire facility for a day or longer Catastrophes major disruption that destroys the facility, requiring moving operations to offsite facility
Type of offsite facility Advantages Disadvantages Hot Site fully configured with equipment and lines. Data retrieved and loaded from backup site Cold Site supplies basic environment (electrical, AC, plumbing) but no systems can also just be a reciprocal agreement Warm Site anywhere in between. High availability - can be immediately ready or within matter of hours Lowest availability longest restoration time Less expensive Expensive!!! Least Expensive Not immediately available (requires some setup and restoration Operational Testing not available Note: For CISSP exam purposes a hot site here is a subscription service not owned by the company!!!
Redundant Sites: Redundant site: Site is equipped and configured exactly like the production site data data can be streamed live Rolling hot site: Large truck or trailer is turned into a work area Multiple processing centers Distributed through multiple locations
Recovery team must be able to recreate the environment Hardware? Software? Configuration manuals? Where are your recovery plans stored? How long will it take for new equipment to arrive many have requirements within 24 hours (do you have a contract with your vendor that provides for this?) Backups do you have apps and O/Ss to support your restored data (remember that we covered types of backups last week)? Ensure that there are at least two copies available of a company s operating system software and critical apps one offsite and one offsite test these to ensure you can restore!!!!!
Employee Notification develop a Crisis Communications Plan Call Tree used to rapidly communicate information throughout an organization by assigning the responsibility for contacting employees to other employees (i.e. Margaret calls Bob and 9 other people, Bob then calls 10 people, who each call 10 people, etc.) Identify users who need to return to work and how they need to work Can you return to paper processes? Can you automate processes?
Covered last week (all in how the archive bit is handled remember?) Full Backup every file is backed up and archive bit is removed Differential Backup only files with the archive bit are backed up, but the archive bit is left on the file (so backup is cumulative until the full backup runs and removes the bits necessitating restoring the last full backup and last differential) Incremental Backup - only files with the archive bit are backed up, and the archive bit is removed from the file (necessitates layering the incremental tapes in order over the full backup during restoration)
Disk shadowing online backup storage (disk mirroring is a one-to-one relationship, disk shadowing uses multiple drives to create shadow sets Electronic vaulting makes copies of files as they are modified and periodically transmits them to offsite backup storage (common in banks) Remote journaling includes only moving the deltas that have taken place
Close enough or provision to access media? Far enough away to withstand regional disaster? Closed on weekends or holidays? Commensurate security controls to production facility? Availability of bonded transport system (Iron Mountain)? Does data need to be encrypted if leaving the production facility?
Method of transferring risk Cyberinsurance new type of insurance that covers DoS, malware, privacy-related lawsuits, downstream liability, etc. Business interruption insurance covers loss of revenue in the event something bad happens
BCP coordinator needs to define teams: Damage assessment team Determines the cause of the disaster, potential for further damage, and whether or not to activate the BCP Restoration team responsible for getting the alternate site into a working and functioning environment Salvage Team responsible for starting the recovery of the original site Media relations team Security team Telecommunications team Reconstitution phase - when a company moves back to its original site or new site
Test Type DRP Review Checklist (consistency) Structured Walkthrough /Tabletop Simulation Test/Walkthrough Drill Parallel Processing Partial and Complete Business Interruption Purpose Most basic reading the DRP from start to finish by the team that developed it to ensure that it is complete Often performed concurrently with a structured walkthrough or tabletop test lists all necessary components required for recovery Group walks through the process on paper Teams actually carry out the recovery process (disaster is simulated) scope of simulation can vary Recovery of crucial processing components at an alternate computing facility and then restoration from a previous backup without disrupting production) Risky! Processing is stopped at the primary location and transitioned to the alternate location
At least annually!! Identify test objectives and scope Identify Lessons Learned Revise the plan after testing (I look for lessons learned as an audit item) Note: BCPs are updated whenever there are significant changes to the organization
Determine how frequently (at least annually) Good idea to train different roles more regularly Train so that everyone knows the initial steps and where to find the plans First aid and CPR Starting emergency power Call tree http://www.bcmpedia.org/w/images/thumb/1/19/call_tree.png/400px-call_tree.png
Plans updated whenever there is a change to the environment Plans reviewed for updates at least annually if no changes Track and document all planned changes and implement a formal approval process for all substantial changes Changes must be auditable!
NIST SP 800-34 (now Rev. 1) ISO/IEC-27301 draft - part of ISO 27000 series addresses Information and Communications Technology (ICT) and Information Security Management System (ISMS) BS-25999 (2 parts) British business continuity standard BCI (Business Continuity Institute) 6 step Good Practice Guidelines
Lack of management support No coordination with vendors Lack of testing Lack of prioritization Lack of training and awareness
Cloud environments complicate Disaster Recovery Cloud environments can be a part of an organization s DR process Must plan on how personnel will access the cloud
Which of the following is the number one priority of all BCP and DRPs? A. The elimination of potential outages B. The reduction of potential outages C. Protection and welfare of employees D. The minimization of potential outages
Which of the following is the number one priority of all BCP and DRPs? A. The elimination of potential outages B. The reduction of potential outages C. Protection and welfare of employees D. The minimization of potential outages
Maximum Tolerable Downtime (MTD) comprises which two metrics? A. Recovery Point Objective (RPO) and Work Recovery Time (WRT)? B. Recovery Point Objective (RPO) and Mean Time to Repair (MTTR)? C. Recovery Time Objective (RTO) and Mean Time to Repair (MTTR)? D. Recovery Time Objective (RTO) and Work Recovery Time (WRT)?
Maximum Tolerable Downtime (MTD) comprises which two metrics? A. Recovery Point Objective (RPO) and Work Recovery Time (WRT)? B. Recovery Point Objective (RPO) and Mean Time to Repair (MTTR)? C. Recovery Time Objective (RTO) and Mean Time to Repair (MTTR)? D. Recovery Time Objective (RTO) and Work Recovery Time (WRT)?
An example of risk transference is: A. Offsite storage B. Insurance C. Maintaining spare equipment offsite D. Fire suppression
An example of risk transference is: A. Offsite storage B. Insurance C. Maintaining spare equipment offsite D. Fire suppression
What is one of the first steps in identifying a BCP? A. Identify backup solution B. Decide whether the company needs to perform a walk-through, parallel, or simulation test C. Perform a business impact analysis D. Develop a business resumption plan.
What is one of the first steps in identifying a BCP? A. Identify backup solution B. Decide whether the company needs to perform a walk-through, parallel, or simulation test C. Perform a business impact analysis D. Develop a business resumption plan.
Which plan details the steps required to restore normal business operations/mission after recovery from a disruptive event? A. Business Continuity Plan (BCP) B. Business Resumption Plan (BRP) C. Continuity of Operations Plan (COOP) D. Occupant Emergency Plan (OEP)
Which plan details the steps required to restore normal business operations/mission after recovery from a disruptive event? A. Business Continuity Plan (BCP) B. Business Resumption Plan (BRP) C. Continuity of Operations Plan (COOP) D. Occupant Emergency Plan (OEP)
Which draft Business Continuity guideline ensures continuity of Information and Communications Technology (ICT) as a part of the organization's Information Security Management System (ISMS)? A. BCI B. BS-7799 C. ISO/IEC-27031 D. NIST SP 800-34
Which draft Business Continuity guideline ensures continuity of Information and Communications Technology (ICT) as a part of the organization's Information Security Management System (ISMS)? A. BCI B. BS-7799 C. ISO/IEC-27031 D. NIST SP 800-34
Which of the following best describes the difference between an Information Systems Contingency Plan and Disaster Recovery Plan? A. Information Systems Contingency Plan procedures are developed for recovery of the system regardless of site or location after a non-disaster B. Disaster Recovery Plan procedures are developed for recovery of the system regardless of site or location C. Disaster Recovery Plan can be activated at the system's current location or at an alternate site D. Information Systems Contingency Plan is developed for disasters that require restoration of IT systems at an alternate site.
Which of the following best describes the difference between an Information Systems Contingency Plan and Disaster Recovery Plan? A. Information Systems Contingency Plan procedures are developed for recovery of the system regardless of site or location after a non-disaster B. Disaster Recovery Plan procedures are developed for recovery of the system regardless of site or location C. Disaster Recovery Plan can be activated at the system's current location or at an alternate site D. Information Systems Contingency Plan is developed for disasters that require restoration of IT systems at an alternate site.
What is the primary objective of a disaster recovery plan? a. To recover critical processes in a timely manner b. Manage public relations after a crisis c. To minimize financial loss during normal operations outage d. Re-design the security infrastructure of the organization after an emergency
What is the primary objective of a disaster recovery plan? a. To recover critical processes in a timely manner b. Manage public relations after a crisis c. To minimize financial loss during normal operations outage d. Re-design the security infrastructure of the organization after an emergency
A critical company asset would most likely have which of the following MTD values? A. Minutes to hours B. Days C. Weeks D. Months
A critical company asset would most likely have which of the following MTD values? A. Minutes to hours B. Days C. Weeks D. Months