Incident Management, Business Continuity and IT Disaster Recovery Aggeliki Tsohou Lecturer, Ionian University, Department of Informatics, Greece atsohou@ionio.gr 1
Contents Information Security Incident Management Terminology Business Continuity Management Terminology Business Continuity facts and practices Case study presentation 2
Terminology Incident: Situation that might be, or could lead to, a business disruption, loss, emergency or crisis Information Security Incident: Single or a series of unwanted or unexpected information security events that have a significant probability of compromising business operations and threatening information security Information Security Event: identified occurrence of a system, service or network state indicating a possible breach of information security, policy or failure of controls, or a previously unknown situation that may be security relevant 3
Terminology Information Security Incident Management: processes for detecting, reporting, assessing, responding to, dealing with, and learning from information security incidents Information Security Incident Response Team: a team of appropriately skilled and trusted members of the organization, which will handle information security incidents during their lifecycle. At times this team may be supplemented by external experts, for example from a recognized computer incident response team or Computer Emergency Response Team (CERT) 4
Background Risk management, operational security, forensics, awareness and training, compliance monitoring are some of the key information security priorities of organizations Ernst & Young (2013), Under cyber attack EY s Global Information Security Survey 2013 Increasing number of security incidents, which constantly grows; from 3.4 million reported incidents per year in 2009 to 42.8 million in 2014 PricewaterhouseCoopers, Managing cyber risks in an interconnected world Key findings from The Global State of Information Security Survey 2015 5
Examples of Information Security Incidents Denial of Service Incidents: resource elimination and resource starvation Information Gathering Incidents By technical means: e.g., pinging network addresses to find systems that are alive, scanning the available network ports on a system to identify the related services (e.g. e-mail, FTP, Web, etc.) and the software version of those services By non-technical means: e.g., theft of intellectual property stored electronically, misuse of information systems Unauthorized Access: e.g., buffer overflow attacks to attempt to gain privileged access to a target, attempts to retrieve password files, etc. 6
Incident Management as part of an ISMS (ISO 27001) Objective: To ensure a consistent and effective approach to the management of information security incident including communication on security events and weaknesses Controls: Management responsibilities and procedures should be established to ensure a quick, effective and orderly response to information security incidents Information security events should be reported through appropriate management channels as quickly as possible Employees and contractors using the organization s information systems and services should be required to note and report any observed or suspected information security weaknesses in systems or services 7
Incident Management as part of an ISMS (ISO 27001) Objective: To ensure a consistent and effective approach to the management of information security incident including communication on security events and weaknesses Controls: Information security events should be assessed and it should be decided if they are to be classified as information security incidents Information security incidents should be responded to in accordance with the documented procedures Knowledge gained from analysing and resolving information security incidents should be used to reduce the likelihood or impact of future incidents The organization should define and apply procedures for the identification, collection, acquisition and preservation of information, which can serve as evidence 8
Information Security Incident Management Plan and Prepare Improve Use Review 9
Plan & Prepare Information security incident management policy, and commitment of senior management Develop and Document an Information security incident management scheme Update of all corporate, system, service, network security policies ISIRT establishment with defined roles and responsibilities Information security incident management awareness briefings and training Information security incident management scheme testing 10
Use Information security event detection and reporting by human or automatic means Collection of information on information security events and assessment and decision the criteria to determine what events are to be categorized as information security incidents Responses to information security incident, including: Real-time or in near real-time Crisis activities and activation of business continuity Forensic analysis Logging Resolution 11
Review Further forensic analysis Identification of lessons learnt from incidents Identification of improvements to security Identification of improvements to information security incident management scheme 12
Improve Make improvements to security risk analysis and management review results Initiate improvements to security Make improvements to information security incident management scheme 13
14
15
Benefits Reducing adverse business impacts, for example disruption and financial loss, caused as a consequence of information security incidents Strengthening the information security incident prevention focus Strengthening of prioritization and evidence Contributing to budget and resource justifications Improving updates to risk analysis and management results Providing enhanced information security awareness and training program material Providing input to information security policy and related documentation reviews 16
Terminology Business Continuity: Strategic and tactical capability of the organization to plan for and respond to incidents and business disruptions in order to continue business operations at an acceptable pre-defined level 17
Terminology Business Continuity Management (BCM): holistic management process that, identifies potential threats to an organization and the impacts to business operations that those threats, if realized, might cause, and provides a framework for building organizational resilience with the capability for an effective response that safeguards the interests of its key stakeholders, reputation, brand and value-creating activities 18
Terminology Business Continuity Plan: Documented collection of procedures and information that is developed, compiled and maintained in readiness for use in an incident to enable an organization to continue to deliver its critical activities at an acceptable pre-defined level 19
BCM is complementary to risk management Risk management helps understanding the risks to operations or business, and the consequences of those risks. BCM helps recognizing what needs to be done before an incident occurs to protect its people, premises, technology, information, supply chain, stakeholders and reputation when the incident happens. BCM helps taking a realistic view on the responses that are likely to be needed when a incident occurs, so that the organization can be confident that it will manage through any consequences 20
Incident Timeline 21 BS 25999-1:2006
BCM lifecycle and its elements 22 BS 25999-1:2006
BCM lifecycle and its elements BCM programme management: enables the business continuity capability to be both established and maintained in a manner appropriate to the size and complexity of the organization. Understanding the organization: provides information that enables prioritization of an organization s products and services and the urgency of the activities that are required to deliver them. Business Impact Analysis is the core process of this phase. Determining BCM strategy: allows an appropriate response to be chosen for each product or service, such that the organization can continue to deliver those products and services: at an acceptable level of operation, and within an acceptable timeframe during and following a disruption. 23
BCM lifecycle and its elements Developing and implementing a BCM response: creation of a management framework and a structure of incident management, business continuity and business recovery plans that detail the steps to be taken during and after an incident to maintain or restore operations BCM exercising, maintaining and reviewing: Demonstrating the extent to which its strategies and plans are complete, current and accurate and identifying opportunities for improvement. Embedding BCM in the organization s culture: making BCM part of the organization s core values and instilling confidence in all stakeholders in the ability of the organization to cope with disruptions 24
Business Impact Analysis Identification of the activities, assets and resources, including those outside the organization, that support the delivery of the fundamental products and services Identification of interdependencies of its activities Identification of any reliance on external organizations, and any reliance placed upon it by other organizations 25
Business Impact Analysis For each activity: establish the maximum tolerable period of disruption by identifying: the maximum time period after the start of a disruption within which the activity needs to be resumed (recovery time objective), the minimum level at which the activity needs to be performed on its resumption, the length of time within which normal levels of operation need to be resumed (maximum tolerable downtime); identify any inter-dependent activities, assets, supporting infrastructure or resources that have also to be maintained continuously or recovered over time 26
IT Disaster Recovery Information Technology (IT) is essential to most organizations and very few can operate for anything other than a short period of time without computer support IT disaster recovery is the process by which IT and associated infrastructure is recovered following a disruption to services An IT disaster recovery solution is decided based on two main parameters: A recovery point objective A recovery time objective, which should be less than the maximum tolerable downtime 27
IT Disaster Recovery Facilities Cold Sites: facilities with adequate space and infrastructure (electric power, telecommunications connections, and environmental controls) to support information system recovery activities Warm Sites: partially equipped office spaces that contain some or all of the system hardware, software, telecommunications, and power sources Hot Sites: facilities appropriately sized to support system requirements and configured with the necessary system hardware, supporting infrastructure, and support personnel Mobile Sites: self-contained, transportable shells custom-fitted with specific telecommunications and system equipment necessary to meet system requirements Mirrored Sites: fully redundant facilities with automated real-time information mirroring. Mirrored sites are identical to the primary site in all technical respects 28
IT Disaster Recovery Facilities Site Cost H/W Telecommunication Time Location Cold Low None None Long Fixed Warm Medium Partial Partial/Full Medium Fixed Hot Medium/ High Full Full Short Fixed Mobile High Depend s Depends Depends Mobile Mirrored High Full Full Zero Fixed 29
Trade off between cost and time (illustrative) 30
IT Disaster Recovery Strategies Infrastructure Considerations: Organization being the owner Organization renting facilities Back-up Considerations: Frequency Location Labeling Storage media Media disposal processes Transportation means 31
IT Disaster Recovery Strategies Hardware Considerations: Agreements with suppliers Hardware inventory Re-use of existing compatible hardware People Considerations: Recovery groups Hierarchy of people and groups 32
Program Exercise Assurance that the Business Continuity and IT Disaster Recover plans will work as anticipated when required, through: exercising the technical, logistical, administrative, procedural and other operational systems of the plans exercising the plan arrangements (e.g. roles, responsibilities) and infrastructure (e.g. locations) exercising the technology and telecommunications recovery, including the availability and relocation of staff verifying that the business continuity plan incorporates all critical activities and their dependencies validating the effectiveness and timeliness of restoration of critical activities 33
Maintenance of plans Regular review of: Business activities and their criticality Interdependencies Hardware and software that supports activities Names and contact details of recovery groups Names and contact details of suppliers Recovery facilities 34
Recent challenges and trends (Ernst and Young, 2012 Global Information Security Survey) Studies show that two out of five businesses that experience a disaster go out of business within five years 35
Recent challenges and trends (Ernst and Young, 2012 Global Information Security Survey) 17% of the respondents said that their organizations do not have a BCM program in place, Of the organizations that do have a BCM program only 25% believe that their programs reflect a leading practice approved by senior management with defined standards and guidelines, roles and responsibilities and tools and techniques 36
Recent challenges and trends (Ernst and Young, 2012 Global Information Security Survey) The most common problem with BCM programs is the lack of governance integration between business continuity and IT disaster recovery Other challenges: Lack of senior management support Unclear roles and responsibilities Conflicting priorities of the business Ineffective coordination between the business and IT Constant changes in the business and in IT 37
Recent challenges and trends (Ernst and Young, 2012 Global Information Security Survey) Problems indicating lack of integration between business continuity and IT disaster recovery: Business owners assuming that IT backs up all information and can quickly and successfully recover it after an interruption IT teams implementing a recovery solution that does not meet the needs of a business Top management rejecting IT s requests for disaster recovery funding IT teams not informing management and other personnel about the interdependencies of the critical systems and applications No consideration of the ramifications if people are not available to support IT systems and business processes 38
Statistics about the question : Which of the following applies to your BCM strategy? 39
Issues noticed by practitioners It is difficult to make a business continuity plans that cover the whole organization Business continuity plans may be designed but never used in reality Business continuity plans may be designed but never updated Business continuity plans may be designed but never tested and exercised Upon an incident, older versions of the plans may be used instead of the current ones After an incident no lessons learnt may not be documented 40
Issues noticed by practitioners It might be difficult to predict a case although an organization should be prepared for, because the incident might be something completely new, e.g. the Stuxnet cyberweapon Estonia s cyberattacks It is not always easy to predict severe impacts, such as reputation damage due to IT security incidents, e.g. The Edward Snowden case in the US which caused severe reputation impacts due to information leakage (The extent of the leaks may never be known according to US investigators) The Finnish Ministry of Foreign Affair cyber attack 41
Academic and practical problems Confusion in the terminology used in organizations and the literature: Disaster recovery planning, Incident response planning, Business continuity planning, Incident management response, Business impact analysis, Vulnerability assessment, Contingency planning, Crisis management planning, IT preparedness, Security incident management Business Continuity Plan, Business Recovery Plan, Continuity of Operations Plan, IT Contingency Plan, Crisis Communication Plan, Cyber Incident Response Plan, Disaster Recovery Plan, Occupant Emergency Plan 42
Academic research Few studies that exist are conceptual or only descriptive of empirical data by interviews or case studies. Examples: Rodrigo Werlinger, Kasia Muldner, Kirstie Hawkey, Konstantin Beznosov, (2010) "Preparation, detection, and analysis: the diagnostic work of IT security incident response", Information Management & Computer Security, 18 (1), 26 42 Omar, A., Alijani, D, Mason, R,(2011), Information Technology Disaster Recovery Plan: Case Study, Academy of Strategic Management Journal, 10 (2), 127-141 Al-Badi A, Ashrafi R, Al-Majeeni A, Mayhew P, (2009) "IT disaster recovery: Oman and Cyclone Gonu lessons learned", Information Management & Computer Security, 17 (2), 114-126 43
Academic research 44 Nijaz Bajgoric, (2006), Information technologies for business continuity: an implementation framework", Information Management & Computer Security, 14, 5, pp.450-466
Academic research Under investigated research area Interdisciplinary nature with unclear boundaries Difficulty of access to empirical data about real security incidents and how they were handled 45
CASE 46
Case: The Hospital According to the hospital management the maximum tolerance period is 2 week According to our analysis the maximum tolerance period was 1 week 47
Case: The Hospital The disaster recovery planning method followed: Context establishment Impact analysis Identification of threats Identification of impacts Identification of requirements for recovery Disaster recovery planning 48
Context establishment 1 Server hosting the Document Management application 1 Server hosting the Blood Results application 1 Server hosting the Human Resources and Payroll application 1 Server hosting the Patients Management application and the Logistics application 1 webserver The Document Management application The Blood Results application The Human Resources and Payroll application The Patients Management application The Logistics application The MS Office 49
Impact analysis The leading question for developing the security plan was which threats are more probable? The leading question for developing the disaster recovery plan was which threats have the most severe impact? and especially can lead to total loss of the system. 50
Impactanalysis Terrorism Fire Earthquake Wilful damage Theft by outsiders Threats which may lead to partial and total loss of data (threats with High or Very high probability): Masquerading by Insiders Masquerading by Outsiders 51
Disaster Recovery planning Back processes: Three generations of back-ups First: every day, only data, same physical location Second: total back-up, weekly, taken on Sunday and loading on Monday morning, stored outside the facilities Third: annual back-up, stored for every year and kept outside the facilities Roles: Group for Personnel safety Group for Disaster Recovery plan execution 52
Disaster Recovery planning Training for the plan execution Annual full rehearsal Back-ups will be stored at fire-protective cabinets, with at least 30 minutes tolerance Warm Disaster Recovery site: At least 3 servers Telecommunications Alarm Hardware maintenance contract with: Maximum 1 week for replacement in case of disaster 53
References ISO/IEC TR 18044:2004 Information technology -- Security techniques -- Information security incident management BS 25999-1:2007, Business continuity management -- Code of Practice, British Standards 54