Bank of Canada. IT Operations and Infrastructure Services

Size: px
Start display at page:

Download "Bank of Canada. IT Operations and Infrastructure Services"


1 Bank of Canada IT Operations and Infrastructure Services Bank of Canada (BoC) White Paper Business Continuity Plan (BCP) versus Disaster Recovery Plan (DRP) for presentation to XXXIV MEETING ON CENTRAL BANK SYSTEMATIZATION Sept 7-9, 2011, SANTIAGO, CHILE Date of Issue: August 17, 2011 Bank of Canada, Information Technology Services (ITS) Daniel Schaffler, Victor Baez with Daniel Lamoureux


3 1. Who we are The Bank of Canada is the nation s central bank, with four main areas of responsibility: Monetary policy - The Bank contributes to solid economic performance and rising living standards for Canadians by keeping inflation low, stable and predictable. Since 1991, the Bank s monetary policy actions toward this goal have been guided by a clearly defined inflation target Currency - The Bank designs, produces and distributes Canada s bank notes and replaces worn notes. It deters counterfeiting through leading-edge bank note design, public education and collaboration with law-enforcement agencies Financial System - The Bank promotes a stable and efficient financial system in Canada and internationally. To this end, the Bank oversees Canada s key payment, clearing and settlement systems; acts as lender of last resort; assesses risks to financial stability; and contributes to the development of financial system policies Funds Management - The Bank provides effective and efficient funds-management services for the Government of Canada, as well as on its own behalf and for other clients. For the government, the Bank provides treasury-management services and administers and advises on the public debt and foreign exchange reserves. In addition, the Bank provides banking services to critical payment, clearing and settlement systems Our principal role, as defined in the Bank of Canada Act, is to promote the economic and financial welfare of Canada. The Bank was founded in 1934 as a privately owned corporation. In 1938, it became a Crown corporation belonging to the federal government. Since that time, the Minister of Finance has held the entire share capital issued by the Bank. Ultimately, the Bank is owned by the people of Canada. Page 1

4 The Bank is not a government department and conducts its activities with considerable independence compared with most other federal institutions. For example: The Governor and Senior Deputy Governor are appointed by the Bank's Board of Directors (with the approval of Cabinet), not by the federal government The Deputy Minister of Finance sits on the Board of Directors but has no vote The Bank submits its expenditures to its Board of Directors. Federal government departments submit theirs to the Treasury Board Bank employees are regulated by the Bank itself, not by federal public service agencies The Bank's books are audited by external auditors appointed by Cabinet on the recommendation of the Minister of Finance, not by the Auditor General of Canada 1.1. History Canadians have always been firm believers in the value of insurance, and the institution of the Bank of Canada is no exception. Making the business case for Business Continuity Management (BCM) and Disaster Recovery Planning (DRP) has never been at issue for the Bank. The necessity of fulfilling our legislated mandate, practicing good governance, and preparedness for provision of uninterrupted, essential services vital to the national and global financial community have all been well understood and accepted, and indeed have been a integral part of the Bank s culture from the early days of the mainframe to the more recent distributed computing environment. The IT computing environment running the Banks core applications in the latter part of the 20th century consisted of a mainframe computer dedicated to the production environment, and another mainframe computer in a separate, geographically disparate location dedicated to the testing and development environments. The production environment was very tightly controlled and restricted, Page 2

5 with a high degree of policy and procedure in place to manage access to, and migration of changes from the development to the production environment. Disaster recovery plans were in place to facilitate complete recovery of the production mainframe computing and network environment, and included off-site vaulting of what were deemed as vital records (eg. tape backups of the system, software, applications and data). A dedicated Disaster Recovery Coordinator position was staffed, with responsibility for over-seeing the creation and management of disaster recovery plans and procedures to ensure business continuity. Disaster recovery plans were exercised and tested twice yearly and entailed full recovery of the production mainframe environment onto the development mainframe. Additional exercises, limited to the IT Services department (at the time known as Automation Services Department (ASD)) were also undertaken periodically to ensure that ASD was well positioned to provide IT services to meet the business continuity requirements of the larger Bank. Those exercises also provided a means by which ASD could ensure that changes enacted into the production environment had been factored into business continuity plans. With the advent and adoption of the distributed computing environment, the complexity of business continuity management increased significantly, but through it all, the Bank has, and always will, continue to place high importance on the need to have proven strategies and plans in place. Page 3

6 1.2. Our Changing World Everything flows and nothing stays. Heraclitus, Greek Philosopher (c.535-c.475 BC) In keeping with the long-term trend in the history of computing hardware described by Moore s law (exponentially increasing capacity), so too has the complexity of both the business and computing environments evolved, along with the spectrum of risks to be dealt with. In today s world, business and IT organizations are: Striving to leverage lessons learned and apply best practices Increasingly required to demonstrate their value (profitability, cost-effectiveness) Re-defining relationships between internal and external business units Dealing with changing business models Addressing rapidly changing technologies, standards and practices The Bank of Canada, like every other organization, has been subject to the technological, natural and environmental forces that shape the nature of our business and delivery of our services, and the changes that need to be undertaken to stay current and sustainable Complexity of environment adapting to emerging technologies Most, if not all, IT organizations are additionally trying to manage IT as business, amidst a dramatically changing technology landscape that is both much more complex and challenging than previous computing environments. Computing platform changes (mainframe vs. distributed vs. cloud; centralized vs. decentralized), networking advances (multimedia; VOIP; data and voice convergence; fiber), server and storage evolution (clusters; virtualization; NAS; SAN; backup and recovery technology), applications development (SaaS; multi-threading) and Page 4

7 services all come with various benefits and risks, and can dramatically alter the business continuity posture of an IT organization. Vigilance with respect to availability, reliability and sustainability, along with the imperative need to ensure alignment between business requirements, service level expectations and the cost of doing business, is required and continuously on-going Increased risk spectrum The spectrum of risks facing an organization from natural and environmental forces has changed significantly over the course of the last two decades. Thinking back from 1998 to the present, the Bank has rapidly responded to, and successfully dealt with, the following naturally occurring events, with marginal disruption to the conduct of business and the delivery of services: ice storm, affecting power distribution lines across central Canada a magnitude 5.0 earthquake occurs in central Canada On the environmental front, the same can be said for the following actual and on-going events: the new millennium (Y2K) heightened terrorism treat after September Severe Acute Respiratory Syndrome (SARS) outbreak day "Take the Capital" protest in Ottawa against G8 meeting being held in Alberta The largest power outage in North American history Suspicious package deposited outside the Bank s head office Potential influenza pandemic Page 5

8 2010 Major equipment failure affecting the Bank s telephony system heightened occurrence of cyber attacks On-going succession planning and the loss of corporate knowledge On-going supply chain (including outsourcing) and vendor induced disruptions The above mentioned events and threats, as well as the potential for future ones, have all contributed directly to the constant re-assessment, strengthening and evolution of the Bank of Canada s risk management, business continuity and IT service continuity management plans and posture. 2. Overview of Risk Management Framework, Continuity of Operations Program, and IT Service Continuity Management 2.1. Risk Management Framework The Bank developed its risk management framework in 1971 in consultation with the Board of Directors. Risk management is viewed by the Bank as particularly important to sound governance, decision making and accountability. The framework supports informed decision making by ensuring that the appropriate competencies, analytic tools, consultation and communication form the foundation for innovation and responsible risk taking. The risk management framework is fully integrated with the Bank s corporate management processes. It is incorporated into the annual planning, priority-setting, budget process, and quarterly/yearly stewardship processes. It is supported by an in-house tool that allows tracking and classification of operational risk events which gives it insight into the nature of problems that arise with its processes and systems. Page 6

9 The Bank has had a long-standing, well-established security and administrative framework for safeguarding its personnel and assets (physical, information and financial). The safeguards include: policies and standards; personnel screening; physical and logical security equipment and processes; business continuity planning; and security awareness programs Continuity of Operations (COOP) Mandate The Bank of Canada delivers services that are essential to the economic well-being of the nation. To ensure that those services, and the Bank s role in the global financial community, continue to be delivered during a disruptive event, the Bank has created a Continuity of Operations (COOP) program. The COOP program encompasses all disciplines necessary to enable recovery of essential Bank services subsequent to a disruptive event, with emphasis on the protection of Bank employees and property Principles and framework The Bank of Canada Continuity of Operations program has been explicitly designed to meet the standards set by applicable sections of the Bank Security Policy (BSP) and the National Fire Protection Association (NFPA) 1600 Standard on Disaster/Emergency Management and Continuity of Operations Programs. Compliance with the BSP is mandatory, but compliance with NFPA 1600 is voluntary. The Continuity of Operations program is an ongoing management and governance process mandated and supported by senior management, and resourced to ensure that the necessary steps are taken to identify the impact of potential losses, maintain viable recovery strategies Page 7

10 and plans, and help ensure continuity of key functions and processes through exercising, rehearsal, testing, training and maintenance Relation between COOP and the Risk Management Framework The COOP program is a key part of the Bank s risk framework for safeguarding its personnel and assets (physical, information and financial), as the COOP program guides, supports, and promotes the Bank s plans to: Help ensure the safe evacuation of all persons on Bank property in an emergency Continue its critical business in the event of a disaster or crisis Business Impact Analysis (BIA) The fundamental goal of the Continuity of Operations discipline is to identify mission-critical processes that support key Bank activities, the maximum recovery timeframes for those processes after a disruptive event, and the processes and procedures used to restore the process subsequent to a business interruption. To maintain an inventory of Bank-wide business processes and functions (not all of which are IT related or have an IT connotation), the COOP program conducts a Business Impact Analysis (BIA), as the mechanism for the identification and prioritization of the Bank s critical business processes based on impact, injury and loss that would result if a process were to become unavailable for any reason. The Business Impact Analysis is necessarily a point in time snapshot that reflects the functions of the Bank and the recovery priorities and timeframes as they exist when the BIA is created. However, the Bank is an evolving organization, and the COOP program recognizes the fact that functions and priorities may shift over time. Therefore, the COOP Program Office is charged with Page 8

11 responsibility for facilitating a review of the BIA every two years, to ensure that each Department s inputs accurately represent the Bank's current operational state IT Service Continuity Management (ITSCM) Mandate Information Technology Service Continuity Management (ITSCM) is concerned with managing the organization s ability to continue to provide a pre-determined and pre-approved level of IT service to support the minimum business requirements following an interruption to the business. This may range from an application or system failure, to a complete loss of the business premises Principles and framework ITSCM is based on the IT Infrastructure Library (ITIL). ITIL is a publicly available framework, and it is used by organizations word-wide to establish and improve capabilities in IT Service Management, to provide value to customers in the form of services (a service being something that provides value to customers.) The main benefits of ITIL include: Alignment with business needs Negotiated achievable service levels Predictable, consistent processes Efficiency in service delivery, with well-defined processes Measurable, improvable services and processes Common language and terms Page 9

12 ITSCM is a mature process within the IT Service Management group of the IT Services (ITS) department. It is supported by IT senior management, and resourced to ensure that the necessary steps are taken to identify the impact of potential losses, maintain viable IT recovery strategies and plans, and help ensure IT continuity of key functions and processes through exercising, rehearsal, testing, training and maintenance Relation between ITSCM and COOP ITSCM is a key part of the overall Continuity of Operations process and is dependent upon information derived through this process. ITSCM is focused on the continuity of IT services to the business, and the COOP program is concerned with the Business Continuity management process that incorporates all services upon which the business depends, one of which is IT. ITSCM supports the overall COOP process by ensuring the required IT infrastructure, applications, and services identified as critical by the business, can be recovered within the required, and agreed upon, business timescales. To accomplish this goal, ITSCM ensures that proactive measures are: in place to minimize or avoid business disruptions caused by IT outages, supported as part of normal IT service deliverables, and factored into all IT projects and initiatives ITSCM provides a framework that minimizes risk for the management and provision of IT services (for either actual or potential disruptions) to defined service levels. Accordingly, ITSCM not only focuses on reactive contingency measures, but also on proactive measures to avoid serious business disruptions. Page 10

13 Input from the BIA Information technology is often a critical resource that is required to restore operations of many Bank processes and functions; it is a critical enabling resource. Therefore, ITSCM is responsible for review the BIA and ensure that the Bank's information technology recovery plan accurately reflects the business priorities and timeframes as they are represented in the BIA. Also the BIA provides the means to categorize the business processes in Tiers based on the maximum allowable downtime. These Tiers allow IT to define: Service levels for applications and infrastructure defined by tier, instead of defining service levels for each application or service Identify the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for the applications and IT infrastructure depending on the business process that supports. The benefits of the categorization by Tiers from the BIA: Covers all business line applications and foundational, enterprise-wide services (eg. network connectivity; application hosting; storage management; ; backup/recovery; remote access; etc) Provides clarity on services provided by ITS, and links them to the costs incurred Provides clarity on Disaster Recovery posture, since Disaster Recovery solutions can be offered to the business in Tiers (eg. Critical, Standard, etc.) Applies criteria of critical vs. non-critical services, providing guidelines for initial prioritization and cost saving opportunities (where to focus attention and resources) ITSCM Process relation with ITIL and operational processes Page 11

14 Ensuring the continuity of IT services in the event of a disruption requires a thorough understanding of IT services provided and how they operate under normal circumstances. The ITSCM process must be aware of and take account of any factors that affect the operation of IT services - ITSCM is a process that is engaged in all activities in ITS. Consequently it receives feeds from various operational processes and entities such as: Configuration Management: IT components and relationships Change Management: Ensuring the currency and accuracy of the Continuity Plans through the identification of changes affecting, or modifying the IT continuity posture Problem Management: Impact analysis of problems that are affecting or may affect the continuity solutions and ITSCM plans Incident Management: Early notification of incidents that can potentially interrupt IT services and could require the activation of ITSCM plans Service Level Management: Service Levels based on BIA criticality, detailing what service levels must be maintained under normal circumstances and in a disaster situation IT Project Management and Delivery Process: Assessment of ITSCM requirements and proposed continuity solutions, based on ITSCM policy IT Enterprise Architecture: Assessment of long-term architecture vision and design compliance to approved continuity solutions and plans Current Recovery Posture The Bank s main data center is located at Head Office with an alternate center located 20 km away, which serves as the recovery site in the event of a business disruption. The alternate site is equipped with computer systems, data links, and staff work areas that enable the Bank to continue critical operations if the Head Office location is inaccessible or unavailable. Page 12

15 The current alternate posture provides: On-site support for a number of users based on BIA requirements Support for remote access connectivity Local workstations configured with business line applications A flexible recovery workspace (since corporate applications and business line tools can be provided via workspace virtualization, when users start their personal workspace, they view their own familiar and personalized work space where they can access files, applications, settings and entire desktop. IT is not dependant on a pre-defined workstation.) Split operations The Bank conducts pre-defined critical business functions from two sites simultaneously, so that should an event affect either site, the remaining site will settle the day s work. The implementation of split operations strengthened and deepened the Bank s operational resiliency. Page 13

16 The future posture, a more resilient environment Page 14

17 The Bank recently launched a project to increase the environment resilience, by relocating the main data centre. The strategy was developed through a review of the threats with potential to impact the Bank s operations, the extent to which those threats can be mitigated by increasing geographic separation between sites and the associated operational risks, as well as what other central banks and similar organizations are doing in this area. The strategy is to: implement split operations for pre-defined critical operations locate the Bank s main data centre and business recovery 6 to 20 km from Head Office, and locate the Bank s alternate data centre 20 to 50 km from the main data centre 2.4. Human capital A fundamental best practice for all COOP planning is to plan to the worst-case scenario, to ensure that the Bank is prepared for a variety of situations, even though we do not necessarily know to what degree we may be challenged. In this instance, one of the worst cases would be coping with severely reduced workforce. To mitigate this scenario, Managers from across the Bank worked with the COOP Office to develop a categorization matrix identifying those functions in their areas that were time-critical and determined whether or not those functions could be performed remotely. They then identified those individuals who currently fulfill those functions (referred to as the core group), as well as a pool of individuals with the skill sets who could do the necessary work if people in the core group were unable to. This category matrix also provides the guidelines for: Identifying areas of vulnerability Page 15

18 Establishing remote access priority, ensuring that the pre-determined staff can continue to conduct business-critical operations 2.5. Awareness The success of ITSCM depends on a continuing commitment at all levels in the organization and on people's awareness of their respective responsibilities. IT service continuity requirements are factored alongside operational activities. Each department has a Departmental Emergency Response Coordinator, which is responsible for the coordination and logistics of the departments Continuity of Operations plan, and also is responsible for the dissemination of information provided by the COOP program office Testing and exercises ITSCM driven exercises ITSCM policies mandate that IT continuity solutions and plans must be tested on a regular basis to evaluate recovery capability effectiveness, and to identify and address any deficiencies. The purpose of this policy is to validate that the applications and infrastructure at the alternate site can operate isolated from the primary site and meet the required Recovery Time and Recovery Point Objectives. It also serves to identify and resolve problems in the IT infrastructure that could impact the recovery capabilities of critical bank processes, and ensures compliance to the Audit Department s requirements for regular ITSCM assessments and reviews of tests and events for operational readiness Disaster Recovery exercises Page 16

19 The objective the Disaster Recovery (DR) exercises is to demonstrate the ability to activate the production applications and IT environment at the alternate site, within the prescribed Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). These tests validate ITS s preparedness to recover operations. The DR exercises are conducted twice yearly, during a weekend, with the participation of all IT groups and Bank business lines to validate critical systems. The DR exercise has predefined conditions, control points, success criteria, and a strict command and control structure. A report on the test results is distributed to the COOP program, IT Management, and the Audit Department. This report measures compliance to Recovery Time and Recovery Point Objectives, quality of the individual test results and reports, and highlights updates required to IT plans and IT infrastructure. Results are analyzed to determinate if there are variations with the pre-determined level of services for the business, and to implement the necessary measures to mend deficiencies Table top exercises Table top exercises are paper base exercises and are conducted in ITS by ITSCM in preparation for the DR exercises. The objective is to identify gaps in the Disaster Recovery plans COOP driven exercises Bank-wide continuity tests Every two years the COOP Program Office conducts full-scale tests during business hours, while the business is conducting real business transactions at the alternate site. These tests utilize close-to-life scenarios and situational injects during the exercise, presenting a realistic and Page 17

20 challenging operating environment. The objective is to stress-test the continuity of operation plans (business continuity plans) for the departments; including ITS, and identify gaps in those plans. These exercises are designed with focus on specific situations, and objectives, and normally include exercise press releases, in paper and video, as part of the injects during the exercise Call tree exercises As part of the ongoing business of individual departments for Bank-wide readiness, all departments and business lines must ensure that they maintain up-to-date contact information for all of their employees. As such, a Bank-wide Call Tree exercise is conducted once a year. In order to represent a realistic scenario, staff are requested not to alter their regular routine to accommodate this exercise Simulations These are paper base exercises and are conducted by COOP Program Office. The objective of these one day exercises, utilizing close-to-life scenarios, is to train personnel and also to identify gaps in the continuity of operations plans Evaluating our preparedness and response Test result reports are reviewed by the Audit Department, and if required, observations are presented to the Bank Senior management. A review of results is conducted with the departments, and tasks are assigned to areas to follow-up on gaps found during the tests. Page 18

21 3. Lessons learned from the occurrence of real contingencies The occurrence of events has provided the occasion for the Bank to update and fine tune its contingency plans, policies, processes, communications, and decision-making in a real-time atmosphere, and allowed the Bank to test the effect of our response measures to support the critical processes of the Bank. A success factor for the Bank to deal with emerging risks has been the rapid reaction to, and implementation of, improvements to contingency plans. The use of information gathered in the BIA has provided an excellent tool to minimize the time to adapt plans to new risks as the Bank does not need to do the data gathering exercise when situations arise. In situations when the staff cannot access the Bank premises, either as a risk reduction measure or as a consequence (ie ice storm, Severe Acute Respiratory Syndrome (SARS) outbreak, Power outage, Potential influenza pandemic), the Bank has adapted the continuity plans to deal with the reduction of staff and with the increase of remote access by taking the following actions: Identify key functions that would be affected by a shortage of staff Identify the minimum necessary number of staff for critical processes during peak periods Identification of pool of staff, from which to draw in case of staff shortage Identify processes that can be done by remote access (users at home), and processes that must be conducted on site. Implement response in stages for a shortage of staff or the need for social distancing Provide training in the use of personal protective equipment for staff that are identified as required to be on site to perform time critical processes Review dependencies on key suppliers Page 19

22 Revise policies and procedures for remote access, and strengthen processes to assure priority access for critical processes in case of remote access bandwidth limitations Provide mobile devices to staff performing critical processes/services Implementation of flexible recovery workspace, that can be accessed from any Bank issued PC or from a user owned PC, providing a personalized work space that is not dependant on a pre-defined workstation In events that require staff to continue operations at the alternate site (ie Power outage, Suspicious package deposited outside the Bank s headquarters, 2010 Earthquake) the Bank has taken the following actions: Implemented split operations, to conduct pre-defined critical business functions from two sites simultaneously Increased the capacity and redundancy for emergency power distribution (redundant diesels generators); and sign agreements with vendors to guarantee fuel supply at primary and alternate sites Increased the capacity and redundancy for the Data Centers cooling Assured a minimum number of seats at the recovery site per department and also provide flexibility by increasing the number of workstations available for the departments by deploying a virtual workspace, which is a work space that is not dependant on a pre-defined workstation Strengthened the Incident Management Team structure and will be undertaking the following: Relocating the main data center away from Head Office Implementing lights-out Data Centers (all Data Center management will be done remotely) Page 20