Technology Resilience and Failover policy Status: Approved
Contents 1 Introduction... 4 2 Technology Resilience and Failover policy... 6 2.1 Policy scope... 6 2.2 Policy statements... 6 2.3 Exception management... 7 2.3.1 Exception criteria... 7 2.3.2 Exception process... 8 2.4 Rationale... 8 2.5 Policy notes... 9 3 RACI matrix... 12
1 Introduction This policy defines the responsibility of BBC service providers and internal BBC technology providers to perform failover exercises to test the resilience and failover capabilities of technology delivered to the BBC, aligning with the British Standards BS25999 and BS25777. The frequency of these failover exercises is dependent on the criticality of the service and any contracts in place, but the general case is that at least one such test must be performed per twelve month period, unless formally dispensated. Each BBC division has responsibility for and ownership of their business continuity plans. Systems classified as high importance to the BBC will form part of certain business continuity plans, specifically those plans where loss of technology is in scope. An element of continuity planning is the rehearsal of plans against the loss of technology. The technology failover tests described in this document will be reflected in the business continuity plans and can be combined with the rehearsal of business continuity plans. The scope, frequency and results of the failover testing described in this document will be reflected in the business continuity plans during the on going cycle of business continuity planning and rehearsal.
2 Technology Resilience and Failover policy 2.1 Policy scope This policy applies to all technology where a backup or standby system is inherent in the design. BBC stakeholders are accountable for the running of failover tests and have the authority to formally dispensate service providers from carrying out tests. 2.2 Policy statements Technology Service Providers: Service providers and internal BBC technology providers shall perform failover tests of systems delivered to the BBC, where failover capability to a standby system is inherent in the design When to test: Testing will take place both before a system is used in production, and at least once per annum during the lifetime of the system Business as usual activities: There will be no interruption to business as usual activities during the tests, unless agreed in advance with the BBC service owners and stakeholders Disaster Recovery: Disaster recovery documented operational procedures to test the failover to a standby system will be reviewed and updated as a result of the tests at least once per year by the Service Provider or internal BBC technology provider Correcting Issues: Actions to rectify issues discovered during tests shall be completed by the service provider or internal BBC technology provider in a timescale agreed with the BBC stakeholders and must always be completed before the next test Extra Testing: Depending on the seriousness of the issue(s) an extra test may be required to confirm that remediation has been successful 2.3 Exception management 2.3.1 Exception criteria The BBC may grant formal dispensation to the service provider if there are a high number of similar systems in scope and it would be impossilbe to test failover to all standby systems in a twelve month period due to the number of tests that would have to be carried out. In such circumstances agreement will be reached between the stakeholders and the service providers to test an agreed number of these systems in a calendar year. Dispensation may also be granted by the BBC stakeholders if it is deemed that the tests can not go ahead because there there would be unacceptable interruption to the business or to the BBC audience. Failover testing to a standby system will not take place if existing contracts do not include a requirement to do so, unless otherwise agreed between the BBC and the service provider. 2.3.2 Exception process Dispensation must be agreed with the relevant BBC stakeholders for the system
2.4 Rationale In order to achieve adequate continuity of service, the BBC requires its third party technology service providers and internal BBC technology providers to commit to various continuity measures which ensure that Business Continuity arrangements for technology continue to meet the needs of the BBC, in the event of a technology incident or failure. This includes having adequate, tested mechanisms in place for seamless failover arrangements and recovery of systems in a reasonable timescale following a failure. 2.5 Policy notes The purpose of failover testing is to learn lessons and to set expectations. The lessons learned during a controlled test scenario can provide valuable knowledge and information which can be used If the system had to failover to a backup system because of an unexpected failure in the live environment. The BBC stakeholders are accountable for the services delivered by the system and reserve the right to cancel tests at short notice. The scope of the tests shall be clearly defined and agreed with the BBC stakeholders in advance of any test taking place so that both the BBC and it s service providers have a clear understanding of their roles in the failover testing process and fully understand the expected impact of the test. While the testing of failover systems is the core activity of this process, the distribution of communications and test documentation is key to its success. In order to ensure that the communications reach the right recipients, distribution lists must be reviewed and updated before and after each test to ensure that those accountable, responsible, consulted and informed understand their roles and are fully involved from start to finish. All tests shall follow the relevant agreed change control procedures to ensure that there are appropriate fallback plans in place to return to normal operations, if for example, unexpected technical issues arise. When a test has been completed the system shall be returned to it s normal operating status. Following each completed test a report detailing the findings will be sent to the BBC stakeholders. Lessons learned and issues discovered during tests will be recorded, documented and be reported to the BBC stakeholders, so that there is a clear understanding of the expected impact to the business, in the event of a real failure to the system during normal operations. The Service Provider or internal BBC technology provider will own the resilience testing operational procedures and documentation and will grant the BBC read only access to this documentation. The service provider will present the BBC with a summary report of its resilience testing activities at least once every three months. If a service provider is only responsible for one annual failover test, then a single annual report will suffice. It is the responsibility of the service provider to address technical, process or operational issues discovered during testing and to take appropriate steps to address these issues. Other parties will need to be consulted or kept informed during the process. A summary of roles can be found in the attached RACI matrix in this document.
3 RACI matrix The RACI matrix is a framework for defining the process obligations of roles and actors with the following classifications: Responsible executes the process Accountable is the business owner of the quality of the end result Consulted provides knowledge, information and other input into the process Informed is given information about the process execution Service provider: Operations Service provider: Service Management Service provider: Business Continuity BBC: Business stakeholders BBC: Business Continuity Strategy Procedures and scope R C C A C C I Communications C R I A I I I Change control R A I I I I I Agree test date R C C A C C I Agree test frequency C R C A C I I Carry out test R C I A I I I Implement actions C R C A I I I BBC: Service Assurance BBC: Business Continuity
Document control Layer Domain Subject Title Policy Reference Document Owner Role Contact Overarching Security Continuity Technology Resilience and Failover cont_13 Andy Leigh Head of Information Security and Business Continuity Strategy Dermot O'Sulivan Version Date Author Status 1.1 16th March 2011 Dermot O'Sulivan Approved 1.0 2nd February 2011 Dermot O'Sulivan Draft 0.1 2nd January 2011 Dermot O'Sulivan Draft