Technical Services Briefing Document TalentLink Disaster Recovery & Service Continuity Version 1.2 (January 2012)
Contents Overview Planning for Service Continuity Disaster Recovery Process Business Continuity Management TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 2
Overview Document Purpose The purpose of this document is to describe the provisions made for disaster recovery of the TalentLink service hosted by Lumesse at the Frankfurt Data Centre. It is also intended to provide the reader with an insight to the steps taken by Lumesse to prevent against and minimise the impact the occurrence of service failure. Scope Lumesse has implemented a Business Continuity Management System (BCMS) based on the BS25999 standards. As part of this BCMS Lumesse maintains a disaster recovery plan to cater for total site loss of a production data centre with the objective of performing service recovery at a secondary production data centre within the Lumesse secure service network that is prepared for the purpose. While a similar approach is taken for disaster recovery of other Lumesse services and data centres, the scope of this document is specific to the TalentLink infrastructure hosted within the Frankfurt Data Centre. The in scope scenario for disaster recovery is an extended total loss of access or of complete site of the Lumesse Data Centre in Frankfurt. Examples of events that could trigger such a scenario include fire, considerable equipment theft, storm, flood, malware outbreak, lightning, or facility shutdown due to other life threatening circumstances. The disaster recovery service has not been devised to cater for momentary or short lived service failures or component failure. These situations are catered for by standard incident and problem management processes. Service Summary The objective of the disaster recovery service for TalentLink in Frankfurt, Germany is to restore full service functionality at the Lumesse Data Centre in Milton Keynes, UK within a Recovery Time Objective (RTO) of 48 hours and within a Recovery Point Objective (RPO) of 24 hours. As is explained in this document, this is achieved through the use of virtualised infrastructure and storage replication techniques, combined with pre-existing capacity and standard operating procedures. The Technical Services team maintains standard operating procedures to support the execution of the disaster recovery process. Tests, exercises and review of these procedures are completed each year. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 3
Planning for Service Continuity When developing and deploying solutions for hosting its SaaS offerings, Lumesse designs infrastructure to deliver high availability as well as to provide continuity of service in the event of invocation of the disaster recovery procedure. The purpose of this section is to provide an overview of the steps taken to achieve this. Lumesse Frankfurt Data Centre Lumesse hosts the TalentLink service at the TeleCity Group Data Centre in Frankfurt Germany. TeleCity is a leading provider of premium data centre services in Europe and were selected as a service provider by Lumesse in part due to their accreditation for the ISO 27001:2005 standard for information security management and the ISO9001:2008 standard for quality management. Attributes of the data centre that contribute to service continuity include: Direct, redundant connection to multi carrier Internet services for high availability Dedicated, secure caged area for Lumesse with card and PIN access to ensure only approved personnel have access 24x7x365 at site engineering services available on demand within a formal Network Operations Centre Business Management System (BMS) maintaining temperature and humidity levels on the data floor areas Very Early Smoke Detection Apparatus (VESDA) and Inergen gas fire extinction systems 24x7 security enforcement from on-site team supported by closed circuit camera systems, automated electronic lock systems and access control list approval for visitors, including Lumesse personnel Standard Hosting Components Lumesse uses the following core hardware components to host the TalentLink service: HP DL 380 and 580 servers configured to a high level of component redundancy Compellent Storage Area Network (SAN) for Fibre Channel and SATA disk storage Cisco Fibre Channel storage switches configured for high availability Cisco Catalyst Chassis Local Area Network (LAN) switches configured for high availability All SAN and local storage configured in redundant arrays HP ilo Advanced solution to support unattended operations by the Lumesse Technical Services team To support disaster recovery Lumesse deploys these standard infrastructure components at all data centres hosting the TalentLink service. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 4
Server Virtualisation TalentLink is hosted using a server virtualisation approach based upon VMware vsphere. vsphere allows a flexible and rapid approach to service provision but also supports high availability of services, protecting against server hardware failure. Service Monitoring and Support In order to maintain good visibility of service levels, Lumesse monitors infrastructure and service condition using the Nimsoft monitoring solution supplemented by proprietary service monitoring components provided with infrastructure components, for example the Compellent SAN. Services are monitored and managed by the Lumesse Technical Services team that also provides an out of hours service to ensure availability of support staff at any time. To further protect service availability, TeleCity Group Frankfurt provides Lumesse with additional oversight of the service management console and will contact on call engineers for support in the event of service alerts, escalating up through the Technical Services management team as required. Backup and Recovery Backup and recovery of the TalentLink service is achieved through a combination of SAN to SAN replication of the virtual infrastructure used to host the service and a tapeless vaulting solution, also replicated between data centres, to cover key data components. All data (including candidate document data) and application assets are hosted on the Compellent SAN within the Frankfurt data centre. This service provides for regular snap shots of storage during the day that can be restored immediately within the data centre by the Technical Services team should the need arise to recover server images, candidate documents or complete database copies. The Oracle database is backed up using the i365 evault backup and recovery service which deploys a specific agent for Oracle database support. Backups are taken on a daily basis and versions of data are maintained on a grandfather, father, son basis. Backups are automatically encrypted using 256bit AES encryption for additional protection. Backups success rates are monitored and restore tests are regularly performed. Database backups are retained for a maximum 6 months. All data backups for TalentLink are automatically replicated from the Frankfurt Data Centre to the Milton Keynes Data Centre. evault backups are replicated each 24 hours, while Compellent SAN data is replicated on a continuous basis during the day. This approach ensures that all data assets required for service recovery are already located at the recovery data centre and that sufficient capacity to operate the service is continually available. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 5
Disaster Recovery Process Technical Solution Disaster Recovery is achieved through a combination of virtual machines and SAN to SAN replication to re-create the environment in another data centre as quickly as possible with the minimal loss of data. The use of storage replication ensures that software revisions and patch levels at the recovery centre are kept consistent with the production data centre. Even though the candidate data represents a huge volume of tiny files, these too are replicated and are immediately available to the recovery team. The diagram below illustrates the key components at the primary and recovery data centres and the replication between them. TLK - High Level Disaster Recovery Process TLK Web and Application Servers TLK Web and Application servers VM registration VMware vsphere VMware vsphere Servers SAN to SAN replication Servers Frankfurt DNS Redirection Milton Keynes Customers The following table details the recovery approach to each of the TalentLink service components that need to be recovered in the event of a disaster. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 6
Component Recovery Process Process Candidate data Database Server SAN to SAN Storage Replication Cloning of production database server Continuous, automated replication to achieve a recovery time objective of no greater than 24 hours Re-fresh of database server image for recovery process following TalentLink version release, typically each month Web Servers SAN to SAN Storage Replication Continuous, automated replication to achieve a recovery time objective of no greater than 24 hours On invocation - register server's in DR environment Application Servers SAN to SAN Storage Replication Continuous, automated replication to achieve a recovery time objective of no greater than 24 hours On invocation - register server's in DR environment DNS Firewall Load balancer Database transaction logs Database data Create DNS entries Firewall rules pre-established in Milton Keynes data centre Rules created in advance on Milton Keynes load balancer installation SAN to SAN Storage Replication evault backup Replication Change Time To Live (TTL) in recovery preparation Change DNS records on invocation Recovery rule set enabled by Lumesse 24x7x365 security services provider at the time of invocation. Recovery rules already in place Continuous, automated replication to achieve a recovery time objective of no greater than 24 hours On invocation apply transaction logs to recovered database to minimise RPO Daily backup replication to ensure availability of database at recovery site. On invocation restore database into recovery database server Invocation The following roles are authorised to initiate the disaster recovery process: Chief Technology Officer Head of Technical Services TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 7
Recovery Team The recovery team is comprised of systems engineers, database administrators and application specialists from the Technical Services team as well as key 3 rd party service providers contracted to be available on a 7x24x365 basis. Communications During the disaster recovery invocation, communication to all stakeholders shall be coordinated by the following roles: Head of Worldwide Corporate Communications and PR Global Director of Support Restoring to normal Operating State Once the primary data centre has been restored to full capability, replication shall be re-established between this and the recovery data centre. Once replication latency has reached appropriate levels, a change window shall be arranged to reverse the disaster recovery process and re-establish normal operating state. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 8
Business Continuity Management Although the purpose of this document is to describe the disaster recovery capability and processes, this documents forms part of the formal Business Continuity Management System (BCMS) of Lumesse. In this chapter of brief description will be given of the BCMS. Business Continuity Management System (BCMS) Lumesse has implemented a BCMS in line with the BS25999 market standard. Lumesse aims to be accredited against this standard in 2012 for its services. As part of the BCMS the following governance concepts have been implemented: Management has set and communicated the business continuity objectives, with due regard to acceptable level of risks, contractual duties and interests of its key stakeholders; Management has established and communicated a Business Continuity policy; A formal BCM governance structure has been implemented Staff is training to ensure its competency and knowledge of the BCM objectives, procedures and processes Business Impact Analysis & Risk Assessments Business Impact Analysis are performed and regularly updated at a department and location level to identify critical dependencies, activities, and resources. The results of the BIA are signed off on an appropriate management level. Risk Assessments are performed and regularly updated to ensure all risks stay with a formalized risk tolerance level. Business Continuity Plans Based on the Business Impact Analysis and the risk assessments, formal Business Continuity plans have been established to ensure Lumesse can handle an interruption in services. All of the business processes are regularly analysed for single points of failures, critical resources in terms of people, applications, infrastructure and suppliers. Threats and risks of each of these components are regularly assessed and mitigated. To ensure people are aware of the business continuity plans and possible threat to business process, the business continuity plans are regularly exercised. Lessons learned and improvements to the business continuity plans are implemented. Continuous Improvement To ensure the BCMS and the Business Continuity plans are continuously improved BCM exercises and DR tests are executed are formally reviewed for lessons learned. Lessons learned are implemented via a formal corrective actions procedure. A combination of internal audits and external audits are used to review focussed assessments of the BCMS. TalentLink Disaster Recovery & Service Continuity 04/01/2012 Page 9