1 Information Services Disaster Recovery Plan Test Exercise Held Wednesday 17 th June Draft B2 22/06/09 Mark Ellis Brian Heaton Steve Pettett Gill Woodhams
2 Introduction On Wednesday 17 th June 2009, a test exercise was run to evaluate the effectiveness of the Information Services Disaster Recovery Plan. A scenario was prepared in advance by Brian Heaton, Mark Ellis and Alex Watson. Alex was later replaced by Steve Pettett as a result of other work pressures. The plan was researched as carefully as possible to ensure that it was a) possible that it might occur, b) believable that it might occur and c) would sufficiently test the disaster recovery plan. Several real-world incidents were referred and in particular it is worth thanking the authors of the report into the fire at City University in London of This report provided some valuable insight. The scenario itself was observed by Mark Ellis (MKE), Brian Heaton (BDH), Steve Pettett (SMP) and Gill Woodhams (GW). Background The background to the scenario (which was not provided to the team) but which will help to give the reader a better context to understand how the scenario progressed was drawn up in advance and reads as follows. Contractors working on roof repairs to Cornwallis South have decided to store their materials in G.19 prior to commencing work. The materials include roofing felt, adhesives, bitumen products and gas cylinders connected to stoves and burners. Overnight there is a gas leak from a faulty valve which permeates G.19 and into the adjacent Gents and corridor near S22. At 07:00 a spark from the immersion heater switch triggers the explosion and subsequent fire. An alarm is received at Campus Watch who send someone to check. Smoke is seen issuing from Cornwallis and this info is radioed back. Campus Watch immediately call Kent Fire & Rescue Service who dispatch 2 pumps/tenders which arrive on-site at 07:12. On being notified that the zone in which the alarm has been signalled contains electrical plant supplying power to a nearby data centre KFRS ask for the power to be isolated. The duty electrician is called upon and he isolates the power by shutting down the whole Cornwallis distribution board at 07:23. NB on loss of primary mains supply to the data centre the generator automatically cuts in and starts to provide backup power. The fire/smoke detector in G.19 has disabled the fresh air intake, but at some point the ducting in G.19 is damaged by explosion/fire. At 07:27 KFRS is discovers the generator and asks for it to be shut down no keys are listed in Campus Watch and KFRS break in to the compound and hit the emergency stop on the generator. When the fire is extinguished at 08:45 KFRS will initially assess the extent of the damage and report on timescales for access to the building. I should be noted that this background scenario whilst not probable is feasible and as far as the planning team were able to ascertain, the resulting damage that is described would be a reasonably accurate outcome of such an incident.
3 Methodology The scenario was run as a largely paper exercise. From the initiation through to the conclusion, information was passed into the Silver Team by the observers who variously acted the roles of KFRS, The Gold Team and Estates Silver Team etc. It was anticipated by the planning team that real time and scenario time would be in sync at first and that scenario time would be accelerated to allow as much ground as possible to be covered. In reality, it proved to be not possible to accelerate very far into the future in the time available to run the scenario. For the sake of completeness, the planned scenario timeline is included as Appendix B, however the event as it unfolded ran slightly differently and this is detailed in the main report. The Silver Team were asked to record all of their firm actions and decisions on a flip chart so that output could be analysed at a later date. In addition, the observers and others made notes. Initiation At 08:50, John Sotillo (JS) was called and asked to attend Senate Committee Room 2 as a fire was evident in Cornwallis South. He was asked to assume that this call had been to his home at around 08:05. JS arrived in SCR2 at 09:00 to be welcomed by the observation team who were able to indicate that there was indeed a fire in Cornwallis South that looked serious. From this time on, the scenario was running. Live Scenario The live scenario proceeded as shown in the table in Appendix A. Items recorded on the action chart are shown in blue; other events are shown in green and input from the Gold Team is shown in red. A breakdown of initial events prior to the arrival of JS can be found by examining Appendix B. These early events are simply summarised prior to 09:00 The record is derived from the action chart kept on the day along with notes from the observation team. It is accepted that other actions and decisions may have been taken and even noted in individual records. However, in the interest of highlighting the need to communicate this effectively only those items recorded on the chart were deemed to have happened rather than just been discussed. The remainder of this report will not dwell significantly on the actual events and actions taken as the observation team feel that this is too case specific. Rather the observation team met after the test scenario to see what wider observations were made and what conclusions and recommendations could be made that might be helpful.
4 Observations In drawing up this report and reviewing the scenario, the authors have identified the following observations as being worthy of comment. 1. The team worked well together most discussion was constructive and resulted in decisions being made. There was no panic. 2. A good analysis of the overall situation was made at the outset of the exercise and a method of recording decisions and actions was established. 3. Liaison with Estates was good throughout. 4. There appeared to be very little liaison with other Silver teams. 5. One of the early requests of the Gold Team for a range of options to be presented depending on different scenarios of the extent of the damage in the machine room was not given proper consideration. 6. The DR plan, while meeting the needs of our auditors, is too detailed to be immediately useful when a Disaster occurs. It was referred too very little during the exercise. 7. Admin support is needed for the Silver Team leader immediately and should be considered a priority when calling the team together. 8. Too little consideration was given to staff welfare a bronze team to support staff and to keep them informed as far as possible should have been formed at the outset. 9. More utilisation of staff in Bronze teams would have enabled them to make a constructive contribution to the business continuity/disaster recovery process thus helping them to feel valued (and taken their minds, at least partly, off the condition of their offices). 10. It wasn t clear whether anyone had been commissioned to test what was working in terms of network/systems at an early stage before staff were dispatched to Medway. 11. The lack of a prioritised list of systems to restore required time to be spent on decisions that could have been made in advance of any disaster occurring. The list of priorities should vary according to the time of year but it should be fixed by IS and not subject to alteration by others. 12. It appeared that, with a few notable exceptions, little thought was given to the formation or management of Bronze teams. 13. It was apparent that when the Silver Team leader was absent no-one was leading or managing activity although constructive discussion did take place.
5 14. Both technical experts left the room together to assess the machine room leaving the Silver Team with considerably reduced technical input. 15. It appeared that no consideration was given to re-locating SDS at the Canterbury campus e.g. would it have been possible to move SDS to Electronics (who have appropriate facilities) so that local access could have been given there? 16. The team allowed themselves to be distracted by the call from Hospitality which should have been dealt with quickly and then dismissed. Conclusions The authors have drawn the following conclusions from observing the exercise. 1. It was concluded that the exercise was worthwhile. Those observing the exercise concluded that the exercise went well and commend the IS Staff involved accordingly. 2. Teamwork and internal communication within the team were good; however more thought needs to be given to remaining staff. More immediate action is needed in dealing with affected staff and this should be a higher priority. 3. IS Staff are not sufficiently familiar with the overall business continuity plan system and the IS plan in particular. Better familiarity with the plan would have resulted in a more streamlined approach deploying Bronze Teams more effectively. 4. An effective Silver Team Leader is essential and it should be recognised that any event heavily involving IS Systems is actually likely to result in the Director of IS being seconded to the Gold Team. The Silver Team Leader should therefore ideally be another person. 5. In an event having such wider ranging implications for the University, IS should perhaps think outside the box a little more in terms of available resources. There are other physical and human resources that might be deployed to bring about a swift restoration of services. 6. Staff should not be swayed or distracted by incoming queries and issues. A real event would doubtless generate hundreds of such matters and they should not be allowed to bring the Silver Team to a halt. A sufficiently robust message held by the helpdesk would need to suffice. Other Silver Teams and ultimately the Gold Team need to make their own decisions based on the best information that released.
6 Recommendations Much will be able to be drawn from the conclusions above, however the authors feel that it is worth making the following specific recommendations. 1. Update and revise the IS Disaster Recovery Plans to reflect the current situation. For example, Business Systems is now part of IS and is not mentioned in current plans. Other references relate to the old server room infrastructure. 2. Produce, with some consultation, a Service Priority List. Much time was lost in discussion in this area. Other institutions have and use such lists and they are often time varying depending on events and or terms coming and going. Such a list would be invaluable in driving forward in the event of an emergency. The list should be published and where possible not subject to external influence once agreed. 3. Update the contact details for key members of IS and key associated people. 4. Put in place a mechanism to review the contact details on a regular basis. 5. Include in the Disaster Recovery pack some key details of Service Level Agreements. The planning team spent much time getting data on SLA response times for fire suppression systems, KentMAN WaveStream connections etc. Having this information to hand in the event of a real incident would likely prove highly valuable. 6. Put in place a mechanism to review the Disaster Recovery plan annually. 7. Ensure that all staff are familiar with the plan there is no time to read it in detail on the day. 8. Run another exercise in 12 months time to test the revised plan.
7 Appendix A Action & Decision Timeline Time Action 07:00 Small explosion fire alarm activated 07:12 KFRS arrive 08:05 Estates duty manager arrives and asks Campus Watch to call JS 08:45 Main fire under control 08:55 JS arrives, is briefed by Estates Duty Manager 09:05 JS starts calls to IS Silver Team 09:12 First Silver Team members arrive 09:17 Calls go to Christine + Angela 09:20 A call comes in from the Telephone Exchange saying that they are getting a lot of calls and that they are unable to contact the helpdesk 09:20 Telephone exchange getting a lot of calls 09:22 Update from KFRS. The corridor is badly damaged and the fire suppression system has been activated. Unable to enter data centre until vented. 09:25 JS is called to Gold Team meeting 09:45 JS returns from Gold Team meeting 10:00 Gold Team say Estates Silver Team are offering Bob Eager room in Darwin for use as helpdesk call centre offer accepted 10:05 KFRS offer to escort two people into Cornwallis for a quick look. Estates and sending ACW? Who from IS? Answer: MAB 10:08 JCH has confirmed Bob Eager room and that Estates will supply telephones 10:10 Collect some staff to make notices for all receptions with latest info & contact number: :15 Estates requested to place automated response message on :20 Establish equipment Bronze Team 10:25 Equipment team contact purchasing to establish emergency purchasing procedures 10:30 Equipment team collect list of useful kit IT staff have with them and kit in rooms under IS control which could be moved 10:30 Contact KentMAN + KCC etc. 10:30 Begin contacting staff and redirecting them to Darwin 10:30 Web external wordpress.com blog running for emergency cover. Assessment of [getting] kent.ac.uk back online under way. Business Systems planning move of kit & services to Medway 10:40 Contact centre set up in Bob Eager room. Contact number advertised on automated answer on 4888 & staffed 10:42 Bronze Team hotline to helpdesk staffed 10:45 Silver Team hotline to helpdesk staffed 10:45 Establish skill (ops) at Medway 10:46 Contact Estates to coordinate new fibres into C.S. server room once damage assessment made 10:46 Equipment Bronze Team contact Hospitality to establish exact needs for weekend conference. Critical needs, not critical needs 10:50 Informing Bronze & Silver Teams of hotlines + switchboard of alternatives to :00 Team established at Medway: JGAM, JC, DAM, DME, JJY 11:00 More diesel for generator + assessment that generator is fit for purpose 11:00 Inform Hospitality in 2 hours for their Sat. Conference 11:10 Contact server room maintenance company via Estates. MAB Call server maintenance engineers 11:40 Gold Team require a message to pass to C&DO. Services likely tomorrow. Watch specific webpage
8 12:00 Library identified for staff. 12:00 KFRS indicate that they can take 4 people to look around Cornwallis. KentMAN person, Adcocks person, MAB and PWR attend 12:10 Bronze Teams reconvene at 13:00 12:10 Update information team 12:12 Update C&DO to inform all departments 12:15 JS inform Gold Team of updates 12:20 Inform Bronze Team of 1pm & hourly meets. 12:20 Inform Hospitality of risks to conference 12:25 Informed Jon Pink to expect SDS at Medway by 2pm but access only possible at Medway 12:30 Updated auto reply + info team 12:45 Silver Team is briefed by PWR & MAB about what they have found during the recce 13:00 With latest update assess which Bronze Teams need to have staff available overnight. Establish shift pattern if appropriate 13:10 Order lunch 13:10 Rata for tech staff to work through the night 13:30 Draft update for C&DO 13:40 SDS up and running at Medway. Inform J Pink, Gold Team, Info Team and auto message 13:45 Update auto reply 13:45 Operating shift pattern 14:00 A BT engineer has arrived and inspected the equipment. He is satisfied that it should work but the fibre is dead. He has in turn called out a fibre team. They are expected within 2 hours. An assessor from ComSol is here and is arranging to bring in two people to pull a priority fibre of the teams choice. 14:10 Silver Team agrees to identify priorities by 16:00 15:00 Inform all that basic web page up with info for staff, students & non-university visitors BT engineer arrives and starts to repair fibre for one WaveStream circuit VESDA fire suppression reset will allow power from generator. Cannot be recharged with gas for 48 hours risk... One WaveStream circuit complete. Once power available access to KentMAN & beyond will be available. 05:00 Generator started to provide power to router & core switch 05:30 ComSol complete fibre link from CDC to Library 06:00 Silver Team reconvene Overnight, 1x WaveStream has been repaired and the generator has been started thus regaining external connectivity. 1x SM and 1x MM fibre have been completed to the library but need patching. The A/C pipework has been progressed but a delivery of gas is still pending. 06:15 Fibre to library patched campus accessible 06:30 Decision to start some services that have been tested overnight and monitor temperature 07:00 Update to all basic services available: net, internet, staff , SDS. Next update at 10:00 with likely schedule of service restoration 09:00 Inform Comp Sci of news 09:00 Adcocks report first A/C unit is online 09:30 Inform Hospitality of more positive message for their conference 10:00 Info update to all positive message 12:00 Adcocks report second A/C unit is online and third is not far behind... Verizon circuit repaired completing JANET access from Canterbury... Second WaveStream circuit repaired restoring redundancy to KentMAN
9 Appendix B Planned Scenario Timeline 07:00 CW Explosion, fire alarm activated 07:02 CW CW arrive at Cornwallis, confirm fire by radio 07:03 CW CW notify KFRS, vehicles dispatched 07:05 CW CW begin evacuation of accessible areas of Cornwallis 07:12 KFRS Arrive onsite, directed to scene by CW 07:14 CW Notifies duty manager of Estates 07:14 KFRS Assess site, request power shutdown CW contact duty electrician 07:23 KFRS Duty electrician kills all power to whole Cornwallis complex 07:25 KentMAN Alarm indicate loss of contact to RNEP and Wavestream 1 07:26 KentMAN Alarm indicate loss of contact to Wavestream 1 07:27 KFRS KFRS have identified a generator running after power removed. Campus Watch say they have no keys. KFRS break into compound and stop generator 07:28 Unisys Alarms for loss of KSPN links 07:28 KentMAN Alarm for loss of link to PoP router 07:32 Data Centre fire suppression is activated, system kills all incoming power and shuts down UPS outputs all equipment immediately loses power 08:05 Estates duty manager arrives on site and asks CW to contact John Sotillo and member of Gold Team 08:45 KFRS Fire controlled damping down starts 08:50 KFRS Initial damage assessment. 09:05 JPMRS arrives on scene and liaises with Gold Team member who orders assembly of IS Silver Team and Estates Silver Team 09:10 KFRS KFRS give initial verbal report on extent of damage to JPMRS, SMP & Gold Team Member building still closed to all except KFRS. Corridors are severely damaged as are several offices. The plant room and everything in it is a total loss. Can t see much in the rooms in the middle and can t get in until room vented due to the fire suppression system. Will get back to you later... 09:10 IS Silver Team assembled in Senate to be briefed by JPMRS 09:30 JPMRS JPMRS called away from Silver Team to attend Gold Team 09:45 Gold JPMRS returns from Gold Team to Silver Team with initial instructions. 1. Must have web presence A.S.A.P. 2. Must have plans for different scenarios depending on what is found when building entered 3. Verify all staff safe & accounted for 09:50 KFRS KFRS offer to escort 2 people into the building with a digital camera to examine extent of damage 09:55 Gold Gold Team instructs CS helpdesk be set up in Darwin Tower as advised by Estates Silver Team 10:20 KFRS 2x people return with details of visible damage. Unable to enter CDC as room has not been vented but appears intact. Cable loss and office damage is obvious at this point 10:30 Gold Incoming communication from Gold Team. Exams processing and boards of examines severely affected due to lack of SDS. Along with web presence this to be considered top priority as being the core of the University business. Provide options and timescales... 12:00 KFRS KFRS declares building safe to enter for limited staff. CDC has been vented. 12:30 Full extent of the damage now apparent including power cables, data, aircon pipes etc. 13:00 JPMRS required to attend a lunchtime meeting of Gold Team. Initial damage assessment & action plan required 13:45 Gold Output from Gold Team: 1. Ensure all decisions and actions being logged (should already be happening!) 2. Make sure all displaced staff accommodated 3. If staff will be working beyond normal hours, prepare rota and organise catering 4. Prepare a position statement for staff giving outline of expected return to service of various key systems to be distributed by the end of the day. Statement to be vetted by C&DO. 20:00 KFRS KFRS satisfied site no longer a risk and hand back site in full