Dronfield Henry Fanshawe School Policy No: S33 ICT Disaster Recovery Plan Revision No: 1 Date Issued: September 2012 Committee: Author: Statutory RDD Date Adopted: September 2012 Minute No: 12/20 Review Date: September 2015 The following are situations that would affect the functionality of the Dronfield Henry Fanshawe School network system and their proposed solution. Please note that backups are taken of all critical servers and data. Whole School Power Failure Chance of failure <0.5% Effect - Critical Should the school suffer a complete power failure then all the computers would shutdown. The servers have uninterruptible power supplies (UPS) and would shut themselves down automatically after 15 minutes. The UPS s would safely shutdown the servers resulting in no loss of server data. All data from the desktops which is not already saved to the servers would be lost. A whole school power failure would result in the closure of the school until the situation was resolved. Once the power was restored then all the servers would need manually restarting. Solution: There is no effective workaround to this problem. The school would need to liaise with the utilities power supplier to determine the seriousness of the problem and the downtime. Downtime Unknown Partial Power Failure A-Block Critical Chance of failure - <1% Effect - Critical A failure of the power supply to A-Block would result in the loss of the Server Room and all the main ICT suites. Solution: There is no effective workaround to this problem. The school would need to liaise with the utilities power supplier and school electrician to determine the seriousness of the problem and the downtime. Downtime - Unknown Partial Power Failure Server Room Serious Effect - Low A failure of power to the Server Room would result in the failure of the whole network due to the loss of the core switches.
Solution: The server room is powered from 2 different power distribution boards. Should the server room board fail then equipment can be changed to the secondary board. Downtime Max 2 Hours Failure of Core Switches Chance of failure - <0.5% Effect - None Both of the core switches has several redundancy features already built-in but it is still possible for them to fail totally. Solution: Each of the core switches has redundancy built in to the switch plus there are two switches per function. Failure of Room Switches Chance of failure - <1% Effect - Low Failure of room based switches would cause limited data loss. Solution: A room based switch is always held as a spare and used if required. Downtime 30 Minutes. Failure of Blade Enclosure, Virtual Connects and Power Supplies Chance of failure - <0.5% Effect Critical Total failure of the Blade Enclosure is very unlikely as the parts within it none moving and have multiple redundancy including 6 power supplies and 2 network controllers. However if the enclosure did fail then this would fail every server in the enclosure and restrict access to critical data. Solution: The Blade enclosure and its internal components are covered with a 5 year next working day warranty. Downtime Access to admin and curriculum data could be restored with 24 hours via the backup server located in F-Block. The main key service which would not be available would be the 200 thin clients located around school and they would be offline until the blade enclosure was restored (2 days). Failure of individual Blade Host Server (VMWare or Citrix) Chance of failure - <1% Effect None The servers are configured in an n+1 configuration meaning that there is always 1extra server available should one fail. Solution: Failure of a host will result in the resources on that server being shared among the other remaining servers. None Failure of SAN1 (VMWARE) Chance of failure - <0.1% Effect Low
SAN1 is a storage area network device with multiple redundant parts including redundant hard disks, power supplies and network connections. A failure of the SAN completely would result in some servers being unavailable depending on which servers were hosted on that SAN. Solution: Failure is unlikely and individual failure of redundant parts would have no effect. In the very unlikely failure of the SAN enclosure then some data would migrate automatically to SAN2 otherwise data would need to be restored from backup. The SAN enclosure is on a next day warranty and could be replaced. 48 hours Failure of SAN2 (VMWARE) Chance of failure - <0.1% Effect Low SAN2 is a storage area network device with multiple redundant parts including redundant hard disks, power supplies and network connections. A failure of the SAN completely would result in some servers being unavailable depending on which servers were hosted on that SAN. Solution: Failure is unlikely and individual failure of redundant parts would have no effect. In the very unlikely failure of the SAN enclosure then some data would migrate automatically to SAN1 otherwise data would need to be restored from backup. The SAN enclosure is on a next day warranty and could be replaced. 48 hours Failure of SAN2 (Citrix) Chance of failure - <0.1% Effect Low SAN3 is a storage area network device with multiple redundant parts including redundant hard disks, power supplies and network connections. A failure of the SAN completely would result in the loss of the Citrix environment and virtual desktops feeding the thin clients (10ZIGs). Solution: Failure is unlikely and individual failure of redundant parts would have no effect. In the very unlikely failure of the SAN enclosure then we would be reliant on a replacement enclosure being supplied by HP. The SAN enclosure is on a next day warranty and could be replaced. 48 hours Failure of Master Domain Controller (DHFS-V-DC01) Effect - Low The failure of the master domain controller would limit the issuing of DHCP IP addresses. Other domain controllers would take over other functionality such as DNS. Solution: DHCP would be installed on another domain controller. Fix master domain controller as soon as possible. Downtime 1 Hour Failure of Domain Controllers (DHFS-V-DC02, DHFS-V-DC03 and Backup) Effect - None The failure of a domain controller would have limited effect on functionality as the other DC s would take over its functions. We have enough domain controllers to continue working
Solution: Fix failed domain controller as soon as possible. Failure of Admin Server Effect - Medium The failure of the Admin server would cause loss of access to staff and admin based data. Solution: Give temporary access to the backup server until the Admin Server is fixed. Fix Admin server as soon as possible and restore data from backup server to Admin. Downtime 2 to 3 hours Failure of Storage Server Effect - Medium The failure of the Storage server would cause loss of access to student and curriculum based data. Solution: Give temporary access to the backup server until the Storage Server is fixed. Fix Storage server as soon as possible and restore data from backup server to Admin. Downtime 2 to 3 hours Failure of SQLServer s (SQLServer an SQLServer2) Chance of failure - <10% Effect - Low Loss of MIS system and eportal. Solution: There are 2 eportal servers and any failure would have limited effect as the other server could be used. The failed server would then be fixed as soon as possible Failure of SQLMain Effect - High Loss of SQL Database functionality including access to MIS, Exchequer and Opera Solution: Install SQL on another VM and restore the data for MIS, Exchequer and Opera. Downtime 1 Day Failure of Web Server Effect - Medium Loss of School Website and VLE. Solution: Create another VM and restore data from backup Downtime 1-2 days Failure of Anti-Virus server (DHFS-V-CTRL01)
Effect - None The failure of the Anti-virus server would result in the AV not being updated on the servers and machines in school. The machines would work with their current updates. Lack of AV updates for 1 day would be negligible. Solution: Recreate the VM and install Sophos. Failure of VM Management Server (DHFS-V-MGMT01) Effect - None This would result in loss of control of the 10zig thin clients Solution: We would recreate a new VM and reinstall the 10zig management software Failure of VMware Centre Server (DHFS-V-VC01) Effect Low (High) This would result in servers not being able to vmotion to another host. Failure of this server alone would not result in any downtime but a subsequent loss of a host would mean that a virtual server would not remain up Solution: Restore vcentre to another Host (migrate servers to remaining hosts) (0.5 days) Failure of Phone System server (DHFS-V-Unify) Effect High This would result in a loss of the internal phone system. Solution: Whilst the server is recovered any calls to the school would be redirected to a mobile phone in reception or a personal mobile associated with a DDI. Downtime 1 day Failure of Backup Server Effect Low The failure of the Backup server would not allow us to backup or recover data. Solution: Transfer operation of Backup server temporarily to another machine located in F-Block. Fix Backup server as soon as possible. Failure of Proxy Server Effect - Medium Should the Proxy Server fail then external access would be lost to the web site, intranet and VLE.
Solution: Temporarily install all data to another server. Fix Proxy server as soon as possible. Downtime 1 day Failure of Exchange Effect - High Loss of email access. Solution: Either fix the Exchange server or copy functionality to another server. Restore backups if necessary Downtime 1-2 days Failure of Appserv1 Effect - Low Loss of access to student controlled assessment work and Impero. Solution: Fix the server and restore backups if necessary. Downtime 1 day Failure of Citrix Environment Servers Effect Medium Servers DHFS-V-CSG01 DHFS-V-CTX01 DHFS-V-CTXDS01 DHFS-V-DDC01 DHFS-V-DDC02 DHFS-V-FS01 DHFS-V-WI01 DHFS-V-WI02 NFS_03 The above servers control the Citrix environment. Although these servers do have some redundancy there are 2 servers whose failure would cause a total loss of access to virtual desktops. These are DHFS-V-CSG01 and NFS_03. Solution: Each of the servers can be reinstalled or restored from backup. NFS_03 holds the base images and this server would need to be recovered and the images restored. 1-2 Days Failure of Door Control Server (DHFS-V-DoorCtrl) Effect - Low The failure of the door control system would result in the door controllers failing to receive updates regarding user identities. The doors would continue to open based on their current data. Solution: The doors would continue to work or would be defaulted to open whilst school was in operation. Restore the server as quickly as possible. None
Failure of Print Control Server (DHFS-V-PCUT) Effect - Low The print control server looks after all network printing in school. It s failure would result in loss of network printing Solution: Install Papercut on another virtual server and restore Papercut settings 0.5 1 day Failure of Lightspeed Web Filter Chance of failure - <1% Effect - High Loss of secure internet. Internet access could be maintained but it would not be properly filtered. Solution: This is proprietary equipment and is under a 3 year next day replacement warranty. The school could run unfiltered but this could not be advisable. Downtime 1-2 days Failure of Router Chance of failure - <0.5% Effect Low Loss of internet and email access. Solution: The main router is the property of KCOM and as such we have no control over repair times. We do have a backup ADSL line which could be used for access and is accessible behind the Lightspeed web filtering server so secure internet access could be maintained. but the service would be degraded. Fire in A-Block Chance of failure - <0.1% Effect Critical A fire in A-Block would be the most serious situation for the computer network as this is the location of the Server Room containing all the servers and core switches, and most of the major ICT suites. Solution: Although most of the equipment could be bought reasonably easily off the shelf the main problem would be redirecting fibre optic cabling. We would be able to get limited access to other buildings within 2-3 days but full access would be 7-14 days. Restoring the server room and ICT suites would be dependent on the level of destruction. The main delay would be restoring the infrastructure of the building. Downtime Unknown Fire in Other Blocks (Non A-Block) Serious Chance of failure - <0.1% Effect - High A fire in another building other than A-Block would require restoration of the infrastructure. Solution: Restore the building infrastructure as soon as possible. ICT equipment could be bought with 7 days. Cabling would take longer. Downtime Unknown