1 Disaster Recovery Exercise - February 2012 Services Unit Report Scenario The North part of the ceiling of the Forum server room has collapsed leading to the destruction of all equipment in racks 0,1 and 2. Equipment in other racks is unaffected but due to safety concerns, no access to the server room is permitted. Services affected by this incident must be identified and restored as quickly as possible. Affected equipement The following items of equipment were affected: Name Role Notes squonk AFS file server (1) crocotta AFS file server (1) bunyip AFS file server (1) cameleopard AFS file server (1) ifevo1 AFS Disk Array (2) bioboy Virtual server data mermaid CVS/SVN server stumer Samba server (1)Data from ifevo2 mounted on these hosts remained intact. (2) This array contained only AFS data Initial difficulties The first task was to identify which equipment had actually been affected by this event. Since no up to date documentation recording the contents of the racks in the machine room seems to exist, some other method of determining the contents of the destroyed racks had to be found (had this been a real-life occurrence, it would of course been much easier to determine which hardware was no longer contactable). This was eventually done by consulting the rfe maps for the switchable plug bars attached to the racks in question.
2 Since the possibility exists that these maps might be out of date or incorrect (though we have no reason to think that this is the case for the maps we consulted) it seems desirable to have a more formal and rigourous procedure for recording rack contents. Once ifevo1 had been identified as one of the affected items of equipment, it was necessary to determine which partitions were stored on that array. This procedure was considerably simplified by the presence of a recent printout of the contents of the array on the desk of one member of the unit. This is clearly information which needs to be both accurate and readily available in the event of an incident similar to the one portrayed in this exercise. Some work has been done on determining disk array contents and this work should be built upon to automatically record this information in a safe location at regular intervals. Status of backups All mirrors and backups were completed successfully on the day preceding the incident. This means that the AFS backup volumes were updated between 19:00 and 22:00 on Sunday the 26 th and the mirrors of the CVS, SVN and Samba data were updated between 0:00 and 1:00 on Monday the 27 th. This does expose one weakness in our tape backup strategy. Because we currently back up the mirrors of the CVS, SVN and Samba data and because these backups take place before the mirroring, any data for these services recovered from tape would have been on the order of 24 hours out of date. There is no real reason why this data cannot be backed up directly from the relevant servers and this should be done as a matter of urgency. Actions taken AFS data There are two possible courses of action available when AFS data becomes unavailable due to an event of this sort. One is to somehow obtain replacement hardware for the servers and storage lost and restore the lost AFS volumes to this hardware. The other is to promote the offsite RO copies of the affected volumes to be RW volumes. Both procedures have their advantages and disadvantages. Restoring data from tape has the disadvantage that it takes a considerable amount of time to restore the data. The inconvenience to the user can be somewhat ameliorated during this period by allowing them read-only access to the data contained in the RO volumes. The big advantage of restoring RW data from tape is that once the data is restored, no further action is required.
3 How long will the restore of data take? Special tools exist within TiBS to speed up the recovery of entire partitions or servers but at the moment we have little real idea of how long the process might take. We should carry out a test restore of an entire partition in the near future and extrapolate from that approximate times for restoring entire servers and sites. Promoting the RO volumes to RW has the major advantage that the users have full access to their data in a very short timeframe. We estimate that in this case, all access to AFS data would have been restored well within 2 hours of the start of the process. The main disadvantage of this approach is the time taken to restore the service to its original state. Although the RW volumes will be once more available, they will be on the 'wrong' site (our policy is to have all RW volumes located in the central area) and there will be no offsite RO volumes leaving us very vulnerable to further failures. A very rough estimate based on the time taken to move all the RW volumes to the central area in the first place is that it would take several weeks to restore the service to its original state. Consideration needs to be given to how we might most efficiently carry out this process. The specifics of this incident means that there is a third alternative to consider. Because some of the data mounted on the destroyed servers survived since the arrays they were on were mounted on a separate rack, it would be possible to promote the RO copies of the volumes on the destroyed array to RWs and mount the RW volumes still in existence on a different server, probably one of the servers in Appleton Tower. Data restored in this way would take longer to make available than a simple promotion of a RO volume but would still be available more quickly than if they had been restored from tape. We estimate that data restored in this way would be available by the end of the first day of the incident. This is not a procedure of which we have practical experience within the School and we should try this out in the near future. Non-AFS data Two services were affected by this incident, CVS/SVN and the admin Samba service. In both cases, only the server hardware was located in the destroyed racks, the service data was located on an array in a surviving rack. As mentioned in the backups section, even if this had not been the case, it would have been possible to retrieve the data from the off-site mirrors and failing that from the backup tapes though as explained above, the data recovered using this method would have been over 24 hours old. One issue raised by this exercise is the availability of server hardware
4 suitable for restoring services to. In most cases, the simplest solution is simply to set up a VM for the service and accept, in the short term at least, any performance issues which may result from this. In the case of services using fibre mounted storage however, this is not an option as we cannot currently use fibre attached storage with VMs. In this case, a supply of physical server hardware and fibre HBAs is needed. It is obviously desirable that this cache of hardware should not be located in a server room or indeed in a basement area, given that flood is probably one of the more likely scenarios which we have to plan for. Since the CVS/SVN service had been the subject of a major restoration effort last year, we decided that restoring the Samba service would offer more possibility of lessons being learned. Our experiences are recounted in the section below: Restoration of the Samba service When considering the restore of the Samba service, there were at least three possibilities: installing new hardware, re-using existing hardware, or using a VM. Since we had no new & unused hardware kicking around, it was either existing hardware or a VM. The initial choice was to create a VM to host the restored Samba service - but it was discovered that the current VM installation options do not allow for SL5 to be installed (and our Samba service requires SL5, as there was no SL6 component, etc), so it was then decided to re-use an existing machine, fantoosh.inf, and restore the service to this. The requisite resources from stumer's profile were copied across to fantoosh, and configured on that host - generating the relevant files. If this happened for real, the partitions containing the data would be restored, and so paths would remain the same (and not have to be tweaked as in the exercise). Currently, the samba data is mirrored to disk, and then that mirror partition is dumped to tape. As a test of this, the data was recovered from tape of the mirror (rather than from the extant mirror, as would be the case in an actual incident - assuming the mirror survived, of course). The time this takes varies with the current TiBS load, and can take several hours (for example, restoring 37Mb from a 453Gb mirror partition backup took 7 hours). This would have been quicker if the data was backed up directly to tape, rather than via the mirror partition. Once the profile configuration was complete, and the password and user files copied into place from the adminsmb restore, it was possible to run the samba component and start the daemon - which made the shares available to other hosts.
5 Lessons learned and actions arising Lessons 1. Some system should be put in place to record and regularly update the contents of server room racks. 2. The configuration of the various disk arrays needs to be dumped to a secure location at regular intervals. 3. Service data should be backed up, wherever possible, directly from the host providing the service. 4. There needs to be a supply of server and fibre hardware available for cases where a VM cannot act as a suitable replacement. Actions 1. Instigate a rack content recording system 2. Arrange for regular dumping of disk configuration information (services unit) 3. Arrange for service data to be backed up directly (services unit) 4. Carry out test restore of entire AFS partition (services unit) 5. Experiment with mounting AFS volumes on a different server (services unit) 6. Arrange for supply of replacement server and fibre hardware 7. Give consideration as to the best way to restore the AFS service to normality after promoting offsite RO volumes (services unit) 8. Need to think about what it means if ext3/4 journals being held on SSDs internal to the server when the disks are on a SAN elsewhere, and the machine and SSD (and so journals) get destroyed, but the file systems on the SAN survive. What would that mean when trying to bring the file system back online on another machine (without the journal)?