Disaster Recovery Design Through Collaboration and Creative Data Management Bob Booth University of Illinois Urbana Champaign CITES
Design Criteria - Constraints Budget constraints, not enough for: Equivalent equipment Complete replication Off campus hot site Remote tape facilities
Collaboration - Design Another department was running similar backup software, but needed assistance Moving toward a tape-less design No off-site facility No backup of their backup system Limited budget Use the strength of the software to create a mutually beneficial off-site recovery system
Collaboration - Constraints Adequate space available on both servers Network pipe large enough Available 24/7 Technical coordination / service administration
Collaboration Agreement Using the backup software resources, communications would be set up between the two systems Enough space would be 'made available' on both systems for each other's off-site needs Each administrator would be 'on-call' for technical assistance if needed Sufficient privilege would be granted to each administrator at the application level
Using the System The other server was used for infrastructure backups at first After the other server was converted from tape to disk, extra space was made available, so we began additional off-site backups of all of our production systems Problem: Discovered very quickly that we had 'too much data' for them to handle
Creative Data Management Forced us into making a decision about our servers and our data For the purpose of backups only, we decided to go with a 'tiered' approach. Mission critical servers, that could possibly benefit from having a readily available third backup would be 'TIER 1' (2 nd off-site) All others would be 'TIER 2' (only one off-site)
Semi Passive Spare A smaller system was placed in our backup data center (about 4 miles away) Enough memory and disk was installed to be usable A complete copy of the production backup server is kept on the this spare, and is kept in 'sync' daily Infrastructure dumped daily, DR plans hourly. The system is designed to be operated in isolation from our production site.
Design Benefits Passive system, no additional application license fees Simple, easy to recreate on 'enterprise hardware' Isolated it does not depend on the production system to operate Can be used for initial testing of software patches Fully testable Tier 1 disaster recovery plan Easy to document
Sample DR Implementation Assume that the main site is inoperable Head to the backup data center, begin restore of 'latest' infrastructure dump Contact collaborative department and inform them of the emergency, and to be on standby While waiting, contact vendors to populate DR site with recovery hardware Inform the 'vault' to return off-site tapes to DR site Restore complete, bring up spare, and review most recent DR plan
Sample DR Implementation (cont) Assumes Tier 1 networking restored Inform sys-admins of available restore data Assume that Tier 1 spares are allocated Begin restores of Tier 1 systems Hardware arrives at recovery site Begin restoration of production server using recovered DR recovery plans Allow restores of Tier 2 systems Allow restores of non-production systems
Daily Operations Overnight, backups come into the server, into disk storage pools Infrastructure dumped to tape periodically Backups separated by policy, migrate to on-site tape On-site tapes copied to off-site tape and removed to vault along with infrastructure once a day Tier1 backups are copied again, to the collaboration servers storage Infrastructure and recovery plan sent to off-site passive spare, and updated recovery information sent to collaboration server DR plan On-site server database To off-site daily After all off-sites are done.. ~7 miles 1Gb net Additional copy of Tier 1 Passive spare server Space for infrastructure dump DR plan Data legend: Red Tier 1 data Blue Tier 2 data Orange Internal Black off-site copy ~4 miles 1 Gb net Primary server On-site tape Virtual tape DR plan Backup data On-site Disk Storage Off-site tape To off-site daily Collaboration server
Outcome Allows for simple testing of upgrades Easy to rebuild Testing for audits is repeatable Can work in isolation Documentation is easy, and based on real output System is easy to expand Pour in some cash to build/enhance infrastructure Disaster 'warm fuzzies'
Conclusions Protect your disaster recovery infrastructure Keep it simple Don't over engineer it, or make it too complicated to deal with in a disaster Keep it independent as possible Don't tie it to critical systems that would make the system hopeless to run in a disaster Double check all possible connections to critical infrastructure Document everything pictures are nice too ;-)