HOW PERTH COUNTY CAN IMPROVE ITS DISASTER RECOVERY PREPAREDNESS USING SERVER VIRTUALIZATION TECHNOLOGIES AMCTO STUDENT NO: 209564
CONTENTS Executive Summary... 2 Scope... 3 Disaster Recovery: Are We Prepared?... 4 Disaster Recovery and Business Continuity... 5 Measures of Disaster Recovery... 6 Server Virtualization: An Introduction... 8 What is server virtualization?... 8 How does it Work?... 9 What are the Benefits?... 10 A Disaster Recovery Scenario... 13 Disaster Scenario... 13 Perth County: Today... 13 Computing Infrastructure... 13 Backup and Data Protection Process... 14 Data Recovery Process... 14 Evaluation... 15 Perth County: Tomorrow (A Disaster Recovery Proposal)... 17 Computing Infrastructure... 17 Backup and Data Protection Process... 17 Data Recovery Process... 18 Evaluation... 18 Challenges and Considerations... 20 Conclusion... 21 Bibliography... 22 Page 1
EXECUTIVE SUMMARY No one expects a disaster to happen, but they do. Whether caused by nature or human, disasters can have a devastating impact on businesses of any size if they are not adequately prepared for. The concept of virtualization has been around for over forty years but only with recent developments in the technology have we seen smaller municipalities beginning to embrace it as a viable technology for their data centres. As server virtualization technology evolves and industry adoption increases, organizations are recognizing benefits reaching far beyond the most popular justifications: hardware consolidation, reduced operating costs (lower power and cooling requirements) and reduced capital costs (fewer servers to purchase). What the industry is just recently realizing is that virtualization can be leveraged beyond the initial benefits and provide a platform to enable enhanced disaster recovery strategies. This report explains how the Municipality of Perth County can use server virtualization to significantly improve its level of disaster recovery preparedness. It will do this by comparing two disaster recovery models; the current state of disaster recovery at Perth County and a new, proposed, model based on server virtualization technologies. This report will show that the current state of disaster preparedness at Perth County is inadequate according to today s standards and would fail to provide an acceptable recovery timeframe in a disaster situation. It will then show how Perth County could realize a significant improvement in disaster recovery timeframes and data safety and security were it to adopt a proposed alternate disaster recovery model based on server virtualization. Page 2
SCOPE This report will discuss how the Municipality of Perth County can dramatically improve its level of disaster recovery preparedness by implementing a server virtualization project. In particular it will focus on system backup and recovery models as major components of disaster recovery preparedness. It will not consider Perth County s disaster recovery preparedness as it relates to the entire organizations business continuity plan or its status. It will do this by comparing and evaluating two disaster recovery scenarios: One outlining the current state of data backup and recovery in place at Perth County The other, a proposed new data backup and recovery model based on server virtualization technologies. The scope of this report is restricted to addressing how server virtualization technology can impact the backup and recovery of the Perth County data centre servers and the applications they provide to staff. Since the focus is on server backup and recovery only, this report does not consider any other technology systems as part of the disaster recovery preparedness evaluation such as telephone systems, mobile devices or internet connectivity. The majority of research for this report was gathered from independent research publications, technical white papers, online articles, vendor documentation and personal use and experience with the current Perth County systems and procedures. Page 3
DISASTER RECOVERY: ARE WE PREPARED? On the afternoon of Sunday, August 21, 2011 a tornado (rated as an F3 on the Fujita Scale) ripped through the town of Goderich severely damaging the historic downtown and homes in the surrounding area and causing damages in the range of $75 to $100 million dollars 1. As a result, many municipal services were severely disrupted for extended periods. The town of Goderich is only 73km away from Stratford Ontario, the location of the Municipality of Perth County s main office building and data centre. What if such a disaster struck your municipality crippling your organizations data centre? What level of service could you provide to your citizens if your organization had no access to email, data files (e.g. financial records, employee data) or the internet? Disasters such as tornados, fires, floods and lightning strikes aren t everyday occurrences in Southern Ontario but they can and do happen. Smaller threats like extended power outages, hardware failure, computer virus and even human error, though not as overwhelming, are more probable and can be just as disruptive. Whatever the cause of a server or data centre outage, anything that comes between employees and the applications and data they use for work has a negative impact on productivity and service delivery. If the risks of data centre disruptions are real, so too are the costs associated. Potential costs associated with a data centre disruption at Perth County might include: Business losses Provincial Offences fines not paid, permits not issued Loss of employee productivity a large percentage of all jobs now rely on computers to input and analyze data. With no paper based alternatives staff will have little to do. Liability and Penalties associated with Regulatory Compliance critical municipal information only kept in a digital format could be lost. This loss could result in financial penalties due to legal requirements of such things as Municipal Freedom of Information and Protection of Privacy Act (MFIPPA) requests or financial reporting requirements for all Municipalities. Reputation Loss members of council and senior staff will be scrutinized for the lack of disaster planning if disruptions cause extended periods of diminished service levels for citizens and could impact future political ambitions. Unfortunately, many small and medium sized municipalities haven t recognized the impact a disaster can have on their organization. Despite increasing warnings from technology experts, it seems like many still think it can t happen to them. Research on the disaster preparedness of small and medium-sized businesses (SMB) has found that: 1 2011 Goderich, Ontario tornado. http://en.wikipedia.org/wiki/2011_goderich,_ontario_tornado (accessed July 20, 2012) Page 4
One in three (33%) has experienced a significant outage in the past two years because of a disaster or emergency such as a power outage, server crash, storage failure, cooling failure, fire, flood, earthquake, hurricane or tornado. 2 More than one in five (21%) of SMBs have lost critical business data as a result of an accident, disaster or emergency in the past two years. 3 Only 50% of SMBs have any kind of disaster preparedness plan in place 4 Of those 50% who do not have any plan, 52% don t think computer systems are critical to business and 40% said that disaster preparedness was not a priority 5 Downtime costs SMBs a median of $12,500 per day 6 Though the research dealt with small and medium businesses, there s little reason to believe the results wouldn t be similar for same sized municipalities. With the risks and costs associated with any data centre disruptions, businesses, of any size, should incorporate some form of Disaster Recovery preparedness plan or processes. DISASTER RECOVERY AND BUSINESS CONTINUITY Disaster Recovery certainly isn t a new topic but with the recent list of high profile disasters (9/11, hurricane Katrina, the northeast blackout of 2003, the Canadian ice storms 1998) there has been a definite increase in its importance and adoption in all business sectors and sizes. Disaster recovery can be considered a form of insurance to protect your IT assets when a disaster strikes. When discussing the topic of Disaster Recovery, many people confuse it with the term Business Continuity. Though closely related, the two terms have very different meanings. Business Continuity can be described as the processes and procedures an organization must put in place to ensure that mission-critical functions can continue during and after a disaster. 7 while Disaster 2 The Benefits of Virtualization for Small and Medium Business, http:// www.vmware.com/files/pdf/vmware-smb- Survey.pdf (accessed July 20, 2012) 3 The Benefits of Virtualization for Small and Medium Business, http:// www.vmware.com/files/pdf/vmware-smb- Survey.pdf (accessed July 20, 2012) 4 Symantec 2011 SMB Disaster Preparedness Survey - Global: January 2011, http://www.symantec.com/content/en/us/about/media/pdfs/symc_2011_smb_dp_survey_report_global.pdf (accessed July 21 2012) 5 Symantec 2011 SMB Disaster Preparedness Survey - Global: January 2011, http://www.symantec.com/content/en/us/about/media/pdfs/symc_2011_smb_dp_survey_report_global.pdf (accessed July 21 2012) 6 Symantec 2011 SMB Disaster Preparedness Survey - Global: January 2011, http://www.symantec.com/content/en/us/about/media/pdfs/symc_2011_smb_dp_survey_report_global.pdf (accessed July 21 2012) Page 5
Recovery refers to specific steps taken to resume operations in the aftermath of a catastrophic natural disaster or national emergency 7. In more humours terms it s been said that "the difference is business continuity is keeping the patient alive. Disaster recovery is getting them back to being healed and walking again." 8 Disaster recovery is considered just one part of a larger Business Continuity system. For this report we will concentrate specifically on the technological aspects of Disaster Recovery specifically how information technology systems are recovered in major service disruption scenarios. MEASURES OF DISASTER RECOVERY For the purpose of this report, we will consider several different factors and measures in order to compare and evaluate the Disaster Recovery preparedness models outlined in this paper. Two of the most common measures of disaster recovery preparedness are recovery time objective (RTO) and recovery point objective (RPO). Recovery time objective measures the amount of time a computer system or application can stop functioning before it is considered intolerable to the organization. 9 It is also referred to as the measure of downtime. Recovery point objective describes a point in time to which data must be restored in order to be acceptable to the owner(s) of the processes supported by that data. This is often thought of as the time between the last available backup and the time a disruption could potentially occur 10. It is also referred to as the measure of data loss. Figure 1.1 The timeline of Recovery Point Objective and Recovery Time Objective Last Good Backup Disaster Strikes Systems Recovered TIME Recovery Point Objective Recovery Time Objective 7 Business Continuity and Disaster Recovery (BCDR), http://searchstorage.techtarget.com/definition/business- Continuity-and-Disaster-Recovery-BCDR (accessed August 2012) 8 Disaster Recovery 101: What you need to know, http://www.computerworld.com/s/article/print/9221831/disaster_recovery_101_what_you_need_to_know (accessed September 2, 2012) 9 Consolidated Disaster Recovery Using Virtualization, http://whitepapers.theregister.co.uk/paper/view/508/conslidated-dr-white-paper.pdf (accessed August 12, 2012) 10 Consolidated Disaster Recovery Using Virtualization, http://whitepapers.theregister.co.uk/paper/view/508/conslidated-dr-white-paper.pdf (accessed August 12,2012) Page 6
Obviously, when using these measures to evaluate any recovery system, the smaller the RPO and RTO the better prepared your organization is to recover from a service disruption of any magnitude. Another factor we will consider is that of backup sites. One of the most important aspects of disaster recovery is to have a location from which the recovery can take place. This location is known as a backup site. In the event of a disaster, a backup site is where your data centre will be recreated, and where you will operate from, for the length of the disaster. There are three common types of backup sites 11 : Cold Site: has no equipment or provisions at the ready, it must all be brought in setup and configured at the time of recovery. Warm Site: already has the required hardware to undertake the recovery process setup and configured but still requires the backup media to be retrieved in order to begin recovery. Hot Site: contains a virtual mirror of your current data centre with all systems configured and waiting to begin recovery. As you move from Cold Site to Hot site the amount of time to recover (RTO) from disaster will decrease but the cost to implement these sites will dramatically increase. Now that we have a basic understanding of Disaster Recovery, why we need it and some of the factors involved, the next section will introduce the concept of virtualization, its benefits and how these relate to disaster recovery. 11 Backup Site, http://en.wikipedia.org/wiki/backup_site (accessed September 2, 2012) Page 7
SERVER VIRTUALIZATION: AN INTRODUCTION Before we can begin to evaluate the disaster preparedness models outlined in this report we need to introduce and explain the concept of server virtualization. Even though server virtualization has been around for a decade, it has always been viewed as a technology for mainframe computers and only affordable to large enterprises. That all changed when VMware delivered the benefits of virtualization to the industry-standard x86-based platforms (a lower cost hardware platform for servers), making virtualization affordable for businesses of all sizes 12. Since it became affordable and the advantages of server virtualization were realized, there has been a steady increase in the number of small and medium-sized businesses (SMB) adopting server virtualization. In a recent survey of SMBs, 64% reported that they had already adopted server virtualization 13. WHAT IS SERVER VIRTUALIZATION? Let s start with one basic definition of virtualization: Virtualization lets you run multiple virtual machines on a single physical machine, with each virtual machine sharing the resources of that one physical computer across multiple environments. Different virtual machines can run different operating systems and multiple applications on the same physical computer. 14 In the past, what we considered a server was a physical machine with an operating system (e.g. Windows Server 2008) and applications running on it (e.g. an email server). That physical machine and its resources (CPU, memory, hard drive, network card, etc.) were tightly coupled and dedicated to that one operating system and its applications. So if you needed another server (to run a different application) you purchased another physical machine that itself had an operating system and other applications (e.g. a database server). This was typically done to isolate applications from one another and prevent conflicts. What virtualization does is break the software based operating system s dependence on the underlying physical hardware and now allows multiple operating systems and their applications to reside on the same physical machine and all share all of its resources. Figure 1.1 (on the next page) shows what the traditional and virtual server concepts look like. Think of this very simplified example of houses and roommates. Without (before) virtualization every person has to live in their own separate house with their own bedroom and specific brand of stove and dishwasher (e.g. LG, Kenmore, GE, etc.). With virtualization, people can now live together in the same 12 Virtualization Overview Vmware, http://www.vmware.com/pdf/virtualization.pdf (accessed August 24, 2012) 13 State of SMB IT 1H 2012, http://www.spiceworks.com/marketing/insights (accessed September 10, 2012) 14 What is virtualization? http://www.vmware.com/virtualization/what-is-virtualization.html (accessed August 24, 2012) Page 8
house, still have separate bedrooms but now share the stove and dishwasher (resources) and they no longer care what brand they are. Obviously, you can already see some of the benefits of this arrangement. Figure 1.1 the traditional server model vs. the virtual server model Before Virtualization: Single Operating System per physical machine Software and hardware tightly coupled After Virtualization: Hardware-independence of operating system Can run multiple virtual machine operating systems on same physical hardware HOW DOES IT WORK? Before virtualization, the operating system would sit on top and talk directly to the physical hardware in order to use its resources like memory or the CPU (see Figure 1.1 above). In a virtualized environment, the original operating system layer has been replaced by a new virtualization or hypervisor layer that sits between the actual hardware resources and each individual running copy of an operating system. Each operating system runs inside a separate virtual machine. These individual virtual machines gain access to the hardware layer only through calls to the hypervisor layer, which is responsible for resource allocation. In essence the hypervisor acts as an interpreter for the virtual machines. When a virtual machine needs to use the network card it contacts the hypervisor and tells it what it needs to do. The hypervisor then translates that request so that it can be understood and executed on the specific make and model of network card on the physical server. The beauty of this system is that the virtual machines see the same hardware regardless of the specific make or model of hardware on the physical server. This allows virtual machines to run on any server running a hypervisor without having to complete any extra configuration changes. Page 9
The hypervisor also ensures that all virtual machines are isolated from one another. This means that, even though virtual machines share the same physical resources of a single physical machine, they remain completely isolated from each other as if they were separate physical machines. In some cases, the operating system itself has no way of knowing that it is running in a virtualized environment either. So if a bug or virus crashed one virtual machine the other machines would remain unaffected and continue on like nothing happened. Using virtualization, the operating system and application can be encapsulated into a single file. This essentially wraps the entire server up into a single file. The single encapsulated file contains all the information required to run that server on another server. Encapsulation makes virtual machines incredibly portable and easy to manage. WHAT ARE THE BENEFITS? Now that we understand the basics and the power of server virtualization, let s discuss some of the major benefits Perth County can realize by undertaking a server virtualization project. HARDWARE CONSOLIDATION AND UTILIZATION Historically, the traditional server model was the norm. It was one physical server for each application workload. Unfortunately, this often lead to a situation of hardware over-provisioning where the hardware capacity purchased was much more than was required to do the job. Most servers in a traditional server scenario operate at only about 5-15% of their total capacity 15. By virtualizing your servers and then consolidating them on to a fewer number of physical servers you are able to maximize your hardware capacity utilization while reducing the number of physical servers you need to run your data centre. An example of this could be if you had three physical servers all running separate applications (e.g. mail server, web server and database server) at utilization rates below 40% and you were to virtualize them and run all three as virtual servers on one physical server (see Figure 1.3 on the next page). 15 Server Consolidation, http://www.vmware.com/solutions/consolidation/consolidate.html (accessed September 12, 2012) Page 10
Figure 1.3 Server Consolidations through Virtualization By just consolidating your hardware and increasing its utilization you will also see these other benefits from virtualization: REDUCED PHYSICAL INFRASTRUCTURE COSTS Fewer physical servers are required to run your data centre which means less hardware to buy and maintain (warranties, spare parts, etc.) A recent 2011 survey 16 suggests that the average rate of server consolidation directly attributed to virtualization is 5:1. When totaled with the hardware costs savings and other capital reductions VMware estimates 17 a cost savings of more than $3,000 annually for every server you virtualize. REDUCED DATA CENTRE OPERATIONAL COSTS With fewer physical servers in your data centre, you will require less physical space to house them, less electricity to power them and less electricity to cool the data centre, all leading to lower operating costs and a smaller carbon footprint. According to one report by Gartner Inc. 18, the effective use of virtualization can reduce server energy consumption by up to 82% and floor space by 85%. 16 Virtualization Industry Quarterly Survey, http://www.v-index.com/consolidation-ratio.html (accessed July 22, 2012) 17 Server Consolidation, http://www.vmware.com/solutions/consolidation/consolidate.html (accessed September 12, 2012) Page 11
PORTABILITY AND FLEXIBILITY Each virtual machine is implemented as a single file or small collection of files that contain the operating system and application files plus the virtual machine configuration. Because virtual machines are encapsulated into files, you can manage them the same way you manage other files. For example, you can move and copy a virtual machine from one physical server to another just like any other software file, or save a virtual machine on any standard data storage medium, from a pocket-sized USB flash drive to an enterprise storage area networks (SANs). A common feature of most virtualization software allows for the migrating of a running virtual machine from one physical server to another with no downtime for the end user! The benefits of encapsulation were what piqued the interest of the disaster recovery community. The ability to quickly and easily backup, move, copy and restore virtual machines to another local or remote data centre would allow them to greatly reduce the time it takes to bring important, top priority systems and applications back online after a disruption. This is an extremely powerful concept that can be leveraged for a variety of purposes including enhanced disaster recovery and reduced RTOs. In the next section, we will use virtualization and the benefits it brings, to propose a new disaster recover model that Perth County could implement in place of its current systems. We will compare and contrast Perth County s current disaster recovery systems and processes against this proposed model using the metrics of RTO (Recovery Time Objective) and RPO (Recovery Point Objective) discussed earlier as well as outlining the pros and cons of each. 18 Server Virtualization: A Step Toward Cost Efficiency and Business Agility, http://www.avanade.com/documents/research%20and%20insights/server%20virtualization%20paper%20final %2001-14-09.pdf (accessed September 12, 2012) Page 12
A DISASTER RECOVERY SCENARIO For the purpose of this report we need to setup up a scenario that would require recovery from a disaster. The reason for this is that, depending on the severity of the disaster, there would be different requirements and options for recovery. DISASTER SCENARIO Similar to the recent tornado that hit Goderich Ontario in 2011; this scenario will assume that a natural disaster has hit Perth County, specifically the city of Stratford, the location of Perth County s administration building and central data centre. The damage is severe enough that the building is closed to all staff and the power and generator power feed to the building has been disabled. The building and everything inside has been rendered useless. With this scenario in mind we will now compare and evaluate how Perth County would recover from this disruption. The first model to be examined is Perth County s current state of disaster preparedness and the second will be a proposed system entirely based on Perth County s full adoption of server virtualization technology. PERTH COUNTY: TODAY COMPUTING INFRASTRUCTURE Perth County has grown quickly in the past few years in its adoption and use of technology. They house their own website, Geographic Information Systems (GIS), email and database systems just to name a few. The Technology Services division is only six years old and has two full time staff. To run the current operation there are twelve physical servers housed in their central data centre. Though they have implemented virtualization to a small extent, it has only been on a go forward basis. This has resulted in a hybrid infrastructure of physical servers and virtual servers (60% physical, 40% virtual). Using server virtualization for disaster recovery has not been a consideration to this point. Consistent with findings in non-virtualized environments, the average server utilization numbers are low (average CPU (Central Processing Unit) usage - 4%, average memory usage - 51%). Figure 1.4 (on the next page) shows a snapshot the Perth County Accounting server s utilization through a regular business day. Page 13
Figure 1.4 Snapshot of Perth County Accounting server utilization BACKUP AND DATA PROTECTION PROCESS Perth County s current approach to disaster recovery relies entirely on tape based media created nightly and stored off-site at the homes of staff members. Each night a backup routine, managed by server software, copies all critical data to a set of disk drives first (located onsite in the data centre) and then copies that same data to the tape drive located in the County s data centre. Data is copied to disk first so that simple daily file recoveries can be achieved without having to retrieve the tape media from the off-site location. The tape is considered the disaster recovery option not the disk storage onsite. Each morning staff will swap out the nightly backup tape for the next day s tape in the rotation and then take that tape off-site to a staff person s residence for safe keeping (located within the city limits). In this process, only critical data (information consisting of files and databases created by staff) is being backed up. Not included in this backup process is system state data. System state data consists of items such as information about user accounts, folder structures, permissions, applications, and other critical configuration settings. Any time a new service is brought online, a manual process must occur to add any critical data produced to the backup routine. So on any given day there would be tape backups, located at various sites within the city limits, containing backups of the County s critical data. DATA RECOVERY PROCESS In the disaster scenario considered for this report, a natural disaster has rendered the Perth County administration building and the data centre in it inoperative and inaccessible. With the current technology and methodologies in place what would the recovery process look like? Page 14
BACKUP SITE Besides the main administration building located in Stratford Ontario, Perth County has several other buildings scattered throughout the geographic County. Some are as close as being in the same city (Stratford) while others are located up to 55km away. In its current disaster recovery model, there is no one building that has been identified as the backup site. So in the case of our disaster for this report, a suitable recovery site must be first identified (power and cooling considerations) and prepared. Once we have identified a backup site, then compatible replacement equipment, including servers, networking equipment and a compatible tape drive (to read the backup tapes), will have to be sourced and purchased. Remember all of this was in the main data centre which we have no access to. To restore the server infrastructure at the backup site, you would require twelve physical servers to get back to full service. The server procurement process alone could take up to five or more business days SERVER RECOVERY Once the equipment arrives, staff must install, configure and setup everything from scratch. Since all software installation files are kept in the data center on site, these will need to be downloaded from the internet before any installation of servers or applications can be initiated. In particular, the servers will need an operating system and specific application software installed and configured before use. In the case of certain proprietary applications, arrangements with the vendors will have to be made for them to assist in the installations and configurations of their products. Once all of software installations and configurations are complete and the servers are brought up to operational level, the tape backups can be used to restore the critical data. In this situation, the offsite backup tapes are the only source of data recovery as all other data sources reside on disk drives in the buildings data center and are unavailable. The current amount of backup data will take approximately four to six hours to restore and test before production use can commence. EVALUATION Perth County s current state of disaster preparedness can be characterized as a tape and pray approach. Whether due to cost or other limitations, this minimal approach to disaster recovery leaves the County exposed to longer than normal recover times if this system was ever deployed in a real-life disaster recovery situation. RECOVERY TIME OBJECTIVE (RTO) With no backup site identified, confusion would ensue as discussions, evaluations and decisions would have to be made by staff to identify a backup site. This step alone would have a negative impact on any RTO desired. Related to this, is the fact with no backup site and no access to the existing infrastructure, all replacement hardware will have to be sourced, purchased and delivered to the backup site. Even if a vendor agrees to expedite any orders, delivery cannot be reasonably expected for at least two to three days at minimum. With a hybrid server model (50% physical and 50%virtual) the number of physical servers required to fully recover would be twelve, a large number to acquire, setup, install and configure in a short period of time, given only two fulltime IT staff. Page 15
With only critical data being backed up to the tapes and not server specific configurations, it will extend out the time it takes to rebuild all of the server systems. Many of the applications are of a proprietary manner and will require extra coordination with the vendors to complete those specific configurations, which presents a huge hurdle in recovery time objectives. RECOVERY POINT OBJECTIVE (RPO) Current backup methods take a single snapshot of all the critical data on the servers. To minimize system impact, these backups take place at night when there is little staff activity. Tapes are then retrieved and taken off-site the next day. In this model staff must reasonably expect at least one business day s loss of data in a recovery situation. This also assumes that all off-site tapes are in a state where they can be recovered. The method of off-site tape rotation is a valid concept but in this current situation there are major flaws. Since all off-site tapes are kept at homes of staff that live within the city limits, it is reasonable to expect a city-wide disaster will negatively impact those homes and the tapes therein. Not to mention human error can creep into any manual tape handling processes such as rotating tapes and taking them off-site to one s home, where even there they can encounter other risks to their reliability. One benefit to this system is cost. With the only tangible costs associated with the system being a tape drive and a small amount of tape media required, the cost to maintain this system is quite small. From all of this it can be estimated that the time to fully recover at a backup site would be in the range of seven to ten days with at minimum one day s worth of data loss. I doubt many senior staff or members of council would agree to that RTO or RPO. In today s municipal settings, even when in a disaster situation, that amount of downtime would be hard to justify. Not being able to recover in a reasonable timeframe will negatively impact Perth County s ability to provide the services to its citizens at an acceptable level. Page 16
PERTH COUNTY: TOMORROW (A DISASTER RECOVERY PROPOSAL) Let s look into the future at what Perth County s disaster preparedness might look like if they adopt a full server virtualization model. COMPUTING INFRASTRUCTURE Perth County has adopted a philosophy of virtualize everything when it comes to their server infrastructure. They have gone back and, where possible, converted all physical servers to virtual machines, leaving only two server applications on their own physical hardware. They have appropriately sized and acquired new hardware and have reduced their physical server requirements from twelve down to five (assuming an average of six virtual machines per physical server) and have room to grow if required. They have reduced the amount of electricity required to power and cool their data centre and have maximized the utilization of all physical servers. When they moved to a full virtualization server model, newer more power efficient hardware was purchased to replace the older hardware. This older hardware was relocated from the main data centre in Stratford to one of their other remote locations almost 55km away. This older hardware, though not as capable, is still sufficient to run the number of virtual machines they require in a disaster recovery situation. In this setup four older servers were relocated to the warm backup site. The focus of Perth County s disaster recovery preparedness now includes full virtual machine backups and off-site replication of this data to a warm backup site, no more tape and pray. BACKUP AND DATA PROTECTION PROCESS The backup routine is now totally different. On a weekly basis a full backup is completed of each virtual machine in its entirety. Then throughout the week a backup is completed every two hours containing only the changes made to the virtual machines since their last full backup (aka differential backup). All backups are first made to disk drives located in the data centre at the Perth County administration building and then the same backup data is replicated across a Wide Area Network (WAN) link to the warm backup site onto the older server hardware (see Figure 1.5 on the next page). There is no tape hardware or media to deal with; every virtual machine backup includes all of the critical data, the system state and configurations as well as all the files necessary to easily and quickly bring that virtual machine up on any other host server required. Any time a new virtual machine is brought into production it is simply added to the backup routine and all of its data, settings and configurations are automatically protected. So on any given day there would be an identical copy of all servers located at a remote warm backup site with backed up data every two hours. Page 17
Figure 1.5 the new backup and data protection model DATA RECOVERY PROCESS So again we consider the disaster scenario whereby the Perth County administration building and the data centre in it has been rendered inoperative and inaccessible. With this newly proposed backup and recovery model in place what would the recovery process look like? BACKUP SITE As part of the conversion to full virtualization project, a warm site was identified and has been established at a location reasonably far enough away (55km away) to avoid most disasters that would strike the Stratford administration building. Hardware replaced during the conversion to full virtualization is in place and online at this location. This site has adequate power and cooling to run the servers and infrastructure already in place to be brought online in a disaster recovery scenario. SERVER RECOVERY Once a disaster has been declared the recovery process just takes a few steps. IT staff will work with the virtual machine backups that have been replicated from the main data centre, bringing each one back online in a systematic approach. Once the virtual machines are online, the latest differential backups (every 2hrs) can be restored to each system. At this point a full system recovery has been achieved when the systems are all online with the latest data backups applied. EVALUATION The disaster recovery model proposed here for Perth County is characterized by its use of virtualization technologies and off-site data replication. It provides a level of data protection and disaster Page 18
preparedness that will allow Perth County to recover to full service levels quickly and with little effort in timeframes that would be acceptable to staff, council and citizens alike. RECOVERY TIME OBJECTIVE (RTO) With a warm backup site and all required infrastructure already in place the recovery process can commence as soon as a disaster is called and confirmed. The warm site contains all of the hardware and data backups required to recover the entire infrastructure. Because they are using full virtual machine backups for recovery, there is no need to contact software vendors and arrange for assistance when installing and configuring proprietary applications, they are already installed and configured in the backed up virtual machine. RECOVERY POINT OBJECTIVE (RPO) With weekly full backups of the entire virtual machines and daily two hour differential backups any and all data can be restored up to two hours in the past. So staff can reasonably assume that they will lose any data that was created in the previous two hours prior to the disaster. The warm site servers house the replicated backup data on multiple disk drives configured for redundancy (e.g. RAID 5, RAID 10). In this case, you can have one or more disks fail before the data is put at risk. With tape media, it is a single tape and damage to or failure of that one tape will render all of the data on it useless. By using virtualization as the basis of disaster recovery, there is no direct reliance on the make or model of hardware deployed at the warm site. This allows the County to use older recycled hardware to help keep the costs of the warm site down. Though the older hardware may not perform as well as that in the live data centre, it will still allow the County to perform a full recovery and operate at an acceptable level of service during the disaster scenario. With a model based on full virtualization and off-site data replication, it can reasonable expected to establish and meet and RTO of hours and an RPO of two hours data loss. These are numbers I would fully expect all senior staff and council to support. Though no system is without its challenges, the benefits of this model are too numerous to ignore; a smaller RTO and RPO can be established and met, financial savings can be realized through full virtualization (fewer physical servers, cheaper to cool and power) and IT staff no longer have to deal with the care and administration of tape media. Today s municipalities are facing increasing reliance on digital data to meet legislative requirements and compliance. The currently proposed disaster recovery model will ensure that critical system data is safe, secure and accessible and that the municipality can recover from a disaster in a reasonable time frame. Table 1 A summary of the two disaster recovery models Current Model Based on Tape New Model Based on Virtualization Computing Infrastructure 12 Physical Servers 5 Physical Servers Backup Site None Identified or prepared Warm site located 55km away Time needed to recover 7 10 days hours Minimum amount of data loss 24hrs worth 2hrs Page 19
CHALLENGES AND CONSIDERATIONS There are few challenges that would face this proposed disaster recovery model. The majority involve the cost to implement. Like all problems solved by technology, it s not a matter of whether the technology can solve the problem but can you afford the solution. STAFF TIME COMMITMENT With only two full time staff and a regular schedule of project related work and daily break fix issues, finding the time to undertake a full virtualization project and establishment of warm backup site would be challenging. All current projects and priorities must be examined and evaluated by senior staff in order to make decisions on what work must be put on hold and what might be cancelled or put off until another fiscal year. Project buy-in from senior staff is crucial to make these decisions and deal with the consequences. WIDE AREA NETWORK (WAN) BANDWIDTH COSTS The data replication from the main data centre to the warm backup site will require an adequate level of internet bandwidth. Though the costs for bandwidth are reasonable, it can be expected that to achieve an adequate performance may require a higher than expected investment and ongoing cost. This will depend on the amount of data that needs to be replicated. Basic replication and speed tests can complete by the IT staff to gauge whether current internet connections will suffice. SERVER HARDWARE AND SOFTWARE COSTS The full server virtualization project would see the County reduce its physical server count from twelve to five. In order to do so will require the purchase of three new servers and the operating system licenses required. These new servers would have more computing capacity than the ones they are replacing and would come at a higher cost. Any time a municipal budget line increases or capital purchases are made, for whatever reason, it will always draw attention and concern. If senior staff is fully committed to the project, they can help council to see the value, benefits and return on investment of the project, easing any budgetary concerns. Page 20
CONCLUSION In this digital age even small municipalities are relying more and more on technology to aid them in providing the services required by their citizens. Preventing any threat to the security and recoverability of this data, whether by nature or human, must be considered a priority. Taking the stance of it could never happen to us is no longer acceptable or reasonable, just ask the town of Goderich Ontario. This report has shown that the current level of disaster preparedness at Perth County is not acceptable to safeguard its vital corporate data in a disaster scenario. It has also explained how, by implementing an alternative disaster preparedness model, including the establishment of a warm backup site, the municipality could: Reduce its overall infrastructure and operating costs through consolidation and energy savings Create a data centre that could easily respond to change and easily scale for future growth Greatly improve the ability to respond to disaster situations, reducing the time to recover from any disaster and the amount of data loss that would occur Establish a disaster recovery model that would meet the expectations and needs of staff, council and the citizens of Perth County This report has shown that even small municipalities require a solid disaster recovery strategy. With vital senior staff support and council s approval, a system can be designed and implemented that will provide Perth County with the data protection and security it requires to quickly and easily recover from a disaster situation. When a municipality can recover quickly from a disaster, it will ensure its citizens are provided the highest level of service possible in any situation. Page 21
BIBLIOGRAPHY 2011 Goderich, Ontario tornado. http://en.wikipedia.org/wiki/2011_goderich,_ontario_tornado (accessed July 20, 2012) The Benefits of Virtualization for Small and Medium Business, http:// www.vmware.com/files/pdf/vmware-smb-survey.pdf (accessed July 20, 2012) Symantec 2011 SMB Disaster Preparedness Survey - Global: January 2011, http://www.symantec.com/content/en/us/about/media/pdfs/symc_2011_smb_dp_survey_report_glo bal.pdf (accessed July 21 2012) Business Continuity and Disaster Recovery (BCDR), http://searchstorage.techtarget.com/definition/business-continuity-and-disaster-recovery-bcdr (accessed August 2012) Disaster Recovery 101: What you need to know, http://www.computerworld.com/s/article/print/9221831/disaster_recovery_101_what_you_need_to_ know (accessed September 2, 2012) Consolidated Disaster Recovery Using Virtualization, http://whitepapers.theregister.co.uk/paper/view/508/conslidated-dr-white-paper.pdf (accessed August 12, 2012) Backup Site, http://en.wikipedia.org/wiki/backup_site (accessed September 2, 2012) Virtualization Overview Vmware, http://www.vmware.com/pdf/virtualization.pdf (accessed August 24, 2012) State of SMB IT 1H 2012, http://www.spiceworks.com/marketing/insights (accessed September 10, 2012) What is virtualization? http://www.vmware.com/virtualization/what-is-virtualization.html (accessed August 24, 2012) Server Consolidation, http://www.vmware.com/solutions/consolidation/consolidate.html (accessed September 12, 2012) Virtualization Industry Quarterly Survey, http://www.v-index.com/consolidation-ratio.html (accessed July 22, 2012) Server Virtualization: A Step Toward Cost Efficiency and Business Agility, http://www.avanade.com/documents/research%20and%20insights/server%20virtualization%20paper %20FINAL%2001-14-09.pdf (accessed September 12, 2012) Page 22