Utilising the Cloud for Disaster Recovery

Similar documents

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

Bridging the gap between local IT and Cloud services, keeping you in control

Bridging the gap between local IT and Cloud services, keeping you in control

Bridging the gap between local IT and Cloud services, keeping you in control

Bridging the gap between local IT and Cloud services, keeping you in control

Cloud economics and flexibility with local choice and control

Neverfail for Windows Applications June 2010

VMware System, Application and Data Availability With CA ARCserve High Availability

Product Overview and Functional Specification

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Deployment Options for Microsoft Hyper-V Server

Veritas Storage Foundation High Availability for Windows by Symantec

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

WINDOWS AZURE EXECUTION MODELS

Symantec Storage Foundation High Availability for Windows

Zerto Virtual Manager Administration Guide

The Cloud in your office

Contents. SnapComms Data Protection Recommendations

Cloud economics and flexibility with local choice and control

Availability for the modern datacentre Veeam Availability Suite v8 & Sneakpreview v9

Windows Server Your data will be non-compliant & at risk on

The School IT Challenge. Introducing Systemax Stack As A Service. Top 12 School IT Challenges

Veritas Cluster Server from Symantec

Pricing Guide. Service Overview

Veritas InfoScale Availability

Business Continuity: Choosing the Right Technology Solution

Implementing Microsoft Azure Infrastructure Solutions

SQL Server on Azure An e2e Overview. Nosheen Syed Principal Group Program Manager Microsoft

Rethink Disaster Recovery with Microsoft

Website Disaster Recovery

Symantec Cluster Server powered by Veritas

vcloud Air Simone Brunozzi, VP and Chief Technologist, vcloud 2014 VMware Inc. All rights reserved.

Building disaster-recovery solution using Azure Site Recovery (ASR) for Hyper-V (Part 1)

Course 20465C: Designing a Data Solution with Microsoft SQL Server

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Disaster Recovery White Paper

Interact Intranet Version 7. Technical Requirements. August Interact

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Cloud Backup and Recovery

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

SOLUTION BRIEF Citrix Cloud Solutions Citrix Cloud Solution for Disaster Recovery

Planning for the Worst SAS Grid Manager and Disaster Recovery

Deploying Exchange Server 2007 SP1 on Windows Server 2008

WHITEPAPER. One Cloud For All Your Critical Business Applications.

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

CA ARCserve Family r15

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

A review of BackupAssist within a Hyper-V Environment. By Brien Posey

How To Run A Modern Business With Microsoft Arknow

AUTOMATED DISASTER RECOVERY SOLUTION USING AZURE SITE RECOVERY FOR FILE SHARES HOSTED ON STORSIMPLE

SharePoint 2013 on Windows Azure Infrastructure David Aiken & Dan Wesley Version 1.0

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

CLOUD SERVICE SCHEDULE Newcastle

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

VMware Cloud Environment

Cloud Computing Disaster Recovery (DR)

High Availability Essentials

Non-Native Options for High Availability

AVLOR SERVER CLOUD RECOVERY

CA Cloud Overview Benefits of the Hyper-V Cloud

DeltaV Virtualization High Availability and Disaster Recovery

Virtualized Disaster Recovery (VDR) Overview Detailed Description... 3

High Availability Solutions for the MariaDB and MySQL Database

AppSense Environment Manager. Enterprise Design Guide

SteelFusion with AWS Hybrid Cloud Storage

DR-to-the- Cloud Best Practices

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Easily recover individual files or full disaster restores. Your data will be there when you need it - it s ready to restore. Install it and forget it

What s new in Hyper-V 2012 R2

VMware vcloud Air - Disaster Recovery User's Guide

What s New with VMware Virtual Infrastructure

TELSTRA CLOUD SERVICES CLOUD INFRASTRUCTURE PRICING GUIDE AUSTRALIA

Backup Exec Private Cloud Services. Planning and Deployment Guide

Server Scalability and High Availability

Designing a Data Solution with Microsoft SQL Server 2014

Virtualizing Exchange

Frequently Asked Questions

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Designing a Data Solution with Microsoft SQL Server

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

The Microsoft Large Mailbox Vision

CompTIA Cloud+ 9318; 5 Days, Instructor-led

Technical Considerations in a Windows Server Environment

MANAGED DATABASE SOLUTIONS

Designing a Data Solution with Microsoft SQL Server 2014

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

CompTIA Cloud+ Course Content. Length: 5 Days. Who Should Attend:

Proactively Secure Your Cloud Computing Platform

Windows Server 2008 R2 Hyper-V Live Migration

Online Backup Plus Frequently Asked Questions

Course 20465: Designing a Data Solution with Microsoft SQL Server

New hybrid cloud scenarios with SQL Server Matt Smith 6/4/2014

Microsoft Azure Cloud oplossing als een extensie op mijn datacenter? Frederik Baert Solution Advisor

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

Designing a Data Solution with Microsoft SQL Server

Module: Business Continuity

Establishing Environmental Best Practices. Brendan Flamer.co.nz/spag/

Transcription:

Utilising the Cloud for Disaster Recovery Craig Scott Head of ICT Services South Tyneside College Supported by AoC

Introduction The Disaster Recovery is something that IT Managers spend a considerable amount of time planning and preparing for with the hope they will never have to implement those plans. Over the years users have come to expect IT to be always on and available 24/7 to allow them to study or carry out the duties associated with their job role. These availability and reliability expectations also impacts on disaster recovery provision, it is no longer sufficient to rely on restoration from backup instead redundant hardware and facilities are required. This paper discusses factors that must be considered when planning for disaster recovery and identifies how cloud services can be used as a disaster recovery solution. Determining Project Scope Disaster Recovery what is it? The most important starting point for the project is to define what you mean by Disaster Recovery. To you and your team is a disaster the failure of a single server? A fire in your data centre? A power outage to your entire site or all of the above? Until you know what you re trying to protect yourself from its difficult to ensure that you have adequate process and procedures in place. A risk based approach can help you to identify potential disasters, the impact they will have on your services and likelihood of their occurrence. Disaster Recovery vs. High Availability High Availability (HA) is typically used to describe systems which are connected by high speed low latency links and often have shared components. Many vendors provide failover clustering technologies that provide high availability solutions, such as Microsoft Windows Failover Clustering, Oracle Real Application Clusters, etc HA solutions are designed to minimise the downtime of business critical services and can protect against hardware failure of specific components. HA clusters generally offer automated failover with minimal data loss. Typically the constituent parts of a failover cluster are located in the same data centre, or are all located on the same LAN (i.e. multiple datacentres within the same building/campus). High Availability Utilising the Cloud for Disaster Recovery 3

As a general rule Disaster Recovery refers to the provision of offsite facilities that are geographically separate from the primary facilities. A consequence of the geographic separation is the introduction of higher latency links. The high levels of latency, and potential unreliability of these links makes them unsuitable for use by many clustering technologies. Disaster Recovery The lines between HA and DR do become blurred by some newer technologies which can be used to provide the levels of failover and reliability typically associated with HA over WAN links. Microsoft Exchange Database Availability Groups being a typical example. Defence in Depth HA and DR are not mutual exclusive options and can be combined to further reduce the risk of service outage. 4 Utilising the Cloud for Disaster Recovery

Objectives The success of any project is dependent upon clearly defined and understood objectives, without which it is impossible to measure the success or effectiveness of the project. The exact objectives will vary from project to project but at a minimum you should consider: Physical Separation Based on your risk assessment of the potential disasters what is the minimum level of physical separation you require between your live and DR systems? Options to consider include: Different building Different campus Different town/city Different area of the country Different country Different continent Acceptable Downtime The initial reaction from many IT managers and business managers is that no downtime is acceptable. However, if the building containing your primary data centre and finance department burns to the ground it will take time for the finance team to be relocated to different premises, it will take time to find computers for them to use etc therefore how quickly do you really need to restore access to your finance system? Acceptable Data Loss Window Whilst zero data loss is certainly desirable as the level of synchronicity between live and DR systems increases so do the costs, either in terms of the technology required or bandwidth utilised to maintain synchronicity. Databases which handle real time transactions, such as on-line or face-to-face enrolments, normally require a small data loss window, ideally the window should be no more than a handful of transactions. If you lose a day of transactions can you recreate that data? Does the person who enrolled via your website know you have lost their data? Do you even know who they are? For other systems a high window may be more acceptable, what would be the impact of losing the last 3-4 hours of data from your file servers? Is this any different from someone forgetting to press save and losing a file? Utilising the Cloud for Disaster Recovery 5

Capacity/Performance What sort of capacity and performance is acceptable for your DR services? Thought needs to be given as to whether your DR services need to give your users the same level of performance as your live systems. Your DR system may introduce new bottlenecks to the mix such as available WAN/internet bandwidth between DR facilities and users. The amount of expansion capacity and historical capacity also needs to be considered. Acceptable Restoration Time If you have had to activate your DR services at some point you ll want to switch back to your live services. How will you do this? Will the failback result in any downtime? The answers to many of the questions you will need to ask yourself will vary from system to system. The Cloud Options Maintaining DR facilities can be expensive, both in terms of investment in hardware, hardware which you hope you will never need to use, and time to maintain and administer the DR hardware. Use of the cloud to host your DR facilities can eliminate or reduce a number of these costs. Most major Cloud providers have globally dispersed redundant data centres which that will generally be hundreds of miles away from your facilities. Infrastructure as a Service (IaaS) Selection of an IaaS option will remove the need to invest in hardware and construct a secondary server room/data centre. An IaaS DR solution involves renting sufficient computing resources from a cloud provider to allow you to create a virtual data centre in the cloud. You are then responsible for creating and maintaining the virtual machines which provide your DR facilities. Platform as a Service (PaaS) With PaaS the cloud provider is responsible for the hardware, operating systems and services. This removes the need for you to maintain and patch virtual machines. An example of PaaS is the Microsoft Azure SQL Database service, Microsoft are responsible for the hardware, operating systems and SQL Server installation, you only need be concerned about your database. In some cases you may be forced down an IaaS route due to the need to install 3rd party software on a server, in other cases PaaS may be appropriate. For example, you may need to use IaaS for your finance system DR as you need to install a 3rd party finance server product but you can use PaaS to provide DR for your website. Alternatives to Disaster Recovery - Software as a Service (SaaS) When looking at the services for which you need to provide DR facilities it is worth asking the question of whether there is a better way to deliver those services. By moving services such as e-mail from traditional on-premises hosted solutions to cloud hosting you obviate the need to invest time and money in providing DR facilities for those services, the availability and accessibility of those services becomes the cloud providers concern. 6 Utilising the Cloud for Disaster Recovery

Selecting a Cloud Provider Platform The cloud is a growth area within the IT sector that is rapidly expanding, both in terms of services offered and companies providing those services. Some providers have invested in the development of proprietary platforms, such as Amazon E2C or Windows Azure, whilst other providers have developed services based on off the shelf products, such as VMWare. Compatibility Compatibility between your cloud provider s platform and your on-premises virtualisation platform can affect the options available for your data replication strategy. If the two platforms are compatible or can be managed by the same virtualisation management platform, such as Microsoft System Centre Virtual Machine Manager, you may be able to move, or replicate, data and virtual machines between your onpremises solution and your cloud solution. Compliance The requirements of the Data Protection Act (1998) are often cited as being a barrier to the use of the cloud, in particular the need to obtain subject consent prior to transferring data outside of the EU. You should not assume that because a cloud provider is based in the UK, or Europe, that your data will be stored within the EU. Most major cloud providers have data centres located within the EU and some allow you to select the region or even individual data centre that will be used to store your data. Security Physical Reputable cloud service providers should be able to provide information on the levels of security accreditation to which their services and data centres comply. Many providers will be delivering services to customers in the financial, health care, defence sectors as well as local and national governments and as such will already comply with extremely stringent security requirements. Connectivity For your data to reach the data centres of your chosen cloud provider it will probably need to travel across the public internet. It is important to ensure that the data is protected in transit. Most SaaS and PaaS solutions have been developed from the ground up as internet services and will make use of SSL & HTTPS to provide secure connectivity. For example HTTPS to connect to a web based SaaS e-mail solution or SFTP to transfer files to a PaaS hosted website. Utilising the Cloud for Disaster Recovery 7

IaaS services typically require Virtual Private Networks (VPN) to connect the hosted virtual machines to your on-premises LAN. Site-to-site VPN s require a device at both sites to terminate the connection, therefore it is important to confirm that you have a suitable end point device capable of handling your end of the connection and that the device will work with your cloud providers VPN implementation. Pricing Model & Contract Offerings Is it necessary for all of your DR assets to be operational 24x7? or do you simply need them ready and waiting to be fired up? Most cloud providers pricing is based on the size, allocated storage and hours of usage of a virtual machine. Applications which are built around an n-tier model will have application servers that host websites or application software. You may only need to fire up the virtual machines hosting these application server roles for a few hours a month for testing and patching. Does your cloud providers pricing structure reflect this usage model? Understanding Risk An analysis of the roles and workloads of your systems will help you to identify the level of risk that the loss of a system poses and therefore the level of DR protection and effort that it warrants. Systems are often comprised of multiple servers each fulfilling distinct roles. The impact of loss, and ease of restoration, will vary depending upon the role of the server. 8 Utilising the Cloud for Disaster Recovery

Suggested role are listed below: Role Description Data Change Frequency Ease of Recreating Data Acceptable Data Loss Examples Data Storage Servers holding nontransactional data High Moderate Moderate < 4 hours File servers, mailbox servers etc where users can recreate documents Database Databases servers High Low Low < 30 minutes SQL Server, Oracle, MySQL etc especially on-line system where may not be possible to recreate data (i.e. e-registers, on-line enrolment) Application Servers which do not store volatile data Low High High Web servers, middle-tier servers etc static content updated infrequently (i.e. software upgrades, website redesign etc ) Data Replication Strategy Obviously it is necessary for the data in each of your DR systems to be updated regularly and to be no older then the acceptable data loss window you have identify for that system. It is important to select a replication method that is appropriate for the level of risk and acceptable data loss window. Approaches Application Replication Many enterprise class applications incorporate their own replication technologies, for example, Microsoft Exchange Database Availability Groups, Oracle Data Guard, MySQL master/slave replication etc Where application replication technologies are available they should be considered as the preferred option as they are designed to replicate data in a manner that makes sense to the application. File System Level In some cases simply copying files from the live systems to the DR systems will suffice to replicate the data. Tools such as robocopy and rsync are able to intelligently determine what differences exist between source and destination locations and only copy new or changed files to the DR location as well as removing redundant files from the DR site. Services such as the Distributed File System (DFS) built into Windows server can be used to automate and manage file replication. It is important to check that a file system copy is appropriate for the type of data being replicated. Using file system replication to copy the data files of your SQL Server whilst it is running could result in data corruption. Utilising the Cloud for Disaster Recovery 9

Virtual Machine Replication Replication or cloning of entire virtual machines is also a strategy that should be considered. This is especially useful for cases where all the components of a single system are located on a distinct virtual machine. This approach should also be consider for application/middle-tier servers where significant time and effort has been expended customising or configuring the middle-tier components. Best Approach Complex systems often consist of multiple servers each of which has a distinct role within that system. Consider a student records system, this will probably consist of a database server, two identical application servers and a client application. Your database will be experiencing constant changes and you need to ensure that in the event of a disaster you don t lose any records, on the other hand the software on the application servers is updated via a controlled process every 6 months when the software vendor releases an update. In this scenario it would be appropriate to make use of the database systems inbuilt replication technology to protect your database and to use virtual machine replication to replicate one of the application servers, you might only replicate the virtual machine once a month as it has a low degree of data volatility. Pro s Con s Data Granularity Recommended For Application Application aware Transaction rollback Corruption detection Automatic failover Can be complicated to setup Requires two installations of application software May require additional licences May introduce additional overhead on live systems Variable but appropriate for application (i.e. database transaction, Active Directory object, e-mail message etc ) Databases Mailboxes LDAP (inc Active Directory) File System Simple to set up Excludes open files Requires scripts and/ or additional software File level File shares VM Replication Replicates entire server Can be complicated to setup Lots of data to transfer Servers may require reconfiguration once activated Virtual Machine (though some solutions allow block level) Application servers 1 server systems Software Licences Typically when you create a virtual machine in the cloud the machine will be based on a template which has a cost associated with it, usually charged hourly, weekly or monthly. In most cases these prices include the cost of the licence for the operating system used by the template. The same usually applies to PaaS in that the charge for the period will include the licence costs for all the components of that service. For example, you don t need to purchase licences for Microsoft SQL Server to use the Microsoft s Azure SQL Database platform. 10 Utilising the Cloud for Disaster Recovery

You do need to ensure that you are adequately licenced for any software you install on the virtual machines you create in the cloud. Consider a scenario where you create a virtual machine to host Microsoft Exchange Server because you want to use Exchange Database Availability Groups to provide application level replication for your e-mail system, in this scenario you probably wouldn t need to purchase a licence for Windows Server (as this will be included in the cost you are paying for the virtual machine) but you will need to buy a licence for the copy of Microsoft Exchange you have installed on that server. Some software vendors incorporate provision in their education and volume licencing schemes that allows you to install additional copies of their software for disaster recovery purposes. Obviously you don t want to spend money on licences you don t need. Try checking the software vendor s website for licencing FAQ s, contacting the retailers who you purchased the software from or contacting the vendors directly if you are ensure about what you are or aren t allowed to do with your existing licences. Considering Failover If you have to activate your DR facilities how will your users and client devices know where to find the systems they need to connect to? Most modern networks make use of DNS to locate servers and services, in some cases you may be using IP addresses to locate services. It is probably that your DR facilities will be on a different IP subnet from your live systems, your clients need to be informed of this to allow them to connect to your DR facilities. Active Directory & DNS Assuming that you are utilising Microsoft Active Directory (AD) the servers on your DR site will need access to the AD and associated DNS in order to operate. Therefore it is recommended that you maintain at least one operational Domain Controller in your DR facilities. This will also provide inherent DR for your AD and DNS infrastructures without any further work on your part. IP Address Allocation If you have chosen to replicate virtual machines to your DR site do these virtual machines have static IP addresses assigned? If so you will need to login to each VM as you bring it online and assign a new IP address. Consider whether you can use DHCP to assign IP addresses to your servers. Application Aware Failover If an application has some form of application level replication it may also have application level failover. Microsoft Exchange Database Availably Groups (DAG) are such an example, with DAG s the Exchange client access servers automatically connect to the mailbox server which is hosting the active database. Distribute File System (DFS) Switching to an alternate file server normally involves finding all references to the UNC path of the failed file server and replacing them with references to the new file server. DFS allows the creation of a fault tolerant file share containing folders that refer to one or more real file shares. By configuring an active and inactive referral for each file share, one referencing your live system and the other your DR system, all you need do to failover is change the referrals appropriately. Utilising the Cloud for Disaster Recovery 11

DNS for Failover It is assumed that you have created a Domain Controller in your DR site that is also a DNS server, thus providing resilience for your DNS. Most of your clients will be using DNS to locate the servers and services to which they connect, in many cases switching to your DR facilities may involve no more than changing DNS entries so they point at the DR system. Consideration needs to be given to the TTL value of the DNS entries as these determine the length of time your clients will cache the returned DNS data. If your records have a TTL of an hour it could take that long before some of your clients can access your DR services. You should ensure that the TTL values for the critical DNS records are set to values that are consistent with your failover objectives. When planning for DR it is recommend to review the way your clients currently locate their servers, where possible try to avoid the use of IP addresses or server names and use DNS aliases (CNAME) records. For example, instead of using http://servername.college.ac.uk/ebs create a DNS CNAME for ebs-live.college.ac.uk which refers to servername.college.ac.uk that way if you have to switch to your DR system all you need do is update the CNAME record. Replicated Virtual Machines In most cases failover of replicated VM s will be as simple as powering on the VM, checking it has an appropriate IP address and ensuring that DNS reflects the current IP address. Where the VM is a part of a multi-tier application and you have also failed over database tier components you may need to update the application with the new address of the database server. This process can be simplified through the use of DNS aliases and application specific redirects, for example, you might create an DNS alias for studentrecords-live.college.ac.uk which points at your live database server, you then use this address when installing/configuration application-tier components, in the event of failure all you need to do is change where the DNS alias points. Network Load Balancers Network load balancers (NLB) provide an option for failover of some services, good quality load balancers will be able to detect server and application failure automatically and redirect traffic. However, you also need to consider DR for your NLB, if you position an NLB on your live site which is configured to redirect traffic to your DR site what will you do if your NLB is out of action? Planning Once you ve carried out your risk assessment you will have a better idea of the disasters that you may encounter and the how what the probability of each disaster is. As you have hopefully realised you are probably more likely to encounter situations where one, or a small number, of related systems have failed, probably as a result of hardware failure or software problem. The level of detail involved in your DR plan should reflect how critical the system is and how quickly it needs to be recovered. You may have generic processes that apply across multiple systems, for example, if you have multiple database servers with identical DR processes a single process is probably sufficient. Whilst it is possible to create detailed scripts and automated procedures that can be sued to activate DR facilities every disaster tends to be different and needs to be assessed individually. The process to fix a disaster of type A may in fact make a disaster of type B worse. 12 Utilising the Cloud for Disaster Recovery

The best approach is to take a scenario based approach, start with the highest probability & highest impact risks and work down to those with the lowest probability and impact. An important consideration in your planning is who has the authority to declare a disaster and invoke the DR plan? In some cases invoking the DR plan may result in more overall disruption then it would to leave a particular service offline for an hour while you fix it. Testing It is essential to test your DR processes regularly. The scope of testing needs to be considered on a system by system basis, also consider if you need to test every system? again if you have 20 servers with an identical process do you need to test them all regularly? For systems with transparent application level replication and failover testing should be straight forward and can be done regularly. In cases where a failover would be disruptive is simulating failover sufficient for the system in question? Example Implementation Background Until the summer of 2011 South Tyneside College (STC) operated across two major campus (Westoe & Hebburn) and a third specialist campus (MSTC). STC s primary data centre was located on the main campus (Westoe) with a smaller sever room at Hebburn, the MSTC has only a single server. Systems had been established for some time to replicate data and services between Westoe and Hebburn allowing either campus to act as DR site for the other. Utilising the Cloud for Disaster Recovery 13

For reasons of operational efficiency a decision was taken to close Hebburn. Due to the high cost of creating the necessary facilities and upgrading the data links it was not feasible to establish DR facilities for Westoe at the MSTC. A redundant server room in a separate building on the Westoe campus was refurbished for DR use. Challenge The primary data centre supports 46 physical servers and 69 virtual machines, a further 16 physical servers are located in the secondary server room providing support for DR. The hardware in the secondary server room had previously been the live hardware from Hebburn and was planned for replacement in summer 2013. Estimated costs for replacing this equipment were expected to be in the region of 50,000-60,000. Examination of the available options indicated that the use of the cloud for our DR facilities would result savings of around 10-15% and provide a truly offsite solution. The work involved would also allow us to gradually migrate a number of live services from on-premises to the cloud in future, producing further cost savings. 14 Utilising the Cloud for Disaster Recovery

Planning Numbers Due to the levels of reliance and HA provided by the equipment in the primary data centre which meant that we only expect to need to activate the DR facilities in the event of a disaster which renders our main campus unusable (fire, floor, prolonged power outage etc..). Under these circumstances we anticipate that the major performance bottleneck will be the available bandwidth of the internet connection(s) used to connect to the virtual data centre. Based on this supposition the following criteria were applied to determine if a system or server was within scope of the project. Where multiple load balanced application servers for the same service existed we would only provide one DR server Where we had split large workloads across non-load balanced servers (i.e. file servers) we would consolidate these workloads on one DR server Servers in DMZ would be excluded where these services duplicated LAN servers which are in scope Servers which were used to support physical equipment which would likely be inaccessible during a disaster would be excluded from scope. This was based on the grounds that if our buildings are out of action so will be the equipment they contain therefore print servers, wi-fi controllers etc would not be required. Analysis of the roles and workloads of our servers indicated that our disaster recovery strategy needed to support a minimum of 29 servers. Workloads Of the 29 systems within project scope we identified 9 database servers and 3 data store servers (file server, mailbox & Active Directory). The remaining servers fit into the application server category. Replication & Failover Strategies Based on the workloads of the systems in scope a combination of application level, file system level and virtual machine cloning was adopted. For a small number of cases it was recognised that the best option was to build a new application server in the cloud due to the comprehensive application level functionality provided by that system, for example, Microsoft Exchange Client Access Servers. Application Level Replication Application level replication was select for Active Directory (AD has inherent replication), Microsoft SQL Server, MySQL Server, Microsoft Exchange. All of these applications have built in multi-server replication mechanisms which allowed for recovery windows of less than 15 minutes. Failover procedures for these systems are either automatic/inherent (i.e. Active Directory & Exchange), or requires a flag setting within the application to indicate the primary server (SQL Server & MySQL). In the case of SQL Server and MySQL Server it is also necessary to update the configurations of the application servers/client applications to reference the DR servers as opposed to the live servers. Utilising the Cloud for Disaster Recovery 15

File Level Replication File level replication was used to replicate data from the 4 on-premises file servers to the single cloud based file server using the built in robocopy command and its mirroring/synchronization option. The synchronization was scheduled to run overnight as a one working day recovery window was deemed adequate for file services. As Microsoft DFS is used in all links and paths that reference the file shares on the file servers failover involves disabling the referral to the on-premises servers and enabling the referral to the cloud servers. Virtual Machine Replication Database Servers A small number of simple systems, some quite critical, have all their components installed on a single virtual machine. These applications either do not have a high workload, or do not have scalable architectures. Systems falling within this category include the payroll system, library management system, active directory certificate services and an Oracle Express server used for teaching purposes. For these systems virtual machine level replication was selected with a nightly replication interval. Failover requires the virtual machines be brought on-line, they will automatically register their new IP address with DNS. Virtual Machine Replication Application Servers The remaining systems all fulfilled application front end/middle tier roles, therefore virtual machine replication was selected as the replication strategy. As updates and changes to the live servers are carried out via a controlled change management process a weekly virtual machine refresh was deemed sufficient. Failover requires the virtual machines be brought on-line, they will automatically register their new IP address with DNS. In some cases it is also necessary to update the database server references to refer to the DR database servers. Cloud Provider Selection Once the workloads, replication and failover strategies had been decided upon a review of the services offered by various cloud service providers was undertaken. As it was identified that 60% of the virtual machines required for the DR solution would only need to be powered up for testing and patching for a couple of hours each month providers with an hourly pricing model were favoured. Compatibility with existing systems was also a factor in provider selection. The virtualisation infrastructure at STC is based on Microsoft Hyper-V (Windows 2008 R2) managed by Microsoft System Centre Virtual Machine Manager (MSCVMM) 2012. Therefore solutions that offered management integration with MSCVMM and virtual machine migration from Hyper-V were favoured. Consideration of the above factors, plus pricing, resulted in the selection of the Microsoft Windows Azure platform, Microsoft were able to offer favourable educational pricing. However as we were the first UK institution to sign up to Azure via an education agreement we discovered Microsoft s signup procedure were not fully developed which resulted in delays of many months. It should be noted that we have been assured by Microsoft that these procedures are now fully developed and have been used successfully by other institutions. 16 Utilising the Cloud for Disaster Recovery

Implementation Process Implementation of the solution was approached via the following sequence: 1. Establish VPN connectivity STC uses a pair of Smoothwall UTM-3000 appliances to provide internet content filtering and firewall services. The Smoothwall UTM-3000 supports IPSec site-to-site VPN s as does Windows Azure. Establishing a site-to-site VPN between the two systems was relatively straight forward. 2. Build & commission Domain Controller in Azure The first server created in Azure was a Domain Controller to provide Active Directory and DNS services to our other servers. This was accomplished through installation of a Windows 2008 R2 on a new virtual machine which we then promoted this server to a Domain Controller and installed the DNS server role. 3. Build database, mailbox and file servers Servers were built to host these roles and the appropriate application software installed (i.e. Microsoft Exchange, Microsoft SQL Server etc ) 4. Establish Replication Application level replication was established for: Exchange DR server was added to Exchange Database Availability Group and existing mailbox database with the DAG had a new replication targets added. SQL Server database log shipping was selected as the most appropriate replication method and using the wizards built into SQL Server Management Studio new log shipping partnerships were created. File Servers initial replication of file data was accomplished via the robocopy command line tool, subsequent replication runs made use of the /mir switch to synchronize the data on the replica servers 5. Establish virtual machine replication Virtual machine replication was initially achieved through the copying backups of the VHD files of live virtual machines to Azure using the csupload command line tool. However work is on-going to use System Centre App Controller and System Centre Orchestrator to accomplish these tasks in future. Utilising the Cloud for Disaster Recovery 17

Future Developments Partially as a result of our experiences with this project it is the intention of STC to make significantly more use cloud computing services. In some cases we have identified that increased adoption of cloud services may in fact increase costs but offers us significantly better functionality. Office 365 A project is underway to migrate all staff & student e-mail content, 500GB of SharePoint content, and the contents of staff & student My Documents folders (approximately 1TB of files) to Office 365. Hyper-V Replica Windows Server 2012 introduced the ability to have active/passive replicas of individual virtual machines. An Azure implementation of this technology is in development which allows Azure to participate as one side of this partnership. Once available this solution will be used to accomplish VM replication to Azure. Server Migration Work carried out to date has proven that it is feasible and practical for us to host servers in Windows Azure. Over the next 3 years an increasing proportion of our server infrastructure will be moved from on-premises hardware to Azure. The migration to Office 365 is the first step of this process as it eliminates the need for on-premises e-mail, file storage and SharePoint servers. 18 Utilising the Cloud for Disaster Recovery

Appendix 1 Provider OS Virtual Machines Storage Bandwidth VPN Requirements Small Medium Large Space IOPS In Out Small VM's Medium VM's Large VM's Storage Bandwidth VPN Est Cost Per Month Annual Cost CPU RAM HDD Price/ hr CPU RAM HDD Price/ hr CPU RAM HDD Price/ hr Price/ GB per month Price/ million per month Price/ GB per month Price/ GB per month Price Per Hour No. Hours Per VM Per Month No. Hours Per VM Per Month No. Hours Per VM Per Month GB per month IOPS per month In GB per month Out GB per month Hours Per Month Utilising the Cloud for Disaster Recovery 19

Appendix 2 Disaster Scope Impact Assessment Residual Risk College Campus Building Service Downtime Liklihood Impact Score Downtime Liklihood Impact Score Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 Utilising the Cloud for Disaster Recovery