University of Illinois, ECE 542 / CS 536, Spring 2015 Hari Ramasamy, Ph.D. Manager and Research Staff Member, IBM Research Member, IBM Academy of Technology hvramasa@us.ibm.com http://researcher.watson.ibm.com/researcher/view.php?person=us-hvramasa Experiences with Building Disaster Recovery for Enterprise-Class Clouds Acknowledgments: Long Wang, Richard Harper, Mahesh Viswanathan (IBM)
Outline What is Cloud? What is Disaster Recovery? Core concepts behind Enterprise-Class Cloud DR Challenges in Enterprise-Class Cloud DR DR Life Cycle and Use Cases Reference Architecture DR Solutions for an enterprise cloud platform (IBM s Cloud Managed Services) Lessons Learned Summary 2
What is Cloud Computing? Essential characteristics [NIST, 2009]: On-demand self-service Broad network access of cloud services Resource pooling and sharing across apps/tenants Rapid/automated provisioning and (later) release of services Resource utilization tracking and Pay-as-you-go Building blocks of Cloud Computing Standardization Virtualization Automation 3
Cloud Terminology Cloud The actual resources (HW, SW, Building, etc.) that enable cloud services Cloud Service What users can buy or request on a Cloud Cloud Computing The model of getting and using cloud resources and services 4
Types of Clouds Based on service models Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Based on ownership or deployment models Public Clouds Private Clouds Hybrid Clouds Based on who manages Managed Clouds cloud provider manages IT services such as monitoring, patching, security, load balancing, and even certain applications on behalf of cloud clients Unmanaged Clouds service management is client s responsibility 5
Cloud Computing As A Service Business Processes Collaboration Industry Applications Software as a Service CRM/ERP/HR Middleware High Volume Transactions Database Web 2.0 Application Runtime Development Tooling Platform as a Service Operating System + Standard Prog. Languages Data Center Servers Networking Storage Fabric Shared virtualized, dynamic provisioning Infrastructure as a Service 6
Infrastructure as a Service (IaaS) Provides a barebones virtual machine with an operating system 7
Platform as a Service (PaaS) Provides an application development platform 8
Software as a Service (SaaS) Provides an application Google search, email, and other applications 9
A Quick Comparison of Cloud Types Quicker to Value (Less Work) SaaS (Application) PaaS (Platform) IaaS (HW + OS) 10 Fewer Constraints (Increasing Flexibility)
What is Disaster Recovery? According to Wikipedia, Disaster Recovery (DR) is "the policies and procedures... for recovery or continuation... of vital technology infrastructure and systems... following a natural or human-induced disaster. Disaster Types Floods Hurricanes Volcanoes Earthquakes Fires Terrorist Attacks Hacker Attacks Alien monsters. IT Infrastructure and Systems Servers Storage Network Software Configuration Policies and Procedures for Recovery Geographic Dispersion Recovery Orchestration Recovery Automation Detailed Plans Data copies DR Drills Periodic Testing Detection 11
Disaster Recovery vs. High Availability Disaster Recovery (DR): process and procedures that enable the continuation or recovery of technology infrastructure or systems after a natural or human-induced disaster causes an interruption High Availability: ability of a system to continue being accessible despite failures of system component(s) Recovery Target Failure type Triggering Event Disaster Recovery (DR) Entire technology infrastructure Site-wide disasters Executive decision High Availability (HA) Individual components or functions Failures of individual computing components Failure detection or administrator action Both increase overall availability but there are differences 12
Disaster Recovery for Enterprise-class Clients Enterprise-class clients Examples: banks, financial institutions, hospitals, governments, utility companies, etc. Many are regulation-bound to have DR coverage DR requirements are very stringent Aggressive Recovery Time Objective (RTO) and Recovery Point Objective (RPO) Most large companies spend between 2% and 4% of their IT budget on DR planning Business impact of loss of IT infrastructure and data can be huge Cost of downtime could dissolve business Ubiquitous nature of IT on Business Irreparable brand damage Loss of customer data and reputation Market opportunity for Business Continuity/Disaster Recovery around $32 Billion in 2015 [Source: IBM] 13
Recovery Point Objective and Recovery Time Objective 14
Disaster Recovery for Enterprise-class Clients on the Cloud Ability to recover the cloud infrastructure and the workloads hosted on it Potential Benefits to Customers (Cloud Users): Self-service Model On-demand DR protection activation On-demand, non-intrusive DR tests Resiliency made cheaper Pay only for workloads that need to be DR-protected No upfront capital expenses Improved agility to outages Challenges to Cloud Providers More Aggressive SLAs Scale & Diversity Inter-dependencies and Coordination of Server DR and App DR DR of Management Capabilities Regulatory Requirements (e.g., location) 15
DR Life Cycle and Basic DR Use Cases DR Deploy ment Failback DR Steady State DR declaration DR Test Failover 16
Reference Architecture for Cloud DR At DR site VMs and applications/appliances may or may not exist before failover Management Systems may or may not always exist before failover or may be limited 17
Reference Architecture for Cloud DR: Replication Replication Method Synchronous Replication Asynchronous Replication Recovery Time Objectives Recovery Point Objectives Cost Seconds-minutes Seconds-minutes $$$ Minutes-few days Minutes-few days $$ Backup-Restore Days-weeks Days-weeks $ Replication Levels Storage-level replication any updates to the VM's state at the primary site's storage is mirrored to the DR site's storage Host-level replication requires installation of agent in each host different agents for different OSes App-level replication may be required for certain apps even if other options are technically possible Replication Modes Active-active (live/live, hot DR, warm DR) Active-passive (cold DR, warm DR) 18
Reference Architecture for Cloud DR: Networking Physical WAN network link between sites must have adequate bandwidth Network design should support multiple replication streams support secure segregation of data streams (e.g., VPNs, VLANs) secure access channels for cloud admins and clients support adequate segregation between accounts within the same client Client network environment may need to be pre-staged at the DR site Network management capabilities may need to be pre-staged at the DR site switching/routing configurations load balancing configurations 19
Reference Architecture for Cloud DR: Management Management Services for Enterprise Workloads include virus scanning patching directory services monitoring backup/restore load balancing network security compliance, At DR site VMs and applications/appliances may or may not exist before failover Management Systems may or may not always exist before failover or may be limited 20
Reference Architecture for Cloud DR: Control Orchestration and Automation Overall coordination of steps in DR lifecycle, particularly failover workload recovery environment recovery management recovery Drive automatic steps in DR lifecyle Administration self-service portal(s) that allow clients and admins to launch DR operations such as which VMs should be replicated initiating DR test or failover viewing replication/recovery status defining user roles specifying access permissions self-service portal(s) that allow admins to launch DR operations such as enrolling clients into DR protection pre-staging the client network environment (e.g., VLANs) 21
IBM: Business Continuity and Resiliency Services Broad experience Broad solution capabilities Industry-specific, globally available expertise Credibility More than 50 years of business continuity and disaster recovery experience More than 7,800 Business Continuity & Resiliency Services contracts with 5000+ clients Unique insights based on the work of 30,000 industry specialists worldwide Global resiliency centers designed for multivendor environments, with over 200 hardware and software vendors supported, including HP, Oracle, Cisco and our own IBM products Business process and technology expertise to help you design and implement the right solution for your business 150 resiliency centers across 50 countries Five million square feet of floor space for disaster recovery, with 41000 work area recovery seats Knowledge of local, regional and global regulations Over 1800 professionals dedicated to business continuity Track record of 100 percent success in meeting commitments to clients who have declared a disaster External validation by analysts that have reported favorably on IBM s breadth of offerings and geographic coverage 22
DR Solutions for an Enterprise-class IaaS-PaaS Managed Cloud (IBM Cloud Managed Services) Description RPO/RTO Specifications Regional Availability Cloud-to-Cloud Cloud-to-Dedicated DR Cloud-to-Repurposed Site Customer Site Failover to similar cloud Failover to dedicated DR Failover to custom site site site 15min/4 hours 15min/4 hours 15min/4 hours Another cloud site in same region No cloud site but a purpose-built DR site in same region Dedicated DR site s VM recovery mechanism Host-based and application-based No cloud site but a customer-owned site in same region VMWare vcenter VM Provisioning at cloud s VM provisioning DR site Replication Type Storage-based, rsync, Storage-based application-based; Replication Mode Active-active Active-passive and Active-passive Active-active Post-failover Full management Limited management Limited management Management Networking VLANs over VPN, MPLS or Point-to- VLANs over dedicated dedicated Fiber link Point with Layer 2 or link 23 Control Layer 3 routing Custom DR Orchestrator Dedicated DR site s DR automation Custom DR Orchestrator
Cloud to Cloud Disaster Recovery (IBM CMS) CMS Cloud A CMS Cloud B Primary VMs File-level or App-level or Host-level replication Secondary VMs Pre-provisioned DR VMs (maybe suspended) Automated DR failover DR Control DR Metadata DR Control Storage-level replication Storage Storage 24
Cloud to Cloud Disaster Recovery (IBM CMS) Primary / Secondary VMs CMS Cloud A File-level or App-level or Host-level replication CMS Cloud B Secondary / Primary VMs Pre-provisioned DR VMs (maybe suspended) DR Control Storage DR Metadata Storage-level replication DR Control Storage Automated DR failover 25
IBM CMS Cloud-to-Cloud Disaster Recovery Overview Failover site can be leveraged for other workloads (e.g. dev/test) 4 hour recovery time objective (RTO), 15 minute recovery point objective (RPO) Full CMS Management capabilities at recovery site IBM makes disaster declaration CMS Site-to-Site Disaster Recovery Fail Over Single annual DR test included Option to purchase additional tests Individual Workload(s) can be tested DR services can be ordered anytime after initial onboarding Primary Focus on Infrastructure DR (as of 1Q 2015) IBM-managed SAP and Oracle Services have DR options that were defined leveraging base CMS capabilities Enhancements to include middleware and database services within DR scope are planned for future release CMS DataCenter Raleigh Lisbon Ehningen Portsmouth Makuhari Winterthur Toronto Fail Back Boulder CMS DataCenter Barcelona Montpellier Lisbon Sydney Ehningen Boulder More to come 26
IBM CMS Cloud to Cloud DR: Steady-State Operations 27
IBM CMS Cloud to Cloud DR: Failover 28
Cloud to Dedicated DR Site (IBM CMS) Primary VMs CMS Cloud A File-level or App-level or Host-level replication Dedicated DR Site in same region Secondary VMs Pre-provisioned DR servers for Managed Applications. DR Control DR Metadata DR Control Other VMs provisioned during failover. Storage Storage 29
Outline of a Sample Failover Procedure 30
Lessons Learned in Enterprise-Cloud DR DR should cover workloads and management Standardization vs. customizability Data management is central to DR design Regulations and regional requirements may trump technology in DR Find acceptable balance between cost and risk mitigation Automation is a must to achieve low RTOs DR Testing should be flexible and non-disruptive 31
Summary and Takeaways Cloud DR is the ability to recover the cloud infrastructure and workloads hosted on it Cloud-based DR-as-a-Service has many benefits for enterprise-class cloud users Cloud-based DR-as-a-Service raises many technical challenges for cloud providers Many trade-offs to be considered Standardization vs. customizability Cost vs. risk mitigation Regulations vs. technical aspects Automation is Key Enterprises considering cloud-based DR expected to grow from 17% in 2014 to 50% in 2018 [Evolve IP Survey, 2015] Stay tuned, exciting stuff is happening in cloud DR 32