Experiences with Building Disaster Recovery for Enterprise-Class Clouds

Similar documents

Resiliency Considerations in Transformation to Cloud

Demystifying Disaster Recovery as a Service

SOLUTION BRIEF Citrix Cloud Solutions Citrix Cloud Solution for Disaster Recovery

Planning for the Worst SAS Grid Manager and Disaster Recovery

How To Use Anibom Smart Cloud For Business

Traditional Disaster Recovery versus Cloud based DR

Disaster Recovery As A Service Storage by CloudGrid and Zerto Virtual Replication Disaster Recovery and Business Continuity Platform

a Disaster Recovery Plan

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

Virtualizing disaster recovery using cloud computing

VMware for your hosting services

How To Make Money From Cloud Computing

DISASTER RECOVERY BUSINESS CONTINUITY DISASTER AVOIDANCE STRATEGIES

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Optimization, Business Continuity & Disaster Recovery in Virtual Environments. Darius Spaičys, Partner Business manager Baltic s

Leveraging the Cloud. September 22, Digital Government Institute Cloud-Enabled Government Conference Washington, DC

Appendix C to DIR Contract Number DIR-TSO-2736 SunGard Availability Services Discount Level: 25% Managed Data Center Services - Cloud Hosting

Savvy Cloud Providers Adopt DRaaS as Cloud-Based Disaster Recovery Soars

Questions for Vermont Hosting RFI

How to use Cloud Solutions by Swisscom for Disaster Recovery. Whitepaper. Fabian Haldimann Stefan Lengacher Thomas Gfeller

Building Private & Hybrid Cloud Solutions

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

Moving beyond Virtualization as you make your Cloud journey. David Angradi

Protecting Data and Applications in Private Clouds for VMware environments

Continuous Data Protection for any Point-in-Time Recovery: Product Options for Protecting Virtual Machines or Storage Array LUNs

The Shift Cloud Computing Brings to Disaster Recovery

VMware on VMware: Private Cloud Case Study Customer Presentation

Designing & Managing Reliable IT Services

Asigra Cloud Backup V13.0 Provides Comprehensive Virtual Machine Data Protection Including Replication

VMware Business Continuity & Disaster Recovery Solution VMware Inc. All rights reserved

Experiences with Transformation to Hybrid Cloud: A Case Study for a Large Financial Enterprise

How To Back Up A Virtual Machine

Cloud Computing. Jean-Claude DISPENSA IBM Distinguished Engineer

What is the Cloud, and why should it matter?

50x Zettabytes*

Journey to the Private Cloud. Key Enabling Technologies

The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?

Fujitsu Cloud IaaS Trusted Public S5. shaping tomorrow with you

CompTIA Cloud+ Course Content. Length: 5 Days. Who Should Attend:

A Guide to Disaster Recovery in the Cloud. Simple, Affordable Protection for Your Applications and Data

Going Hybrid. The first step to your! Enterprise Cloud journey! Eric Sansonny General Manager!

Why Cloud CompuTing ThreaTens midsized enterprises and WhaT To do about it

Enterprise Cloud Solutions

WHITE PAPER. Header Title. Side Bar Copy. Header Title. What To Consider When WHITEPAPER. Choosing Virtual Machine Protection.

Planning, Provisioning and Deploying Enterprise Clouds with Oracle Enterprise Manager 12c Kevin Patterson, Principal Sales Consultant, Enterprise

An Introduction to Private Cloud

Healthcare: La sicurezza nel Cloud October 18, IBM Corporation

Barnaby Jeans Sr. Solution Architect Business Critical Applications

Performance Management for Cloudbased STC 2012

EMC VPLEX FAMILY. Continuous Availability and Data Mobility Within and Across Data Centers

Building Private & Hybrid Cloud Solutions

WHITE PAPER: Egenera Cloud Suite for EMC VSPEX. The Proven Solution For Building Cloud Services

OWASP Chapter Meeting June Presented by: Brayton Rider, SecureState Chief Architect

CompTIA Cloud+ 9318; 5 Days, Instructor-led

Our Cloud Backup Solution Provides Comprehensive Virtual Machine Data Protection Including Replication

CHAPTER 8 CLOUD COMPUTING

Building disaster-recovery solution using Azure Site Recovery (ASR) for Hyper-V (Part 1)

CA Automation Suite for Data Centers

How To Make A Cloud Based System A Successful Business Model

Building your Server for High Availability and Disaster Recovery. Witt Mathot Danny Krouk

VMware Solutions for Small and Midsize Business

Remote Voting Conference

Cloud Services Trending

Disaster Recovery (DR) Planning with the Cloud Desktop

Leveraging the Cloud for Data Protection and Disaster Recovery

Recovery as a Service Raj Krishnamurthy Principal Product Manager, SunGard Availability Services

Using the cloud to improve business resilience

Planning the Migration of Enterprise Applications to the Cloud

A Gentle Introduction to Cloud Computing

Cloud Computing: It s In Your Future. What You Need to Know about Logicalis and Cloud Computing

MANAGED DATABASE SOLUTIONS

The Big Bang: cloud resiliency and the data explosion

Technology Comparison. A Comparison of Hypervisor-based Replication vs. Current and Legacy BC/DR Technologies

Replication, Business Continuity and Restoration with Cloud Economics

EMC VPLEX FAMILY. Transparent information mobility within, across, and between data centers ESSENTIALS A STORAGE PLATFORM FOR THE PRIVATE CLOUD

Part2 Hyper-V Replica and Hyper-V Recovery Manager. Datacenter Specialist

Disaster Recovery Hosting Provider Selection Criteria

Cloud Computing and the SME Prosper on the cloud. Wally Kowal, President and Founder Canadian Cloud Computing Inc.

ADVANCE YOUR MISSION WITH THE CLOUD DO MORE WITH LESS CLOUD SOLUTIONS CDW NONPROFIT

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

NetApp SnapMirror. Protect Your Business at a 60% lower TCO. Title. Name

Enterprise Cloud Adoption- Deployment Models, Workloads and Industry Perspective

Veritas Storage Foundation High Availability for Windows by Symantec

The Cloud, Virtualization, and Security

Performance Management for Cloud-based Applications STC 2012

Virtual Server System and Data Protection, Recovery and Availability

The case for cloud-based disaster recovery

Virtual Server System and Data Protection, Recovery and Availability

WHITE PAPER. The Double-Edged Sword of Virtualization:

IBM Cloud Builder Professional Services. and Remote Managed Services

Webinar: Modern Data Protection For Next-Gen Apps and Databases

Benefits and Tips about Off-Site Data Protection and Disaster Recovery for Nimble Storage SAN Customers

Microsoft Private Cloud

VMware vcloud Air - Disaster Recovery User's Guide

Automate DR Testing with Zerto and OO

Disaster Recovery as a Service 2013

Using Live Sync to Support Disaster Recovery

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Infrastructure, application services, and managed services - all in a single, integrated platform CENTURYLINK S END-TO-END MANAGEMENT SOLUTIONS:

Always On: Unitrends Disaster Recovery Services (DRaaS)

Transcription:

University of Illinois, ECE 542 / CS 536, Spring 2015 Hari Ramasamy, Ph.D. Manager and Research Staff Member, IBM Research Member, IBM Academy of Technology hvramasa@us.ibm.com http://researcher.watson.ibm.com/researcher/view.php?person=us-hvramasa Experiences with Building Disaster Recovery for Enterprise-Class Clouds Acknowledgments: Long Wang, Richard Harper, Mahesh Viswanathan (IBM)

Outline What is Cloud? What is Disaster Recovery? Core concepts behind Enterprise-Class Cloud DR Challenges in Enterprise-Class Cloud DR DR Life Cycle and Use Cases Reference Architecture DR Solutions for an enterprise cloud platform (IBM s Cloud Managed Services) Lessons Learned Summary 2

What is Cloud Computing? Essential characteristics [NIST, 2009]: On-demand self-service Broad network access of cloud services Resource pooling and sharing across apps/tenants Rapid/automated provisioning and (later) release of services Resource utilization tracking and Pay-as-you-go Building blocks of Cloud Computing Standardization Virtualization Automation 3

Cloud Terminology Cloud The actual resources (HW, SW, Building, etc.) that enable cloud services Cloud Service What users can buy or request on a Cloud Cloud Computing The model of getting and using cloud resources and services 4

Types of Clouds Based on service models Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Based on ownership or deployment models Public Clouds Private Clouds Hybrid Clouds Based on who manages Managed Clouds cloud provider manages IT services such as monitoring, patching, security, load balancing, and even certain applications on behalf of cloud clients Unmanaged Clouds service management is client s responsibility 5

Cloud Computing As A Service Business Processes Collaboration Industry Applications Software as a Service CRM/ERP/HR Middleware High Volume Transactions Database Web 2.0 Application Runtime Development Tooling Platform as a Service Operating System + Standard Prog. Languages Data Center Servers Networking Storage Fabric Shared virtualized, dynamic provisioning Infrastructure as a Service 6

Infrastructure as a Service (IaaS) Provides a barebones virtual machine with an operating system 7

Platform as a Service (PaaS) Provides an application development platform 8

Software as a Service (SaaS) Provides an application Google search, email, and other applications 9

A Quick Comparison of Cloud Types Quicker to Value (Less Work) SaaS (Application) PaaS (Platform) IaaS (HW + OS) 10 Fewer Constraints (Increasing Flexibility)

What is Disaster Recovery? According to Wikipedia, Disaster Recovery (DR) is "the policies and procedures... for recovery or continuation... of vital technology infrastructure and systems... following a natural or human-induced disaster. Disaster Types Floods Hurricanes Volcanoes Earthquakes Fires Terrorist Attacks Hacker Attacks Alien monsters. IT Infrastructure and Systems Servers Storage Network Software Configuration Policies and Procedures for Recovery Geographic Dispersion Recovery Orchestration Recovery Automation Detailed Plans Data copies DR Drills Periodic Testing Detection 11

Disaster Recovery vs. High Availability Disaster Recovery (DR): process and procedures that enable the continuation or recovery of technology infrastructure or systems after a natural or human-induced disaster causes an interruption High Availability: ability of a system to continue being accessible despite failures of system component(s) Recovery Target Failure type Triggering Event Disaster Recovery (DR) Entire technology infrastructure Site-wide disasters Executive decision High Availability (HA) Individual components or functions Failures of individual computing components Failure detection or administrator action Both increase overall availability but there are differences 12

Disaster Recovery for Enterprise-class Clients Enterprise-class clients Examples: banks, financial institutions, hospitals, governments, utility companies, etc. Many are regulation-bound to have DR coverage DR requirements are very stringent Aggressive Recovery Time Objective (RTO) and Recovery Point Objective (RPO) Most large companies spend between 2% and 4% of their IT budget on DR planning Business impact of loss of IT infrastructure and data can be huge Cost of downtime could dissolve business Ubiquitous nature of IT on Business Irreparable brand damage Loss of customer data and reputation Market opportunity for Business Continuity/Disaster Recovery around $32 Billion in 2015 [Source: IBM] 13

Recovery Point Objective and Recovery Time Objective 14

Disaster Recovery for Enterprise-class Clients on the Cloud Ability to recover the cloud infrastructure and the workloads hosted on it Potential Benefits to Customers (Cloud Users): Self-service Model On-demand DR protection activation On-demand, non-intrusive DR tests Resiliency made cheaper Pay only for workloads that need to be DR-protected No upfront capital expenses Improved agility to outages Challenges to Cloud Providers More Aggressive SLAs Scale & Diversity Inter-dependencies and Coordination of Server DR and App DR DR of Management Capabilities Regulatory Requirements (e.g., location) 15

DR Life Cycle and Basic DR Use Cases DR Deploy ment Failback DR Steady State DR declaration DR Test Failover 16

Reference Architecture for Cloud DR At DR site VMs and applications/appliances may or may not exist before failover Management Systems may or may not always exist before failover or may be limited 17

Reference Architecture for Cloud DR: Replication Replication Method Synchronous Replication Asynchronous Replication Recovery Time Objectives Recovery Point Objectives Cost Seconds-minutes Seconds-minutes $$$ Minutes-few days Minutes-few days $$ Backup-Restore Days-weeks Days-weeks $ Replication Levels Storage-level replication any updates to the VM's state at the primary site's storage is mirrored to the DR site's storage Host-level replication requires installation of agent in each host different agents for different OSes App-level replication may be required for certain apps even if other options are technically possible Replication Modes Active-active (live/live, hot DR, warm DR) Active-passive (cold DR, warm DR) 18

Reference Architecture for Cloud DR: Networking Physical WAN network link between sites must have adequate bandwidth Network design should support multiple replication streams support secure segregation of data streams (e.g., VPNs, VLANs) secure access channels for cloud admins and clients support adequate segregation between accounts within the same client Client network environment may need to be pre-staged at the DR site Network management capabilities may need to be pre-staged at the DR site switching/routing configurations load balancing configurations 19

Reference Architecture for Cloud DR: Management Management Services for Enterprise Workloads include virus scanning patching directory services monitoring backup/restore load balancing network security compliance, At DR site VMs and applications/appliances may or may not exist before failover Management Systems may or may not always exist before failover or may be limited 20

Reference Architecture for Cloud DR: Control Orchestration and Automation Overall coordination of steps in DR lifecycle, particularly failover workload recovery environment recovery management recovery Drive automatic steps in DR lifecyle Administration self-service portal(s) that allow clients and admins to launch DR operations such as which VMs should be replicated initiating DR test or failover viewing replication/recovery status defining user roles specifying access permissions self-service portal(s) that allow admins to launch DR operations such as enrolling clients into DR protection pre-staging the client network environment (e.g., VLANs) 21

IBM: Business Continuity and Resiliency Services Broad experience Broad solution capabilities Industry-specific, globally available expertise Credibility More than 50 years of business continuity and disaster recovery experience More than 7,800 Business Continuity & Resiliency Services contracts with 5000+ clients Unique insights based on the work of 30,000 industry specialists worldwide Global resiliency centers designed for multivendor environments, with over 200 hardware and software vendors supported, including HP, Oracle, Cisco and our own IBM products Business process and technology expertise to help you design and implement the right solution for your business 150 resiliency centers across 50 countries Five million square feet of floor space for disaster recovery, with 41000 work area recovery seats Knowledge of local, regional and global regulations Over 1800 professionals dedicated to business continuity Track record of 100 percent success in meeting commitments to clients who have declared a disaster External validation by analysts that have reported favorably on IBM s breadth of offerings and geographic coverage 22

DR Solutions for an Enterprise-class IaaS-PaaS Managed Cloud (IBM Cloud Managed Services) Description RPO/RTO Specifications Regional Availability Cloud-to-Cloud Cloud-to-Dedicated DR Cloud-to-Repurposed Site Customer Site Failover to similar cloud Failover to dedicated DR Failover to custom site site site 15min/4 hours 15min/4 hours 15min/4 hours Another cloud site in same region No cloud site but a purpose-built DR site in same region Dedicated DR site s VM recovery mechanism Host-based and application-based No cloud site but a customer-owned site in same region VMWare vcenter VM Provisioning at cloud s VM provisioning DR site Replication Type Storage-based, rsync, Storage-based application-based; Replication Mode Active-active Active-passive and Active-passive Active-active Post-failover Full management Limited management Limited management Management Networking VLANs over VPN, MPLS or Point-to- VLANs over dedicated dedicated Fiber link Point with Layer 2 or link 23 Control Layer 3 routing Custom DR Orchestrator Dedicated DR site s DR automation Custom DR Orchestrator

Cloud to Cloud Disaster Recovery (IBM CMS) CMS Cloud A CMS Cloud B Primary VMs File-level or App-level or Host-level replication Secondary VMs Pre-provisioned DR VMs (maybe suspended) Automated DR failover DR Control DR Metadata DR Control Storage-level replication Storage Storage 24

Cloud to Cloud Disaster Recovery (IBM CMS) Primary / Secondary VMs CMS Cloud A File-level or App-level or Host-level replication CMS Cloud B Secondary / Primary VMs Pre-provisioned DR VMs (maybe suspended) DR Control Storage DR Metadata Storage-level replication DR Control Storage Automated DR failover 25

IBM CMS Cloud-to-Cloud Disaster Recovery Overview Failover site can be leveraged for other workloads (e.g. dev/test) 4 hour recovery time objective (RTO), 15 minute recovery point objective (RPO) Full CMS Management capabilities at recovery site IBM makes disaster declaration CMS Site-to-Site Disaster Recovery Fail Over Single annual DR test included Option to purchase additional tests Individual Workload(s) can be tested DR services can be ordered anytime after initial onboarding Primary Focus on Infrastructure DR (as of 1Q 2015) IBM-managed SAP and Oracle Services have DR options that were defined leveraging base CMS capabilities Enhancements to include middleware and database services within DR scope are planned for future release CMS DataCenter Raleigh Lisbon Ehningen Portsmouth Makuhari Winterthur Toronto Fail Back Boulder CMS DataCenter Barcelona Montpellier Lisbon Sydney Ehningen Boulder More to come 26

IBM CMS Cloud to Cloud DR: Steady-State Operations 27

IBM CMS Cloud to Cloud DR: Failover 28

Cloud to Dedicated DR Site (IBM CMS) Primary VMs CMS Cloud A File-level or App-level or Host-level replication Dedicated DR Site in same region Secondary VMs Pre-provisioned DR servers for Managed Applications. DR Control DR Metadata DR Control Other VMs provisioned during failover. Storage Storage 29

Outline of a Sample Failover Procedure 30

Lessons Learned in Enterprise-Cloud DR DR should cover workloads and management Standardization vs. customizability Data management is central to DR design Regulations and regional requirements may trump technology in DR Find acceptable balance between cost and risk mitigation Automation is a must to achieve low RTOs DR Testing should be flexible and non-disruptive 31

Summary and Takeaways Cloud DR is the ability to recover the cloud infrastructure and workloads hosted on it Cloud-based DR-as-a-Service has many benefits for enterprise-class cloud users Cloud-based DR-as-a-Service raises many technical challenges for cloud providers Many trade-offs to be considered Standardization vs. customizability Cost vs. risk mitigation Regulations vs. technical aspects Automation is Key Enterprises considering cloud-based DR expected to grow from 17% in 2014 to 50% in 2018 [Evolve IP Survey, 2015] Stay tuned, exciting stuff is happening in cloud DR 32