B.Jostmeyer Tivoli System Automation bjost@de.ibm.com The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality? IBM Systems Management, Virtualisierung und Storage Symposium 15.- 17. November 2010, Marriott Hotel Heidelberg
The Promise of Virtualization for Availability, High Availability, and Disaster Recovery Myth or Reality? Virtualization technologies play an important role in datacenters especially in service oriented ( cloud ) environments. There is a lot of focus on virtualization technologies for distributed server platforms like VMware, System p s, SUN Solaris Zones, and others. Of course, virtualization provides several benefits nevertheless we want to concentrate in this presentation on the aspects of availability, high availability and disaster recovery. In the next hour I want to provide an overview about some existing virtualization technologies and their benefits to increase application availability by reducing planned downtimes. Furthermore I want to discuss limitations of virtualization technologies in comparison to traditional high availability solutions and how overall availability can be enhanced by a combination of both worlds 2
Agenda Part I: Introduction Part II: Usage of Virtualization Technology for Availability and High Availability Capabilities and Limitations Part III: Usage of Virtualization Technology for Disaster Recovery Capabilities and Limitations 3
Introduction to Virtualization Technologies Virtualization in simple words: Abstracting from the hardware Empowering a single piece of hardware to run multiple independent systems Primary value of virtualization is to enhance the overall utilization of hardware. VM VM VM (HW) Different virtualization areas: virtualization Provides multiple virtual machines (VM) on one physical server hardware as host for operating systems. Storage Virtualization not in the scope of this talk Network Virtualization is the term for a component used to manage virtual machine with its resources on one physical server 4
Virtualization and High Availability More than 80% of enterprises have adopted server virtualization, but only 20% of all server workload is on virtual machines Lack of confidence when it comes to high availability of virtual infrastructure Better management tools predict increase in adoption rate to 48% by 2012 Virtualized landscapes have the same high availability needs - stay in business 24x7x365 Failures causing service outages happen on hardware as well as on software stack Whenever maintenance is required if possible avoid service interruption Be prepared for the worst recover the business in another site 5
Business Continuity Definitions High Availability Continuous Availability Continuous Operations High Availability - A system to provide service during defined periods, at acceptable or agreed upon levels and masks UNPLANNED OUTAGES from endusers. Continuous Operations - A system to continuously operate and mask PLANNED OUTAGES from end-users. Continuous Availability - Attribute of a system to deliver non disruptive service to the end user 7 days a week, 24 HOURS A DAY (there are NO outages). 6
Business Relevance of Availability Commerce is handled over the internet and computing centers are growing At the same time businesses need to ensure that their systems are available 24/7 Downtime can be directly translated into loss of revenue Average cost of 1 hour downtime: 42.000$, but cost can be much higher Overall Availability situation has to consider planned and unplanned outages Availability 90% 95% 99% 99.9% 99.99% 99.9999% Downtime per year 36.5 days 18.25 days 3.65 days 8.76 hours 52.6 minutes 31.5 seconds 7
Business Continuity Issues 40 % Operations Errors 20 % Environmental Factors, HW,, Power, Disasters 40 % Application Failures Source: Gartner Group, 2007 Reasons for planned downtime Maintenance Tests Reasons for unplanned downtime Operator errors Application failures Environmental failures failures HW failures Disasters Additional Challenges caused by dynamically created services ( IaaS, PaaS, SaaS ) Loss of business Loss of customers the competition is just a mouse click away Loss of credibility, brand image and stock value 8
Virtualization Marketing Messages... 9
Part II: Usage of Virtualization Technology for Availability and High Availability - Capabilities and Limitations
Agenda Part I: Introduction Part II: Usage of Virtualization Technology for Availability and High Availability Capabilities and Limitations Virtual Mobility Automatic Restart of Virtual Added value through combination of virtualization features with HA software Fault Tolerance Part III: Usage of Virtualization Technology for Disaster Recovery Capabilities and Limitations 11
Virtual Mobility Virtual Mobiliy can move complete, running VM images (hosting and applications) from one virtual server to another virtual server with no downtime of the service. Examples: VMware vmotion, POWER Live Partition Mobility,... Customer benefit: mobility can help to reduce the planned down time for maintenance steps (e.g. HW maintenance). After move of guests a server can be shut down Limitation: Guest Mobility cannot be used in unplanned failure situations (HW/SW) VM VM VM VM (OFFLINE) I II 12
Automatic Restart of Virtual (s) s can detect unplanned VM outages (e.g failure), unplanned hypervisor outages, or HW failures and restart failed images. In case of failure the hosting hypervisor detects the failure and restarts image In case of hypervisor or server HW failure a backup hypervisor detects the outage and restarts all unavailable virtual servers. Example: VMware High Availability VM VM VM VM I VM VM VM VM (OFFLINE) II 13
Sample Classification for Business Applications Class 1: Unimportant Business Application Unplanned application downtime can be longer than a day (RTO > 1 day) Very long service windows (planned downtime) are accepted IT Configuration: No redundand components, No monitoring Class 2: Important Business Application Unplanned application downtime can be serveral hours to a day (RTO < 1 day) Long service windows (planned downtime) are accepted. Configuration: Mostly no redundand components, application is monitored and in failure situation manually recovered Class 3: Mission Critical Business Application Unplanned application downtime has to be avoided (RTO < x mins) Service windows have to be avoided and should be very short. Configuration redundant components (HW and SW) Usage of technology for automated recovery (e.g. like Tivoli System Automation) (high) availability features are extremely attractive for class 2 business applications. Reason: Simple/easy to use ( with one mouse click ) (high) availability features provide significant added value for class 3 business applications, but are not sufficient. (limitations are explained on ff pages) 14
Technology Limitation - Overview The management scope of a hypervisor is the set of virtual servers. A hypervisor has no knowledge/awareness about the business applications hosted inside the virtual servers. technology limitation: No application awareness: 1. No detection of application failures (SW failures) If the business application within the VM fails, this is not detected by virtualization technology 2. Automatic restart of virtual server does not always guarantee that application is working properly afterwards Application type can cause restart problems 3. No awareness of application dependencies Virtualization technology is not aware of dependencies between different application components running in different VMs 15
Limitation 1: s do not detect of application failures An unplanned outage of a business applications running inside a virtual servers will result in business service interruption when no other high availability product has been configured to observe the status of these applications VM VM VM VM? I 16
Excursus - High Availability for different Application Types Stateless Application Application / Component Type Stateless Application Multiple Instances Recommendation No failover required Stateless Recommendation: Provides implicit HA by running multiple instances in parallel Web Web Web System I System II System III Warm-Standby Warm Standby and Hot Standby (Stateful Application Component Single Instance) Warm-Standby (Type I) Recommendation: Use SA MP for Warm-Standby (Example: DB2 ) Hot Standby (Type II) Recommendation (for existing proprietary Hot Standby solutions): Use SA MP for split-brain resolution and automation (Example: DB2 HADR) Warm standby Hot Standby DB2 System I System II DB2 HADR System I Hot-Standby DB2 HADR System II Active-Active (Stateful Application Component Single Instance) Active/Active (Type III) No failover required, implementation requires a very complicated, infrastructure to support data integrity and resiliency (e.g. DB2 pure Scale) Active Active Recommendation: Use SA MP for split-brain resolution and automation Active / Active DB2 DB2 DB2 System I System II System III 17
Limitation 2: Automatic Virtual Restart is not always sufficient Automatic Virtual Restart for applications of Type II Hot Standby does not work. Sample scenarios for Hot Standby applications: SAP Central Service, DB2 HADR SAP Enqueue LPAR SAP Enqueue Replication LPAR SAP core component SAP Central Service (SCS) consisting of enqueue server and enqueue replication server will hang after a simple restart of the virtual server running the enqueue server. Automation logic is required to start enqueue server on virtual server where enqueue replication server is already running DB2 Primary LPAR DB2 Secondary LPAR To exploit DB2 HADR feature, role of DB2 Secondary has to be changed to primary after failure of DB2 Primary. 18
Limitation 3: No awareness of application dependencies across virtual servers Relationships between business application components are not known by any hypervisor and can cause that application does not work after recovery Sample scenario: Recovery of DB2 node (via Automatic Virtual Restart) requires J2EE container recycle. Since virtual server is not aware of application dependency application hangs Recovery of database after failure often requires recycle of J2EE application Web Web Web WAS DB2 LPAR LPAR 19
Agenda Part I: Introduction Part II: Usage of Virtualization Technology for Availability and High Availability Capabilities and Limitations Virtual Mobility Automatic Restart of Virtual Added value through combination of virtualization features with HA software Part III: Usage of Virtualization Technology for Disaster Recovery Capabilities and Limitations 20
System Automation for Multiplatforms an Overview Tivoli System Automation for Multiplatforms Provides a High Availability Cluster Automates startup and shutdown in correct sequence of complex, statefull applications Heartbeat Actively monitors all resources and reacts on outages of SW and HW components by automatic restart in correct context shared Disk Automation Policies define the Automation Scope of System Automation Describe resources, groups and relationships Define the desired target availability situation No need to develop automation scripts / workflows / actions 21
Application Automation & High Availability Automation and Availability are two major functional aspects provided by the SA Product Family Automation Automate complex operations reduce skill requirements Applications skills Operation System skills Focus on dependencies between business relevant applications Support changing automation goals Runs On Depends On Depends On SA monitors application, systems, file systems, networks SA choreographs startup and shutdown of these resources High Availability for Applications: Avoid downtime - keep business critical applications Running 24 x 7 SA provides HA cluster for redundancy SA uses automation aspect to re-assure availability Heartbeat shared Disk 22
System Automation for Multiplatforms Usage in Virtualized Environments Value Statement: Collaboration of virtualization technology and classical HA clustering provides best of both. Benefits ( Best of both ): SA provides recovery for application failures (hypervisor limitation 1) SA provides recovery automation for hot standby applications (hypervisor limitation 2) DB2 DB2 HA Cluster SA MP SA MP 1 2 I II DB2 Primary DB2 Secondary Primary SA MP HA Cluster SA MP 1 2 23 I II
Reduced Planned Downtime for Clustered Mission Critical Application Value Statement: Collaboration of virtualization technology and classical HA clustering provides best of both. Benefits ( Best of both ): Virtualization avoids planned downtime production workload SA provides recovery of unplanned HW/SW failures S1 S2 Scenario description 1. Operator moves guest running SAP production system to another server without impacting HA redundancy 2. System Automation recognizes guest move and assures application high availability through application standby/secondary 3. System Automation detects application failure and recovers application on standby server 2 HA Cluster (SA) 1 S3 24
Recovery of unplanned HW/SW failures for Mission Critical Applications Value Statement: Collaboration of virtualization technology and classical HA clustering provides best of both. Benefits ( Best of both ): Virtualization avoids planned downtime production workload SA provides recovery of unplanned HW/SW failures S1 1 HA Cluster (SA) S2 Scenario description 1. Hardware failure of the system where the application is running 2. System Automation detects node failure and recovers application on standby server 3. Guest is moved to spare server 4. System Automation re-establishes cluster 2 HA Cluster (SA) HA Cluster (SA) 1 S3 25
Tivoli System Automation Application Manager The Problem: Business applications are complex and difficult to manage. The reason for this is caused by...... a multi-tiered SW stack (application components) which builds up the overall business application.... application components running in a heterogeneous platforms environment... start/stop dependencies between application components (which are also often not documented) The Solution: Tivoli SA Application Manager allows to operate on business applications as a single instance. Tivoli System Automation Application Manager...... allows to aggregate a multi-tiered SW stack to a single business application instance... provides various adapters for the heterogeneous platform environment... knows the start/stop dependencies between the application components... can automatically restart after application failures AIX HA Cluster AIX Linux HA Cluster Linux 26
SA Application Manager: Automation of Multi-tiered Applications Value Statement: Collaboration of virtualization technology and classical HA clustering provides best of both. Benefits ( Best of both ): SA Application Manager can manage relationships in multitiered business applications (hypervisor limitation 3) SA Application Manager Automation Policy Web Portal HTTP Ref StartsAfter WAS Ref StartsAfter DB2 Ref Recovery of database after failure requires often recycle of J2EE application Web Web Web WAS DB2 LPAR LPAR LPAR KVM VMware PowerVM 27
Outlook: Automated Maintenance with no Application Downtime Value Statement: Automation (SA AppMan) will allow to perform server evacuation in a single step. Today, best practice is a stepwise evacuation of virtual servers using guest mobility. This is a time-consuming manual operation task. System Automation Benefits: Automated, step-wise relocation of virtual servers (SA) Application impact assessment of evacuation operation S1 S2 2 4 HA Cluster HA Cluster Scenario description 1. Operator initiates server evacuation 2. System Automation automates the guest mobility of virtual servers (stepwise guest mobility) and eventually also stop virtual servers 3. Operator turns server S3 off 1 3 5 6 S3 28
Part III: Usage of Virtualization Technology for Disaster Recovery - Capabilities and Limitations
Disaster Recovery for Virtualized Platforms Replication of Virtual images across sites. Site I Site II Classical DR solutions replicate application data Examples: VMware SiteRecovery Manager, Tivoli Productivity Center for Replication DB2-A DB2-B DB2-C DB2-A DB2-B DB2-C Limitations No application awareness Linux A Linux B Win C Linux A Linux B Win C Linux A Linux B Win C Linux A Linux B Win C Storage-based replication of -images and data 30
Disaster Recovery with System Automation Application Manager Disaster Recovery environments are multi-site datacenter setups Metro/Global Mirror replication technologies are employed to ensure that all business relevant data are available on the failover site. The RPO in such environment is typically in the range of hours. A DR plan exists that contains instructions in case of a disaster With System Automation Application Manager a DR solution can be created integrating replicated storage setups with multi-tiered business applications. Stuttgart (Production) Böblingen (Backup) 31
SA Application Manager & Disaster Recovery Components View SA Application Manager Operations Console Automation JEE Framework Manages Applications Websphere Automation Engine Adapter Adapter Adapter Adapter SA MP HACMP SA MP HACMP Manages Data Replication TPC-R Replication 32 Site I Site II
Outlook - SA Disaster Recovery Manager Manage Business Applications Manage site-relocation for multi-tiered Applications controlling different s 1. Operator wants to move multi-tiered, cross platform SAP Production to Site II 2. System Automation stops all SAP Applications in correct sequence and starts virtual guests on Site II 3. With help of TPC-R System Automation switches Replication Direction for SAP DB 4. System Automation starts SAP again on Site II SAP Prod SAP Prod SA App Man Web VMWare WAS VMWare Web VMWare SAP Dev. VMWare SAP DB Prod SAP DB Prod TPCR LPAR LPAR DB2 LPAR LPAR LPAR SAP DB2 Dev LPAR LPAR LPAR PowerVM PowerVM PowerVM PowerVM Site I Replication Session A Site II Replication Session B 33 DSxxxx DSxxxx
Outlook: Manage Availability in Hybrid Cloud Environments Operators System Automation Application Manager HA Cluster A HA Cluster B HA Cluster C IaaS / PaaS Physical Virtualized Public Cloud Service Resources / Applications Dependencies On Premise Off Premise 34
Summary - Value of HA Clustering in Virtualized Environments Virtualization in a data center improves utilization and thus reduces HW costs. Vitualization technologies can help to enhance availability for planned outages High Availability can only be ensured by having redundancy of HW and SW. Availability of Business Applications can only be ensured by having a true HA cluster, since do not manage Applications. VM VM VM SA Cluster VM VM VM (HW) (HW) 35
Summay - Value of Virtualization & System Automation for Datacenters Disaster Recovery Solutions for virtualized environments require the replication of your business relevant data to a remote site. SA Application Manager provides a way for coordinated site fail-overs of your business applications even when running clustered, in virtualized environments. System Automation Application Manager VM VM VM SA Cluster VM VM VM VM VM VM (HW) (HW) (HW) Data Replication 36
More about High Availability and System Automation in Cloud Infrastructures The Promise of Virtualization for High Availability Cloud Resiliency Business Continuity for heterogeneous Infrastructures 37
Thank YOU!!! Need More Information? New WIKI on developerworks: https://www.ibm.com/developerworks/wikis/display/tivoli/tivoli+system+automation Contact: Thomas Lumpp, STSM - SA AM / SA MP thomas.lumpp@de.ibm.com +49-7031-16-3057 Bernd Jostmeyer. Lead Developer SA AM bjost@de.ibm.com +49-7031-16-4106 38