WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?



Similar documents
Veritas Cluster Server from Symantec

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Veritas InfoScale Availability

SAP Solutions on VMware Business Continuance Protecting Against Unplanned Downtime

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Why Fails MessageOne Survey of Outages

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

High Availability and Disaster Recovery Solutions for Perforce

SteelEye DataKeeper Cluster Edition. v7.6. Release Notes

How Routine Data Center Operations Put Your HA/DR Plans at Risk

Symantec Cluster Server powered by Veritas

Neverfail Solutions for VMware: Continuous Availability for Mission-Critical Applications throughout the Virtual Lifecycle

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Veritas Cluster Server by Symantec

DataKeeper Cloud Edition. v7.6. Release Notes

VERITAS Business Solutions. for DB2

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Course 2788A: Designing High Availability Database Solutions Using Microsoft SQL Server 2005

Maximizing Data Center Uptime with Business Continuity Planning Next to ensuring the safety of your employees, the most important business continuity

Backup and Redundancy

Informatica MDM High Availability Solution

SteelEye DataKeeper Cluster Edition. v7.5. Release Notes

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

Cloud Computing Disaster Recovery (DR)

High Availability Solutions for the MariaDB and MySQL Database

Ecomm Enterprise High Availability Solution. Ecomm Enterprise High Availability Solution (EEHAS) Page 1 of 7

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Top Ten Private Cloud Risks. Potential downtime and data loss causes

Symantec and VMware: Virtualizing Business Critical Applications with Confidence WHITE PAPER

Enterprise Linux Business Continuity Solutions for Critical Applications

Real-time Protection for Hyper-V

DISASTER RECOVERY WITH AWS

Blackboard Managed Hosting SM Disaster Recovery Planning Document

Achieving High Availability

Three Ways Enterprises are Protecting SQL Server in the Cloud

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Reducing the Cost and Complexity of Business Continuity and Disaster Recovery for

SteelEye Protection Suite for Linux v Network Attached Storage Recovery Kit Administration Guide

MaximumOnTM. Bringing High Availability to a New Level. Introducing the Comm100 Live Chat Patent Pending MaximumOn TM Technology

Windows Server Failover Clustering April 2010

HA Configuration Approach

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

F5 and Oracle Database Solution Guide. Solutions to optimize the network for database operations, replication, scalability, and security

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

COMPARISON OF VMware VSHPERE HA/FT vs stratus

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

Enhancing Exchange Server 2010 Availability with Neverfail Best Practices for Simplifying and Automating Continuity

Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.

High Availability with Windows Server 2012 Release Candidate

Total Business Continuity with Cyberoam High Availability

Fault Tolerant Servers: The Choice for Continuous Availability

High Availability & Disaster Recovery. Sivagopal Modadugula/SAP HANA Product Management Session # 0506 May 09, 2014

Whitepaper Continuous Availability Suite: Neverfail Solution Architecture

DeltaV Virtualization High Availability and Disaster Recovery

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Achieving Zero Downtime for Apps in SQL Environments

Business Continuity: Choosing the Right Technology Solution

Virtualization Essentials

Executive Brief Infor Cloverleaf High Availability. Downtime is not an option

7 Best Practices When SAP Must Run 24 x 7

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

Maintaining a Microsoft SQL Server 2008 Database

HyperQ DR Replication White Paper. The Easy Way to Protect Your Data

Disaster Recovery for Oracle Database

WHITE PAPER: TECHNICAL. Enhancing Microsoft SQL Server 2005 Availability with Veritas Storage Foundation for Windows

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

Cisco Active Network Abstraction Gateway High Availability Solution

Course Syllabus. Maintaining a Microsoft SQL Server 2005 Database. At Course Completion

Cloud Failover Appliance

High availability and disaster recovery with Microsoft, Citrix and HP

High Availability for Citrix XenApp

Support Guide Comprehensive Hosting at Nuvolat Datacenter

Access to easy-to-use tools that reduce management time with Arcserve Backup

Course Syllabus. At Course Completion

better broadband Redundancy White Paper

High Availability Cluster for RC18015xs+

High Availability Database Solutions. for PostgreSQL & Postgres Plus

Ingres Replicated High Availability Cluster

The Art of High Availability

High Availability and Clustering

Neverfail for Windows Applications June 2010

VMware System, Application and Data Availability With CA ARCserve High Availability

Disaster Recovery. Automated Disaster Recovery. Environment VMware Recovery and NetApp Storage and Data Management Solutions

Red Hat Enterprise linux 5 Continuous Availability

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

WHITE PAPER. The Double-Edged Sword of Virtualization:

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Protecting SQL Server in Physical And Virtual Environments

PROTECTING MICROSOFT SQL SERVER TM

CHAPTER 2 BACKGROUND AND OBJECTIVE OF PRESENT WORK

Transcription:

Best Practices to Ensure SAP Availability Abstract Ensuring the continuous availability of mission-critical systems is a high priority for corporate IT groups. This paper presents five best practices that IT administrators can implement to maximize the uptime of their SAP system. We also introduce SIOS Technology Corp. s SteelEye Protection Suite for SAP, the leading platform for disaster recovery and business continuity of SAP systems. Introduction to Availability SAP systems are the lifeblood of many businesses. Downtime of these mission-critical systems can result in lost revenue, decreased productivity, customer frustration and loss, and employee stress. A study, titled The Costs of Enterprise Downtime by Infonetics Research, places the cost of downtime for organizations at 3.6 percent of annual revenue. That s an astounding $18 million each year for a half billion dollar firm. In the ERP space, where products such as SAP provide the infrastructure for applications on which businesses run, the cost of downtime can approach $1 million per hour, according to a survey on the cost of enterprise application downtime. At the core of all SAP systems is SAP NetWeaver the integrated technology platform. The overall SAP system that includes database servers, network infrastructure, client access devices, etc. For administrators tasked with ensuring high availability of SAP systems, monitoring and recovery of SAP NetWeaver and all related SAP solution components is critical, to avoid single points of failure (SPoF) including message and enqueue services. What is high availability? Availability is defined as the amount of time that application services are accessible to end users. Availability of a system includes both the infrastructure such as servers and storage at the hardware layer, the network, the operating system, base services such as the database, and applications. Availability is measured as a percentage of time that the application is available to end users. An application is highly available when it approaches 99.99 percent known as four 9s of availability. Four 9s should be the goal for SAP deployments, which equates to less than one hour of downtime for an entire year. Preventing downtime is objective number one for IT professionals. But when downtime occurs, it is important to restore the system to operation as quickly as possible. This interval is known as the Recovery Time Objective (RTO). When a SAP system is down, minutes count. Organizations should work to optimize their RTO. What Causes Downtime? Based on extensive feedback from clients, Gartner estimates that, on average, 40 percent of unplanned application downtime is caused by application failures. Another 40 percent is caused by operator errors. And about 20 percent is due to hardware failure. According to Gartner, the leading reasons for unplanned downtime experienced by enterprise systems are listed in Table 1. WHITE PAPER

Type of Outage Description Examples Application failure People issues An application failure brings the system to an immediate halt. An operator error can occur when IT staff fails to perform an operations task or executes a task incorrectly. Software bugs Performance issues Application changes Loading the wrong patch Mistakenly powering down the system Operating system failure Hardware failure Power outage, heating / cooling issues Natural disaster Corrupted code can cause a slowdown or complete failure of an operating system that may only be detected after the impact is felt. Any hardware component failure can have a local or system-wide impact. Without proper backup provisions, a power outage or the failure of a cooling system can bring down a site quickly. An unexpected flash flood or a fire and accompanying smoke can take out a data center for an extended period of time. Software bugs Server failure Network failure Drive failure Brownout or blackout Power cable break Air conditioner failure Flooding from broken pipes Fire or smoke Extreme weather Table 1 : Causes of Unplanned Downtime IT professionals must examine all causes of downtime when designing and implementing high-availability systems. This includes not only unplanned downtime, but planned downtime as well. Planned downtime can be just as disruptive to operations, especially in follow-the-sun organizations that span the globe. The most common reason for planned downtime is system maintenance. Type of Outage Description Examples Planned system changes Planned system downtime takes place when performing routine, periodic maintenance, upgrades, or new deployments. The impact of the planned outage depends upon the scope of the changes being made. Software release updates Bug fixes/patches Config changes Database reorganizations Import of transports requests Profile parameter changes Middleware component upgrades Table 2 : Causes of Planned Downtime

Best Practices to Ensure SAP Availability This paper offers five best practices that help organizations worldwide as they endeavor to build SAP high availability and disaster recovery solutions. Implementing these practices will increase the likelihood of achieving SAP availability and RTO objectives. 1. Use clustering to eliminate Single Points of Failure As the diagram below illustrates, an SAP environment consists of multiple layers, including Presentation, Application, and Database. Each layer can represent a single point of failure depending on its configuration; therefore, each should be protected by a high-availability solution. in the application stack must be monitored to ensure complete availability of the entire solution. SAP is a complex system with numerous dependencies, and each of those dependencies must be monitored as part of the overall high-availability solution. Examples of dependencies include IP addresses for user access to the system, or the database running under SAP. If the database is down, SAP is down. The high-availability solution must understand the dependencies between all of these components, carefully monitor them, and then take the dependencies into account during recovery and switchover. Examples of an SAP resource dependency graph are shown in Figure 2 below: Figure 1: SAP Architecture Overview Likewise, the SAP core itself is composed of several cooperating services that may be run on a single server or may be distributed across several servers in a cluster. Some of these services have redundancy built into them, while others do not and represent single points of failure in the SAP environment. The SAP Central Instance (Enqueue and Message server for ABAP), the SAP System Central Services (SCS) Instance (Enqueue and Message Server for Java), the Database server, and the NFS server represent single points of failure and should be protected. Switchover is a standard technique for increasing the availability of critical SAP NetWeaver environments by clustering together a number of servers (or virtual machines) to eliminate single points of failure. A switchover mechanism ensures that the resources assigned to a node in the cluster are automatically reassigned to another node in the cluster in the event of the first node failing. This ensures that affected resources remain available. 2. Protect all components of the SAP stack with proactive monitoring and dependency management It is not enough to simply perform hardware monitoring for the servers on which SAP is running. Every component Figure 2. Example SAP dependencies Advanced switchover solutions constantly monitor all components of the SAP solution stack including servers, databases, network connections, and SAP services, understand the dependencies between the components, and take automatic recovery action if a problem is detected. 3. Optimize the cluster configuration to achieve RTO requirements The restoration time of any failed application is determined by the time required to detect the failure plus the time required to complete a recovery procedure. Time to restore is defined by the write formula with font as TRESTORE = TDETECT + TRECOVER. For SAP, detection of an outage occurs through monitoring the health of a number of components including: the physical server on which SAP runs individual SAP services IP addresses used by clients to access the SAP server and associated network connections databases on which needed information resides These checks should be configured to ensure best operation in your specific SAP environment. For example, you may want to vary the intervals at which heartbeats are sent between servers or the number of heartbeats

that must be missed before a system is determined to be down. You must also decide if you want to allow the monitoring software to take an automatic recovery action on failure detection, or if you instead want a system admin alerted so that a human makes the final decision on failover. Adding an admin notification and confirmation to the recovery process will add time to the recovery, but can prevent false failovers especially in WAN configurations that may be subject to intermittent and short-lived network outages. There also may be certain site-specific error conditions for which you want to optimize the detection algorithm; perhaps certain services tend to go down frequently or you have determined that a pattern of log file entries is a precursor to SAP outages. Monitoring for these conditions should be regular and frequent so that detection and subsequent recovery are as fast as possible. The switchover solution deployed should be tunable to optimize detection and recovery for your specific configuration and RTO requirements. 4. Use switchover to eliminate downtime during planned maintenance IT professionals view high availability as an insurance policy against unplanned downtime. But in fact, high-availability solutions may be used more often for managing and/or protecting against planned downtime. As we noted earlier, applying patches or performing upgrades are a major cause of downtime for SAP environments. This is because many organizations don t deploy clustering, so there is no way to perform an automatic failover to the backup site. Switchover can be used to perform manual movement of the SAP solution stack among servers to prevent downtime during planned maintenance. 5. Plan carefully and test the solution regularly, particularly after configuration changes The most important of the best practices is proper solution planning, testing, deployment, and validation. This begins at the initial decision to implement an SAP high-availability or disaster-recovery solution and continues as long as it is in production. In the planning phase, you must answer these critical questions which have been previously discussed: What is the Recovery Time Objective for the solution? What pieces of the solution besides the SAP server and processes themselves must be monitored and recovered? Is a minimum set of functionality acceptable for some time following recovery and, if so, what is that subset? Are there site-specific error conditions that should be optimized for in the monitoring phase? Will automatic failover be allowed, or should administrator notification be the first recovery action? With these requirements documented, you begin design. For disaster recovery implementations, the choice of a disaster recovery site should be the first decision made. While some organizations decide to use remote corporate offices as backup locations, many look to co-located hosting centers that provide full redundancy for power and network connectivity. An analysis of bandwidth requirements and availability must be done. Designing a sufficient connection between sites in terms of bandwidth and latency is critical to deployment success. At this stage, you should also identify and document the servers, the storage capacity and configuration, the network routing between primary and disaster recovery site, and the method of client redirection that will be used. Personnel with the following skills need to be involved in the entire process from design thru final validation: Operating system administration SAP server administration Clustering software administration Network routing and troubleshooting These skills are critical to building the end-to-end solution and should be involved in developing the initial design, in building test environment, hands-on in the deployment and subsequent validation. The testing phase is often difficult because properly emulating the production SAP environment within a test lab can be difficult. The closer you can get to a mirrored production environment within the test lab, the fewer issues you will see arise during deployment. In testing, you are looking to answer several questions: Does the failover software perform as expected on both detection and recovery? Does any data replication software perform as expected in terms of speed and data consistency? Is there any noticeable performance impact on endusers from the presence of the clustering or data replication software? Are the various clients able to seamlessly migrate to SAP following all switchover scenarios? Given your assembled team of experts and a successful testing phase where you have emulated the production environment, deployment should be straightforward. Of course, you will want to have scheduled sufficient time for testing of recovery scenarios. Each of these tests should validate that client redirection works as planned and that all services needed for a fully functional SAP environment are recovered.

Following the initial deployment, it is recommended that failover tests be made after any change (installation of patches, introduction of new services, etc.) to the SAP environment. A worst-case scenario is a failure on the primary server where the switchover solution fails to bring SAP into service because of an administrative change that was not accounted for in the recovery process. Not surprisingly, most of the error conditions that are reported back to us result from an administrator making a change within the SAP environment and not realizing the impact on the switchover solution. Even in a static environment, a test should be run at least quarterly to ensure proper monitoring and recovery. SteelEye Protection Suite for SAP SIOS Technology Corp. s SteelEye Protection Suite for SAP ensures continuous availability of the entire SAP system, assuring the high availability of clustered systems. To enable automatic system and application recovery if the system goes down, SteelEye Protection Suite for SAP allows applications to failover to other servers in the cluster, minimizing the risk of a single point of failure. It monitors all components of the end-to-end solution and takes appropriate recovery action, taking into account the dependencies between solution components. SteelEye Protection Suite for SAP monitors system and application health, maintains client connectivity, and provides uninterrupted data access wherever clients reside. SteelEye Protection Suite provides two critical functions for the SAP environment: SteelEye LifeKeeper provides the ability to monitor the health of all critical single point of failure components and to take an appropriate recovery action when degradation in health is detected. Then at user-defined intervals, LifeKeeper will check the health of the SAP CI and/or SCS, the DB, NFS, the IP addresses being used by client connections, and underlying system services. If a problem is detected, LifeKeeper will attempt to recover the troubled resource locally on the same server.if this is not successful, a switchover to the standby server will be initiated. In this same vein, if the entire system on which these components are running should experience a failure, LifeKeeper will migrate all of the server processes to the correct standby server. Monitoring and protection are provided at both the individual component and system level. SIOS Technology Corp. s SteelEye Protection Suite for SAP: Provides high availability protection for entire SAP NetWeaver application stack Automatic management of application servers Can optionally choose to start and stop the Java Instance in Java-only environments Uses standard SAP utilities to monitor the enqueue and message server Performs dependency management and monitoring for the underlying database Fully integrates with Protection Suite GUI for administration and monitoring Can easily protect other applications through Protection Suite Extender with simple scripting For more information, visit our website at www.us.sios.com. SIOS Technology Corp. US/Canada 866.318.0108 Europe + 44 1494 429382 Int l +1 (650) 843-0655 2929 Campus Drive, Suite 250, San Mateo, CA 94403 2010 SIOS Technology Corp. All rights reserved. SIOS, SIOS Technology, LifeKeeper and SteelEye DataKeeper and associated logos are registered trademarks or trademarks of SIOS Technology Corp. and/or its affiliates in the United States and/or other countries. All other trademarks are the property of their respective owners. Dec 10 # 653