Best Practices to Ensure SAP Availability Abstract Ensuring the continuous availability of mission-critical systems is a high priority for corporate IT groups. This paper presents five best practices that IT administrators can implement to maximize the uptime of their SAP system. We also introduce SIOS Technology Corp. s SteelEye Protection Suite for SAP, the leading platform for disaster recovery and business continuity of SAP systems. Introduction to Availability SAP systems are the lifeblood of many businesses. Downtime of these mission-critical systems can result in lost revenue, decreased productivity, customer frustration and loss, and employee stress. A study, titled The Costs of Enterprise Downtime by Infonetics Research, places the cost of downtime for organizations at 3.6 percent of annual revenue. That s an astounding $18 million each year for a half billion dollar firm. In the ERP space, where products such as SAP provide the infrastructure for applications on which businesses run, the cost of downtime can approach $1 million per hour, according to a survey on the cost of enterprise application downtime. At the core of all SAP systems is SAP NetWeaver the integrated technology platform. The overall SAP system that includes database servers, network infrastructure, client access devices, etc. For administrators tasked with ensuring high availability of SAP systems, monitoring and recovery of SAP NetWeaver and all related SAP solution components is critical, to avoid single points of failure (SPoF) including message and enqueue services. What is high availability? Availability is defined as the amount of time that application services are accessible to end users. Availability of a system includes both the infrastructure such as servers and storage at the hardware layer, the network, the operating system, base services such as the database, and applications. Availability is measured as a percentage of time that the application is available to end users. An application is highly available when it approaches 99.99 percent known as four 9s of availability. Four 9s should be the goal for SAP deployments, which equates to less than one hour of downtime for an entire year. Preventing downtime is objective number one for IT professionals. But when downtime occurs, it is important to restore the system to operation as quickly as possible. This interval is known as the Recovery Time Objective (RTO). When a SAP system is down, minutes count. Organizations should work to optimize their RTO. What Causes Downtime? Based on extensive feedback from clients, Gartner estimates that, on average, 40 percent of unplanned application downtime is caused by application failures. Another 40 percent is caused by operator errors. And about 20 percent is due to hardware failure. According to Gartner, the leading reasons for unplanned downtime experienced by enterprise systems are listed in Table 1. WHITE PAPER
Type of Outage Description Examples Application failure People issues An application failure brings the system to an immediate halt. An operator error can occur when IT staff fails to perform an operations task or executes a task incorrectly. Software bugs Performance issues Application changes Loading the wrong patch Mistakenly powering down the system Operating system failure Hardware failure Power outage, heating / cooling issues Natural disaster Corrupted code can cause a slowdown or complete failure of an operating system that may only be detected after the impact is felt. Any hardware component failure can have a local or system-wide impact. Without proper backup provisions, a power outage or the failure of a cooling system can bring down a site quickly. An unexpected flash flood or a fire and accompanying smoke can take out a data center for an extended period of time. Software bugs Server failure Network failure Drive failure Brownout or blackout Power cable break Air conditioner failure Flooding from broken pipes Fire or smoke Extreme weather Table 1 : Causes of Unplanned Downtime IT professionals must examine all causes of downtime when designing and implementing high-availability systems. This includes not only unplanned downtime, but planned downtime as well. Planned downtime can be just as disruptive to operations, especially in follow-the-sun organizations that span the globe. The most common reason for planned downtime is system maintenance. Type of Outage Description Examples Planned system changes Planned system downtime takes place when performing routine, periodic maintenance, upgrades, or new deployments. The impact of the planned outage depends upon the scope of the changes being made. Software release updates Bug fixes/patches Config changes Database reorganizations Import of transports requests Profile parameter changes Middleware component upgrades Table 2 : Causes of Planned Downtime
Best Practices to Ensure SAP Availability This paper offers five best practices that help organizations worldwide as they endeavor to build SAP high availability and disaster recovery solutions. Implementing these practices will increase the likelihood of achieving SAP availability and RTO objectives. 1. Use clustering to eliminate Single Points of Failure As the diagram below illustrates, an SAP environment consists of multiple layers, including Presentation, Application, and Database. Each layer can represent a single point of failure depending on its configuration; therefore, each should be protected by a high-availability solution. in the application stack must be monitored to ensure complete availability of the entire solution. SAP is a complex system with numerous dependencies, and each of those dependencies must be monitored as part of the overall high-availability solution. Examples of dependencies include IP addresses for user access to the system, or the database running under SAP. If the database is down, SAP is down. The high-availability solution must understand the dependencies between all of these components, carefully monitor them, and then take the dependencies into account during recovery and switchover. Examples of an SAP resource dependency graph are shown in Figure 2 below: Figure 1: SAP Architecture Overview Likewise, the SAP core itself is composed of several cooperating services that may be run on a single server or may be distributed across several servers in a cluster. Some of these services have redundancy built into them, while others do not and represent single points of failure in the SAP environment. The SAP Central Instance (Enqueue and Message server for ABAP), the SAP System Central Services (SCS) Instance (Enqueue and Message Server for Java), the Database server, and the NFS server represent single points of failure and should be protected. Switchover is a standard technique for increasing the availability of critical SAP NetWeaver environments by clustering together a number of servers (or virtual machines) to eliminate single points of failure. A switchover mechanism ensures that the resources assigned to a node in the cluster are automatically reassigned to another node in the cluster in the event of the first node failing. This ensures that affected resources remain available. 2. Protect all components of the SAP stack with proactive monitoring and dependency management It is not enough to simply perform hardware monitoring for the servers on which SAP is running. Every component Figure 2. Example SAP dependencies Advanced switchover solutions constantly monitor all components of the SAP solution stack including servers, databases, network connections, and SAP services, understand the dependencies between the components, and take automatic recovery action if a problem is detected. 3. Optimize the cluster configuration to achieve RTO requirements The restoration time of any failed application is determined by the time required to detect the failure plus the time required to complete a recovery procedure. Time to restore is defined by the write formula with font as TRESTORE = TDETECT + TRECOVER. For SAP, detection of an outage occurs through monitoring the health of a number of components including: the physical server on which SAP runs individual SAP services IP addresses used by clients to access the SAP server and associated network connections databases on which needed information resides These checks should be configured to ensure best operation in your specific SAP environment. For example, you may want to vary the intervals at which heartbeats are sent between servers or the number of heartbeats
that must be missed before a system is determined to be down. You must also decide if you want to allow the monitoring software to take an automatic recovery action on failure detection, or if you instead want a system admin alerted so that a human makes the final decision on failover. Adding an admin notification and confirmation to the recovery process will add time to the recovery, but can prevent false failovers especially in WAN configurations that may be subject to intermittent and short-lived network outages. There also may be certain site-specific error conditions for which you want to optimize the detection algorithm; perhaps certain services tend to go down frequently or you have determined that a pattern of log file entries is a precursor to SAP outages. Monitoring for these conditions should be regular and frequent so that detection and subsequent recovery are as fast as possible. The switchover solution deployed should be tunable to optimize detection and recovery for your specific configuration and RTO requirements. 4. Use switchover to eliminate downtime during planned maintenance IT professionals view high availability as an insurance policy against unplanned downtime. But in fact, high-availability solutions may be used more often for managing and/or protecting against planned downtime. As we noted earlier, applying patches or performing upgrades are a major cause of downtime for SAP environments. This is because many organizations don t deploy clustering, so there is no way to perform an automatic failover to the backup site. Switchover can be used to perform manual movement of the SAP solution stack among servers to prevent downtime during planned maintenance. 5. Plan carefully and test the solution regularly, particularly after configuration changes The most important of the best practices is proper solution planning, testing, deployment, and validation. This begins at the initial decision to implement an SAP high-availability or disaster-recovery solution and continues as long as it is in production. In the planning phase, you must answer these critical questions which have been previously discussed: What is the Recovery Time Objective for the solution? What pieces of the solution besides the SAP server and processes themselves must be monitored and recovered? Is a minimum set of functionality acceptable for some time following recovery and, if so, what is that subset? Are there site-specific error conditions that should be optimized for in the monitoring phase? Will automatic failover be allowed, or should administrator notification be the first recovery action? With these requirements documented, you begin design. For disaster recovery implementations, the choice of a disaster recovery site should be the first decision made. While some organizations decide to use remote corporate offices as backup locations, many look to co-located hosting centers that provide full redundancy for power and network connectivity. An analysis of bandwidth requirements and availability must be done. Designing a sufficient connection between sites in terms of bandwidth and latency is critical to deployment success. At this stage, you should also identify and document the servers, the storage capacity and configuration, the network routing between primary and disaster recovery site, and the method of client redirection that will be used. Personnel with the following skills need to be involved in the entire process from design thru final validation: Operating system administration SAP server administration Clustering software administration Network routing and troubleshooting These skills are critical to building the end-to-end solution and should be involved in developing the initial design, in building test environment, hands-on in the deployment and subsequent validation. The testing phase is often difficult because properly emulating the production SAP environment within a test lab can be difficult. The closer you can get to a mirrored production environment within the test lab, the fewer issues you will see arise during deployment. In testing, you are looking to answer several questions: Does the failover software perform as expected on both detection and recovery? Does any data replication software perform as expected in terms of speed and data consistency? Is there any noticeable performance impact on endusers from the presence of the clustering or data replication software? Are the various clients able to seamlessly migrate to SAP following all switchover scenarios? Given your assembled team of experts and a successful testing phase where you have emulated the production environment, deployment should be straightforward. Of course, you will want to have scheduled sufficient time for testing of recovery scenarios. Each of these tests should validate that client redirection works as planned and that all services needed for a fully functional SAP environment are recovered.
Following the initial deployment, it is recommended that failover tests be made after any change (installation of patches, introduction of new services, etc.) to the SAP environment. A worst-case scenario is a failure on the primary server where the switchover solution fails to bring SAP into service because of an administrative change that was not accounted for in the recovery process. Not surprisingly, most of the error conditions that are reported back to us result from an administrator making a change within the SAP environment and not realizing the impact on the switchover solution. Even in a static environment, a test should be run at least quarterly to ensure proper monitoring and recovery. SteelEye Protection Suite for SAP SIOS Technology Corp. s SteelEye Protection Suite for SAP ensures continuous availability of the entire SAP system, assuring the high availability of clustered systems. To enable automatic system and application recovery if the system goes down, SteelEye Protection Suite for SAP allows applications to failover to other servers in the cluster, minimizing the risk of a single point of failure. It monitors all components of the end-to-end solution and takes appropriate recovery action, taking into account the dependencies between solution components. SteelEye Protection Suite for SAP monitors system and application health, maintains client connectivity, and provides uninterrupted data access wherever clients reside. SteelEye Protection Suite provides two critical functions for the SAP environment: SteelEye LifeKeeper provides the ability to monitor the health of all critical single point of failure components and to take an appropriate recovery action when degradation in health is detected. Then at user-defined intervals, LifeKeeper will check the health of the SAP CI and/or SCS, the DB, NFS, the IP addresses being used by client connections, and underlying system services. If a problem is detected, LifeKeeper will attempt to recover the troubled resource locally on the same server.if this is not successful, a switchover to the standby server will be initiated. In this same vein, if the entire system on which these components are running should experience a failure, LifeKeeper will migrate all of the server processes to the correct standby server. Monitoring and protection are provided at both the individual component and system level. SIOS Technology Corp. s SteelEye Protection Suite for SAP: Provides high availability protection for entire SAP NetWeaver application stack Automatic management of application servers Can optionally choose to start and stop the Java Instance in Java-only environments Uses standard SAP utilities to monitor the enqueue and message server Performs dependency management and monitoring for the underlying database Fully integrates with Protection Suite GUI for administration and monitoring Can easily protect other applications through Protection Suite Extender with simple scripting For more information, visit our website at www.us.sios.com. SIOS Technology Corp. US/Canada 866.318.0108 Europe + 44 1494 429382 Int l +1 (650) 843-0655 2929 Campus Drive, Suite 250, San Mateo, CA 94403 2010 SIOS Technology Corp. All rights reserved. SIOS, SIOS Technology, LifeKeeper and SteelEye DataKeeper and associated logos are registered trademarks or trademarks of SIOS Technology Corp. and/or its affiliates in the United States and/or other countries. All other trademarks are the property of their respective owners. Dec 10 # 653