The Benefits of Data Replication in HACMP/PowerHA Cluster Management Implementations
Executive Summary In today's fast-paced world, businesses both large and small face increasing internal and external demands for data protection and efficient, uninterrupted operations. Even a brief interruption in services and processes can have potentially disastrous results that businesses cannot afford to risk. IT departments are being tasked with accommodating these requirements while also being expected to do more with less in a diminishing economy. Fortunately, technological advances in AIX high availability, clustering disaster recovery, and continuous operations have continually risen to meet these challenges, ensuring that both planned outages due to maintenance and upgrades and unplanned outages due to environmental conditions, operator error, or software bugs result in minimal data loss. The HACMP high availability solution that clusters multiple servers to shared storage offers automatic recovery of applications and system resources if a failure occurs with the primary server, thereby maintaining the highest levels of data currency in that scenario. Nonetheless, clustering is only part of the equation of a truly resilient IT infrastructure because should the shared storage become damaged or otherwise unusable, significant disruption of business critical applications will still occur. That is why the other essential component of a truly resilient AIX environment is data replication technology which protects the database by maintaining a storage clone in an offsite location. This way both servers are redundant and storage is redundant. Still, not all replication solutions are alike. 1
Ensuring Protection for the Server With the advent of globalization and of business demand for increased service-level agreements (SLAs) that require the highest level of availability of business-critical services and servers, high availability solutions became critical components in information systems not just for large enterprises, but also for medium-sized and small businesses that, in many ways, are even more vulnerable to system outages. While a larger enterprise may have the human, technical, and financial resources to cope with and survive an unplanned outage, smaller businesses that lack similar resources can easily be put out of business if a core IT function becomes unavailable even for a short period of time. High availability, also sometimes referred to as fault resilience, refers to technology with which servers and business services can achieve availability characteristics in the range of 99.99% 99.999%. High availability systems should be designed for businesses that can endure short periods of downtime; in contrast, fault tolerant systems are designed to achieve virtually continuous operation, albeit that level of availability requires fully redundant hardware and software components, resulting in higher solution cost. High availability for the AIX operating system is accomplished by cost-efficiently utilizing redundant hardware and software components as well as clustering software that manages the system and is responsible for monitoring system health and performing the necessary recovery actions should a failure occur. Since there is typically sufficient system capacity available to temporarily host services either readily or via AIX s Capacity on Demand facility high availability clustering will improve service availability not only during unplanned events, but also during scheduled maintenance. As high availability clustering products enable the administrator to handle system resources in groups, they can significantly improve system administration and change management practices, thereby contributing high levels of achievable service-level agreements and reducing administration labor expenses. In order to provide high availability for AIX servers on the System p/aix platform, IBM s high availability solution for AIX, High Availability Cluster Multi-Processing (or HACMP, now called PowerHA) has traditionally been utilized. HACMP, in its base offering, is a high availability solution that provides capabilities to assist with monitoring the cluster, automatically recovering applications and system resources if a failure occurs, and easing system and cluster administration and maintenance via its Single System Image like capabilities. HACMP provides a very mature, robust, and feature-rich environment for high availability, with built-in capabilities to support 32 nodes, complex multi-tier business application environments, and AIX s superior virtualization features. It provides protection for network resources, applications, logical volume manager (LVM) resources, and other resources that may be less commonly used but are equally important for certain environments. HACMP s base offering is typically used in a shared-disk environment; protection against disk or disk array failure typically has to be achieved by disk mirroring technologies (RAID solutions, LVM mirroring, etc). A typical local HACMP cluster is depicted in the following diagram: 2
Ethernet (IP) Network LAN Heartbeat Monitor AIX Node AIX Node Shared Data Store High availability in the realm of a local data center was, for many years, the typical business continuity solution for AIX environments. While there were solutions available for AIX that addressed the need for disaster recovery (another major component of business continuity), those solutions were expensive, challenging to install, and difficult to administer. The recent decade, however, witnessed a surge in interest for proper disaster recovery practices and solutions driven by large, as well as SMB-type, businesses. This is because, for any size company, a failure of a data center, be it an unexpected power outage, a scheduled site maintenance project, or an environmental disaster, must not jeopardize service-level agreements; must not place unacceptable risk on those businesses ability to continue serving their global market; and must not risk their ability to satisfy stringent regulatory compliance requirements such as HIPAA, Sarbanes-Oxley, or Basel II, or other country-specific regulations. It is important to mention the third and final, yet equally important, component of business continuity: continuous operations. Any high availability and disaster recovery solution has to be relatively easy to use for the current IT staff and should introduce the least disruption to existing IT processes; otherwise, the solution itself may become the source of disruptions to business. Considerations for Disaster Recovery Solutions In order to satisfy the emerging business need for disaster recovery for AIX environments and servers, solutions targeted toward disaster recovery needs have been introduced and have gained wider acceptance over the last decade. HACMP s disaster recovery family of products, branded as HACMP/XD, provides various options for disaster recovery. 3
These solutions typically provide automated failover capabilities for the data (controlled by HACMP s cluster management functionality) and rely on HACMP to provide availability of all the other resources necessary for the business service, such as applications and network resources. Disaster recovery solutions typically replicate data to a geographically distant remote location either synchronously or asynchronously, via either an IP-based or a proprietary connection (ESCON link, etc). Synchronous data replication solutions replicate data to the remote location in a synchronous manner; that is, the application s write request is only considered finished once both servers have written the data to their respective disks. While some business services do require the characteristics of synchronous replication, the downside is that the application may be slowed down because writing over the WAN takes significantly longer than writing to the SAN or to direct-attached disks, and network bandwidth between the two sites has to be sized to handle the peak data load. These solutions are also sensitive to unexpected network use. Therefore, they typically require very careful network bandwidth sizing and ongoing management, as well as quasi-dedicated networks, which ensure that an unexpected network load (due to increased user workload, a network backup operation, etc.) does not interfere with the performance of critical business applications. Asynchronous replication solutions, on the other hand, buffer data on the local site, and, as soon as data is written to the disk, the application s write request is complete. The advantage of these solutions is that the application s performance is not affected noticeably, and network bandwidth can be sized to the average data load. Any excess data that cannot be replicated to the disaster recovery site because of network bandwidth limitation is buffered up and is later replicated to the disaster recovery site as bandwidth allows. Asynchronous replication solutions also typically cope with network outages better. For the majority of business applications and businesses, asynchronous replication solutions provide superior performance and cost efficiency while maintaining satisfactory recovery point objectives (RPOs) and recovery time objectives (RTOs). One important requirement for replication solutions is write-order fidelity, which means that writes on the recovery site have to occur in the exact order they occur on the production site; otherwise, applications could not reliably be recovered. If the replication solution experiences a shorter or longer period of network outage, it has to be able to provide a consistent image of the data with write-order fidelity. When compared with local high availability solutions, disaster recovery solutions have unique design challenges that stem from the geographic distance and the presence of two or more copies of the data. One such consideration is what was discussed above: the choice between synchronous and asynchronous replication. It is also important to evaluate between automated and non-automated failover options. It is relatively straightforward to determine failure of network and other components of the IT infrastructure in a local data center environment, and an unnecessary failover typically has climited effects on the business. These challenges become much more complex in the case of a disaster recovery cluster. WAN outage can occur much more easily; providing redundant networks to distinguish between network failures and site failures becomes more difficult; and a failover to the disaster recovery site can have significant effects on the business: the disaster recovery site s environment may be less powerful; having the users reach the application service 4
via the WAN may introduce slower application response times; certain client and router reconfigurations may be required in order to allow the clients to connect directly to the disaster recovery site; and, when the business is ready to move production back to the main site, the site fallback will introduce another outage. For these reasons, while it may seem tempting to automatically fail over the service, customers typically opt for manual disaster recovery failover. Rolling Disasters One additional scenario that may more easily happen in a disaster recovery configuration is the occurrence of a failure condition the industry refers to as rolling disasters. In this case, certain components of the system start to fail gradually, while replication is still occurring. This leads to corrupt transactions being added to the database while replication is still occurring, which results in data corruption not only at the production site, but also in the disaster recovery server s data image. Eventually, the entire site fails, but by that point, the image of the data on the disaster recovery server is unusable for business purposes. With traditional replication solutions, the only available image of the data is either the current replica (as with HACMP/XD, HAGEO or GLVM) or a point-in-time (PIT) copy of the data, which would have to be taken at predetermined time periods (e.g., Veritas Volume Replicator with FlashSnap), which affects RPO targets. If none of the available copies are suitable for business purposes (e.g., the latest image got corrupted five minutes ago, but the latest PIT copy is from an hour ago), the business has to decide whether to revert back to the last good copy (which may be last night s backup in many instances) or attempt to repair the data image available, both of which significantly degrade RPO and/or RTO. In essence, the result of rolling disasters is a combination of physical and logical disasters in the sense that the production site experiences a physical disaster and the data becomes corrupt prior to the complete failure of the production site. Disaster recovery solutions should not only provide protection against physical and logical disasters, but should also be effective for mitigating extensive downtime due to scheduled maintenance. Such maintenance procedures include upgrading applications, the operating system, or hardware; moving servers from one location to another; ensuring power maintenance at the production site; and maintaining the network infrastructure, to mention just a subset of common scenarios. Additional Data Protection with EchoStream for AIX Traditional data replication solutions, both general solutions and the ones available for AIX, typically replicate data to the disaster recovery site. In conjunction with a clustering solution, they provide automated failover capabilities, thereby achieving a tier-7 disaster recovery solution, which means it has addressed the requirement that data can be replicated to the disaster recovery site and can be used to start up business services should the production site experience a failure condition. 5
The disadvantage of traditional disaster recovery and data replication solutions is fourfold: Either they are unable to protect against logical as well as rolling disasters or they rely on predetermined snapshot points to provide some level of protection, with degraded levels of recovery point and recovery time objectives. Because of this, they place a lot of burden on the administration staff to ensure proper operating procedures that result in an acceptable balance among recovery point objectives, ongoing replication performance, and recovery time objectives. Solutions that have been available for the AIX platform are either difficult and expensive to configure and maintain, or do not include functionality needed by the majority of businesses, such as asynchronous replication and manual failover capability. If a data replication solution relies on sector-by-sector storage hardware replication or predetermined snapshot points, it is difficult for IT to efficiently use this second set of data for other workloads, such as reporting, business intelligence, and data warehousing. In addition, backup tapes cannot typically be made from the data on the target system, which means that process must still be conducted on the production system, with planned downtime being required to do so. If the solution does not offer capabilities to easily create snapshots without affecting replication, then disaster recovery testing (an important component of an overall business continuity plan) either is difficult and cannot therefore be performed sufficiently frequently or may cause degradation to recovery point and recovery time objectives if replication is affected. Going Beyond Traditional Recovery EchoStream for AIX from Vision Solutions is an innovative disaster recovery solution that addresses the disadvantages discussed above. It is an asynchronous, IP-based disaster recovery solution; hence, it is able to utilize network bandwidth efficiently, without noticeably impacting application performance. Due to its unique continuous data protection (CDP) capabilities, it is able to assist not only with unplanned physical disasters, but also with the far more common logical disasters, as well as rolling disasters. Its unique virtual snapshot capability assists in ensuring disaster recovery readiness, and its easy disaster recovery capability enables a quick manual switch to the disaster recovery site. Protecting the Data as Well as the Server with Real-Time Data Replication While Vision Solutions offers a tier-7 disaster recovery solution by giving customers the option to combine clustering and replication products, many enterprise-level customers have chosen to leverage EchoStream s unique replication and disaster recovery characteristics with HACMP s feature-rich environment. The benefits of this combined system architecture are many: HACMP s mature, feature-rich, robust capabilities are utilized for automated high availability within the main data center. Should a localized failure condition occur, HACMP can recover critical system resources. 6
HACMP s Single System Image (SSI) type capabilities greatly reduce system administration time and resource requirements. EchoStream provides data replication and disaster recovery capabilities to one or more disaster recovery sites. Due to EchoStream s flexible replication capabilities, cascaded or star-like replication topologies can also be configured. EchoStream replicates data asynchronously to the disaster recovery site through an IP-based connection either with or without data compression. Only changes are replicated, thereby ensuring efficient bandwidth utilization. By compressing the communication flow, customers typically achieve five to six times the network bandwidth utilization. EchoStream provides single-click manual failover capability to the disaster recovery site. As discussed above, under most circumstances, disaster recovery failover is not an automated process, and automating it would introduce unacceptable risks. After disaster recovery failover due to either planned or unplanned outage, EchoStream allows for a very network-efficient resynchronization process, requiring only the changes that occurred after the failover to be synchronized back to the original production site. Once the business is ready for failing back the production application to the original production site, the failback procedure is similarly simple. EchoStream s unique true CDP feature, which not only replicates changes to the disaster recovery site, but also tracks and stores each change in buffers as it occurs, can be utilized to quickly restore data and to recover from logical disasters. In other words, you can recover objects from any point in time should an object become deleted or otherwise corrupted. EchoStream s virtual snapshot capability can be utilized for a wide variety of business uses, ranging from offloading backup procedures, through data retrieval, to business-reportgeneration purposes. More will be discussed about the benefits of this capability later in this paper. Since EchoStream is a software-based replication solution that runs on the AIX server, it does not require that major changes be made to existing data center operating procedures. It can run on any underlying storage solution, either in a heterogeneous or homogeneous storage environment, thereby allowing for a gradual introduction and the full utilization of existing capital investments. Similar and dissimilar storage options are accommodated. Businesses can leverage their existing storage and SAN investment, including hardware, software, and staff knowledge. This allows businesses to grow their storage size and performance with business needs and helps them to avoid vendor lock-in. Replication for HACMP System Architectures In its simplest case, this configuration consists of two HACMP nodes at the production site and one server at the disaster recovery site, as depicted in the diagram below. The servers could be either standalone AIX servers or logical partitions (virtual servers) running on the same physical AIX server. The HACMP cluster can be either an existing one, in which case EchoStream would be added for disaster recovery purposes, or a new cluster that requires superior local high availability and disaster recovery characteristics. 7
Offsite HACMP AIX AIX AIX EchoStream for AIX Local HA with Remote Replication In the configuration depicted above, HACMP is utilized in a shared disk configuration on two nodes to make business services highly available. HACMP can be used either in a hot standby or in a mutual takeover configuration. An EchoStream context (EchoStream s replication group that can contain several logical volumes among which write-order fidelity is maintained by EchoStream) is made part of the HACMP resource group. During regular operations, one of the HACMP nodes hosts the resource group, and replication occurs from that server to the recovery server at the disaster recovery site. If there is a failure condition on the server hosting the resource group or if the administrator initiates a resource group movement to the other HACMP node, the resource group is taken over. As part of the failover procedure, EchoStream is stopped on the server originally hosting the resource group and then is started on the server to host the resource group. From the recovery server s perspective, this is merely a short suspension in replication. If the site fails or the administrator moves the production service to the disaster recovery site, the disaster recovery failover is achieved by issuing an EchoStream command that will bring up EchoStream and the file system on the disaster recovery site, after which the application can be started up, resulting in a short RTO. Extending Replication to Include True Continuous Data Protection (CDP) As discussed above, EchoStream provides its unique data replication and disaster recovery capabilities by utilizing what is referred to as "true continuous data protection" (true CDP) technology. In essence, this technology is similar to transaction logging in that each write IO is buffered in the form of redo and undo logs and can later be used to reconstruct earlier images of the data, allowing for advanced any-point-in-time data recovery. 8
Since changes are continuously tracked and buffered as they occur, a significant advantage EchoStream has over other, more traditional replication products is that the recovery point can be chosen after a problem occurs, rather than having to rely on placing snapshot points before each major operation. This greatly improves recovery point and recovery time objectives not only during physical disasters, but also during logical disasters. Rather than having to restore a previous night s backup in order to restore an accidentally deleted file, a logical, virtual image of the data can be reconstructed within minutes, from which the deleted file can then be recovered. This same process can be extended to databases and deleted records and tables. During physical disasters, the recovery point objective (RPO) essentially becomes a continuum. Rather than having only the latest image of the data available, which may be unusable from a business perspective even if it is consistent at the file system and database levels, the business can now evaluate whether an earlier recovery point might be more suitable to recover to. Criteria for that decision might include the necessity to bring the AIX server in sync with auxiliary systems that have worse RPO characteristics; business process related significant recovery points (e.g., recovering to the middle of end-of-day processing may not be desirable); or the occurrence of a rolling disaster that corrupted the latest recovery point. The process of evaluating this is easily achievable with virtual snapshots, which is discussed in greater detail below. Why Is the Inclusion of True CDP Superior to Replication Without CDP? As discussed above, EchoStream's true CDP functionality adds unparalleled capabilities to all aspects of both disaster recovery and business continuity, capabilities that are necessary in order to satisfactorily address unique challenges that occur in disaster recovery environments. Perhaps EchoStream s most-utilized feature is its unique snapshot capability, which lets companies achieve better utilization of their recovery server. EchoStream replicates data to the disaster recovery server while journaling data updates as they occur. The journals are buffered on the recovery server and can be archived to tape media to provide an expanded recovery window (which may be mandated by regulatory compliance requirements). EchoStream s snapshot capability occurs on the recovery server, thereby mitigating any risk that snapshots could impose on the production server. On the recovery server, the administrator can create either read-only or read/write-capable virtual snapshots, which can then be utilized for a variety of purposes (listed below). EchoStream uses Copy-On-Write snapshot technology, which allows for very disk space efficient creation of snapshots, with very low disk-use overhead (typically on the order of a few percent of the protected data set s size). It is important to note that, while snapshots are in use on the recovery server, replication occurs uninterruptedly, thereby not exposing the business to either degraded recovery point or recovery time objectives. 9
Snapshots then can be used for a variety of purposes: 1. Most importantly, the combination of snapshots and CDP allows you to easily and resource-efficiently recover data without impacting production. If there is data corruption (caused by users, administration, or an application), you can easily reverse that by creating a snapshot to an earlier point in time, performing any necessary investigation and validation, and then reapplying the data onto the production server, in most cases without affecting production and the users of the system. 2. One of the most common ways to use snapshots is to offload the tape backup procedure from the production server to the recovery server, thereby completely eliminating the need for a "backup window." A snapshot can be created on the recovery server. 3. A snapshot is an effective way to manage your reporting, business intelligence, and data mining requirements. 4. With a snapshot, you can perform non-intrusive disaster recovery readiness testing to ensure that service-level agreements are met. 5. When you need to test application upgrades or software patches before rolling them out to production, a snapshot ensures the success of the process by providing a place to return to if anything goes awry. 6. A snapshot is useful for creating an isolated "sandbox" training system for new employees, which can both minimize employee ramp-up time and ensure that practice activities and mistakes do not impact actual production systems. Summary In today's competitive economy, data and service availability is crucial to a business's survival. Inefficiency cannot be tolerated. Hardware has become increasingly dependable, but unplanned outages caused by physical disasters, logical disasters, and rolling disasters still happen. And even planned outages for maintenance and upgrades can have a negative effect on business performance. In the case of unplanned outages, HACMP software for System p/aix provides monitoring, failure detection, and automated application recovery to help protect business-critical applications and the businesses that rely on them from failing. And during planned outages, the HACMP solution can transfer applications and data to backup systems so that users still have access. A reliable, cost-effective IT infrastructure that keeps a business running 24x7 is no longer a luxury. It's a necessity. 10
Easy. Affordable. Innovative. Vision Solutions. Vision Solutions, Inc. is the world s leading provider of high availability, disaster recovery, and data management solutions for the IBM System i and System p markets. With a portfolio that spans the industry s most innovative and trusted HA brands, Vision s itera, MIMIX, and OMS/ODS keep business-critical information continuously protected and available. Complementing Vision s availability offerings, Vision Director delivers a highly integrated set of applications that proactively monitors, manages and optimizes System i servers, databases and application environments to help ensure the continued health of System i servers. Affordable and easy to use, Vision products help to ensure business continuity, increase productivity, reduce operating costs, and satisfy compliance requirements. Vision also offers advanced cluster management, data management, and systems management solutions, and provides support for i5/os, Windows and AIX operating environments. As IBM s largest high availability Premier Business Partner, Vision Solutions oversees a global network of business partners and services and certified support professionals to help our customers achieve their business goals. Privately held by Thoma Cressey Bravo, Inc., Vision Solutions is headquartered in Irvine, California with offices worldwide. For more information call 801-799-0300 or toll free at 800-957-4511, or visit. itera MIMIX OMS/ODS 15300 Barranca Parkway Irvine, California 92618 800-957-4511 801-799-0300 Copyright 2010, Vision Solutions. IBM and System i are trademarks of International Business Machines Corporation. WP_ReplicationHACMP_E_1005