Availability Digest. What is Active/Active? October 2006



Similar documents
Availability Digest. MySQL Clusters Go Active/Active. December 2006

Connect Converge / Converged Infrastructure

Why Your Business Continuity Plan May be Inadequate

Availability Digest. Banks Use Synchronous Replication for Zero RPO February 2010

Eliminating End User and Application Downtime:

Companies are moving more and more IT services and

HA / DR Jargon Buster High Availability / Disaster Recovery

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

DISASTER RECOVERY BUSINESS CONTINUITY DISASTER AVOIDANCE STRATEGIES

Leveraging Virtualization for Disaster Recovery in Your Growing Business

Four Steps to Disaster Recovery and Business Continuity using iscsi

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Windows Server Failover Clustering April 2010

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Disaster Recovery for Oracle Database

Five Secrets to SQL Server Availability

Shadowbase Data Replication Solutions. William Holenstein Senior Manager of Product Delivery Shadowbase Products Group

High Availability Database Solutions. for PostgreSQL & Postgres Plus

Availability Digest. Payment Authorization A Journey from DR to Active/Active December, 2007

Neverfail for Windows Applications June 2010

Windows Geo-Clustering: SQL Server

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

The Economic Impact Of Zero Data Loss. White Paper

Leveraging Virtualization in Data Centers

Exchange Data Protection: To the DAG and Beyond. Whitepaper by Brien Posey

Cisco Nexus 1000V and Cisco Nexus 1110 Virtual Services Appliance (VSA) across data centers

Developing a Complete RTO/RPO Strategy for Your Virtualized Environment

Availability Digest. Real-Time Fraud Detection December 2009

Protecting Microsoft SQL Server

Availability Digest. Backup Is More Than Backing Up May 2009

Synchronous Replication of Remote Storage

be architected pool of servers reliability and

The Impact Of The WAN On Disaster Recovery Capabilities A commissioned study conducted by Forrester Consulting on behalf of F5 Networks

IT Service Management

Ingres Replicated High Availability Cluster

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Code Subsidiary Document No. 0007: Business Continuity Management. September 2015

courtesy of F5 NETWORKS New Technologies For Disaster Recovery/Business Continuity overview f5 networks P

Shared Machine Room / Service Opportunities. Bruce Campbell November, 2011

Business Continuity: Choosing the Right Technology Solution

DISASTER RECOVERY PLANNING GUIDE

Perforce Backup Strategy & Disaster Recovery at National Instruments

Westek Technology Snapshot and HA iscsi Replication Suite

ETM System SIP Trunk Support Technical Discussion

Database Resilience at ISPs. High-Availability. White Paper

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

Informatica MDM High Availability Solution

Tk20 Backup Procedure

Data Replication and the Clearinghouse S Trading System

Contents. SnapComms Data Protection Recommendations

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

Availability Digest. Cellular Provider Goes Active/Active for Prepaid Calls September 2008

Surround SCM Backup and Disaster Recovery Solutions

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

HP StorageWorks Data Protection Strategy brief

DeltaV Virtualization High Availability and Disaster Recovery

Federated Application Centric Infrastructure (ACI) Fabrics for Dual Data Center Deployments

Server Clusters : Geographically Dispersed Clusters For Windows 2000 and Windows Server 2003

Cloud Computing Disaster Recovery (DR)

Technical Considerations in a Windows Server Environment

High Availability Solutions for the MariaDB and MySQL Database

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

Eliminate SQL Server Downtime Even for maintenance

The Shift Cloud Computing Brings to Disaster Recovery

Planning and Implementing Disaster Recovery for DICOM Medical Images

Reduce your downtime to the minimum with a multi-data centre concept

Availability Digest. TANDsoft FileSync Adds Deduplication February 2014

HRG Assessment: Stratus everrun Enterprise

Rajesh Gupta Best Practices for SAP BusinessObjects Backup & Recovery Including High Availability and Disaster Recovery Session #2747

Cloud Backup and Recovery

PROTECTING MICROSOFT SQL SERVER TM

Enterprise Key Management: A Strategic Approach ENTERPRISE KEY MANAGEMENT A SRATEGIC APPROACH. White Paper February

Real-time Data Replication

Avaya Aura Virtualized Environment

Real-time Protection for Hyper-V

Whitepaper. 10 Reasons to Move to the Cloud

Synchronous Data Replication

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

The Microsoft Large Mailbox Vision

SQL Server AlwaysOn Deep Dive for SharePoint Administrators

Unitrends Backup & Recovery Solutions and Disaster Recovery Best Practices

High Availability and Disaster Recovery Solutions for Perforce

Planning for the Worst SAS Grid Manager and Disaster Recovery

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

IBM TotalStorage IBM TotalStorage Virtual Tape Server

Disaster Recovery: The Only Choice for Mission-Critical Operations and Business Continuity

Abhi Rathinavelu Foster School of Business

A Study on Cloud Computing Disaster Recovery

MEDITECH Disaster Recovery

Virtual Disaster Recovery

The Weill Cornell Medical College and Graduate School of Medical Sciences. Responsible Department: Information Technologies and Services (ITS)

Creating A Highly Available Database Solution

VERITAS Volume Replicator in an Oracle Environment

System Migrations Without Business Downtime. An Executive Overview

WHITE PAPER: TECHNICAL. Enhancing Microsoft SQL Server 2005 Availability with Veritas Storage Foundation for Windows

Delivering Fat-Free CDP with Delphix. Using Database Virtualization for Continuous Data Protection without Storage Bloat.

DISASTER RECOVERY STRATEGIES FOR ORACLE ON EMC STORAGE CUSTOMERS Oracle Data Guard and EMC RecoverPoint Comparison

Multi-Datacenter Replication

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

Transcription:

the Availability Digest What is Active/Active? October 2006 It is a fundamental fact that any system can and will fail at some point. The secret to achieving extreme availabilities is to let it fail, but fix it fast. This is the premise behind active/active. If a service outage is too short for to notice, it will not be perceived as a service outage. In effect, service availability has been maintained. From a high-level perspective, active/active architectures accomplish fast recovery by distributing the user over multiple independent and geographically distributed processing nodes. Should a node fail, the of that node are switched to one or more surviving nodes in the network. This switchover can often be done in subseconds taking no more time than the resubmission of a failed transaction, an otherwise common occurrence. 1 Active/Active Architectures An active/active system is a network of independent processing nodes, each having access to a common replicated such that all nodes can participate in a common. redundant network redundant network An Active/Active System with Network-Attached Storage An Active/Active System with Direct-Attached Storage The network comprises two or more processing nodes connected to a redundant communications network. Any user can be connected to any node over this network. The nodes all have access to two or more copies of the. The copies may be connected directly to the network, or they may each be directly attached to one of the nodes. 1 Much more information concerning active/active architectures may be found in the book entitled Breaking the Availability Barrier: Survivable Systems for Enterprise Computing, by Dr. Bill Highleyman, Paul J. Holenstein, and Dr. Bruce Holenstein, published by AuthorHouse; 2004. 1

The copies are kept in synchronism by ensuring that any change to one copy is immediately propagated to the other copies. There are many techniques for doing this, such as network transactions or. These techniques are described later in this article. Should a node or its attached fail, the connected to that node have lost their services. In this case, they are switched to another node or are distributed across multiple surviving nodes to immediately restore service. Users at other nodes are unaffected. (See Do You Know Where Your Train Is? for a description of one effective way to switch.) Providing that the nodes and copies are geographically distributed, active/active systems provide disaster tolerance. Should a disaster take out a node or a site, there are others in the network to take its place. There are many ways to keep two or more copies in synchronization. These include network transactions, asynchronous, and synchronous. Whatever the technique, one result is mandatory. The copies must always maintain referential integrity so that each can be used actively by any copy. Referential integrity typically means that transactions initiated by a node be committed in the same order at any of the copies. In some cases, this requirement may extend to the updates within a transaction. This is known as preserving the natural flow of all updates. Network Transactions Using network transactions, the scope of a transaction includes all copies. This results in each copy of a item across the network being locked before any copy is updated. As a result, all copies are kept in exact synchronism. One problem with network transactions is that each lock request and each update must individually flow across the network and a completion response received. In widely dispersed systems, such a round trip could take tens of milliseconds; and performance can be seriously affected. This is a clear example of the compromise between availability and performance. 2. update 2. update 1. start transaction 3. commit Network Transaction Asynchronous Replication An asynchronous extracts changes made to its source from some sort of a change queue (such as a change log or an audit trail) and sends these changes to a. This is done under the covers with respect to the, which is therefore unaware of the activity. Consequently, there is no performance impact on the. source change queue 2. send change Asynchronous Replication 2

However, there is a delay between the time that a change is applied to the source and the time that it is subsequently applied to the. This delay is known as latency and typically ranges from hundreds of milliseconds to a few seconds. As a consequence, should a node fail, there will likely be transactions that, having been applied to the source, are still in the pipeline at the time of the failure. These transactions never make it to the and, in effect, are lost. 2. send changes Data Collision Another issue with asynchronous is collisions. If a item is updated on each of two copies within the latency time, each will be replicated to the other copy and will overwrite the original update to that copy. The two copies are now different, and both are wrong. There are several techniques for avoiding collisions or for detecting and resolving them. These techniques will be discussed in later articles. Synchronous Replication With synchronous, changes are replicated to the via asynchronous but are held there and used only to lock the affected 6. commit? items. When the source node is ready to commit the transaction, it checks 5. yes with all copies to ensure that they have been able to obtain locks on all 4. ready? 2. send change of the affected items. It does this by source change sending a query behind the last update queue over the channel. If all copies are ready, the source node instructs them to apply the updates Synchronous Replication and to release their locks. Otherwise, all copies are instructed to abort the transaction. 2 Synchronous, like network transactions, guarantees that all copies will be in exact synchronism (as opposed to asynchronous, which keeps the copies in near-synchronism because of latency). Thus, no transactions are lost as the result of a failure; and collisions cannot occur. In this case, the is delayed as it waits for the commit to complete across the network. This is called latency. However, this delay compares to network transaction delays, which must wait for each update as well as the commit to complete across the network. As a consequence, synchronous is generally more efficient if copies are widely distributed or if transactions are large. Network transactions may be more efficient for collocated copies and short transactions. 2 See Holenstein, B. D., Holenstein, P. J., Strickler, G. E., Collision Avoidance in Bidirectional Data Replication, United States Patent 6,662,196; December 9, 2003. 3

Other Advantages of Active/Active There are many other advantages that an active/active architecture brings: Elimination of planned downtime: A node can be upgraded by simply switching its to other nodes. The node then can be brought down and its hardware, operating system,, or s upgraded and tested. At this point, the node can be returned to service by returning its to it. This technique effectively eliminates planned downtime. Data locality: As compared to an active/backup configuration in which the backup system is not processing transactions, an active/active configuration can be distributed to provide locality. Users can be connected to their nearest respective node, thus improving performance. Use of all purchased capacity: There is no idle backup system sitting around in an active/active system. Therefore, in a multinode active/active architecture, less capacity may need to be purchased. For instance, in a five-node configuration, if each node can carry 25% of the load, full capacity is provided even in the event of a node failure. However, only 125% of required capacity must be purchased rather than 200% for an active/backup system. Online capacity expansion: Capacity easily can be added by installing a new node and then switching some to the new node. Load balancing: The load across the network can be rebalanced by moving from a heavily loaded node to lightly loaded nodes. Risk-free failover testing: Failover testing can be risk-free and not require any user downtime. In an active/backup system, it may take hours to fail over, during which time the are denied service. The same downtime impact occurs when are switched back to the primary node following failover testing. Furthermore, what if the backup node turns out to be nonoperational? In an active/active system, it is known that the other nodes are working; and failover takes seconds at most. Lights-out operation: In a multinode active/active network, it may not be necessary to have every site staffed since should a node fail, the system continues operating anyway. Time to recover the failed node is less time-critical, especially if service can still be provided even after a second node failure. RPO and RTO When deciding upon which active/active architecture to use, one has to consider two important corporate goals, RPO and RTO. RPO is the recovery point objective. It specifies the amount of that can be lost in the event of a failure. Tape backups can lose hours of (as well as having hours of recovery time). Asynchronous can lose seconds of, depending upon how fast the is (some have latencies measured in subsecond times). Synchronous and network transactions provide a zero RPO. RTO is the recovery time objective. It states the tolerance of the operation to an outage. Some s can be down for minutes or hours without disastrous consequences. 4

For other s, downtime of minutes or even seconds can be unacceptable. The primary advantage of active/active systems is that they provide essentially a zero RTO. Active/Active Issues There are several important active/active system issues that must be understood and resolved. How will user switching be handled? Can it be done automatically? Is there a chance of collisions? If so, how will they be handled? Are lost transactions acceptable following a node failure? To what extent? Can they be recovered? Can the s be run in an active/active environment? Do they require modification? What performance impact can be tolerated when implementing an active/active architecture? What additional cost can be tolerated? This may include hardware, software licenses, networking, people, and sites. In Summary There is always a fundamental compromise between availability, cost, and performance. Active/active technology can be tailored to a particular to optimize this compromise. 5