Continuous Availability Suite: Neverfail s Continuous Availability Suite is at the core of every Neverfail solution. It provides a comprehensive software solution for High Availability (HA) and Disaster Recovery (DR) through continuous data protection without the need for any specialized hardware. In this technical paper we outline how this is achieved from an architectural design and operational perspective. Introduction Neverfail provides software that protects companies against the impact of IT outages. An effective business continuity strategy understands the business impact of server downtime and focuses resources on those critical processes that business depends on. We call this Continuous Availability. The Neverfail approach is to ensure that key Windows-based business applications will remain available to your organization 24x7 through planned or unplanned downtime. Neverfail s Continuous Availability Suite is at the core of every Neverfail solution. It provides a comprehensive software solution for High Availability (HA) and Disaster Recovery (DR) through continuous data protection without the need for any specialized hardware. In this technical paper we outline how this is achieved from an architectural design and operational perspective. Neverfail Logical Architecture and the Continuous Availability Suite The Neverfail Continuous Availability Suite is comprised of several interrelated concepts and components that work in harmony to provide a high level of availability and protection from a broad range of failures. By leveraging years of experience and a mature set of software tools, the Neverfail solution applies the best ideas from a number of availability concepts. Combining pieces of data replication, server clustering, network management, and application-specific monitoring of key performance indicators, Neverfail provides assurance from downtime and great value in the same solution. The Neverfail logical architecture (Figure 1) outlines the key logical components of a single instance of the Neverfail Continuous Availability Suite. We describe each component of this architecture below. 1
The SCOPE tool and process provides detailed information about the current running state of your server environment and recommendations for optimizing your servers before installing Neverfail s products. Figure 1: The Neverfail Logical Architecture SCOPE The first component to consider in the suite is SCOPE which is an acronym for Server Check, Optimization, and Performance Evaluation. It is both a software tool and a process that ensures the success of a Neverfail product implementation by providing current, accurate, and complete information about the server environment. The SCOPE tool and process provides detailed information about the current running state of your server environment and recommendations for optimizing your servers before installing Neverfail s products. Heartbeat At the core, Neverfail s Heartbeat component orchestrates operations and manages communications between servers. It performs the complex core product functions of replicating data to and from other Neverfail instances at the windows kernel level while live applications are running in the operating system above. Heartbeat also manages coordinated failovers, switchovers and switchbacks between the various servers in a Neverfail Cluster, synchronizing activity as required between Active and Passive instances. Application Management Framework The other major component of Neverfail is the Application Management Framework (AMF). The AMF is responsible for real time detection of faults, discovery of changes in the state of any protected application, managing inter-dependencies and live registration/de-registration of new protected applications through specific business application modules or 3rd party adapters. 2
The Neverfail Continuous Availability Director (CAD) is a presentation interface which allows multiple instances of Heartbeat to be monitored and managed from a single location. This client application can be run locally on any member of a Neverfail group, or may be run on any desktop or other remote host. Neverfail business application modules allow information about how best to protect a specific application to be encapsulated including inter-dependencies between related services and registry entries. And since AMF knows the state of each business application module, (including the state of any associated resources such as services or registry entries) it can be configured to manage interdependencies between applications as well as. As an example, the Microsoft Exchange application modules may detect that while emails are being sent correctly, none are being received by the protected server and therefore agreed services levels are not being met. In this event Neverfail can be set to automatically raise an alert through any of several notification methodologies, or even switchover to another instance of Exchange. However, if we switchover Exchange to another instance, we may affect the communications latency requirement between Exchange and a related servers such as a Blackberry Enterprise Server (BES). So even though the BES is operating correctly we may wish to switch this application over at the same time in a coordinated fashion. Finally, the AMF can be customized to protect any crash-consistent Windows Application. Using a generic business application module, Neverfail can monitor and manage the state of any application s related windows services. Custom tasks can also be implemented to provide application-specific monitoring of key performance indicators. This means we can extend the protection afforded by Neverfail beyond that supported by standard business application modules as long as the applications concerned meet certain restartability conditions. Continuous Availability Director The Neverfail Continuous Availability Director (CAD) is a presentation interface which allows multiple instances of Heartbeat to be monitored and managed from a single location. This client application can be run locally on any member of a Neverfail group, or may be run on any desktop or other remote host. By providing the public name of the protected server, the Continuous Availability Director connects to the Heartbeat instance on that server and provides visibility to the primary and secondary (and tertiary, if configured) servers in that Neverfail cluster. The role, status, and state of each server is readily available on the overview page. Additional tabs provide visibility into details into the state of the application, the network, the data, and replication processes. Changes to the configuration on any of these tabs can be made from within the CAD without disrupting the replication process. Furthermore, unique application groups may be defined in the Continuous availability director for the purpose of both monitoring and managing more complex applications. When application groups are defined within the CAD, warnings and alerts from any member server will roll up to the group level. The status of all groups are visible simultaneously within the CAD, and issues within the protected infrastructure are quickly identified within the enterprise. Additionally, application groups can optionally be configured to switch over to a remote DR location as a collection. By coordinating the switchover the entire application group, Neverfail reduces the complexity and the dependencies related to moving an application to a DR site. 3
WANSmart Lastly, Neverfail offers an optional WAN optimization feature called WANSmart. While the core product does include basic data compression over a wide area network, WANSmart goes even further by implementing an on-the-fly data deduplication algorithm to dramatically reduce the amount of bandwidth required to support replication. When servers have a high changerate on their protected data, WANSmart can help to minimize the cost of the connection between the datacenters and help to ensure that changes arrive at the DR site faster than if data was sent in an uncompressed state. Similar to many hardware-based WAN optimization solutions, WANSmart works in software between the local and remote servers reduce the amount of data that needs to be sent to the remote site. When servers have a high change-rate on their protected data, WANSmart can help to minimize the cost of the connection between the datacenters and help to ensure that changes arrive at the DR site faster than if data was sent in an uncompressed state. Levels of Protection Provided Neverfail can provide five levels of protection to ensure that end-user clients remain connected in the event of a failure: 1. Server Protection Neverfail continues to provide availability to end user clients in the event of a hardware failure or operating system crash. When deployed, Neverfail provides the ability to monitor the active server by sending I m alive messages from the 1st passive server to the active server which reciprocates with an acknowledgement over a network connection referred to as the Neverfail Channel. If the passive server detects that the process or heartbeat has failed, it can initiate a failover. A failover occurs when the passive server detects that the active server is no longer responding. This can be because of hardware failure or loss of network connectivity on the active server. Rather than the active server being gracefully closed, it has been deemed to have failed and requires no further operations. In a failover, the passive server is brought up immediately to take on the role of the active server. The mechanics of failover are discussed later. 2. Application Protection The Neverfail instance running on the active server locally monitors the applications and services it has been configured to protect through business application modules. If a protected application should fail, Neverfail first tries to restart the application on the active server. If a restart of the application fails, then Neverfail can initiate a switchover. A switchover gracefully closes down any protected applications that are running on the active server and restarts them on the passive server along with the application or service that caused the failure. The mechanics of switchover are discussed in more detail later in this series. 3. Network Protection Neverfail proactively monitors the ability of the active server to communicate with the rest of the network by polling up to three defined nodes around the network, including devices such as the default network gateway, primary DNS server, and the Global Catalog server at regular intervals. If all three nodes fail to respond, for example, if a network card or local switch fails, Neverfail can gracefully switch the roles of the active and passive servers (referred to as a switchover) allowing the previously passive server to assume an identical network identity to that of the previously active server. After the switchover, the newly active server then continues to service the clients. 4
In its simplest form, Neverfail operates as a Neverfail Pair with one server performing an active role (normally the Primary server) while the other server performs a passive role (normally the Secondary server). The server in the active role provides application services to users and serves as the source for replication while the server in the passive role serves as the standby server and target for replicated data. This configuration supports replication of data between the active and passive server over the Neverfail Channel. 4. Performance Protection Neverfail proactively monitors system performance attributes to ensure that your protected applications are actually operational and providing service to your end users, and that the performance of those applications is adequate for the needs of those users. Neverfail business application modules define these monitoring and pre-emptive repair capabilities. Neverfail business application modules allow the application framework to monitor application services and key performance indicators to ensure that protected applications are truly operational and not in a hung or stopped state. Pre-defined rules and adjustable thresholds allow Neverfail to monitor specific application attributes to ensure that they remain within normal operating ranges. Rules can be enabled or disabled as desired, and can be set to trigger specific corrective actions whenever these attributes fall outside of their respective ranges. 5. Data Protection Neverfail ensures the data files that applications or users require in the application environment are made available should a failure occur. Neverfail can be configured to protect files, folders, and even specific registry settings of the active server by mirroring them in real-time to the passive servers. This means that if a failover occurs, all files that were protected on the failed server remain available to users after the failover, on the server which has assumed the active role. Neverfail provides all five protection levels simultaneously to ensure that all facets of the user environment are maintained at all times and that the network (the Principal Public Network) continues to operate through as many failure scenarios as possible. Architectural Concepts Neverfail describes the organization of servers based upon clusters, cluster status, and relationships between clusters. Neverfail refers to a cluster of two servers as a Neverfail Pair and three servers as a Neverfail Tertiary configuration. Installing Neverfail on the servers then assigns an Identity (Primary, Secondary, or Tertiary) to the servers and results in a Neverfail Pair or a Neverfail Tertiary. Note that a Neverfail cluster can include machines that also participate in a VMware or Microsoft cluster. Each server is assigned both an Identity (Primary or Secondary or Tertiary) and a Role (Active or Passive). Identity is used to describe the physical instance of the server while the Role is used to describe what the server is doing. When the Identity is assigned to a server it normally will not change whereas the Role of the server will change depending on the operation the server is performing as determined by Neverfail. In its simplest form, Neverfail operates as a Neverfail Pair with one server performing an active role (normally the Primary server) while the other server performs a passive role (normally the Secondary server). The server in the active role provides application services to users and serves as the source for replication while the server in the passive role serves as the standby server and target for replicated data. This configuration supports replication of data between the active and passive server over the Neverfail Channel. When deployed as a pair, Neverfail can be deployed for either high availability (HA) using a high speed LAN connection, or disaster recovery (DR) using a lower bandwidth 5
WAN connection where bandwidth optimization may be required. When deployed in the Tertiary configuration, Neverfail provides HA of protected applications to end-users via the LAN, and simultaneously provides DR with a third server located at a remote site, via a WAN. The CAD provides a single point of management, control and protection for business-critical applications and IT services allowing you to take a businesscentric view of business-critical applications, servers, IT services and their interdependencies. Monitoring & Managing Availability Neverfail s Continuous Availability Director (CAD) contains tools to allow you to monitor and manage multiple instances of Neverfail. The CAD provides a single point of management, control and protection for business-critical applications and IT services allowing you to take a business-centric view of business-critical applications, servers, IT services and their interdependencies. The Continuous Availability Director is started by invoking the Heartbeat management client. The first screen presented is the Heartbeat Servers Overview screen that allows you to see all of the instances of Heartbeat running in your organization. Figure 2: Heartbeat Servers (Overview) Screen After the management client is running, the navigation panel on the left allows viewing and selecting the groups and cluster connections that can be managed within the CAD. The selection of a group or cluster made in the navigation panel points the CAD to that group or cluster, and the client provides information related to only the selected group or cluster. Figure 3: Continuous Availability Director Management Client 6
This graphical interface to monitoring and management significantly reduces the ongoing burden of protecting business-critical applications by providing centralized monitoring, management, and configuration of all Neverfail-protected servers and application. The CAD also simplifies and expedites root cause analysis across multiple systems, applications, services, office locations etc. Selecting a Neverfail cluster in the navigation panel of the management client shows an overview screen for that cluster. The overview provides status information on the protected applications, network, file system, and registry. The first screen to open by default is the Server Summary. This allows you to view the roles that servers are performing (active or passive), the actions that the servers are currently performing, and summary information on the status of communications and data replication between servers. Figure 4: Server Summary Screen The lower panel displays status information for each server in the Cluster. To change the currently displayed server, click a server in the graphical representation in the upper panel, or select the server Identity tab (Primary Server, Secondary Server, or Tertiary Server) in the bottom panel. This graphical interface to monitoring and management significantly reduces the on-going burden of protecting business-critical applications by providing centralized monitoring, management, and configuration of all Neverfail-protected servers and application. The CAD also simplifies and expedites root cause analysis across multiple systems, applications, services, office locations etc. There are many other screens that offer a rich set of monitoring and management functions. While it is beyond the scope of this paper to cover the complete operability of the Continuous Availability Director, the included administration guide provides detailed instructions on every facet of the application. Creating a shared-identity secondary server through day-zero cloning The Neverfail Continuous Availability Suite provides an extremely high level of uptime through the deployment of a warm standby server that can take over the active role for the protected application. While it is a completely separate and unique server, this 7
secondary server shares the same system identity as the primary protected server. The Windows hostname, address space, security identifiers, and installed applications are all completely identical between these two servers. The main reason for having a true secondary server is that Neverfail provides protection from issues that arise within the operating system or the application itself. Should the protected server suffer a failure within the operating system, or should the application itself fail to continue operations, the stand-by server can quickly be made active with zero loss of data while the issue is being resolved on the original server. Neverfail s Continuous Availability Suite is designed to run on either two or three connected Windows servers that are in effect clones of one another. Each cloned server can operate independently or in concert with other another, continuously synchronizing changes to the protected data and application state information. However, from an end users point of view, only one of these servers can be seen as active at any one time. The other servers are referred to as being passive. In order to create this shared-identity secondary server, Neverfail employs a unique cloning process to copy the system state and applications from the active primary server and applies them to an empty Windows server of the same operating system level. The result is a simplified installation of the standby servers as there is no need to install and configure the protected application and any required ancillary applications on the new stand-by server. Because the secondary server starts off with an installed operating system, there is no need for similarity of hardware between the active and stand-by servers. While the primary and secondary servers do share the same identity and the same applications, these two machines are in fact two separate servers. This is different than simply having a complete copy of a protected server, as is common with many fastbackup solutions and virtualization technologies. The main reason for having a true secondary server is that Neverfail provides protection from issues that arise within the operating system or the application itself. Should the protected server suffer a failure within the operating system, or should the application itself fail to continue operations, the stand-by server can quickly be made active with zero loss of data while the issue is being resolved on the original server. Isolating stand-by servers with a network packet filter driver While the stand-by servers in a Neverfail pair or tertiary do share the same Windows identity, and sometimes the same IP address, as the protected Primary server, only the single active server is ever visible to the public network. Neverfail implements a network packet filter technology to isolate the stand-by server or servers and prevent duplicate server names or addresses from appearing on the network. The network filter driver is a selective filter. Rather than blocking all traffic from passing through the NIC interface, the driver blocks all NetBIOS traffic and all protected TCP/IP traffic. While in most cases this amounts to 100% of the configured NIC address space, it is possible to allow unique additional IP addresses to pass through the filter driver to provide access in and out of a given host even when it is in a passive state. In addition to the public NIC that is used when a given server is active, a Neverfail pair will also use a second network connection referred to as the Neverfail Channel. The Neverfail Channel is a crucial component of the architecture and is configured to provide dedicated communications between servers. When deployed in a pair 8
configuration, the Primary and Secondary servers each require at least two network interface cards (NICs), one NIC for the Principal (Public) Network connection, and at least one NIC for the Neverfail Channel connection. Primary Server (Active) Secondary Server (Passive) If a channel connection between two servers becomes compromised, communication and replication can automatically resume using the remaining connections. Figure 6 illustrates communications between servers in a tertiary configuration. Neverfail Channel LAN / WAN Figure 5: Neverfail Pair Communications X In a tertiary configuration, each server has a Neverfail Channel connection to each other server in the cluster. This architecture minimizes the impact of a failure of any one server or network component. If a channel connection between two servers becomes compromised, communication and replication can automatically resume using the remaining connections. The figure below illustrates communications between servers in a tertiary configuration. Primary Server (Active) Neverfail Channel Tertiary Server (Passive) WAN Secondary Server (Passive) X X LAN / WAN Figure 6. Neverfail Tertiary Configuration In either configuration, whether there is one stand-by server or two, and whether a stand-by server is local or remote, the Windows identity and primary IP address of that server will be isolated from the public network until needed. From an administrative perspective, any server in the Neverfail cluster can be accessed through the Neverfail Channel, through a defined management address, or through any kind of console access technology. 9
Neverfail is completely hardware agnostic and runs equally well on both physical and virtual machines. Furthermore, Neverfail s unique cloning process makes moving from a physical platform in production to a virtual platform for DR a breeze. Combined HA & DR through tertiary server support A Neverfail pair may be deployed with the secondary server being either local or remote relative to the protected primary server. With the secondary server local to the primary this is considered to be a high availability (HA) configuration, typically providing a fullyautomated failover to an adjacent server in the event of a single-host failure. With the secondary server remote the primary server this is considered to a disaster recovery (DR) configuration, typically providing push-button failover to a remote data-center in the event of a either host failure or complete site failure at the primary datacenter. Neverfail provides the ability to create a combined HA & DR solution through a tertiary support model. In a this tertiary configuration, the primary server is protected by a local secondary stand-by server and remote tertiary stand-by server simultaneously. The concepts and configuration in a tertiary configuration are the same as they are in a simple pair. All three machines share the same identity, and those machines on the same subnet will also share the same IP public address. An additional NIC will be required in each machine in order to facilitate the Neverfail Channel connections between each of the other machines in the cluster. Failover in a tertiary configuration is often configured to provide automatic local failover and pushbutton remote failover. When performing a planned switchover from the Continuous Availability Director, the operator is given the option of performing a switchover to either the secondary or tertiary machine in the cluster. Reducing hardware expense through virtualization Protecting applications on physical servers or data on SAN arrays has traditionally required an investment in similar hardware for HA/DR failover. While making significant investments in production hardware is easy to justify, making a similar justification for secondary hardware is often more challenging. For a cost effective implementation, reducing the amount of secondary hardware required is always a desirable capability. Whether that means replicating to a smaller number of physical machines, or replicating to less expensive hardware components, Neverfail provides that level of flexibility. At the server level, Neverfail s architecture dictates that there will be at least one stand-by server for each protected primary server. While Neverfail s solution works well for protecting stand-alone physical servers, it is not a requirement to have a similar physical server as the stand-by secondary. Neverfail is completely hardware agnostic and runs equally well on both physical and virtual machines. Furthermore, Neverfail s unique cloning process makes moving from a physical platform in production to a virtual platform for DR a breeze. From a storage perspective, the only requirement is that the logical drives within the operating system are aligned between the servers. If the protected data resides on the E: and F: drives in production, changes to that data will be replicated to the E: and F: drives on the stand-by server as well. While the production machine may have a dedicated direct-attached storage array or be connected to SAN storage, the standby machine may have a single or mirrored internal SATA drives with three logical drive partitions. 10
The cost savings in hardware alone are significant, and may even justify considering a tertiary protection solution. By adding one more hypervisor host, combined with Neverfail s tertiary model, this complex application can be protected with local HA secondary servers and remote DR tertiary servers simultaneously. With a shared-nothing architecture and support for P2V failover, implementing a complete HA/DR solution should require minimal investment in secondary hardware. For an example of cost reductions in a DR environment, consider a complex business application running on five or more physical servers, each attached to dedicated drives in a SAN array. This entire environment can be protected by five secondary virtual machines running on a single hypervisor host using direct-attached SATA drives. The cost savings in hardware alone are significant, and may even justify considering a tertiary protection solution. By adding one more hypervisor host, combined with Neverfail s tertiary model, this complex application can be protected with local HA secondary servers and remote DR tertiary servers simultaneously. Conclusion Neverfail s continuous availability solution is unique in that it can protect businesses from a broad range of impacts to application availability. It protects critical applications against physical server hardware, network infrastructure, and operating system and application failures. The stand-by server architecture, whether deployed in a pair or in a tertiary model, ensures that a warm server is ready to take over the active role should any problem arise. Since these servers share the same identity as the protected servers there are little or no changes required to the infrastructure to support the switchover. Neverfail s unique cloning process also simplifies deployment of the secondary servers by handling not only the system state but the installed applications as well. By supporting applications from within the operating system, Neverfail has visibility and capability to monitor and manage the application directly. Neverfail also easily supports mixed physical and virtual environments across different vendors and with dissimilar hardware configurations. Through the Continuous Availability Director, administrators have the ability to manage HA and DR deployments throughout the entire enterprise regardless of where they re deployed. If a problem occurs, Neverfail can take a variety of pre-emptive, corrective actions including fully coordinated failover of all components within the ecosystem. The net result is the elimination of end-user downtime and continuous availability for business applications. All rights reserved. Neverfail is a trademark of Neverfail Group Limited. All other trademarks are trademarks of their respective companies. No part of this publication may be reproduced, transmitted, transcribed, or translated into any language or computer language, in any form or by any means without prior express, written consent of Neverfail Group Limited. Neverfail, Inc. 5914 West Courtyard Drive, Suite 160B Austin, TX 78730 Tel: 512 327 5777 I info@neverfailgroup.com I www.neverfailgroup.com 11