Staying Alive Understanding Array Clustering Technology

White Paper Overview This paper looks at the history of supporting high availability network environments. By examining first and second-generation availability solutions, we can learn from the past and understand why third-generation clustering technology is quickly becoming an essential requirement in modern enterprise data centers. Learn about: The limitations of first and secondgeneration availability solutions Third-generation clustering technology; how it works and how it benefits enterprise productivity and profitability Clustering as a part of Array Networks integrated AFE enterprise data center appliances

Introduction With the expectations of Internet users for content, quality, and performance constantly taxing the resources of even the largest of Web infrastructures, even thinking of downtime isn t an option. Estimates in the hundreds of millions of dollars in lost value have demonstrated that web site and application downtime can be as damaging as a successful hacker attack. In order for Web infrastructure to deliver on its promise of increased productivity and increased profitability, being impervious to failure is a requirement. Accomplishing this feat, however, is a non-trivial task that typically requires a large sum of money. Redundant connectivity, redundant hardware, and a well designed architecture take time, highly qualified (and thus highly paid) IT staff, and of course a lot of hardware. This paper begins by taking a quick look at the history of providing high availability on the Web. By examining first and second-generation solutions, we learn from past mistakes and proceed into our proposed solution: providing high availability using third-generation Application Front End (AFE) appliances. A Brief History First generation Web infrastructure had a simple goal: Just Make It Work. At the time, elaborate hosting options, Web server choices, and networking options were limited at best. Typical networks carried light enough traffic that a single Unix server could easily handle the load. The problem with many of these networks was that they did not provide a scalable or highly available architecture. In order to solve the first round of problems associated with availability, a DNS based solution was put into place. Simply put, a single hostname would point to multiple IP addresses. By doing this, clients would cycle through the choices in a round-robin fashion thereby distributing the load across multiple servers. This technique is aptly called Round- Robin DNS. The problem with this technique is that should a host become unavailable for whatever reason, DNS would still return the IP to clients (See figure 1). In a two-server scenario where both IP addresses were listed for a single host, the user would experience a 50% failure rate when connecting. Given that many Web sites were hobby-oriented and non-commercial, this was an acceptable downfall. But as commercial Web sites and applications started to pop up, all of them expected to contribute millions in profits and productivity gains, a better solution was required.

Figure 1: Round-Robin DNS entries for www.example.com if its web site had two servers www.example.com IN A 200.200.200.1 IN A 200.200.200.2 The second generation of web sites turned to Server Load Balancers (SLB) to solve two problems with one stone: providing a highly available Web sites and a scalable infrastructure. In the simplest model, server load balancers are proxy servers ( virtual servers ) sitting in the front of the Web servers ( real servers ). The load balancer accepts all of the incoming connections for the Web site on a single advertised IP address and then opens connections to the Web servers, spreading the connection across all of the servers so that no one server is over-taxed (See figure 2). Figure 2: Server Load Balancer for www.example.com using three web servers What makes this technique well suited for high availability environments is that the load balancer is able to monitor the health of all of the servers that it balances for. If a server becomes unavailable, the load balancer can intelligently react and stop sending new connections to it. The load is automatically redistributed, thus the site remains up. Once the server is repaired and brought back online, the load balancer will notice this and allow new connections to go to the repaired server. Unfortunately, the honeymoon for this technology was short. As Web infrastructure gained the ability to scale to literally thousands of Web servers, other problems became apparent. To solve these problems, task specific appliances such as SSL accelerators and reverse proxy caches became commonplace, each adding latency. Furthermore, sites began to experience Noah Ark s Syndrome, the need to buy everything in pairs for high availability. Amongst the second-generation load balancers themselves, only 1+1 redundancy is available. The solution to this is to look towards the third generation.

High Availability Using Third-Generation Appliances Looking at a networking rack with Noah s Ark Syndrome, defining what a third generation appliance should be becomes apparent: a system that is well integrated (to eliminate Noah s Ark Syndrome), is high performance to keep up with the Internet s growth, and is capable of N+1 clustering. As of this writing, the leading vendor in the 3G Space is Array Networks. Array Networks AFE appliances integrate key features needed in the enterprise data center into a single device: clustering, application firewall, SLB, GSLB, caching, CDD, and SSL acceleration. For this paper, we will focus on the clustering feature. How It Works Perhaps the best way to demonstrate how N+1 clustering works and the benefits it provides, is to walk through clustering set up for an Array AFE appliance. Let us examine each the process: Configure each Array in the cluster with all of the virtual IP s that are to be hosted by the cluster. Each virtual IP is assigned a priority. The resulting matrix should look something like Figure 3. Figure 3: Matrix showing each unit s virtual IP priority VIP 200.200.200.1 200.200.200.2 200.200.200.3 Array Unit 1 Array Unit 2 Array Unit 3 100 70 50 60 90 40 20 30 80 Make certain that the outside interfaces are able to communicate with one another via multi-cast. The upstream router should also be able to communicate with all three units on all three virtual IP s. The inside interfaces on each unit should be able to communicate with all of the real servers represented by all three virtual IP s. A simple network configuration is shown in Figure 4.

Figure 4: A highly available network using an Array AFE When the Array AFEs are started, they will use the layer 2 switch on their outside interfaces to send multicast packets to one another. With each unit discovering the others, they can share their clustering configuration with one another. Once all of the units have each other s clustering configuration, they can examine the matrix of virtual IP priorities. Using this information, each unit will find that they have the highest priority value for a specific virtual IP. The unit that has the highest priority for a virtual IP then becomes the master. The other two units become the backup for the virtual IP. During normal operation, each unit sends a heartbeat packet to the other units on the outside interface. This packet informs other units that the sender is okay and is capable of servicing requests to the virtual IP s for which it has the highest priority value. Should a unit become unavailable (e.g. it is taken down to help manage the load at another site) the other units will notice that a heartbeat packet has not arrived for some configurable period of time and thus the unit with the next highest priority for the virtual IP that became unavailable is automatically elected to be the new master. For example, if Unit 1, which is the master for virtual IP 200.200.200.1, becomes unavailable, Unit 2 becomes master for 200.200.200.1 since it has a priority value of 60. If Unit 1 were brought back to this cluster, it would take back the master status for 200.200.200.1 because its priority is 100, which is greater than Unit 2 s priority of 60 (See Figure 3).

It should be noted that the each virtual IP s master/backup status is individually managed. Just because Unit 1 is master for 200.200.200.1 does not mean it is master for all of the virtual IP s in the cluster. In this specific example, Unit 2 is the master for 200.200.200.2 and Unit 3 is master for 200.200.200.3. If Unit 3 were to fail, Unit 1 would take over 200.200.200.3 since its priority value of 50 is greater than Unit 2 s priority value of 40 for this virtual IP. The N+1 Win An observation made about the second generation of high availability tools was that they only came with 1+1 redundancy, that is, 1 active unit taking the load and 1 hot spare ready to go live for a given virtual IP. With the units load balancing virtual IP s, a growing site is bound to notice that the 1+1 solution is a limiting factor in the equation. To overcome this limitation, third generation architectures have moved to N+1 redundancy, that is, to achieve redundancy for n appliances (n being an arbitrary number) you need n+1 appliances. The key here however is n: as your site size increases, you are no longer limited to hosting them on a single unit. Instead, you can spread them out over n units thereby giving you substantially greater capacity. Observers from a technical point of view should notice that being able to scale the same platform to n units translates into a higher level of redundancy, a lesson third generation appliance makers learned from RAID technology. Simple 1+1 mirroring is nice when both units are working, but should one of the units become unavailable, leaving the lifeblood of your company running on a single unit can be an incredibly stressful situation. N+1 technology, such as RAID 5, has clearly demonstrated their superiority. The business case is just as noteworthy: by using a select third generation appliance capable of scaling to match your infrastructure s growth, you protect your investment since older units need not be thrown away. Even better are the lowered IT costs that come with rolling out technologies with which your staff is already familiar. Conclusion Understanding high availability has been a notoriously complex issue. Hopefully this paper has made it easier to understand and visualize implemented in your own network. With third generation appliances providing high availability through clustering along with tight feature integration, it simply doesn t make sense to create your business s web site without clustering in mind.

In examining third generation solutions, we have seen Array AFEs provide a solution that accomplishes everything a network engineer could want in data center wish list. Even better, the business justifications for moving to Array AFEs to provide high availability is as clear as it gets: less network confusion, less rack space, and less power consumption all translate into lower operational costs through smaller IT staffing and smaller co-location bills. Combining the technical and business reasons, using third generation solutions (such as those from Array Networks) for high availability is an obvious choice. About Array Networks Array Networks is a world leader in secure application acceleration and deployment appliances for global enterprises. Built upon the Array SpeedStack(TM) technology, Array s unified secure content access solutions enable industry-leading performance, integration, scalability and ease of implementation and management. Headquartered in Milpitas, California with sales offices in the U.S., Europe, Asia Pacific and Latin America, Array engineers and manufactures its products in the Silicon Valley and sells them through direct and indirect channels across the globe. Array Networks, Inc. www.arraynetworks.net