1 Scalability and Availability for a Web Farm CHAPTER 3 IN THIS CHAPTER Understanding Availability 40 Understanding Scalability 45 Scaling a Web Farm 50

3 Scalability and Availibility of Web Farms CHAPTER 3 41 one nine, or.9999, which is four nines. A four nine site is % available. The formula for calculating this ratio is 100 (Unscheduled Downtime/Period of Time) 100. For example, if a site is down for one hour in a day, the availability of that site in one day is 100 (1/24) 100 = %. If a site is down one hour in a week, the availability is 100 (60/10080) 100 = %. Is 100% Availability Realistic? For a Web farm to obtain 100% availability, it cannot have unscheduled downtime. This should always be the goal of any Web farm, but it is unlikely. From the availability formula, a four nine site can have seconds of downtime in a year, long enough to replace a network cable! This availability comes at considerable cost in redundant systems, and most businesses accept one or two nines as the goal. Table 3.1 shows some availability measurements. TABLE 3.1 Availability Measurements Percent Available Downtime Per Year 90% 36.5 Days 95% Days 98% 7.3 Days 99% 3.65 Days 99.9% 8.76 Hours 99.99% Minutes % 315 Seconds % 31.5 Seconds 3 SCALABILITY AND AVAILABILITY OF WEB FARMS Assessing a Web Farm s Availability Availability measurements are useful only if an organization uses them properly. A successful strategy for considering availability statistics is to break the measurements into categories. These categories will differ for each Web farm, but basic categories include server availability, application availability, and network availability. The disjointed nature of these categories means each measurement can be considered separately. This separation provides hard numbers for where money and resources should be spent to improve availability. If network availability is low, it doesn t make sense to invest time and effort in solving application availability problems. All complex systems fail. Any failure can affect a Web farm s availability rating if the failure has no contingency. Realistic availability goals are achievable through redundancy.

4 42 Deploying and Managing Microsoft.NET Web Farms Redundancy is the first line of defense for any site failure. In hardware systems, redundancy comes from complete copies of critical components. For example, a system with a redundant CPU has a spare CPU that is not used unless the primary CPU fails. If this failure occurs, the spare CPU takes over, enabling the system to continue processing without missing an instruction. With software systems like SQL Server 2000, redundancy comes from a cluster of servers, each capable of owning the database of the others. Redundancy eliminates single points of failure in complex systems. Each availability area has different ways to solve this problem. Monitoring is the final step in assessing the availability of a Web farm. Without monitoring, it is not possible to determine when a component fails. Each area of a Web farm has different methodologies for error reporting. Understanding Network Availability The majority of network systems in a Web farm are physical. When building a network for high availability, redundancy must be considered at every level. Every point in the network should have a backup. From the connection, to the Internet, to the routers that move traffic throughout the farm, each level must be considered. Not every system in a Web farm network has to be redundant, however, and a cost-benefit analysis is the best way to determine at which points to build in redundancy. For example, Foo.com has a front-end router that handles traffic from the Internet. If this router fails, the site is down. However, router failures are rare, and Foo.com decides to take a calculated risk and not have a cold spare sitting unused. A four-hour response time agreement with the router manufacturer is Foo.com s reason why they accept this downtime if it occurs. However, Foo.com decides that having redundant connections to the Internet is important, so they spend the extra money to have two connections to the Internet at all times. Each business will make the same decisions differently. The most common area of failure in any network system is at the cable level. It is usually cost prohibitive to have redundant network cables between every point in a network. Luckily, it is relatively easy to diagnose and replace a faulty cable in a network. Vendor-specific tools best handle monitoring the network for failures. There are tools that exist today that take the pulse of the entire network, ensuring that each connection is functioning correctly. When a failure does occur, these tools can alert an administrator to the problem. NOTE More information for building Web farm networks is found in Chapter 4, Planning a Web Farm Network.

5 Scalability and Availibility of Web Farms CHAPTER 3 43 Understanding Server Availability Single points of failure in server hardware include CPUs, hard drives, motherboards, and network cards. Hardware redundancy is purely a cost issue. Available on the market today for a price are servers that have three levels of redundancy for every internal system. Some businesses will invest considerable dollars to achieve complete hardware redundancy. With.NET enterprise servers it is possible to create redundant configurations without the need for extreme hardware redundancy. The.NET platform eliminates single points of failure by providing simple software solutions so that multiple servers can handle the same task. With.NET, an administrator can designate two or more servers to provide redundancy and increased availability at the server level. This availability creates a natural pathway to scalability for a Web farm. Monitoring server availability is best accomplished with the server vendor s tools. As part of the decision to standardize on a particular hardware platform, the monitoring software available should be a factor in the decision. These centralized monitoring tools can inform an administrator when any critical component of a server fails, from the CPU to the power supply. Without tools like this, monitoring for hardware failures is a crapshoot. If the drivers for a piece of hardware log errors to the NT event log, then tools like Microsoft Health Monitor can provide an alert. Watching the NT event log with Health Monitor is a way to alert on failures in redundant software systems like Windows Cluster Server and network load balancing. NOTE More Information on Health Monitor is available in Chapter 14, Monitoring a Web Farm with Application Center SCALABILITY AND AVAILABILITY OF WEB FARMS Understanding Application Availability A more subtle aspect of building highly available systems is application availability. Application availability measures how the functions and features of a specific application perform throughout the Web site lifecycle. By gauging application availability, actual uptime is measurable. If any functional portion of a site, such as credit card authorization, must function to complete a transaction, then that portion s availability affects the overall site availability measurement. Even if the Web site itself is successfully providing content, if a transaction fails at any point, the site is unavailable. By considering application availability, a new type of thinking emerges for application design. While load-balancing techniques help eliminate single points of failure in physical systems, software single points of failure are a more difficult problem to solve. With the credit card example, the application must be robust enough to continue the transaction either by deferring

6 44 Deploying and Managing Microsoft.NET Web Farms credit card authorization or switching to a different credit card service. Availability planning for credit card processing must consider redundant connections to a lending institution and load-balanced redundancy for credit card processing servers, and it must also provide servicelevel redundancy. Building software systems that have this level of redundancy is a unique challenge. Each application will have different requirements. However, at the fundamental design level there are a few key points to consider when orchestrating an application availability solution. Create software systems with well-defined boundaries. This means that in the credit card example it should be fairly easy to tell when an application has entered the credit card processing engine. This enables an application to drop in a different system as long as the boundaries into that system look the same. When an application data path leaves the boundaries of a Web farm, this process is a good candidate for application availability. For instance, if an application makes a WAN connection to an external bank, this connection is by definition not under the control and management of the IT staff that manages the Web farm. In situations like this, alternative mechanisms from the Network layer to the Application layer should be thought through to provide the highest level of redundancy. This may mean purchasing a second WAN connection of a lesser speed and cost and having an agreement for credit card processing with a second bank. For transaction-oriented systems, build into the architecture a way to move to a batchoriented processing engine. This means that in the credit-card example the same information would be gathered to process a credit card authorization, it would just not happen in real-time. The functionality of an application would be decreased in batch mode, but at least the transaction could be completed at a later time. It is much better to switch to batch-oriented processing than tell a customer to come back later. Later may never come. The most complicated monitoring problem is application availability monitoring. Many applications rely on customer input to determine when critical systems fail. Beyond this, most monitoring endeavors for applications are custom built. Critical application components should add time in a delivery schedule to build and implement the appropriate monitoring. In some cases, there are tools that generate replay scripts that can simulate a user on a Web site. These tools can be used to test the functionality of a site and report when errors occur. Even tools like this will likely demand a full-time resource to manage and maintain these scripts for even remotely sophisticated sites. Understanding Scheduled Downtime Scheduled downtime should be the only reason a site becomes unavailable. Whenever a site has a scheduled release, hardware upgrade, planned ISP outage, or other required downtime scenario, this is scheduled downtime. Consider these downtimes as a separate measurement

7 Scalability and Availibility of Web Farms CHAPTER 3 45 from unscheduled downtime. While it is important to improve this availability number, the goal should stay within what is reasonable in today s Web farms. New.NET Enterprise technologies, like Application Center 2000, will help improve the scheduled availability rating. Most of the improvements are to be made by improving the process for releasing new features into the production Web farm. Measuring Overall Availability When measuring overall availability, a business should consider total downtime as a separate measurement from total unscheduled downtime. Each area of availability is calculated separately to help direct where efforts to improve availability should be made. Keep a running total of the network, server, application, unscheduled, and scheduled availabilities. Combine the network, server, and application availability to create the unscheduled availability quotient. Combine all the availability ratings to create the overall availability rating. Table 3.2 has the availability ratings and goals for Foo.com, based on a six-month period. TABLE 3.2 Foo.com s Availability Goals for Six Months Availability Availability Availability Type Goal Downtime Measurement Network hour 99.97% Server hours 99.88% Application hours 99.77% Scheduled hours 98.39% Unscheduled hours 99.61% Overall hours 98% 3 SCALABILITY AND AVAILABILITY OF WEB FARMS Foo.com does a good job of preventing unscheduled downtime. It is 0.3% away from its stated unscheduled availability goal. However, this is approximately 12 hours of downtime. In order to achieve this goal, application availability problems need to be addressed. To improve overall availability, Foo.com needs to work on the time it takes to perform scheduled maintenance tasks. Foo.com needs to make up 43 hours to reach its overall availability goal of 99%. Understanding Scalability Stress is a universal term that applies to both natural systems and computer systems. When a human feels stress, his coping mechanism differs according to the type of stress. When a Web farm experiences stress, its coping mechanisms differ according to the type of stress, too. A Web farm becomes stressed under many different scenarios. Some of the following items illustrate such stress factors:

9 Scalability and Availibility of Web Farms CHAPTER 3 47 Web Server Web Server Scaling Up Database Server Database Server FIGURE 3.1 Scaling up Foo.com s Web Server. NEW TERM Scaling out is accomplished by increasing the number of servers in a layer of a Web farm. Determining How to Scale Up When facing a situation where a server is over utilized, there are four different scenarios to consider. Two of the utilization problems come down to a relationship between CPU and memory. In fewer cases, utilization problems can relate to I/O bottlenecks. And even fewer problems relate to network throughput and performance. Knowing a few simple concepts can help an administrator diagnose these problems and make the right choice as to scaling up or scaling out. In the first scenario, and the most common, the CPU utilization on a box under load with no application problems is stuck at 100%. In this situation, the CPU has too much work to do. There is adequate memory, and thus hard drive usage is not the problem. The only way to address this problem is to buy a bigger CPU or add a second one either to the same server or an additional server. With high CPU utilization situations, there will almost never be problems with memory, network speed, or disk space because the CPU has to wait on these devices and during these wait times cannot perform work. Thus, the CPU would not stay at 100%. 3 SCALABILITY AND AVAILABILITY OF WEB FARMS

11 Scalability and Availibility of Web Farms CHAPTER 3 49 Internet Network Load Balancing Windows Cluster Server Web Cluster SQL Cluster VWeb01 VSQL01 VSQL02 Web01 Web02 Web03 SQL01 SQL02 Network Load Balancing LDAP Cluster Component Load Balancing COM+ Cluster 3 VLDAP01 SCALABILITY AND AVAILABILITY OF WEB FARMS LDAP01 LDAP02 Legend COM01 COM02 Virtual Server FIGURE 3.2 A simple load-balancing diagram with different clusters. NEW TERM A cluster is a group of servers that act as one entity through software- or hardwarebased balancing systems. The original definition of a Web farm encompassed only the servers that provided HTTP services. The.NET Web farm has evolved into a collection of different cluster types. Just as a real farm has chickens, pigs, horses, and cattle, a.net Web farm has Web clusters, COM+ clusters, SQL Server clusters, and other clusters.

14 52 Deploying and Managing Microsoft.NET Web Farms FOO.COM Hardware Load Balancing Device Web01 Web02 Web03 FIGURE 3.4 Foo.com s simple.net Web farm using HLB. Scaling COM+ Using Component Load Balancing COM+ Services via DCOM use RPC, a connected network protocol, for remote COM+ component invocations. This means that once a client and a server have established a connection, they remain connected like FTP. This is the opposite of HTTP, which is a disconnected protocol. A special technology called Component Load Balancing (CLB) was created to provide scalability and availability for COM+ and circumvent the connected nature of DCOM. CLB works by intercepting component creation requests before they leave the client computer. CLB maintains a list of servers, usually members of a COM+ cluster in an server architecture like that shown in Figure 3.5, to send remote component activations. Through a combination of statistics-gathering using WMI and a round-robin load-balancing algorithm, CLB picks the best server for a component s activation. CLB still uses DCOM, so the client server in Figure 3.5 has a DCOM connection to each member of the COM+ cluster. NOTE More information on Component Load Balancing and COM+ clusters can be found in Chapter 13, Deploying Application Center 2000 COM+ Clusters. Scaling the Data Tier Unlike the Web and component tiers, scaling of the data tier is much more complicated. With disparate entities like LDAP directory services and SQL Server 2000, each requires a different mechanism for dealing with scalability problems. Some, like SQL Server, have limited options for scaling out and could force application changes under load. However, careful planning can help prevent situations in which the data tier becomes a bottleneck.

15 Scalability and Availibility of Web Farms CHAPTER 3 53 Web cluster members maintain a list of servers for CLB, in this case COM01 and COM02 Web Cluster CLB opens a DCOM connection to both servers in the COM+ cluster to achieve scale and redundancy CLB COM+ Cluster COM01 COM02 A COM+ Application FIGURE 3.5 Component Load Balancing in a Web farm. Scaling Out Directory Services LDAP is a TCP/IP based directory service. Scaling out of directory services like LDAP is accomplished in much the same way as scaling out the Web services tier. Technologies like hardware and software load balancing provide linear scalability for any TCP/IP based data services system. 3 SCALABILITY AND AVAILABILITY OF WEB FARMS Improving Availability Using SQL Server 2000 with a Windows Server Cluster With SQL Server 2000 and Windows 2000 there are two types of shared-nothing clusters: those that use a disk to maintain cluster state and those that use the network. A shared-nothing cluster that uses a disk to maintain cluster state is called a Windows Server cluster. The operating systems of cluster members have access to the same physical data on a shared disk. This systems does not rely on complicated distributed lock management techniques, so even though there is a shared disk, the clusters are considered shared nothing.windows Server clusters improve the availability of SQL Server by allowing failure of one node of the cluster. Clients make requests of a virtual SQL server through a Windows Server cluster virtual network name. When a SQL server fails, the virtual network name and SQL server are restarted on a different member of the cluster. Availability is greatly improved as new client requests of the virtual name move to the failover cluster member. When the failed node is repaired, the virtual name and SQL server can be moved back to their original cluster member. Figure 3.6 shows a shared-disk SQL Server 2000 cluster and what happens during a fail-over.

