Measurement and Management of Internet Services Using Agilent Firehunter

Similar documents

WHITE PAPER September CA Nimsoft For Network Monitoring

Network Management and Monitoring Software

WHITE PAPER OCTOBER CA Unified Infrastructure Management for Networks

Using IPM to Measure Network Performance

VMWARE WHITE PAPER 1

The ISP Column A monthly column on all things Internet

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

GoToMyPC Corporate Advanced Firewall Support Features

Cover. White Paper. (nchronos 4.1)

Agilent Technologies Performing Pre-VoIP Network Assessments. Application Note 1402

Using TrueSpeed VNF to Test TCP Throughput in a Call Center Environment

About Firewall Protection

Understanding Slow Start

The Problem with TCP. Overcoming TCP s Drawbacks

Internet Protocol: IP packet headers. vendredi 18 octobre 13

IBM Tivoli Monitoring for Network Performance

STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT

Closing The Application Performance Visibility Gap Inherent To Citrix Environments

A Passive Method for Estimating End-to-End TCP Packet Loss

Diagnosing the cause of poor application performance

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary

EXPLORER. TFT Filter CONFIGURATION

CiscoWorks Internetwork Performance Monitor 4.0

SiteCelerate white paper

Low-rate TCP-targeted Denial of Service Attack Defense

Testing VoIP on MPLS Networks

High-Speed TCP Performance Characterization under Various Operating Systems

Cisco Bandwidth Quality Manager 3.1

Sage ERP Accpac Online

Sage 300 ERP Online. Mac Resource Guide. (Formerly Sage ERP Accpac Online) Updated June 1, Page 1

Fifty Critical Alerts for Monitoring Windows Servers Best practices

How To Provide Qos Based Routing In The Internet

Key Components of WAN Optimization Controller Functionality

Linux MDS Firewall Supplement

Performance Tuning Guide for ECM 2.0

Stateful Inspection Technology

Accurate End-to-End Performance Management Using CA Application Delivery Analysis and Cisco Wide Area Application Services

Network Simulation Traffic, Paths and Impairment

DDL Systems, Inc. ACO MONITOR : Managing your IBM i (or AS/400) using wireless devices. Technical White Paper. April 2014

Frequently Asked Questions

Features Overview Guide About new features in WhatsUp Gold v12

IP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview

Monitoring Application Response Time Components and analysis of various approaches

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

APPLICATION PERFORMANCE MONITORING

Testing & Assuring Mobile End User Experience Before Production. Neotys

How To Manage Performance On A Network (Networking) On A Server (Netware) On Your Computer Or Network (Computers) On An Offline) On The Netbook (Network) On Pc Or Mac (Netcom) On

Cisco and Visual Network Systems: Implement an End-to-End Application Performance Management Solution for Managed Services

AS/400e. TCP/IP routing and workload balancing

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

IP Filtering for Patton RAS Products

CloudLink - The On-Ramp to the Cloud Security, Management and Performance Optimization for Multi-Tenant Private and Public Clouds

IxLoad - Layer 4-7 Performance Testing of Content Aware Devices and Networks

3.1 RS-232/422/485 Pinout:PORT1-4(RJ-45) RJ-45 RS-232 RS-422 RS-485 PIN1 TXD PIN2 RXD PIN3 GND PIN4 PIN5 T PIN6 T PIN7 R+ PIN8 R-

Cisco Performance Visibility Manager 1.0.1

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

Measuring IP Performance. Geoff Huston Telstra

Requirements of Voice in an IP Internetwork

Ten Best Practices for Optimizing ADC Deployments

Monitoring Android Apps using the logcat and iperf tools. 22 May 2015

Optimizing IT Performance

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

User s Manual TCP/IP TO RS-232/422/485 CONVERTER. 1.1 Introduction. 1.2 Main features. Dynamic DNS

co Characterizing and Tracing Packet Floods Using Cisco R

Features Overview Guide About new features in WhatsUp Gold v14

Nimsoft for Network Monitoring. A Nimsoft Service Level Management Solution White Paper

Guide to DDoS Attacks December 2014 Authored by: Lee Myers, SOC Analyst

Virtual private network. Network security protocols VPN VPN. Instead of a dedicated data link Packets securely sent over a shared network Internet VPN

WhatsUp Gold v11 Features Overview

VisuSniff: A Tool For The Visualization Of Network Traffic

RFC 6349 Testing with TrueSpeed from JDSU Experience Your Network as Your Customers Do

Question: 3 When using Application Intelligence, Server Time may be defined as.

Proxy Server, Network Address Translator, Firewall. Proxy Server

Packet Capture and Expert Troubleshooting with the Viavi Solutions T-BERD /MTS-6000A

Content Distribution Networks (CDN)

Avaya ExpertNet Lite Assessment Tool

Whitepaper. A Guide to Ensuring Perfect VoIP Calls. blog.sevone.com info@sevone.com

LESSON Networking Fundamentals. Understand TCP/IP

Chapter 4 Firewall Protection and Content Filtering

Solving complex performance problems in TCP/IP and SNA environments.

Per-Flow Queuing Allot's Approach to Bandwidth Management

Mission-Critical Java. An Oracle White Paper Updated October 2008

EXPERIMENTAL STUDY FOR QUALITY OF SERVICE IN VOICE OVER IP

File Transfer And Access (FTP, TFTP, NFS) Chapter 25 By: Sang Oh Spencer Kam Atsuya Takagi

QAME Support for Policy-Based Management of Country-wide Networks

Computer Networks CCNA Module 1

Chapter 12 Supporting Network Address Translation (NAT)

Transport Layer Protocols

Monitoring Traffic manager

Transcription:

Measurement and Management of Internet Services Using Agilent Firehunter This white paper reviews the approaches taken to gather critical measurements from ISP services like Web, news, and email. It also demonstrates how these measurements can be used to help with problem detection and isolation. Comparisons are made between Agilent Firehunter s approach and those taken by some favorite public domain tools. This white paper contains the following main topics: Introduction Agilent Firehunter Measurement Technologies Active Measurements Passive Measurements Firehunter's Use Of Active And Passive Measurements Measurement Details Service Quality Measurements Network Measurements Infrastructure Service Measurements Server Measurements Demonstration of Firehunter's Utility News Problem Diagnosis Email Performance Tuning DNS Problem Detection Network Monitoring Relation to Existing Products and Tools Summary References Legal Notices

Introduction Quality of service (QOS), which refers predominantly to the availability and performance of Internet services, is fast emerging as a key differentiation for Internet Service Providers (ISPs) as they attempt to gain a dominant share of the Internet marketplace. In a PC Week study [1], 96 percent of respondents listed service availability as their main consideration in choosing ISPs. Furthermore, 93 percent of respondents perceived service performance as their second key expectation from ISPs. To meet the expectations of their subscribers and to attract new subscribers, ISPs must measure and manage their QOS. Doing so requires a shift in the management paradigm that ISP operations personnel are used to. From monitoring and managing network links in terms of metrics such as packet loss and delay, using tools such as ping and traceroute, ISP operations personnel must begin to look at a totally new set of metrics that represent QOS from the perspective of subscribers. Examples of such metrics include the accessibility of a specific Usenet newsgroup, the delay between transmission and reception of an email message, the response time for retrieval of a web page, and so forth. More than ever before, to meet subscriber expectations it is also becoming increasingly important for ISPs to monitor their systems to detect problems in advance of subscriber complaints. Furthermore, accurate yet rapid diagnosis of detected problems is also essential to minimize subscriber-perceived service downtimes. All of these requirements warrant a new generation of technologies for service quality measurement and problem diagnosis. Agilent Firehunter Agilent Firehunter is a system that is targeted at enabling ISPs to measure and manage the quality of their services. Firehunter includes a heterogeneous set of measurements that enable end-to-end, top-to-bottom management of ISP network, server, and infrastructure components. Measurements of web, email, and news services enable ISPs to assess the quality of service they offer to subscribers. To facilitate problem diagnosis, Firehunter includes performance and usage measurements of infrastructure services such as the Domain Name Service (DNS). Measurements of server and network utilization and performance complete a comprehensive measurement suite for ISPs. Ongoing measurements, coupled with intelligent baselining and thresholding capabilities, enable early detection of potential problems. The unique service modeling capability of Firehunter enables rapid determination of the root causes of problems. A sophisticated set of graphical interfaces enables Firehunter to serve the unique requirements of ISP operations personnel, capacity planners, and customer support staff. This document describes in detail the measurements that form the basis of Firehunter. Firehunter's service modeling and baselining capabilities, which facilitate analysis of the measurement results, are described in Agilent Firehunter Services Models and Baselining [2]. The following sections describe the reasons for choosing the measurements that are included in Firehunter and the operation of these measurements. Experimental results demonstrating the effectiveness of the measurements are presented later in the document. These results, which highlight the practical utility of all of the measurements, have been obtained from a realworld deployment of the measurements in a top-tier commercial ISP environment. Due to resource constraints, not all of the measurements described in this document are available as part of Firehunter 1.0. Additional measurements will be incorporated in future releases of Firehunter. Agilent Technologies Page 2 Published 1998

Measurement Technologies There are two predominant classes of measurements: active and passive. This section discusses each of these measurement classes and how Firehunter uses them in combination. Active Measurements Measurements in this class explicitly stimulate traffic to networks, servers, and service applications to assess various metrics of interest about these elements. For example, to measure the availability and performance of a web server, an active measurement issues an HTTP GET request for a web page and observes the status code in the response from the server as well as the total response time for retrieving the web page. To obtain measurements that are representative of the real world, the synthetic workload generated by the active measurements must be determined based on observations of subscriber workloads obtained while the system is in operation. Furthermore, to truly reflect subscriber-perceived problems, the active measurements must use the same network path and the same set of infrastructure services (DNS, Network File Service (NFS)) that subscribers use while accessing the ISP services. In cases where this constraint cannot be met, additional measurements may need to be made and correlated to assess the state of the elements of interest in isolation. For instance, if the performance of a DNS server has to be assessed using a different network path than the one subscribers use, the DNS response time measured by an active measurement must be correlated with the network response time measured over the path used for the measurement. In this case, the DNS server's performance is poor only if the DNS response time is abnormally high and the network response time is not. The main advantage of active measurements is that they do not rely on any monitoring capabilities built into the networks, servers, and service applications. On the other hand, since they introduce additional traffic if they are not carefully engineered, these measurements can end up impacting the QOS they are intended to measure! Moreover, the measurements are initiated from external locations, without requiring access to the ISP systems. Measurement results therefore do not provide sufficient information about the subscribers' use services and the service applications' use of resources to enable detailed problem diagnosis and capacity planning. Consequently, the utility of these measurements is limited to service quality assessment and preliminary problem diagnosis. Passive Measurements As the name suggests, measurements in this class utilize instrumentation built into network, server, and service application components. Since they do not require the explicit stimulation of traffic, these measurements have minimal impact on ISP systems. Consequently, they also scale better for larger ISP systems. Moreover, since they use built-in instrumentation that tracks real subscriber traffic, the measurements truly reflect the QOS perceived by subscribers. Besides being useful for service quality assessment, passive measurements can also provide information about resource and service usage, which is critical for detailed problem diagnosis and capacity planning. Passive measurements can be obtained from different sources: From the servers: Web, email, and news services are based on the reliable Transmission Control Protocol (TCP) that involves the explicit communication of acknowledgements back from subscriber client applications (for example, web browsers, and email clients) back to the servers. Based on acknowledgement feedback from the clients, the service applications can detect both the start of different service transactions from clients and the end of these transactions. As part of their normal operation, the service applications can measure the service performance in terms of response times for different service transactions. This information can be stored in log files for further analysis by management applications. Alternatively, the information can be exposed via management information bases. Resource and service usage information can also be obtained from the servers. Agilent Technologies Page 3 Published 1998

From subscriber clients: Monitoring capabilities built into subscriber client applications, or available using special-purpose instrumentation software such as Net.Medic [3] can provide an assessment of service quality as observed by subscribers. From special-purpose network probes: External network probes can be used to snoop network transmissions and analyze the captured packets to yield measurements of services and networks. Firehunter's Use of Active and Passive Measurements While the problems associated with the distribution of special-purpose software make clientside passive measurements less attractive to ISPs, the complexity and cost of implementation make the network probe-based approach less viable for service quality assessments. Since a great majority of service applications (web, email, and news servers) do not have sufficient instrumentation capabilities built in to enable the passive assessment of service quality, Firehunter incorporates active measurements for service quality assessment. However, passive measurements from the servers can provide a wealth of information for detailed problem diagnosis and capacity planning, which Firehunter exploits. Figure 1 illustrates the components of Firehunter. A diagnostic measurement server (DMS) serves as a host for agents that make active measurements of service quality. Software agents operating on the ISP servers track resource and service usage passively. The unique combination of active and passive measurements included in Firehunter provides information that is critical for effective operational monitoring and capacity planning for ISPs. To maximize the effectiveness of the measurements and to minimize their impact on the ISP system, Firehunter's flexible service model allows the user to easily configure where the agents are placed and how often the measurements are executed. Figure 1. Components of Firehunter: the deployment of a DMS and measurement agents in an ISP server farm Agilent Technologies Page 4 Published 1998

Measurement Details Firehunter provides service quality, network, infrastructure service, and server measurements. This section describes each of these measurements in detail. Service Quality Measurements To measure the QOS of web, email, and news services, Firehunter includes custom agents: Web service measurements: The monitoring agent for web services uses active measurements to assess the web services' availability and performance. Like a typical web client retrieving a web page, this web service agent resolves the IP address of a target web server, establishes a TCP connection with the web server, and then issues a GET request to a specified web page (see Figure 2). By interpreting the HTTP response header returned by the server, the web service agent determines the availability of the web service. The overall response time to retrieve the web page is determined. In addition, this agent derives a breakdown of the overall response time into the following categories: Time for DNS resolution TCP connection establishment time Server response time-----the time between when the GET request is issued to the time when the HTTP response header is received Data transfer time-the time between when the HTTP response header is received and the web page retrieval is completed The response time components can reveal potential bottlenecks. Clearly, the characteristics of the network links (for example, delay and packet loss) interconnecting the DMS to the target web server impact all of the response time components. Assuming little change in the intervening network's characteristics, an increase in delay or an increase in the DNS response time is an indicator of DNS-related problems. A large TCP connection establishment time is typically indicative of bottlenecks in the server host (note that although the TCP connection is established to the port the web server application is listening to, TCP connection establishment is entirely handled in the kernel, rather than in the web server application). The server response time is indicative of the web server application processing delays. Included in this measure could be queuing delays at the server waiting for a web server application process to be allocated to handle the new request, delays for DNS processing required by the server, and file system access times for retrieval of the web page. Agilent Technologies Page 5 Published 1998

Figure 2. Operation of the web service measurement agent Email service measurements: The ability of a subscriber to send mail to other subscribers (whether they are subscribers of the same ISP or of a different ISP) is distinct from the ability of the subscriber to receive mail from the email servers. To assess the availability and performance of these distinct email operations, the monitoring system includes the following active measurement agents: Mail delivery agent: Firehunter's mail delivery agent assesses the capabilities of an ISP's email system to accept mail messages for delivery to one or more target destinations. Since the ISP system may use different mail servers to handle messages destined for local subscribers and those destined for external locations on the Internet, to verify the different mail delivery paths, different local and remote destinations can be specified as targets to the mail delivery agent. This agent uses the Simple Mail Transfer Protocol (SMTP) to communicate with one or more ISP mail servers, transmits the specified mail messages, and reports the availability and performance of the mail servers. Mail retrieval agent: Firehunter's mail retrieval agent emulates a subscriber accessing a mailbox from the ISP's mail server. While doing so, this agent assesses the availability and performance of the mail retrieval components of the email system independent of the mail delivery components. In addition, this agent can also measure the delay between the transmission of a mail message and its reception at the intended destination. Figure 3 depicts the operation of this agent, which uses the Post Office Protocol Version 3 (POP3) for communication with an ISP's POP3-based mail server. In its default mode of operation, this agent retrieves all the messages in a specific subscriber mailbox. To obtain consistent measures over time, a constant sized mailbox must be used for retrieval. Like the web service agent, the mail retrieval agent tracks the overall response time for accessing the mailbox as well as the individual components of response time: TCP connection establishment time, time to authenticate the subscriber, and time to retrieve the subscriber's mailbox. These response time components can be used to identify potential bottlenecks that may exist during retrieval of mail messages in the ISP system. While retrieving messages from a mailbox, the mail retrieval agent can be instructed to selectively recognize messages with specific filter criteria (for example, specific subject field in the mail headers) and to perform additional actions for these messages. For instance, the agent can be instructed to estimate the delay between the transmission and reception of the message by comparing the origination time of the mail message (included in the mail header) and the retrieval time of the message. Agilent Technologies Page 6 Published 1998

Figure 3. Operation of the mail retrieval measurement agent News service measurements: Availability and performance of the news service is measured using the news service agent, the operation of which (see Figure 4) isvery similar to that of the other active measurement agents described earlier. Figure 4. Operation of the news service measurement agent The ability of the web, email, and news service quality measurements to assess the subscriber-perceived QOS is dependent on the location of the diagnostic measurement server (DMS). In an extreme configuration, the DMS is located at a point of presence (POP) site and uses a dial-up line to make its measurements. The service quality measurements are representative of subscriber-perceived performance. However, locating DMSs at several POP sites can be overly expensive. Moreover, simultaneous active measurements from a number of POP sites can also end up congesting the servers they measure. At the other end of the spectrum of possibilities is a DMS that is centrally located, possibly in an ISP's server farm. A Agilent Technologies Page 7 Published 1998

single set of measurements that assesses the service quality offered by the ISP's web, email, and news servers to the DMS is used to minimize the measurement overheads imposed on the servers. However, since in this case the DMS is located close to the ISP's servers, the service quality measurements it provides do not reflect the impact of the networks interconnecting the ISP's POP sites to the server farm. In fact, since they use different network paths to access web, email, and news services from the server farm, the QOS observed by subscribers is dependent on the POP sites they use to connect to the ISP. To obtain an endto-end perspective of service quality, the service quality measurements described above must be complemented with measurements of the network's effects on service quality. Network Measurements Until now, ISPs have extensively used tools such as ping and traceroute to detect and diagnose network problems. Firehunter incorporates ping's capabilities to assess network connectivity and provide end-to-end delay information that is useful for problem diagnosis. Application servers, routers, and terminal servers in the different POPs can be targets for the network connectivity and delay measurements. While ping and traceroute give an indication of changes in network characteristics, these tools are not sufficient to quantify the impact of these changes on the availability and performance of subscriber-visible services. For services that use TCP for reliable communication, a useful network performance metric is throughput, defined as the rate of reliable transmission of packets between a source and a destination. The throughput achievable between any sourcedestination pair is a complex function of several factors such as the socket buffer size in use, characteristics of the source and destination's TCP implementation, processing capabilities of the source and destination, bursts of packet loss (if any), round-trip packet transmission delays, and so forth. The complexity of this relationship makes it almost impossible to estimate the throughput achievable based on packet loss and delay measurements available from ping. Many public domain tools such as throughput TCP (ttcp) and netperf [4] have been widely used for measuring throughput. A key drawback of these tools is the need for custom software applications to be executed at the source and destination, respectively, in order to enable the measurement. To enable throughput measurements without requiring special-purpose instrumentation at the targets, Firehunter's throughput monitoring agent builds on the concept of the Traceroute Reno (Treno) tool that has been developed as part of the Internet Provider Performance Metrics (IPPM) working group of the Internet Engineering Task Force [5]. Treno emulates a TCP-based data transfer using User Datagram Protocol (UDP) packets to transmit data to a target. The UDP packets are sized equivalent to typical TCP packets and are directed at a non-existent port on the target. Any IP addressable target responds to such packets by transmitting an ICMP error message that is almost equivalent in size to the TCP acknowledgment packets that a subscriber client machine transmits in response to a TCP data packet. By incorporating TCP's delayed acknowledgment, window opening, and retransmission schemes in the server application, Treno emulates data transfers over a TCP connection. In its basic form, Treno was intended to permit subscribers to compare the performance offered by different network providers, and for network providers to monitor the performance of their networks. In an effort to measure the best possible throughput achievable when using TCP as the transport protocol over an IP network, Treno continues to open its transmission window, which represents the amount of data for which acknowledgements are outstanding at any time, until it encounters a packet loss. In an extreme scenario, Treno can continue to grow its transmission window to an extent that its packet transmissions flood the network for the duration of the test. Moreover, in practice, the TCP stacks on subscriber PCs are configured with a default maximum window size that restricts the amount of data transmitted simultaneously. Since Treno has no window size restriction built in and since in many cases, the throughput achieved over a TCP connection is known to be proportional to the window size used, Treno overestimates the throughput achieved by a subscriber during a normal TCP-based data transmission. Agilent Technologies Page 8 Published 1998

To overcome these limitations, Firehunter's throughput monitoring agent extends the concepts of Treno by building in restrictions on the maximum window size. Furthermore, by restricting the amount of data transferred during each measurement to match typical data transfer sizes that subscribers use when retrieving web, email, and news content, Firehunter's throughput monitoring agent ensures that its measurements reflect subscriber perceptions of network throughput. Since it does not rely on custom software in the targets for throughput measurements, Firehunter's throughput monitoring agent can measure throughput from the DMS to any IPcapable device, such as application servers, routers, terminal servers in the POP sites, or even subscriber PCs and terminals. An approximation of end-to-end service quality can be obtained by combining the web, email, and news service quality measurements with throughput measurements directed to subscriber PCs and terminals. For example, if the response time measured by the DMS for retrieval of a 40 KB web page is 5 milliseconds (ms), and the throughput achieved between the DMS and a subscriber's PC for a 40 KB transfer is 20 KBps, the end-to-end response time for the 40 KB transfer from the web server to the subscriber's PC can be approximated to be: 0.005 + 40*8/20 = 16.005 ms. However, such a measurement is most likely to be limited by the bandwidth bottleneck on the dial-up lines. A more useful metric for ISPs to track is the throughput achievable during transmission from the server farm to the different POP sites. The terminal servers at the POP sites can serve as targets for such a measurement (see Figure 1). A reduction in throughput is usually a good early indicator of problems that may impact subscriber-perceived QOS in the future. Problem diagnosis is facilitated by the breakdown of end-to-end QOS to service quality measurements that are most likely to be dominated by server farm problems and throughput measurements that are more reflective of network problems. In the event that a reduction in network throughput is detected, ping and traceroute can be used to isolate the root cause of the problem. Infrastructure Service Measurements Although they are not directly visible to subscribers, infrastructure services such as DNS, which handles translation of host names to IP addresses, and the NFS, which enables remote data access, play a crucial role in the operation of Internet services. Any deterioration in availability and performance of these infrastructure services usually manifests as subscribervisible problems with one or more of the Internet services. To monitor these infrastructure services, Firehunter includes the following measurements: DNS measurements: Truly accurate measurements that assess DNS performance as perceived by subscribers can only be obtained using passive measurements that track address resolutions requested by subscriber applications. Such measurements can be made either at the DNS server (requiring modifications to the server application), at DNS clients (for example, some web proxy servers log DNS response time as one of the components of the overall response time for subscriber requests), or by using nonintrusive external probes. In the absence of such passive monitoring capabilities in most ISP systems, Firehunter uses active measurements to emulate typical DNS requests and assess DNS availability and performance. Figure 5. Firehunter's distinct measurements for DNS server cache hits and misses Agilent Technologies Page 9 Published 1998

Since the ability of a DNS server to service requests from its cache is independent of its ability to resolve mappings not in the cache, the DNS agent includes separate measures of DNS cache hits and misses for a server (see Figure 5). By issuing a non-recursive query to a DNS server, the agent forces the server to respond based on its local state alone, thereby emulating a DNS cache hit at the server. By requesting address resolution to a well-known Internet site and by observing the status returned by the DNS server in its response, the DNS agent measures the availability of the DNS server. To measure the performance of DNS cache misses, the DNS agent issues a request for address resolution for a randomly chosen host. Comparison of DNS response times with the network response times provides an indication of whether the network or the DNS server is a performance bottleneck. NFS measurements: One of the common ways of scaling an ISP system is by deploying back-end content servers that store web, email, and news content, and permit front-end servers (FESs) to access the content via NFS. NFS monitoring experience developed over the years in enterprise networks is directly applicable to the ISP domain. Passive measurements that track NFS statistics maintained by the server and client machines yield various statistics about the NFS sub-system. There is a richer set of NFS metrics that can be collected from the client end than from the server end. On the server end, the number of NFS calls handled by the server over time can be an indicator of potential overload conditions. On the client end, the rate of NFS calls issued by the client, the percentage of duplicate responses received from the server(s), the percentage of timeouts observed by the client, and the percentage of retransmissions can all be tracked over time to detect NFS anomalies. A rule of thumb often used is that during normal operation, the percentage of retransmissions and duplicate responses observed by the client should be much less than 5 percent. When this is not the case, a comparison of the percentage of retransmissions with the duplicate responses provides an indicator of the cause of the problem. If very few duplicate responses have been received, the interconnecting network is potentially lossy. On the other hand, if the percentage of duplicate responses is almost equal in magnitude to the percentage of retransmissions, this implicates a slow-down at the server end as the cause of the problem (for example, because there are too few NFS server (nfsd) processes executing on the server). A look at the client statistics that indicate the number of NFS requests waiting for service can indicate if there is a bottleneck at the client end. When there are too few NFS client (biod) processes executing on the client to handle remote NFS accesses, the count of waiting requests grows. Server Measurements Measurements taken on ISP servers can yield a wealth of information that is critically important for isolating problems. Firehunter's agents executing on ISP servers passively track various statistics of interest. Among the measurements obtained using these agents are the CPU utilization of the server, free memory availability, virtual memory page-scanning rate, and packet transmission and reception rate through each of the interfaces of the server. For servers that support TCP-based services (web, email, news, and so forth), the rate of connections to and from the server and the number of connections currently established on the server are useful measures of the server's workload. A breakdown of connections based on the specific ports to which they relate is also available, thereby enabling workload to be determined for each service (SMTP, POP3, HTTP, and so forth), even in cases when the same physical host supports multiple services. By monitoring the different states of the TCP connections to a server, it is possible to detect abnormal conditions such as the malfunctioning of specific servers or the occurrence of SYN attacks. Agilent Technologies Page 10 Published 1998

Demonstration of Firehunter's Utility This section presents experimental results demonstrating the utility of Firehunter's measurements in real-world ISP environments. News Problem Diagnosis Based on ongoing measurements of an ISP's news service, Firehunter's recorded typical response times for retrieving a set of news articles amounting to 10 KB of data from the news FESs were in the range of 1 to 6 seconds. In the scenario under consideration, Firehunter observed that the response times for news access had dramatically increased to over 100 seconds at certain times of the day. Further analysis of the measurement results (see Figure 6) revealed that one of the three FESs in use, news-fes3, was performing significantly worse than the others. Restrictions on our access to this FES prevented further diagnosis of the problem. Figure 6. Performance of their news FESs The poor performance continued for several days, as the ISP's operations personnel failed to even detect that the problem existed, let alone fix the problem. Interestingly, a week later, the performance measured for all three FESs had degraded significantly (see Figure 7). While news-fes3 performed poorly in the period 8/19-8/21, all three FESs experienced problems on 8/22. Agilent Technologies Page 11 Published 1998

Figure 7. Performance of the three news FESs during a subsequent period Analysis of the breakdown of the response time for one of the FESs, news-fesi (see Figure 8) indicated that on 8/22, while the set-up time dominated during the period from 7 AM to 10 AM, the server response time was the dominant component during the period from 10 AM onwards. The changes in the dominant response time components suggest that there were two different problems (with potentially different sources) that were affecting news performance. A similar behavior was observed for the other FESs as well, indicating that the same problems were affecting all the FESs. Figure 8. Detailed breakdown of the performance of news-fes1 on 8/22 Agilent Technologies Page 12 Published 1998

Passive measurements available from agents operating on one of the FESs, news-fes1, provided additional information essential for diagnosis. Comparison of the rates of TCP connections handled by news-fes1 on 8/22 with baselines computed from historical data indicated that there had not been any increase in the number of connections to and from the news-fes. Analysis of the rates of packets to and from news-fes1's external interface also failed to indicate any change in subscriber-generated news workload. However, analysis of the rates of packets to and from news-fes1's internal interface revealed the source of the problem. Figure 9 depicts the packet traffic handled by news-fes1's internal interface over a week. There were two traffic peaks one on 8/17 and the other on 8/22 corresponding to an incoming traffic rate of over 45,000 pkts/min, both of which corresponded to times when news performance observed using the news service agent had degraded. The second peak coincides with the 7 AM to 10 AM time period on 8/22. The above analysis leads to the conclusion that the source of the first problem is congestion on news-fes1's network interface. Rerouting of traffic from the 10Mbps interface to a 100 Mbps interface around this time removed the performance bottleneck. Figure 9. Packet traffic handled by the internal interface of news-fes The second problem observed on 8/22 from 10 AM onwards was attributable to NFS problems. Figure 10 compares the NFS statistics gathered from news-fes1 for 8/21 and 8/22. While there had not been a detectable increase in NFS calls per minute issued by news-fes1 to the NFS server, the percentage of NFS time-outs had increased dramatically from close to 0 to almost 10 percent. Comparison of network response times between news-fes1 and the NFS server did not indicate any change from normal. This leads to the conclusion that a slowdown of the NFS server was the cause of the problem. Since the NFS server is accessed by all the news FESs, performance problems were observed when accessing all of the news FESs. Agilent Technologies Page 13 Published 1998

Figure 10. Client measurements of NFS obtained from news-fes1 Email Performance Tuning Figure 11 depicts the performance of an ISP's mail delivery system. In this case, Firehunter was used to track the delay between the transmission of a message and its receipt at a local subscriber mail account. As is evident from the figure, starting with a mail delivery delay of about 15 minutes on 8/27 when Firehunter was deployed at this site, mail delivery performance of the ISP's mail system continued to degrade. By 9/10, mail delivery delay had reached close to 40 minutes at peak times. This rapid increase in mail delivery delays was attributable to a failure of the mail queue processing algorithms implemented by the mail server application. As is noticeable from the figure, tuning of the mail server application, together with the introduction of new mail handling policies by the ISP (for example, to queue mail messages for no more than 3 days), resulted in a significant drop in the mail delivery time noticeable to subscribers. By 9/22 mail delivery delays were well under 5 minutes. This scenario is representative of Firehunter's utility. Besides providing measurements that can enable ISPs to assess the quality of services they offer, Firehunter provides a means by which ISP operation staff can directly observe the impact of the performance tuning and system reconfigurations they perform. Agilent Technologies Page 14 Published 1998

DNS Problem Detection Figure 11. Performance of an ISP's mail delivery system Since DNS is a common infrastructure service that impacts all the ISP services, such as web, email, and news, failure of DNS is often widely noticed. While hard failures caused by DNS availability problems are easily spotted, soft failures caused by DNS performance problems often take a long time to resolve. The DNS measurements that Firehunter includes are targeted at spotting these very cases. Figure 12 depicts a scenario in which the response time for DNS resolutions from a server's cache had dramatically increased from a few tens of milliseconds to over 500 ms around 10 PM on 8/13. The network performance measured by ICMP Echo response times had not changed significantly during the same period, thereby implicating the DNS server as the source of the problem. As is often times the case with DNS servers, memory leaks in the server software turned out to be the source of the problem in this case. Agilent Technologies Page 15 Published 1998

Network Monitoring Figure 12. Detection of a DNS performance problem Another infrastructure component that impacts all of the subscriber-visible services is the network infrastructure. As alluded to earlier, Firehunter includes a measurement that estimates the throughput achievable over a TCP connection between an ISP's server farm and a POP site. To evaluate the integrity of this measurement, we compared the throughput measured by Firehunter's throughput monitor with that achieved over a TCP connection using the netperf public domain tool, for the same source-destination pair. The data transfer size was set to 40 KBytes, and the TCP socket buffer size used was 8 KBytes. Packet loss and round-trip delay measurements were made at the same time for correlation with the throughput measurements. Figure 13 compares the results of throughput measurements obtained using netperf and using Firehunter. In order to evaluate the suitability of Firehunter's throughput measurement as a service-level metric that can be used to offer a reasonable minimum guarantee of performance, the 90 percentile values that represent the throughput that is achieved at least 90 percent of the time were chosen for comparison. The percentile values were computed over a 30-minute time window. Figure 13(a) illustrates a remarkable degree of correlation between the throughput measurements observed using netperf and that observed using Firehunter. While the instantaneous measurements (not shown in Figure 13) can vary because of the dynamic nature of network phenomena, the 90-percentile values of throughput show a 97 percent correlation. Figure 13(b), which shows the variation in network round-trip delays and packet loss values during the same time period, serves to explain the reasons for the different drops in throughput. The three valleys in the throughput graphs map directly to periods of high packet loss. The three peaks around 800 KBps correspond to periods of high network delays with no packet loss. Agilent Technologies Page 16 Published 1998

Relation to Existing Products and Tools Conventional network management systems (HP OpenView, IBM Tivoli, CA Unicenter, and so forth) provide very limited capabilities to assist an ISP in detection and diagnosis of problems relating to Internet services. Over the years, several measurement tools that have been developed and made available in the public domain for monitoring IP networks and servers. Tools such as ping and traceroute have been used for network monitoring and diagnosis, and those such as top, vmstat, and iostat have been used for monitoring servers. The lack of an integrated, easy-to-use monitoring system has meant that ISP operations personnel spend a significant amount of time manually troubleshooting problems. The time taken for problem diagnosis has a direct impact on subscriber-perceived QOS. As the complexity of ISP systems grows to handle scale and to deliver newer, more sophisticated services, problem detection and diagnosis have tended to take longer and longer. This has resulted in growing numbers of unsatisfied subscribers, contributing to an industry-wide churn [1]. The Virtual_Adrian toolkit from Sun Microsystems [6] is intended to simplify the task of problem diagnosis for ISPs by providing a variety of statistics relating to the performance of their servers. However, the toolkit does not provide any capability for modeling and measuring Internet services. The focus on management of services, rather than just management of servers, is one of the distinguishing characteristics of Firehunter. Figure 13. Comparison of throughput measurements using netperf and using Firehunter Agilent Technologies Page 17 Published 1998

Summary At the other extreme of the spectrum of products is the WhatsUp monitoring solution from Ipswitch [7]. Using only active measurements, WhatsUp tracks the availability and performance of various services. However, WhatsUp's measurements are limited in functionality compared to Firehunter. Each measurement attempts to establish connections to the respective TCP ports and issue a single service-specific command, which does not truly represent typical subscriber accesses to the ISP services. Consequently, WhatsUp's measurements are not sufficient to assess the service quality of ISP services. For example, WhatsUp's measurements do not indicate how long it will take a subscriber to download a 10 KB set of news articles from a news server. Moreover, since it predominantly uses active measurements, WhatsUp does not provide sufficient information for problem diagnosis. One of the early systems that attempted to provide an integrated solution for ISPs is the network operations console, NOCOL [8]. Using a distributed architecture that includes agents for different types of network, server, and service measurements, NOCOL attempts to provide information essential for problem detection and diagnosis. NOCOL tests service availability by merely connecting to the service port; additional diagnostic information as well as service performance measurements are lacking. However, the lack of an easy-to-use user interface and the absence of sufficient service-specific measurements have been major deterrents in its widespread adoption. Firehunter uses a manager-agent architecture similar to NOCOL's. The main novelty of Firehunter is the support it offers for problem identification and diagnosis. By incorporating sophisticated and comprehensive measurement mechanisms that track the availability, performance, and usage of subscriber-visible services (web, email, and news), as well as infrastructure services (such as DNS and NFS), Firehunter provides the information essential for problem diagnosis. An easily customizable web-based user interface enables network, server, and service measurements to be visualized through a common interface, enabling ISP operations personnel to identify the root cause of problems. Recently, Inverse Networks has devised a measurement suite for relative comparison of the QOS delivered for services offered by different ISPs [9]. These measurements, which are designed to emulate typical subscriber accesses, are executed from one or more measurement locations. By dialing in to each of the ISP's POP sites periodically, the measurements provide an estimate of the percentage of times subscribers are likely to experience busy signals while connecting to the ISP. Once a subscriber dial-in connection is established, measurements of web and email availability and performance are taken. Since it includes service measurements alone, this measurement suite does not provide sufficient information for root-cause detection or capacity planning. Another drawback of these measurements is their inability to proactively detect problems in advance of subscriber complaints. Since they emulate subscriber accesses over dial-up lines, the performance reported by these measurements is limited by the capacity of the dial-up lines. Consequently, these measurements are able to report problems only when their magnitude increases to such an extent that they are observable via the dial-up lines, which happens around the same times that subscribers notice the problems. In contrast, by incorporating measurements from strategic locations in an ISP's system, Firehunter enables proactive detection and diagnosis of problems. The IPPM working group of the Internet Engineering Task Force has been exploring tools for assessing network quality [10]. Initial draft metrics have been proposed, and early prototypes of tools are being evaluated. As the metrics become standard, future releases of Firehunter will include appropriate IPPM measurements in their measurement suites. This document has described the measurements that form the basis of Agilent Firehunter, a system for measurement and management of ISP services. Using a combination of active and passive measurements of ISP services, infrastructure, networks, and servers, Firehunter gathers a variety of information that is critical for operational monitoring, capacity planning, and subscriber support. Agilent Technologies Page 18 Published 1998

References Legal Notices 1 Rebecca Wetzel, Customers rate ISP services, PC Week, November 1997, http://www.zdnet.com/pcweek/sr/1110/10isp.html. 2 Brian Atkins, Agilent Firehunter Service Models and Baselining, Helwett-Packard Company, March 1998. 3 Bob Metcalfe, Net.Medic lets you know when to pull the plug on your Internet connection, InfoWorld, April 14, 1997. 4 Rick Jones, Netperf: A network performance benchmark: Revision 2.1, Agilent Technologies, February 1996. 5 Matthew Mathis and Jamshid Mahdavi, Diagnosing Internet congestion with a transport layer performance tool, Proceedings of INET'96, June 1996, http://www.psc.edu/~mathis/htmlpapers/inet96.treno.html. 6 Adrian Cockcroft, Advanced monitoring and tuning, Sun World On-Line, October 1995, http://www.sun.com/951001/columns/adrian/column2.html. 7 Mike Avery, WhatsUp Gold enhances network supervision tasks, InfoWorld, Vol. 20, Issue 4, January 26, 1998. 8 R. Enger and J. Reynolds, RFC 1470: FYI on a network management tool catalog: Tools for monitoring and debugging TCP/IP Internets and interconnected devices, June 1993. 9 Jackie Poole and Amy Doan, Inverse profile checks up on ISP performance, InfoWorld, April 28, 1997. 10 Vern Paxson, Guy Almes, Jamshid Mahdavi, and Matthew Mathis, Framework for IP Performance Metrics, Internet Engineering Task Force (IETF) Network working group Internet Draft, February 1998. Copyright 1998 Agilent Technologies Inc. All Rights Reserved. Agilent Technologies Page 19 Published 1998