Monitoring Application Response Time Components and analysis of various approaches

Monitoring Application Response Time Components and analysis of various approaches Explores the components and analysis of application response time; describes new approaches that can simplify the analysis of sources of delay for networked applications. Executive summary Client/server applications are rapidly proliferating There are an increased number of application response time issues, caused by a variety of reasons. Specialists may be required to address the problem. IT staffs are shrinking Solutions previously available solved the problem technically, but they were so difficult to deploy and maintain that they gained no real market traction This paper will explore the components of application response time, the approaches available, and how new approaches can greatly simplify the process of analyzing sources of delay in networked applications Table of Contents Executive Summary 1 Understanding Application Response Time The End User Experience 1 Importance of Monitoring Application Response Time 1 Components of Application Response Time 2 Response Time Analysis Approaches 2 SuperAgent A Passive, -Side Approach 3 Business Value of Response Time Analysis 4 Conclusion 5 Instant vision into application performance. That s Network SuperVision. That s Fluke Networks promise to you. Understanding application response time - the end user experience Computer users hate to wait. For heavy transactional environments, response times longer than three seconds significantly impact worker productivity. End users, however, are concerned with more than absolute times. Response time expectations tend to be defined by past experiences. These experiences are relative to the baseline performance users have come to expect from the applications they use. If their current experience differs significantly from their expectations, support calls and complaints increase dramatically. Importance of monitoring application response time Application response time problems have grown with the proliferation of server-based applications. Pinpointing the source of application delays has turned out to be a difficult task. A network manager at a Fortune 500 financial services company states that too much time is spent isolating problems. The manager adds, And even after the help desk has supposedly isolated the problem as being in the network, only about 50% of the time does it actually turn out to be a network problem. Sometimes we have to dispatch complete teams of people just to figure out where the problems are. Network, application and MIS managers are motivated to keep business-critical applications running smoothly across their networks. Determining where performance problems lie and who has the responsibility of solving the problems is one of the time consuming challenges facing IT groups. Companies would like the maintenance and management of any response time monitoring process to have a low man-hour cost due to the critical shortage of human expertise. What is needed is: A clear understanding of response time components An understanding of tools and approaches available, and An understanding of the business uses for response time tools. Application Note

Components of application response time Response time defined Consider a user sitting at a PC (client), using an application that is communicating with a server across the network. The client sends a request to the server, and the server responds with one or more packets in reply. If it is a reliable application using positive acknowledgements, the client acknowledges receipt of the response message. The client may then send another request to the server. In general, a transaction (e.g., placing an order, performing a query) may consist of a number of client requests and corresponding server responses. The time elapsed from when the client sends the request (packet-level or transaction-level) to when it receives the last packet in the response is referred to herein as the Total Response Time. Network, server and application behavior all contribute to the Total Response Time. The network The network contributes to Total Response Time through a variety of mechanisms. The selection of protocols (e.g. Frame Relay or ATM, EIGRP or OSPF, FIFO or CBWFQ) strongly influence the delay experienced by a packet as it traverses the network. There is processing delay (a catch-all term for the various actions taken once a packet is received by a node until it is assigned to a transmission queue), queueing delay (when other packets are present), transmission or serialization delay (the time lapsed from when the first and last bits in the frame are transmitted, captured by the link capacity), and propagation delay (the time it takes a bit to travel across the link, dependent on the physical medium and distance). Packet corruption and loss will either degrade the quality of information or introduce additional delay due to the need for retransmissions. In enterprise terrestrial networks, queueing and transmission delay are often the dominant components of network delay. In satellite networks, the propagation delay (coupled with the access protocol) can dominate. The server delay is affected both by server and application design. performance is affected by processor speed, memory, I/O performance, disk drive speed, as well as configuration settings. Application design includes architecture and algorithms. The application Application delay is affected by interdependent factors such as application design (e.g., are sessions persistent or transient?), transaction size, protocol selection (e.g., UDP or TCP, Tahoe or Reno), and network infrastructure. The fewer round-trips an application requires to complete a given transaction, the less sensitive it will be to the network infrastructure. However, the number of round-trips can itself be dependent on the network infrastructure due to retransmissions. Response time analysis approaches There are several different approaches to response time analysis based on the type of monitoring (passive vs. active) and the location of the monitor (server-side vs. clientside). The approach selection impacts total cost of ownership, the effectiveness and accuracy of response time measurements, and complexity of deployment. Both approaches have merit and there are several vendors in the marketplace supporting each methodology. -side vs. client-side monitors side: -side monitors are deployed on a server (an agent) or near it (an appliance). Because these monitors need not be installed at client sites, they greatly reduce deployment and management costs. Since they are deployed on or near the servers, they can provide an unrestricted view of all clients and all transactions to/from the server farm. They can also provide the most accurate server delay statistics due to their proximity. A server-side agent is installed on the server to be monitored. Care should be taken to ensure that it does not interfere with the server s operation. A server-side appliance may either be a pass-through (in-line) or pass-by (tap) device. In-line tools pass data through the device similar to a router, and they are an additional source of failure for the application service. Tools that tap into the line cause no additional issues should they fail. Client-side: Client-side monitors are deployed at the various client sites of interest. They provide the most accurate measure of end-to-end delay but have difficulty separating the network and server delay components. Two common client-side approaches are to ping the server periodically or to assign the TCP connection setup time to the network round-trip time and assume it is constant throughout the session. The first method introduces possibly gross inaccuracies because network devices may handle ICMP pings differently from the application packets (routing, queueing, discarding, servicing). Both methods rely on sampling that may not be representative of network conditions experienced by the actual application packets. Deployment of client-side monitors, while rarely easy, poses particular challenges for ISPs. Passive vs. active monitors Passive: A passive monitor is a non-intrusive device that observes actual application traffic. It typically either decodes the packets (minimally to the transport and possibly to the application layer) or uses the ARM API to identify the beginning and end of an application transaction. Since the analyzed data is actual end-user activity, this approach provides a representative measurement of the end user experience. Passive monitors are available as either client-side Monitoring Application Response Time 2

or server-side tools. A passive server-side monitor has the capability to monitor all users, all transactions, all the time. A limitation of the passive approach is that it cannot be used for verifying service availability. Because there are no scheduled transactions, it cannot with 100% accuracy determine whether there is a connectivity failure or users are simply on holiday. It can, however, use historical information with correlation to arrive at a reasonable conclusion assuming the failure does not occur when users are normally inactive. Active: An active monitor is an emulated client normally installed on select desktops in a client-side approach. Active monitors provide the ability to replay a simulated transaction(s) on a scheduled basis. Their scheduled nature allows them to serve as a 24x7 check of network availability, regardless of client daily usage patterns. The active monitors execute scripts that in turn generate emulated transactions. The scripts must be constructed for each transaction of each application to be monitored. This results in additional management to ensure that the transactions accurately reflect user behavior over time. It also introduces additional load to the network and servers. Because of the repetitive nature of the requests, the effect of caching (on server, network, or monitor) may significantly skew results. SuperAgent - a passive, server-side approach Overview SuperAgent is a passive server-side appliance that analyzes response-time behavior for TCP applications. SuperAgent identifies the source of slow application performance across wide area networks with no need for endpoint agents and without adding additional load to the network or servers. SuperAgent measures real end-user experience for all locations, all the time. The measurements are continuously updated during client-server interactions to reflect changing network conditions. Switch Mirrored Port Database Transaction Application Fig. 1 - Standalone configuration SuperAgent is available in a standalone or multiunit configuration. It connects to a mirrored port on a switch near the server farm and examines TCP packet header information. The SuperAgent solution is not restricted to specific applications such as HTTP or SAP/R3; it can monitor any TCP-based application. The response time delay is separated into network, application and server components to clearly identify bottlenecks. Application response time measurement SuperAgent separates response time into application, network, and server delay components, giving a full view of where time is being spent during a transaction. It calculates the network component of Round-Trip Response Time by tracking the time differential between the server s response and corresponding acknowledgement from the client. SuperAgent s algorithms compensate for the fact that TCP does not need an acknowledgement for every packet, nor does it need to respond immediately. SuperAgent also compensates for client processing time. Response time composition HTML User Interface Data Store SuperAgent Engine A view of a typical response time graph (Fig. 2) reveals how SuperAgent separates key delay components to assess network, server and application response time. STANDALONE SUPERAGENT Fig. 2 - SuperAgent response time graph SuperAgent collects the response times by monitoring all the TCP/IP packets to and from the servers associated with the application. Packets are analyzed to decompose into the following components: Connection Time (Conn Time) represents the time it takes for a TCP session to be established. This value measures the time it takes to establish a TCP session (a communication path) between the client and server before data transfer can begin. Delay (Srv Delay) represents the amount of time it takes for the server to process the request. Data Transfer (Data Xfer) represents the time after the server begins responding to a request (i.e., first packet sent) until the time that all of the information is placed upon the network (last packet sent). Response flow is governed by the TCP protocol configuration as well as the application s design. This delay normally includes multiple round-trip times across the network to complete the transaction. Round-Trip Time Retry (RTT Retry) represents the additional delay due to retransmissions. Network Round-Trip Time (Network RTT) represents the amount of time it takes for a packet to traverse a round-trip on the network. Client and server processing time are excluded when calculating this value. Monitoring Application Response Time 3

In order to provide a thorough analysis of the response time breakdowns mentioned earlier, SuperAgent presents a comprehensive view of application performance through the following measurements: Average response times providing a view of overall response time. Standard deviations to measure the variability of response time measurements. This measurement helps determine whether responses are clustered within a narrow range or whether there are wide variances from the norm that could affect the average. Data volume to measure how much data (bytes and packets) are flowing to and from the servers. For example, data volume provides an indication whether there was enough usage to merit investigation of a high response time. Response size to measure response time by specific transaction sizes. This measurement provides insight into whether delays were caused by abnormally large responses or whether a large volume of small responses caused the problem. It allows alarms to be configured differently for large and small downloads. Percentiles to measure the percentage of users experiencing a given response time. By viewing percentiles, the scope of the problem can be established. For example, are 50% of users experiencing excessive response times or are less than 10% of users seeing the problem? Number of observations to measure the statistical significance of the response times being viewed. If the response time is high, but is based on only one measurement, probably no action need be taken. If the response time is high and it is based on a high number of observations, then there is a real problem. Session reports to analyze the TCP session data by measuring open, complete, timed out, and refused session counts. Open, complete, and timed out sessions provide insight into end user usage patterns. Refused session counts can be an indication of a SYN attack or a server that cannot respond to all requests. QoS graphs to analyze Quality of Service. This data provides metrics for understanding the impact of response time on users. Data analyzed includes rates per user, number of users, data loss rate, and server goodput. Once a detailed view of traffic traversing the network is established, SuperAgent can then be used to deliver value for the enterprise. Business value of response time analysis SuperAgent is designed to integrate well into existing network infrastructure. SuperAgent complements other tools used for a company s overall network performance management strategy by using an open database, standard formats for exporting data (CSV, XML), and through the use of SNMP traps. A response time analyzer such as SuperAgent lends itself to various business applications. Network and application managers can expect to be responsible for some or all of the areas detailed below. An effective, easy to implement tool will help in accomplishing the assigned tasks. Daily monitoring Through the use of meaningful alerts, SuperAgent can be used on a daily basis to assist IT departments in monitoring the response times of applications and to identify critical times where network and application bottlenecks can occur. Alarms can be set based on automatic baselines collected by SuperAgent. When using the auto-baselining feature, users are able to dial in the sensitivity levels they desire, customizing the alarms to their specific environment. Such information can assist Help Desks in responding to end-user calls with correct and timely information, and will increase overall client satisfaction. Troubleshooting IT Departments can use SuperAgent s powerful ability to analyze the performance of critical applications and network transit times. They can isolate response time issues and determine if such issues are due to network infrastructure or application/server capabilities. Using SuperAgent to identify the source of response time problems can save hours of troubleshooting and thousands of dollars in lost productivity. Trouble assignment In a typical company, network support is distributed across multiple IT teams, each responsible for a different technology of the network. For example, typical IT departments are composed of groups: a WAN group that handles routing and transport issues; LAN groups that handle switching, client, and patching issues; application groups to manage client applications; and perhaps several server groups that handle computer room operations across a multi-tiered network. When response time problems occur in the network, isolating the type and location of the problem (and thus which team is responsible for resolution) can cause logistical confusion and waste valuable resources and time. SuperAgent gives IT departments the ability to quickly identify whether the problem is related to the network, a specific server, or a specific application running on that server and which region of the network is affected. Geographically dispersed clients Many problems faced by IT departments and service providers involve response issues with geographically dispersed clients. SuperAgent provides data and analysis by region using subnet groupings, and can pinpoint application response delays associated with diverse network topologies. An analysis of the SuperAgent data may demonstrate the need to upgrade a remote link servicing a group of users or to offload Monitoring Application Response Time 4

application processes to additional servers. SuperAgent may also be used to target locations for new servers, such as database servers, mail servers, domain controllers, and intranet web sites. N-Tier application analysis N-Tier applications are those applications that act across several different computer systems in a distributed network topology. For example, an applet composed of scripts might be installed on a client that accesses an application containing powerful algorithms on a server. The server in turn will query a database system located on yet another server. Isolating response issues in such an architecture can be time-consuming and errorprone, as response issues may be a function of available bandwidth, transit topology, back-office processing, and/or server sizing. One would need to view the network interdependencies and application conversations between the three computers to identify and resolve problems. Such an analysis is a core capability of SuperAgent. SuperAgent can also be used in the prototyping of custom N-tier applications, as programmers can see how various approaches effect application performance across multiple servers and diverse WAN and LAN topologies. Service Level Agreements (SLAs) IT departments often have performancebased SLAs, which are hard to measure. They are faced with the task of ensuring that external entities such as service providers are meeting SLAs. They must also demonstrate to company executives, directors, and managers that internal SLAs for user groups within their own company are being met. With alarms that can be configured to specific thresholds, SuperAgent can notify IT staff of slow response, instead of waiting for end users to do so. SuperAgent can then measure, analyze, and document user experienced response times in terms of contribution from the server or application being accessed by the user versus network contribution. This quantification can then be used to determine if response time issues may lie in other areas of the network, such as leased portions of a network (such as public framerelay networks), or within the application server environment itself. Thus, SLA compliance can be monitored for exceptions and causes; associated solutions can be empirically measured for effectiveness by comparing SuperAgent data before and after remedy implementation. Service Providers often have different priorities for various customers. SuperAgent allows classes of users to be defined and separate SLAs applied to each class. This provides support for companies using multiple connection technologies such as LAN, WAN and satellite or having different grades of service. SuperAgent can also be used to tailor SLAs to the actual performance characteristics of different services provided by Application Service Providers (ASPs). Setting reasonable and expected levels to SLAs can allow IT departments to cost efficiently purchase the level of service required, while assisting ASPs in providing a layered service model containing multiple cost/service points. Conclusion This white paper has described response time components as well as monitoring approaches. We believe that, while each approach has its benefits and drawbacks, overall the passive, server-side approach provides the best blend of accurate measurement and ease of deployment. Response time monitoring and the ability to quickly determine the source of a problem provides business value to IT departments in terms of productivity and minimized downtime. NETWORKSUPERVISION Fluke Networks P.O. Box 777, Everett, WA USA 98206-0777 Fluke Networks operates in more than 50 countries worldwide. To find your local office contact details, go to www.flukenetworks.com/contact. 2003 Fluke Corporation. All rights reserved. Printed in U.S.A. 10/2003 2112769 D-ENG-N Rev A