The Network or The Server? How to find out fast! White Paper Contents Getting to the Bottom of Performance Problems Quickly.2 Collaborating across the IT Performance Boundary...6 Copyright Information...7 by John Q. Walker and Kent Erickson A user calls and reports that a networked application is running poorly. Is it the network, or is it the server? Whether you re an applications manager, a network manager or a server administrator, you ll need to answer that question again and again. The Network or the Server? 1
To use your resources efficiently, you need to solve and eliminate performance mysteries by tracking problems to their source even if the source isn t in your department. Learning about the other side of application troubleshooting helps move you past the endless questions about who s responsible for slowdowns, and on to providing solutions. Knowing where the problem is helps you help users faster. Getting to the Bottom of Performance Problems Quickly The following questions help you tell network trouble from server trouble and provide early, targeted intervention to keep networked application performance at its best. Some questions provide direction no matter what your job description is. Where appropriate, though, we ve offered both network- and server-focused troubleshooting strategies to give you a broader perspective on performance issues. 1. What counts as a performance problem, anyway? The answer depends on the application. Users use the word slow for problems with a client-server, backand-forth transaction. Multimedia, streaming applications tend to look ugly or sound bad. Poor quality usually means long response time, although for multimedia applications such as video-on-demand or voice-over-ip it can mean variable response time. It s important to measure performance both when users aren t complaining and when they are, so you can see what s changed, and it s important to measure it from both a network and system/application viewpoint. Problems in the Network Performance for large data transfers is reflected by throughput. Throughput depends on how much bandwidth is available on the slowest link and how much other traffic there is. Looking at throughput is okay for bulk data transfers, but many business applications involve lots of back and forth traffic. For these applications, user experience is reflected by response time, that is, how long it takes to complete the transaction. Increased bandwidth often improves throughput, but usually does little users response time. Response time is most affected by the end-to-end network latency [1] how long it takes to get through the network from end to end. Multimedia performance problems happen when data gets lost or scrambled. In contrast to transaction-oriented traffic, the streaming traffic used by multimedia applications is sent in one direction. Its throughput rate is preset by the sender, and its response time isn t a concern since there are no transactions. Multimedia performance problems can usually be described in terms of lost data and the variation in the arrival rate, also known as jitter. Problems at the Server Server problems manifest themselves in events. From the perspective of a server administrator, performance can be affected either by sudden system or application errors, or by an overloaded hardware subsystem. In the Windows world, the operating system and most applications are good at writing events to the event log in case of a sudden error. A server administrator wants to know that a critical event has been written, on what system, and what user applications are affected. NetIQ s AppManager watches for event log messages, alerts administrators that there is a problem, and provides a simple view of the applications installed on each server. Server hardware subsystems should be monitored individually. Assuming a system has been set up properly, the most common cause of poor server performance in business applications has historically been the disk subsystem, followed by memory, and then CPU. Since any of these three subsystems can contribute to a performance problem, it s important to track how busy each subsystem is by looking at both the percentage of time it s busy and the number of tasks waiting to be serviced. 2 Copyright NetIQ Corporation 2000-2001.
2. How was the problem reported? A problem reported by just one user may be traceable to the user s computer. Problems reported by lots of users point to something more pervasive in the network or at a server. There are some problems which no users see: examples are alerts displayed on a system management console, or a configuration mismatch found by someone on the network or server administration team. An end-to-end user problem where something s clearly wrong is usually handled more promptly than an alert where there s otherwise no observable effect. 3. Is it an availability or reachability problem? Is there really a performance problem that is, there s a slowdown as opposed to an availability problem that is, it just doesn t work or the application can t be reached. Check the Network For applications using IP addresses or domain names, it s easy to use Ping, since the software to respond to an incoming Ping is built into every TCP/IP stack. However, there s no Ping for IP Multicast, IP Quality of Service (QoS), SNA, or Novell networking. For these protocols, the good news is that Ping s function and, even better, the ability to do a third-party Ping is built into NetIQ s Performance Endpoints. 4. Can you validate what users are experiencing? Check the Server When complaint occurs about slow response time, it may mean that one component in a redundant system has failed, causing a server on another continent to respond. Many large organizations using Windows NT4 domains have experienced this type of problem with logon authentication when a Backup Domain Controller fails. By monitoring system and service availability, NetIQ s AppManager helps ensure that the obvious problems are noticed and brought to an administrator s attention. Being able to recreate a user performance problem increases your efficiency. You can prove the problem is not a one-time event. It lets you remove the user from troubleshooting and debugging steps. You can get a good idea that the problem is solved before going back and telling the user it s fixed. NetIQ s Chariot, which actively generates traffic that looks like application traffic and measures end-to-end performance, can help you troubleshoot performance problems without getting your users involved and can help you verify that problems are fixed. 5. When was the problem observed? When was a performance problem first observed? Is it a one-time event, or is there a pattern? If it s an event, what changed or what occurred externally? Many new problems occur after something changes in the network or the servers. Was there a configuration change? Was there an upgrade? Was a new component added? Are there new personnel? Was there a power failure? 6. Is there a pattern? If a performance problem isn t a one-time event, there s probably a pattern to its occurrence. For example, does the problem occur every day at noon, or on the last business day of the month? Has it been occurring since last Saturday? Has performance gradually been getting worse over the past three months? Sometimes a long-term trend or pattern can result in an event. For example, a disk drive slowly consumes its free The Network or the Server? 3
space and at some point, completely runs out causing an event. Similarly, more and more users may be added to a network, at some point exceeding the capacity of the router assigned to handle their traffic. Monitor the Network Use NetIQ s End2End network monitoring to document network performance between hundreds or thousands of connected computers. End2End can run tiny scheduled tests, say every 15 minutes, all day long, every day. You can look back over many hours, days, or weeks, to see similar patterns of network performance. Dozens of pre-built End2End reports make it easy to find the specific correlations you re looking for. Monitor the Server Long-term performance management is called capacity planning. NetIQ s AppManager includes reports that track application and server availability and performance by day, week, or month. A simple graph can often be used to instantly spot a server subsystem that has become overloaded and is creating a problem and regular reviews of capacity-planning reports can help prevent these types of problems from ever occurring, by enabling timely server upgrades. 7. Where was the problem observed? Is the problem observed by users of the same application, all in the same geographic location but not other locations? If the problem s in just one location, it s probably a network problem. Does the problem occur for all users of a particular server or application, no matter where they re located? Then the problem is probably in the network near the server or in the server itself. Is it just one user? Then, it s probably a client configuration or server setup problem for that client, not a network problem. Modern applications are often provided by several computers working together for example, a Web site might consist of several HTTP servers, a transaction server, and a database server. Network monitoring can show that the Web site is slow, but system monitoring can show which system in the Web site is not performing. It s important to monitor every system that provides services, with both common metrics for raw server performance and detailed application metrics to allow problem diagnosis. Figure 1: The End2End Transaction Time Percentage report displays the percentage of response time spent in the client computer, in the network, and in the server computer. In this example, an increasing percentage of time is being spent in the server computer, indicating a possible server problem. 4 Copyright NetIQ Corporation 2000-2001.
8. What s the performance history of this application? Of your network and server? Monitor the Network For network problems, it s helpful to know the paths the traffic is taking and what juncture points in the network are likely to be causing problems. Route monitoring lets you isolate a problem to a specific device or hop in a network; it also lets you see the routes taken by traffic when no problems were reported and compare those to routes taken when performance was poor. NetIQ s End2End network monitoring lets you isolate network performance patterns from user behavior. You can collect performance information when there isn t an historical base, even when there are no users around to run applications. When a complaint is reported, set up End2End to monitor the connection between the client and server. What s different between the periods where users complain and when they don t? If the network measurements remain consistent, yet problems are reported at different times, the problem s probably not in the network. End2End application monitoring functions are valuable for looking behind anecdotal complaints and seeing what s really happening. Application monitoring lets you see a longterm set of trends in performance, and break transaction times into their component parts. 9. How s the total transaction time being spent? Monitor the Server Use NetIQ s AppManager server monitoring to collect data on disk busy percentage, disk queue lengths, memory utilization and paging rates, CPU busy percentage and queue lengths. When a complaint is received, you ll have a graphical view of the how the performance of the system was different at the time of the complaint and at times when performance was acceptable. That s an easy way to determine if you can solve the problem with a simple hardware upgrade, or by balancing applications across several servers. NetIQ s application modules are valuable for diagnosing and correcting application-specific problems at a deeper level than raw hardware performance. For example, the SQL Server monitors the cache hit ratio a change to a stored procedure might force the database to retrieve data from disk instead of from the cache, greatly reducing the performance of individual queries while raw system measurements still seemed acceptable. Many application modules also include the ability to perform scheduled synthetic transactions. These are a useful aid in comparing an end-user s view of a transaction with the server s view. For example, the IIS module can schedule retrieval of specific Web pages either from the monitored server or another server, recording the time it takes for the page to load. If you have user complaints about poor Web server performance, you can use the records of synthetic transactions to help determine if the problem is on the server or the network. A breakdown of transaction time can tell you a lot about where the problem lies. End2End application monitoring reports show what portion of transaction time is spent in the client machine, what portion in the network, and what portion in the server. Find the most recent time, location, and application with bad performance, and see what the client/network/server breakdown has to say about which component used the most response time. Compare this breakdown to other times when the performance was good say last month at the same day of the week and time of day. What s changed? If the server time is worse but the network time is constant, the problem s probably in the server. Look at other time periods with the same pattern what s similar? What differences can you spot between those times and times of better performance? The Network or the Server? 5
Figure 1 shows an example breakout of response time; of the total response for a transaction, how much time is spent at the client, how much at the server, and how much in the network? How do these percentages change over the course of a day? Collaborating across the IT Performance Boundary In many businesses, different teams are responsible for network and server performance. But anyone who works with networked applications shares the goal of efficient response and resolution for application performance problems. Both groups benefit when the source of a problem be it client, network or server is quickly identified. Using the questions, tools and strategies we ve recommended here can help you anticipate, track and troubleshoot problems wherever they crop up. About The Authors John Q. Walker is the director of network development of NetIQ Corporation. He was a co-founder of Ganymede Software, which joined NetIQ Corporation in spring 2000. He can be reached at johnq@netiq.com. Kent Erickson is a senior developer with NetIQ Corporation. He joined Mission Critical Software in Houston, TX, which merged with NetIQ Corporation. He can be reached at kent.erickson@netiq.com. References 1. It s the Latency, Stupid, Stuart Cheshire, May 1996, available on the Web: http://rescomp.stanford.edu/~cheshire/rants/latency.html. Acknowledgments This paper was suggested by Katherine Demacopoulos. Several people improved this paper by contributing excellent reviews: Joana Bacon, Marya DeVoto, Jeff Hicks, Jim McQuaid, and Kim Shorb. 6
Copyright Information NetIQ Corporation provides this document as is without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability or fitness for a particular purpose. Some states do not allow disclaimers of express or implied warranties in certain transactions; therefore, this statement may not apply to you. This document and the software described in this document are furnished under a license agreement or a non-disclosure agreement and may be used only in accordance with the terms of the agreement. This document may not be lent, sold, or given away without the written permission of NetIQ Corporation. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, or otherwise, with the prior written consent of NetIQ Corporation. Companies, names, and data used in this document are fictitious unless otherwise noted. This document could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein. These changes may be incorporated in new editions of the document. NetIQ Corporation may make improvements in and/or changes to the products described in this document at any time. 1995-2001 NetIQ Corporation, all rights reserved. U.S. Government Restricted Rights: Use, duplication, or disclosure by the Government is subject to the restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause of the DFARs 252.227-7013 and FAR 52.227-29(c) and any successor rules or regulations. AppManager, the AppManager logo, AppAnalyzer, Knowledge Scripts, Work Smarter, NetIQ Partner Network, the NetIQ Partner Network logo, Chariot, End2End, Pegasus, Qcheck, OnePoint, the OnePoint logo, OnePoint Directory Administrator, OnePoint Resource Administrator, OnePoint Exchange Administrator, OnePoint Domain Migration Administrator, OnePoint Operations Manager, OnePoint File Administrator, OnePoint Event Manager, Enterprise Administrator, Knowledge Pack, ActiveKnowledge, ActiveAgent, ActiveEngine, Mission Critical Software, the Mission Critical Software logo, Ganymede, Ganymede Software, the Ganymede logo, NetIQ, and the NetIQ logo are trademarks or registered trademarks of NetIQ Corporation or its subsidiaries in the United States and other jurisdictions. All other company and product names mentioned are used only for identification purposes and may be trademarks or registered trademarks of their respective companies. The Network or the Server? 7