1 The eg Suite Enabing Rea-Time Monitoring and Proactive Infrastructure Triage White Paper
2 Restricted Rights Legend The information contained in this document is confidentia and subject to change without notice. No part of this document may be reproduced or discosed to others without the prior permission of eg Innovations Inc. eg Innovations, Inc. makes no warranty of any kind with regard to the software and documentation, incuding, but not imited to, the impied warranties of merchantabiity and fitness for a particuar purpose. Copyright Copyright 2003 eg Innovations. A rights reserved. egurkha and eg ASPite are trademarks of eg Innovations. A other trademarks, marked and not marked, are the property of their respective manufacturers. Specifications subject to change without notice
3 Introduction In the recent past, the compexity of Internet/Intranet services has grown dramaticay. Many new business modes, new customerfocused services, and efficient on-ine coaboration services have emerged that improve the overa operationa efficiency of businesses. To support these new services, IT infrastructures has grown in compexity. Rather than supporting simpe cient-server appications, IT infrastructures are now designed to comprise of mutipe inter-operating tiers. The front-end incudes firewas to safeguard against maicious attacks, web servers to hande user traffic, and oad baancers to distribute traffic amongst a the web servers. The back-end has grown to be even more compex. Whie the web servers mainy act as HTML gateways that forward user requests, it is the middeware appication servers hosting the business ogic components that communicate with database servers, payment gateways, order processing systems, etc., to accompish the business functions. WEBSERVER APP SERVER DB SERVER DB SERVICE FIREWALL WEBSITE REGISTER LOGIN BROWSE DB STORAGE FIREWALL SOFTWARE WEBSERVER SOFTWARE APP SERVER SOFTWARE DB PROCESS NETWORK NETWORK NETWORK NETWORK USER HARDWARE HARDWARE HARDWARE HARDWARE NETWORK WEB MANAGER APP MANAGER DB MANAGER MANAGER SILO SILO SILO SILO Figure 1: Sio based monitoring is no onger sufficient for managing muti-tier IT infrastructures Whie the use of muti-tier architectures heps with respect to infrastructure scaabiity, it aso poses interesting chaenges for monitoring and management. For exampe, consider a user ogging into a muti-tier web site (Figure 1). The user request is received by a web server, forwarded to the ogin appication running on the middeware appication server, which in turn accesses a backend database. If there is a probem with the database service (say, the database access is 50% sower than norma), it is ikey that the ogin appication wi be affected, and that the web server wi aso be affected. In this exampe, a singe probem has ripped and affected mutipe infrastructure tiers resuting in a number of aarms - e.g., from the database server, appication server, web server, etc. Since the end-to-end service invoves mutipe dependent appication and network eements, a faiure in one of the tiers (e.g., database) can affect the other tiers as we (e.g., web server, appications, etc). Consequenty, probem identification and diagnosis in muti-tier infrastructures is a huge chaenge. Owing to the inter-dependencies between the different tiers of a muti-tier infrastructure, monitoring soutions that ook at the target environment as a coection of distinct, independent sios are often not capabe of assisting operators to quicky determine when and where probems originate. For exampe, in the exampe in Figure 1, a database probem coud be impacting the performance of the web server and appication server tiers. A monitoring soution based on the sio approach wi indicate that probems exist in a the infrastructure tiers, but wi not be capabe of differentiating what and where the source of the probem is.
4 Service Monitoring - The Need To further compound matters, as IT infrastructures have grown in compexity, it has aso become impossibe for a singe operator/administrator to be responsibe for the entire infrastructure. Typicay, the maintenance team for a arge infrastructure comprises of appication deveopers - those who deveop the appications, and domain experts ike the WebLogic administrator, network administrator, database administrator, etc. Whie the appication deveopers and domain experts are responsibe for putting together the infrastructure, yet another group, the service managers are responsibe for the 24*7 operation and performance of the end-to-end service. Since a service invoves mutipe, heterogeneous appications and network devices, it is not reasonabe to expect that a service manager has expertise in a the domains invoved in the service (web, network, database, appication, etc.). Whie there are a variety of toos avaiabe for managing specific appications and network devices in-depth, these toos are mainy appropriate for the domain experts and appication deveopers, who are interested in optimizing the performance of the infrastructure components under their contro, and in advanced troubeshooting and diagnosis - e.g., which SQL query is consuming too many resources, which java component is eaking memory, etc. On the other hand, the service managers are primariy interested in keeping the service running with good quaity of service. Their interest is mainy in determining when a probem happens, which domain is the cause of the probem - is it the network? is it the server? is it the database?. By knowing this, a service manager can quicky determine when a probem happens which domain expert/appication deveoper to hod responsibe for soving a probem. The term "triage" refers to the process by which a service manager can rank the current status of an IT infrastructure in importance and priority, and sort them based on their need for immediate action. In today's environment, the infrastructure triage process is very cumbersome and time-consuming. When a probem is reported, the service manager has to bring together a the domain experts and appication deveopers to review the probem report and anayze which domain(s) coud be causing the probem. The fact that the domain expert/appication deveoper coud each be using disparate too sets, with widey different user interfaces makes the triage process extremey compicated and time-consuming (see Figure 2). A rue of thumb is that it takes eight hours on an average to find out the cause of a probem, and that over 80% of the time to repair is actuay spent in probem isoation. Appication Experts Users Service Managers Orace Expert Network Expert Webogic Expert Domain Experts Figure 2: Infrastructure triage is often cumbersome and time-consuming
5 In order to effectivey maintain and manage IT infrastructures, service managers require monitoring and management soutions that can enabe them to determine the foowing in rea-time: How is the service performing? The service managers shoud be abe to obtain service quaity reports that can quantify the performance being deivered to users of the infrastructure services; If there is a probem, which domain is the cause of the probem - is it the network? server? database? appication? It is critica for the service manager to be abe to quicky pin-point when a probem happens, which domain coud be the cause of the probem, and what the potentia probem coud be. Depending on the nature of the probem, the service managers themseves shoud be abe to correct simpe/often recurring probem situations. For more compex issues, the probem reports provided to the service managers shoud aow them to quicky hand over the probems to the appropriate domain expert or appication deveoper for immediate troubeshooting and resoution. Keeping in mind that the service manager may not have the expertise or the time to sift through tons of data, the monitoring soution shoud be simpe to use, and effective i.e., enabe the service manager to perform his/her tasks without needing to spend a ot of time and effort. Where are the potentia bottenecks in service deivery and how can the service performance be optimized? Even when the service is performing as expected, there may be periodic trends in service usage that coud point to potentia future probem situations. The idea monitoring soution for a service manager wi provide proactive indicators of system bottenecks that if corrected in advance coud avoid future performance bottenecks. Proactive Infrastructure Triage TM using the eg suite The eg suite is a comprehensive rea-time monitoring and proactive infrastructure triage soution that addresses the key requirements of IT infrastructure service managers. Figure 3 iustrates how the eg soution operates. Whie network monitoring soutions focus on the network eements aone, and sio-based appication monitors focus on individua appications, the eg suite takes a hoistic view of the entire IT infrastructure. Taking the end-user perspective, the eg suite tracks the service performance in terms of avaiabiity, response times, and usage. To compement the service eve monitoring (which can revea potentia probems with the service) and to further triage a service probem, the eg suite tracks the heath of the individua IT infrastructure components incuding network devices, servers, appications, etc. Speciaized monitors for over fifty popuar appication patforms, support for most common Microsoft Windows and Unix server operating environments, and coverage of basic network monitoring requirements, ensures that the eg suite provides comprehensive insights into an IT infrastructure s performance. The eg suite is targeted primariy at the service managers. To understand the function of the eg service manager which is the centra component of the eg architecture, et us draw an anaogy to the function of a genera physician handing a medica compaint from a patient. Most of the time, the genera physician is the first point of contact for the patient. In many cases, the genera physician him/hersef is abe to deduce where the probem is and prescribe a remedy. In more compex cases, the genera physician directs the patient to an expert (e.g., dentist, eye speciaist, neuroogist, etc.) who wi be abe to correct the probem. The eg service manager performs a simiar function for IT infrastructures. A service manager can use the eg manager to determine how the infrastructure services are performing. When a probem is detected, the eg manager provides the next eve of detaied diagnosis. In a majority of cases, using this information, the service manager can proceed to fix the probem. In cases where they do not have the necessary access or where additiona troubeshooting/expertise is necessary, the service manager can forward the probem on to the appropriate domain expert. Through its infrastructure triage capabiity, the eg manager heps a service manager determine which domain(s) is the cause of the probem, using which the service manager can forward the probem on to the appropriate domain expert (Figure 3).
6 eg Service Manager Appication Server DNS Firewa Load Baancer Web Servers Appication Server Database Server NETWORK MANAGER SILO APPLICATION MANAGER SILO Figure 3: Automatic infrastructure triage with the eg suite eiminates finger pointing Figure 4 summarizes the benefits that accrue from using the eg suite for IT service management: Service managers can greaty benefit from the infrastructure triage capabiities they can quicky figure out which expert to contact in the event of a probem. By reducing the finger-pointing between domain experts and appication deveopers, the eg suite ensures that probems are resoved faster we before users can notice them, thus enabing higher service avaiabiity and improved user satisfaction. With the eg suite in pace, service managers too have to spend ess time in probem resoution, and hence, they can focus their time and energies on more productive activities. By ensuring that the domain experts and appication deveopers are invoved in troubeshooting probems that are ony reevant to their areas of expertise and responsibiity, the eg suite ensures that these experts are efficienty used. Since it provides a 100% web-based interface, the eg suite faciitates coaborative management in muti-domain environments for exampe, in a hosted environment, the service manager is responsibe for the network and server infrastructure, but the appication ayer is the responsibiity of the user. In such situations, both the service manager and users can access the eg manager and obtain a consistent view of the status of their infrastructure. Since it ceary quantifies the performance across the different tiers and ayers of the infrastructure, the eg suite enabes service managers and users to quicky figure out whether a probem reates to the user domain or the service provider domain. This powerfu capabiity can ensure that users can monitor their own appications and they need not even ca the service manager in the event that a probem is being caused by a faut in their appication(s). The consequent reduction in support cas to the service manager can resut in a significant cost saving in such muti-domain environments. Appication Deveopers Service operators can indentify and sove probems without requiring expert assistance Audit performance Identify/fix common probems easiy Eiminates Finger-pointing Users Service Operators Service operators know which expert to contact Orace Expert Efficient use of experts ca them ony when necessary! Network Expert Webogic Expert Domain Experts Figure 4: Benefits of using the eg suite for IT service management
7 A service monitoring soution ike the eg suite is intended to augment current monitoring and maintenance practices, not radicay change them. For instance, domain experts wi sti need to use sio-based expert toos ike network sniffers, database tuning toos, source code optimization soutions, etc., for fine-tuning the performance of the infrastructure components they contro. By providing a high degree of visibiity for service managers into the functioning of the different domains of an IT infrastructure, the service monitoring soution enabes more effective streamining and efficient operation of the IT infrastructure. The eg Difference Having highighted the need for a service monitoring and infrastructure triage soution in IT infrastructures, in this section, we wi focus on what makes the eg suite the preferred soution for most service managers. The key characteristics of the eg suite and how they benefit customers are discussed beow: Scaabe, 100% WEB-BASED architecture: Athough it uses the conventiona manager/agent architecture that is widey used by most management systems, the eg suite is unique in its use of web technoogies. The eg architecture itsef is buit aong the ines of muti-tier web architectures and hence supports sma and arge IT infrastructures equay we. A communications between the manager and agents use HTTP/HTTPS. The key advantage of this approach is that it permits the manager and agents to be in different physica ocations, possiby separated by mutipe demiitarized zones. In fact, the agents can even reside within private Intranets and sti be managed by an eg manager in a centra ocation. This architecture is ideay suited for arge enterprises and managed service provider environments. Many IT infrastructures have virtua private networks depoyed between the managed environment and the network operations center simpy to aow secure access to the monitored servers. By innovativey using the web protocos (HTTP/HTTPS) and agent poing technoogy, eg's 100% web based architecture offers an easy to depoy soution at a much ower-cost for monitoring and managing your IT infrastructures across geographicay disparate networks. Singe agent technoogy: As auded to earier, the eg suite incudes extensive monitoring capabiities for networks, servers, and appications. The tabe beow (Figure 5) summarizes the variety of IT infrastructure components monitored by the eg agents. Component Type Operating systems Web servers Web appication servers Database servers Network devices Microsoft appications Firewas Thin-cient servers Emai servers Messaging servers Others Component Brand Windows NT, 2000, 2003 server, AIX, HPUX, Red Hat Linux, Soaris (SNMP-based support for Nove Netware and other operating systems) Apache, ipanet/sunone, Microsoft IIS, IBM HTTP Server, Orace HTTP Server WebLogic, CodFusion, ATG, ipanet/sunone, Microsoft transaction server, WebSphere, SiverStream, JRun, Tomcat, Orace 9i OC4J, Orace Forms Servers, Borand Enterprise Server Orace, Microsoft SQL server, DB2 UDB, Sybase, MySQL Cisco routers, Cisco Catayst switches, Baystack hub, Loca director, any MIB-II compiant device Active Directory, BizTak server, Windows Internet Name Service (WINS), Domain controer, FTP server, DNS server, DHCP server, Print server, Proxy Server, Fie server, Event ogs Check Point Firewa-1, Cisco PIX Citrix MetaFrame, Microsoft Termina server Microsoft Exchange, Lotus Domino R5, SunONE/iPanet messaging server MSMQ, WebSphere MQ, FioranoMQ Tuxedo domain servers, Network printers, NetApp fiers and NetCache, Nove Groupwise Figure 5: IT infrastructure components monitored by the eg suite
8 Most sio-based monitoring soutions require one agent modue per appication that is monitored. With such a mode, separate agent icenses need to be purchased depending on various parameters ike the depoyment patform, the types and number of appications monitored, the number of CPUs, etc. In contrast, the eg suite offers a powerfu singe agent icensing poicy for its agents. As per this poicy, a singe eg agent can monitor a the appications executing on a server Assuming one IP address per server.. Moreover, agent icenses are not tied to operating systems or node-ocked, thereby aowing operators to pick and choose where they want to depoy the agents, and to even dynamicay change the ocation of the agents. Furthermore, the agent icensing is aso not tied to the hardware capabiities of the server being monitored by the agent. Its simpe and cost-effective agent icensing mode makes the eg suite an attractive soution for IT infrastructure monitoring. Rea-time, PROACTIVE MONITORING of the TRUE end-user experience: The experience that an IT infrastructure offers to its users is governed predominanty by how we its appication components perform. Many monitoring toos use emuated requests to monitor web transactions. The drawbacks of such emuation-ony techniques are: } This approach cannot be used to monitor critica transactions such as payment, registration, etc. } Moreover, since they merey sampe the functioning of the target environment, these emuation techniques typicay detect and report probems ony when they are severe enough to impact the end user performance, i.e., they are usefu mainy for reactive monitoring. In order to avoid the drawbacks of the emuation-ony approach, eg agents depoy a proprietary web-adapter technoogy that enhances vania web servers with the capabiity to track and report various metrics reating to individua web sites and even web transactions in rea-time. The monitoring is done in an impementation-independent manner, as a resut of which eg agents are abe to monitor Java (Servets, EJB, JSPs) and other non-java impementations (ASP, PHP, CGI, etc.) with equa feicity. Since it is abe to monitor rea-user transactions to web servers in rea-time, eg s web adapter technoogy enabes the agents to proactivey monitor and quantify a anomaies that may occur in an IT infrastructure. Figure 6 beow shows rea-time monitoring of web transactions for a web site. In this exampe, the user registration transaction is experiencing a probem, and the percent errors is 100%, impying that users are not abe to register via the web site. Figure 6: Tracking rea user transactions to a web site
9 Automatic threshoding A the metrics coected by an agent are subjected to threshoding i.e., comparing their vaues with pre-defined upper or ower bounds to determine if there is any abnormaity. Many monitoring soutions require administrators to specify the threshods for every measurement. Expicity configuring threshods for each and every metric being coected can be a aborious process spanning days or even months for arge IT infrastructures. To simpify the configuration process for in an IT infrastructure, the eg suite incudes a unique automatic threshod computation capabiity. In this approach, the eg manager computes the threshods to be used dynamicay, using tried and tested statistica quaity contro techniques to anayze past vaues of the metrics and to automaticay set the upper and ower bounds for each of the metrics, using the historica data. Since the vaues of the metrics vary from time to time, the historica threshods are aso time-varying. This ensures fast and easy setup of the eg system as administrators do not have to configure threshods for each and every metric (Figure 7). Auto-computed threshod Rea-time measurement Figure 7: Automatic Threshoding capabiity of the eg suite Automatic infrastructure triage: To ensure that IT infrastructures operate with minimum downtime, it is critica to perform probem detection and diagnosis instanty and accuratey. Correation of various probems reported at the network, system, and appication ayers is critica for speedy and accurate probem diagnosis. Most appication monitoring soutions do not incude any speciaized correation capabiity manua anaysis of the coected data is essentia to determine the root-cause of probems. In contrast, the eg suite uses a nove, patented correation and automatic infrastructure triage technoogy. To impement this capabiity, the eg manager incorporates a series of heuristics that take into account the configured site topoogies and pre-buit modes of different network and appication components. By automaticay correating across the network, system, and appication ayers, the eg suite is abe to accuratey identify and report the root-cause of probems. For the exampe in Figure 6, where different transactions of a web site are faiing, Figure 7a depicts the service topoogy i.e., the data fow/dependency between the different appications and network components invoved in providing this service. The coor coding in the figure denotes the current status of a the appications/network components invoved in deivering the service. It is obvious from the coor coding that athough there are many components that are experiencing probems, the root-cause of the probem is the Orace database server. The Orace database is indentified as the root-cause of the probem Figure 8a: Topoogy representation of a probem
10 Further dridown into the Orace database server reveas that the rea probem is due to one of the tabespaces having run out of the aocated space (see Figure 8b). Due to its abiity to represent the service inter-dependencies as a service topoogy graph, and its abiity to mode the different appications as a set of hierarchica ayer, the eg manager is even abe to automaticay anayze the current state of the infrastructure components and provide automatic anaysis that pin-points to the root-cause of probems. In Figure 8c, the eg aarm window ceary pin-points that the root-cause of the service faiure in Figure 6 is the Orace database tabespace issue. This prioritized information is made avaiabe via SMS, emai, over the web, or via SNMP traps, thereby ensuring that service managers can quicky triage their IT infrastructure from any where, at any time. Actua cause of the probem Figure 8b: The ayer mode showing that the Tabespaces ayer in the Orace database is the cause of the probem Root-cause highighted in the eg aarm window Figure 8c: eg s aarm window showing automatic prioritization of aerts Simpe and Fast Provisioning Since ony one agent needs to be instaed per server, threshods can be auto-determined, pre-defined modes determine what metrics need to be coected by each agent (depending on what appications are monitored by the agent), the depoyment of the eg suite is done very rapidy. A browser based interface ensures a near zero earning curve for users. Moreover, since eg s auto-triage technoogy does not invove setting up eaborate correation rues and circuits, users can get the eg system up and running in a matter of hours, not weeks or months. Morever, since the eg agents are auto-upgradabe from the centra manager, eaborate reconfiguration/reinstaation of the software is not necessary when new versions are reeased or support for new IT infrastructure components is introduced.
11 Rea-Time and Post-Facto Anaysis Besides aying extensive emphasis on rea-time monitoring and troubeshooting, the eg suite aso incudes eaborate reporting and anaysis capabiities for off-ine anaysis and proactive capacity panning. Different types of reports can be generated for the different eves of management in an organization. Operations reports provide in-depth insights across network, system, and appications thereby providing cear indicators of performance bottenecks and trends that coud be affecting infrastructure performance, Users have the fexibiity to customize the operation reports to suit their individua needs and preferences (Figure 9). Figure 9: An operations report showing critica system metrics of a Figure 10: An executive Report summarizing the performance Citrix Metaframe servers in an IT infrastructure Executive reports for management executives (Figure 10) offers comprehensive heath reports that summarize the overa state of each of the infrastructure components. By reviewing a report of a server's heath, an executive can determine what percentage of the time was the server's operation troube-free. By comparing the performance reports of the different components, executives can quicky determine where the probem-prone areas of their infrastructure are. Comparison of performance across time periods can aso provide indications of whether the infrastructure performance is improving over time.
12 Summary This whitepaper has outined how the eg suite makes IT infrastructure monitoring and triage easy, effective, and efficient. IT administrators and service managers can use the eg suite at a stages of the software ifecyce. In the deveopment phase, the eg suite can be used in conjunction with oad/stress testing toos and heps fine-tune appication performance. In the depoyment stage, the eg suite is used to ensure that the deveoped appications are meeting the performance expected of them. In the maintenance stage, the eg suite ensures that the IT infrastructure and its services are meeting the service eves of expected of them through its 24*7 monitoring and instantaneous troubeshooting capabiities. Figure 11: How the eg suite heps at different stages of the software ifecyce About eg Innovations eg Innovations is the eading provider of enterprise-cass monitoring and management soutions for IT Infrastructure. The company s 100% web-based monitoring soutions are especiay suited for mission-critica infrastructures where proactive monitoring, rapid diagnosis, and instant recovery are critica. Customers wordwide use the eg soutions to improve the quaity of their services thus increasing their competitive positioning, owering their operationa costs, and optimizing the usage of their infrastructures. For More Information eg Innovations, Inc 33, Wood Ave, South, Suite 600, Isein, New Jersey USA Ph: (866) Emai : Web :