Data Center Real User Monitoring

Transcription

1 Data Center Real User Monitoring Alert System Administration Guide Release 12.3

2 Please direct questions about Data Center Real User Monitoring or comments on this document to: Customer Support Copyright 2015 Compuware Corporation. All rights reserved. Unpublished rights reserved under the Copyright Laws of the United States. U.S. GOVERNMENT RIGHTS-Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in Compuware Corporation license agreement and as provided in DFARS (a) and (a) (1995), DFARS (c)(1)(ii) (OCT 1988), FAR (a) (1995), FAR , or FAR (ALT III), as applicable. Compuware Corporation. This product contains confidential information and trade secrets of Compuware Corporation. Disclosure is prohibited without the prior express written permission of Compuware Corporation. Use of this product is subject to the terms and conditions of the user's License Agreement with Compuware Corporation. Documentation may only be reproduced by Licensee for internal use. The content of this document may not be altered, modified or changed without the express written consent of Compuware Corporation. Compuware Corporation may change the content specified herein at any time, with or without notice. All current Compuware Corporation product documentation can be found at Compuware, FrontLine, Network Monitoring, Enterprise Synthetic, Server Monitoring, Dynatrace Network Analyzer, Dynatrace, VantageView, Dynatrace, Real-User Monitoring First Mile, and Dynatrace Performance Network are trademarks or registered trademarks of Compuware Corporation. Cisco is a trademark or registered trademark of Cisco Systems, Inc. Internet Explorer, Outlook, SQL Server, Windows, Windows Server, and Windows Vista are trademarks or registered trademarks of Microsoft Corporation. Firefox is a trademark or registered trademark of Mozilla Foundation. Red Hat and Red Hat Enterprise Linux are trademarks or registered trademarks of Red Hat, Inc. J2EE, Java, and JRE are trademarks or registered trademarks of Oracle Corporation. VMware is a trademark or registered trademark of VMware, Inc. SAP and SAP R/3 are trademarks or registered trademarks of SAP AG. Adobe Reader is a registered trademark of Adobe Systems Incorporated in the United States and/or other countries. All other company and product names are trademarks or registered trademarks of their respective owners. Local Build: April 1, 2015, 12:36

3 Contents Contents Introduction Who Should Read This Guide Related Publications Organization of the Guide Chapter 1 Alert System Types of Alerts Alert States and Notifications Means of Alert Delivery Defining an Alert Process Overview Chapter 2 Managing Alert Definitions Viewing Alert Definitions Editing Alerts Editing a User-Defined Alert on a Single Device Enabling and Disabling Alerts on Devices Duplicating User-Defined Alerts Deleting Alerts Working with Predefined Alerts Compatibility of User-Defined Alerts with Different CAS Versions Chapter 3 Defining New Alerts Configuring Trigger Conditions for Alerts Configuring Optional Alert Detector Settings Comparison Modes Overview Condition Types Overview Limitations on Using Baseline Conditions Limitations on Using Metrics in Alert Definitions Specifying Output Filters Filter Syntax Modifying Alert Propagation Settings Configuring Triggering Conditions for Link Performance Alerts Configuring Output Filters for Link Performance Alerts

4 Contents Chapter 4 Managing Alert Notification Recipients Defining Trap Clients Adding a New Compuware Open Server Configuring Scripts Chapter 5 Configuring Alert Notifications Sending SNMP Alert Notifications to a Single Trap Manager Disabling the Alert Engine on CAS Appendix A Alert Usage Example in a Web-based Environment Alert Definition Example: High Server Time for Service Alert Definition Example: Abnormal URL Traffic for Software Service User Alert Definition Example: Operation attributes(1) Appendix B Alert Usage Example in an Enterprise Environment Alert Definition Example: Network Performance for Site Alert Definition Example: Excessive Number of Servers Used by User. Top Software Service Identified Alert Definition Example: New Server Detected Appendix C Dimensions Available for User-defined Alert Definitions Appendix D Metrics Available for User-defined Alert Definitions Real user performance (probe) Synthetic backbone Application user experience Internetwork Traffic Network Link Enterprise Synthetic and Sequence Appendix E Alert Definitions Provided with DC RUM _TIME_4_URL AMD_DROP_PCKTS_ALL_I AMD_DROP_PCKTS_DRVR AMD_DROP_PCKTS_SNGL_I AMD_NOTRAFFIC_DRVR AMD_NOTRAFFIC_I AMD_SSL_ENGINE AMD_SSL_STATUS AMD_UNIDIR_TRAFF APPL_ABNOR AVL_DROP_4_APPL DATABASE_SIZE DISKS_STORAGE EXC_ACT EXC_ACT EXC_ACT_SIMPLE FLOW_DROP_4_CL_LOC

5 Contents HOT_IP HTTP_SERV_EFF INCORR_LOGIN LOAD_TIME_4_URL LOAD_TIME_4_URL_4_CLI LOC_CL_UP_STR_EFF LOC_HTTP_STR_EFF LOSS_RATE LOW_OPER_4_CAP_MOD LOW_OPER_4_SYS_MOD METRIC_ALM_ METRIC_ALM_ METRIC_ALM_ METRIC_ALM_ NEW_APP NEW_SERVER NEW_SERVICE NEW_USER NEW_WORKSTATION OP_GAP_4_SRV PAGE_LOAD RBAND SERVC_PERF SRV_ERR_GROW_4_HTTP_REQS SSL_APPL_INOPER SUSP_CLI_TRAFF SUSP_URL_TRAFF SVR_TIME_4_URL TFC_LVL TFC_SUSP TRANSMETRIC_ALM_ TRANSMETRIC_ALM_ TRANSMETRIC_ALM_ TRANSMETRIC_ALM_ URL_RESP_EFF USER_AVAILABILITY VPN_DROP_OFF Glossary Index

6 Contents 6

7 INTRODUCTION Who Should Read This Guide This manual is intended for administrators of Data Center Real User Monitoring report servers and RUM Console. The administrative tasks described in this manual can be performed only by users with administrative privileges. It is assumed that the reader is familiar with basic networking concepts and with concepts related to managing applications under Microsoft Windows. Related Publications Documentation for your product is distributed on the product media. For Data Center RUM, it is located in the \Documentation directory. It can also be accessed from the Media Browser. Go online ( for fast access to information about your Dynatrace products. You can download documentation and FAQs as well as browse, ask questions and get answers on user forums (requires subscription). The first time you access FrontLine, you are required to register and obtain a password. Registration is free. PDF files can be viewed with Adobe Reader version 7 or later. If you do not have the Reader application installed, you can download the setup file from the Adobe Web site at Organization of the Guide This guide is organized as follows: Alert System [p. 9] - Introduces the concept of alerts. Managing Alert Definitions [p. 17]- Describes how to manage your alerts: listing, sorting, disabling, enabling, editing, cloning and creating from the scratch. Defining New Alerts [p. 25] Describes how to configure alert detector settings and how to control the contents and delivery of alert messages. Managing Alert Notification Recipients [p. 47] Describes how to add and list the alert notification recipients. 7

8 Introduction Configuring Alert Notifications [p. 53] Explains how to create a notification message template and how to assign notifications to recipients. Dimensions Available for User-defined Alert Definitions [p. 75] Lists dimensions that are available for creating metric alert definitions. Metrics Available for User-defined Alert Definitions [p. 79] - Lists metrics that are available for creating metric alert definitions. Alert Definitions Provided with DC RUM [p. 121] - The following alerts are supported by at least one of the DC RUM report servers: Central Analysis Server or Advanced Diagnostics Server. 8

9 CHAPTER 1 Alert System The alert mechanism enables you to be proactive when dealing with problems and to remove problems before they start affecting users. In the reactive model of dealing with problems, you react to problems reported by your users (for example, website users). In such a scenario, the CAS is monitoring a given website and the AMD is measuring operation time for every operation, transaction, and user all the time. Then, using the gathered data, the report server displays all details on charts and makes it possible to measure performance and troubleshoot problems. When problems are reported by users, you look at the reports and find out that, for example, the problem is with HTTP response time from a certain server. You then go and fix the problem: reboot or restart the process or take other corrective action. In other words, you react to a problem that has already affected your users. In the proactive model, you detect problems before your users can notice them. For this, you need two things: the knowledge of how the problems manifest themselves in your particular environment, and the means of detecting such situations. For example, if long HTTP response time is the best early indicator of developing problems, you could display a chart showing the HTTP response time metric and take action if the value of the metric is above a certain value. It is even better to automate the process and let the system inform you when the metric exceeds the threshold. This is exactly what the alert mechanism was designed to do. Ideally, the system could inform a designated operator about the problem and feed data into an alert management engine. The engine could then perform a corrective action such as restarting the offending server or process. Thus, the report mechanism enables you to move some of the responsibility and intelligence from a human operator (watching the charts) to the machine (acting on alerts). Defining and Modifying Alerts For an alert to be raised, you need to specify the alert triggering conditions, which requires careful observation and knowledge of the system. You need to ensure that: You understand what you are trying to achieve. You have gathered your requirements. 9

10 Chapter 1 Alert System You know how problems in the monitored system manifest themselves. You can translate your intentions into alert configuration. You must ensure that alerts detect error situations and nothing but error situations. In other words, you must ensure that failure notifications are sent and corrective actions are performed always when needed, but only in those situations. When configuring alerts, first of all you must consider what the system would be showing if you were troubleshooting a failure in a reactive mode. These could be, for example, slow operations, HTTP response time, SSL handshake errors, stopped pages, 5xx HTTP errors on the login URL, or some textual information that needs to be captured with application error recognition. Then you need to ask yourself what values for a given time duration are still acceptable and what values mean a real problem. Thus, for example, 5 minutes of high server time might not signify a problem, but if it stays high for more than 15 minutes it might be a problem, particularly if after 30 minutes you also see 5xx HTTP errors. Then you have to react. With this type of information, you can start to think about looking for the right alerts to configure. It is not enough to detect alert conditions and then trigger and send alert notifications. You need a business process that ensures that this situation will be fixed as soon as possible. In other words, it is not enough to generate many alerts from monitoring tools if you still react to problems only when users call to complain. Usage Scenarios for Alerts The alert system can satisfy various user requirements and operational scenarios, such as: Notifying the recipient of both the beginning and the end of the alert condition. The user is notified when an alert condition is raised and also when the situation returns to normal. Notifying the recipient only if a given condition lasts for a certain period of time, or if a given event is repeated several times. This enables the user to focus on real issues and not on insignificant or intermittent glitches. Notifying the recipient several times at regular intervals throughout the duration of the problem. Types of Alerts Alerts can be divided into different categories based on the underlying detector mechanism and on their function. Alert detectors are the actual mechanisms responsible for analyzing the monitored traffic and for recognizing alert triggering events. The detector mechanism determines such things as the types and number of parameters that a given alert takes (or can be modified to take), the speed of processing, and user access to the actual detector code. In most cases, you will be working with user-defined metric alerts, which provide a simple and fast mechanism for performing complex queries on a set of predefined metrics; or on expressions combined of such metrics. They are easy to create, modify, and use. They execute quickly: up to 1,000 alert definitions can be processed in one reporting cycle. It is recommended that metric alerts be used whenever possible because of their speed of execution and ease of modification. The following types of metric alerts can be configured: 10

11 Chapter 1 Alert System Real user performance (probe) These alerts monitor traffic between a client and a server. They are based on traffic monitored by AMD, including the elements that are configured on the CAS: applications, transactions, reporting groups, tiers, regions, areas and sites. Application user experience These alerts are based on the data provided in the Application, Transcation and Tier data view. Enterprise Synthetic and sequence These alerts monitor transactions and track the HTTP-based software service activity of synthetic agents and standard users. They are based on traffic monitored by DC RUM or Enterprise Synthetic. Citrix/WTS hardware These alerts monitor the performance of Citrix servers or Windows Terminal Services (for example, the number of active or open sessions). Network link These alerts monitor link utilization. Internetwork traffic These alerts monitor traffic coming in and going out of a specific site. Synthetic backbone These alerts report problems related to Dynatrace Synthetic Monitoring transactional traffic. Although it is recommended that metric alerts be used whenever possible, not every possible alert condition can be expressed as a metric alert. This is why a set of pre-defined SQL-based alerts is provided. These alerts perform SQL queries on the traffic monitoring database. The benefit of using them is that there are no constraints to the complexity of the queries: any event that can be expressed in SQL can be detected. However, the SQL queries take a considerable amount of time to execute, so performance problems can result. You cannot create new SQL-based alert definitions or duplicate the existing ones in the RUM Console. You can, however, modify some of the detector settings for example, change the threshold values or delete the alert definitions. Among predefined alert definitions, there are also a few non-sql alerts that were designed for specific purposes and that can be modified in only limited ways. Most of them monitor and report on resources of a report server and cannot be deleted from the system. The predefined alerts are grouped based on the type of event on which they report: Anomalies Alerts sent when an abnormal situation is detected (for example, when there are too many services detected for a single user). Diagnostics Alerts that are related to the resources of a report server (for example, free space on the server hard drives or free space for the server database). New objects Alerts that are sent when a user, server, or service registers for the first time in the monitored network. 11

12 Chapter 1 Alert System Performance Alerts that report mainly errors that occur during the execution of operations and abnormal time metric values for the operations. They also notify recipients about application availability problems. Alert States and Notifications The alert system is a multi-layer mechanism. For a DC RUM user, the most important elements of this mechanism are alert states and notifications. An alert is raised if the monitored traffic meets the conditions specified in the alert definition, such as when a particular metric exceeds a defined threshold value. An optional notification can then be sent. Alert States If a given metric exceeds its threshold value, an alert state might not be triggered immediately. Exactly when an alert is triggered is defined in the alert definition. It often happens that you want to raise an alert only after a threshold has been exceeded a specific number of times in a given time interval. Similarly, notifications are not sent in direct response to the triggering conditions but in connection with alert states being raised, remaining on, or being lowered. For example, an alert can be raised: As soon as the triggering conditions are fulfilled (after just one occurrence of the alert condition). After a specified number of occurrences of a given condition. An alert state can then be lowered, or will expire: Immediately after the condition that triggered the alert has ceased to occur. If the triggering condition has not reappeared for a specified number of minutes. If the triggering condition has not reappeared for a specified number of reporting cycles. A condition can repeat a number of times, but after an alert is triggered (raised), it remains raised until it is turned off or expires (is lowered). Similarly, after an alert condition is raised, a notification can be sent zero or more times while the alert state remains on, and a notification can be sent when the alert is turned off. Notifications After an alert is raised, an optional notification can be sent. Whether a notification is sent depends on the alert definition. Later, if the alert state remains on, repeated notifications can also be sent as needed. In particular, while the alert remains on, an alert notification can be repeated: In every reporting cycle Every specified number of minutes Alert cancellation notifications are also possible: an alert definition can specify that a notification should also be sent when the alert state is lowered, that is, when the alert reverts to the off state. 12

13 Chapter 1 Alert System Notifications are sent not in direct response to triggering conditions, but in response to alert states being raised, remaining on, or being lowered. One alert can send a number of notifications. After an alert is turned on, it remains in the on state until it is turned off or expires. Means of Alert Delivery Alert notifications can be sent to a specified address, via SNMP traps, or delivered to COS. Notifications are sent to recipients based on subscriptions. Users, referred to as alert subscribers, can select which alerts they want to receive, apply additional filtering criteria, and select the delivery mechanism. When is the selected delivery mechanism, all alerts that have occurred within a single monitoring interval are by default sent in one message. Every enabled alert, even if it has no recipients defined, is generated and can be viewed in the alert logs. All alert notifications, whether ed or not, are recorded in alert logs. For more information, see Alert Log Viewer in the Data Center Real User Monitoring Administration Guide. When traps are the selected delivery medium, a separate trap is associated with each alert notification. Each trap has an associated trap definition, identified by an OID, in the MIB in the alarms.mib file. This MIB can be imported on the trap recipient to correctly interpret the meaning of the alert and automate any corrective actions. Refer to your network management platform manual for information on how to install third-party MIBs. Alert notifications can also be delivered to COS Release You can check the release number of the currently running module in the Administration Console. To open the Administration Console from the Windows Start menu, choose Programs Compuware Compuware Open Server Administration Console. Defining an Alert Process Overview Defining an alert is a process that begins with identifying the need for an alert, and then goes on to defining alert settings and arranging for the alert message to be sent to the correct audience. It is useful to follow a top-level procedure to ensure that all the required steps have been followed. Before You Begin You should be familiar with DC RUM components and basic monitoring concepts. Refer to the Data Center Real User Monitoring Getting Started. You need to install the following DC RUM components: The latest version of AMD Refer to the Data Center Real User Monitoring Agentless Monitoring Device Installation Guide. The latest version of RUM Console Refer to the Data Center Real User Monitoring RUM Console Installation Guide. The latest version of CAS 13

14 Chapter 1 Alert System Refer to the Data Center Real User Monitoring Central Analysis Server Installation Guide. Optionally: The latest version of ADS Refer to the Data Center Real User Monitoring Advanced Diagnostics Server Installation Guide. Make sure that default ports are available for communications between the individual DC RUM components. For more information, see Network Ports Opened for DC RUM in the Data Center Real User Monitoring Administration Guide. To send alert notifications via , a user with report server administrator privileges must configure the server to use an existing SMTP server. For more information, see Specifying the SMTP Server for Scheduled Report Mailing in the Data Center Real User Monitoring Administration Guide. To define an alert in the RUM Console, execute the following steps: 1. Identify a business need for an alert. For a discussions of what an alert mechanism could be used for, see Alert System [p. 9]. 2. Add a new alert definition. New alerts are created from scratch with the alert definition wizard. You can also duplicate some of the existing alerts that best suit your need and then modify the settings of the duplicated definition. The first step of creating a new definition will be selecting the alert type, specifying the alert name, and providing a description that will later help you identify the purpose for which the alert has been defined. For more information, see Defining New Alerts [p. 25] and Duplicating User-Defined Alerts [p. 21]. 3. Configure the detector settings. Choose the dimensions and metrics that you want to monitor and to which the alert should be applied, and define the conditions in which the alert should be triggered. For more information, see Configuring Trigger Conditions for Alerts [p. 26]. 4. Optional: Define alert output filters. Additional conditions can be set on alert output fields, so that only alerts satisfying those conditions are raised. To specify those, follow the information in Specifying Output Filters [p. 37] and Filter Syntax [p. 37]. 5. Optional: Modify default propagation characteristics. It is important to send alert notifications in the correct circumstances and with correct frequency. By default, an alert is raised after 1 monitoring interval during which the triggering conditions were fulfilled. In most cases, the default setting will be sufficient, but you can modify it so that the alert meets your requirements. For more information, see Modifying Alert Propagation Settings [p. 40]. 6. Optional: Define alert notification recipients. This is required only if you plan to send alert notifications to trap recipients or Compuware Open Server, and not only to CAS users. The information about CAS users is downloaded 14

15 Chapter 1 Alert System automatically from Central Security Server and displayed in the RUM Console. For more information, see Managing Alert Notification Recipients [p. 47]. 7. Optional: Configure an alert notification. Define an alert message template and enable the notifications for selected recipients. If a raised alert has no configured and enabled notifications, it will only be reported in the logs. For more information, see Configuring Alert Notifications [p. 53]. 8. Review and publish the definition on the devices. 15

16 Chapter 1 Alert System 16

17 CHAPTER 2 Managing Alert Definitions The actions you can perform on an alert definition depend on the alert type. All definitions can be listed, sorted, enabled, disabled, or have their names, descriptions, and notification messages modified. User-defined metric alerts can also be created from scratch or duplicated and saved under new names. Use the Alert Management screen in RUM Console as your primary control panel for alert definitions. Viewing Alert Definitions Use the Alert Management screen as your primary control panel for creating and editing alert definitions. 1. Start and log on to RUM Console. 2. Select Alerts from the RUM Console top menu. It will open the Alerts tab of the Alert Management window. On this tab, the definitions are grouped in user-defined and predefined alerts. For each definition, you can see a list of devices reporting on traffic to which the definitions can be applied. NOTE If you upgraded to release 12.3 from earlier versions of DC RUM, you may in some situations notice more than one alert definition with the same name on the user-defined alert list. If this is the case, it means that despite having identical names the alerts have in fact different detection or notification settings. If, on upgrade, two or more definitions with identical configurations (including the alert names) are detected, they will be merged into one entry, even if they were detected on different devices. Switching from the Alerts tab to the Devices or Recipients tab will give you another perspective on the available alert definitions. The Devices perspective offers you a list of the report servers in your network together with the alert definitions assigned to each device, whereas the Recipients tab offers a list of the configured notification recipients (CAS users, trap recipients, and COS instances) and alert definitions assigned to each recipient. 3. Optional: Change view settings. 17

18 Chapter 2 Managing Alert Definitions Switch between the user-defined and predefined alert definitions, using the links provided at the top of the alert list. Display disabled alert definitions. By default, the alert list shows only the enabled definitions. In fresh DC RUM installations, there are a few predefined definitions in the User-defined group that are by default disabled. They serve as examples showing how correct definitions of metric-based alerts are built. Many of the alert definitions in the Predefined group are disabled as well. To view all alerts, select the Show disabled check box at the top of the list. Display specific alert types. By default, the alert list shows all definitions, but using Type to display list you can browse the alerts by categories. For more information, see Types of Alerts [p. 10]. Using the Filter box, search for a specific alert definition. Type a string that you want to look for. If found, all occurrences of the string will be highlighted. To turn off the filter, clear the Filter field contents. Click any column heading to sort by that column. Click the same heading again to reverse the sort order. Editing Alerts User-defined and predefined alerts differ in terms of the availability of editing options. The settings for built-in alerts can only be edited for each device separately. This is mainly because many of these alerts are designed to monitor the resources of report servers. The user-defined alerts, however, are modified on all report servers at a time. Modification of settings on one device is also possible, but requires you to follow a different procedure. For more information, see Editing a User-Defined Alert on a Single Device [p. 19]. To edit an alert definition: User-defined alerts 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. Choose the User-defined alert group. 4. Access the editor. Select an alert with a mouse click and click Edit Alert. It will put you directly in the first alert definition wizard step, where you will be able to rename the definition, modify the alert description, change alert assignment to report servers. From the Actions menu available for a selected alert, choose Edit alert. This is also a shortcut to the first wizard step. 18

19 Chapter 2 Managing Alert Definitions From the Actions menu available for a selected alert, choose Edit notifications. This is a shortcut to configuration of notification messages for the selected alert. Predefined alerts 5. On the Alerts tab of the Alert management screen, choose the Predefined alert group. 6. Select an alert with a mouse click. 7. Access the editor. Since the predefined alerts can only be modified for a selected device, you will find the links to the editor beside each report server. You can choose between 2 options: From the Actions menu available for a selected report server, choose Edit alert. It will put you directly in the first alert definition wizard step, where you will be able to rename the definition, modify the alert description, change alert assignment to report servers. From the Actions menu available for a selected report server, choose Edit notifications. This is a shortcut to configuration of notification messages for the selected alert. 8. Save and publish the new configuration. Editing a User-Defined Alert on a Single Device To edit an alert definition on a single CAS, you first need to remove it from this device. Assume that an alert definition is assigned to two devices: CAS A and CAS B, and you want to modify the definition only on CAS A. 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. Select an alert definition that you want to modify. 4. Click Edit to access the wizard. 5. Click the CAS list to view all the devices to which the alert is assigned. 6. Remove the alert from CAS A. This is done by clearing the check box beside the IP address of the device. 7. Click Finish to save the changes. 8. On the summary screen, click Apply. 9. In the Save confguration pop-up window, choose to save the configuration as a draft. 10. Back on the Alert management screen, duplicate the modified alert definition. 11. Edit the duplicated definition according to your needs. a. Modify the definition name. b. Assign the definition to CAS A and remove it from CAS B. c. Modify the detector and notification settings. d. Click Finish to save the changes. e. On the summary screen, click Apply. 19

20 Chapter 2 Managing Alert Definitions f. In the Save confguration pop-up window, choose to save the configuration as a draft. 12. Click Publish Configuration to apply new settings to the devices. Enabling and Disabling Alerts on Devices The Alert Management screen enables you to assign alert definitions to each device separately or to several devices at the same time. You can do it when browsing either the list of available definitions or a list of report servers available in your network. To enable or disable an alert definition on a report server: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. Alerts tab The Alerts tab enables you to browse all alert definitions and report servers available for each definition. 3. On the Alerts tab, choose an alert group. Select User-defined or Predefined alert group. 4. Select alerts that you want to manage. Select one definition with a mouse click. Use check boxes to select several definitions. 5. Enable or disable the selected alerts. Enable or disable a single definition on all available report servers. From the Actions menu in the table row corresponding to the alert definition, select Enable alert or Disable alert. Note that only one of these options will be available, depending what the current status of the definition is. Enable or disable a single alert on a single device. In the table row corresponding to a given report server, click Enable or Disable, depending on your needs. Only one of these options will be available, depending on the current status of the definition. Enable or disable several definitions on all available report servers. Devices tab From the Actions menu at the top of the alert list, select Enable selected or Disable selected. The change will be applied to all available report servers. Note that the Actions menu will be active only if you used check boxes to select alert definitions. The Devices tab enables you to browse all report servers in your network, together with alert definitions available for assignment. 6. On the Alert management screen, switch to the Devices tab. The screen shows a complete list of report servers in your DC RUM installation together with the assigned alerts. 20

21 Chapter 2 Managing Alert Definitions 7. Choose a report server with a mouse click. By default, the list of alerts will show which alerts are currently enabled on the selected device. To also display the definitions that are not enabled on this device, select the Show disabled option. 8. Enable or disable alerts on the selected device. Enable or disable a single definition on the selected report server. In the table row corresponding to a given definition, click Enable or Disable. Only one of these options will be available, depending on the current status of the definition. Enable or disable several definitions on the selected report server. From the Actions menu at the top of the alert definition list, select Enable selected or Disable selected. The change will be applied to the selected report server. Note that the Actions menu will be active only if you used check-boxes to select alert definitions. 9. Click Publish Configuration to apply new settings to the devices. Duplicating User-Defined Alerts Duplicating an alert definition and editing the settings is a good way to create a new alert based on an existing definition. Note that duplication is not possible for any of the predefined alert definitions. To duplicate an alert: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. Choose the User-defined alert group. 4. With a mouse click, select an alert that you want to duplicate. 5. From the Actions menu that corresponds with a given alert definition, select Duplicate alert. This will open the first alert definition wizard screen. Make any adjustments you need and proceed to the subsequent wizard steps. The detector settings will be copied from the original alert definition, but will be available for editing. To edit the detector settings, proceed as in Editing Alerts [p. 18] and Configuring Trigger Conditions for Alerts [p. 26]. To edit the notification settings, proceed as described in Configuring Alert Notifications [p. 53]. Deleting Alerts After the definitions is ready, save and publish the configuration. On the Alert Management screen, you can delete any of the user-defined definitions and some of the predefined ones. Deleting is not possible for the following predefined alerts: INCORR_LOGIN LOW_OPER_4_SYS_MOD 21

22 Chapter 2 Managing Alert Definitions LOW_OPER_4_CAP_MOD DATABASE_SIZE DISKS_STORAGE HOT_IP SYS_STATUS SUSP_CLI_TRAFF SUSP_URL_TRAFF EXC_ACT2 To delete an alert from the system: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. On the Alerts tab, choose an alert group. Select User-defined or Predefined alert group. 4. Select one or more alerts that you want to remove from the system. 5. Delete the definitions. To delete one definition on all available devices, choose Delete alert from the Actions menu that corresponds with a given alert. To delete several alerts at a time, from all available devices, select them using the check boxes and choose Delete selected from the Actions menu at the top of the alert list. For user-defined alerts only: To delete a definition from a single CAS, click Edit Alert to access the definition wizard. On the list of CASes, clear the check box beside the IP address of a given device. 6. Click Publish Configuration to apply new settings to the devices. Working with Predefined Alerts Unlike user-defined alerts, the editing options for predefined alerts are limited. In fresh DC RUM installations, some of the predefined definitions are by default enabled and some are disabled. For information about the default status of each alert, see Alert Definitions Provided with DC RUM [p. 121]. You can edit the detector and notification settings for a predefined alert only on a single CAS. Also, you cannot change the default assignment of an alert to a given report server. Some of the pre-defined alerts cannot be deleted from the system. The list of such alerts includes the following: INCORR_LOGIN LOW_OPER_4_SYS_MOD LOW_OPER_4_CAP_MOD DATABASE_SIZE 22

23 Chapter 2 Managing Alert Definitions DISKS_STORAGE HOT_IP SYS_STATUS SUSP_CLI_TRAFF SUSP_URL_TRAFF EXC_ACT2 You can modify the alert name, description, propagation settings, output filters, notification settings, and detector parameter values, but you cannot change the default list of parameters in the detector settings. Unlike metric-based user-defined alerts, the extended editor functionality is not available for predefined alerts and the detector parameters and their values are displayed in the form of a simple list. Enabling and disabling of the definitions works in the same way as for the user-defined alerts. For more information, see Enabling and Disabling Alerts on Devices [p. 20]. Compatibility of User-Defined Alerts with Different CAS Versions The RUM Console is able to manage alert definitions on different versions of the Central Analysis Server (CAS) and to determine which alert definitions can be published on each report server. The Alert Management screen in the RUM Console lists alert definitions coming from all CASes. However, definitions coming from one server may not be available for publishing on other servers, depending on the server version, alert type, metrics, and parameters supported by each of these devices. In general, alert definitions compatible with older CAS versions are available for publication on the latest CAS version, but alert definitions compatible with the latest CAS version may not be available for publication on older CAS versions. When defining a new alert: You can only assign a new definition to a CAS that supports the selected alert type. This is automatically ensured by the RUM Console, so you will not be allowed to assign an alert to a report server that does not support a particular type. You can only choose from a common set of metrics supported by every CAS selected for publishing in the alert definition. This set is defined and displayed automatically for you in the alert definition wizard. When editing an existing alert: If you change the alert type to one that is not supported by a CAS in your installation, this server will automatically be removed from the list of devices to which the definition is published. If you remove a server from the list of CASes to which the alert should be published, and the system discovers that the alert definition uses metrics that are not supported by the remaining devices, you will see a warning message listing unsupported metrics. In this situation, you have to either modify the list of servers or edit the alert detector settings so that they do not use unsupported parameters. 23

24 Chapter 2 Managing Alert Definitions 24

25 CHAPTER 3 Defining New Alerts The new alerts that you can define in DC RUM operate on a provided set of predefined metrics, or on expressions combined of such metrics. After they are defined, they are easy to modify and use, and they execute quickly: up to 1,000 alert definitions can be processed in one monitoring interval. Before You Begin It is assumed that you have the latest version of the AMD added and configured in the RUM Console. For more information, see Adding an AMD to the Devices List in the Data Center Real User Monitoring Smart Packet Capture User Guide. It is assumed that you have the latest version of the CAS added and configured in the RUM Console and connected to your AMD. For more information, see Adding a CAS to Devices List in the Data Center Real User Monitoring Smart Packet Capture User Guide. It is assumed that you have the latest version of the RUM Console installed. For more information, see RUM Console Installation Guide. To add a new alert definition: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. Specify Basic Settings 3. Click Add Alert. It will open the alert definition wizard. 4. Select an alert type from the list. You can choose one of the following alert types: Real user performance (probe) These alerts monitor traffic between a client and a server. They are based on traffic monitored by AMD, including the elements that are configured on the CAS: applications, transactions, reporting groups, tiers, regions, areas and sites. 25

26 Chapter 3 Defining New Alerts Application user experience These alerts are based on the data provided in the Application, Transcation and Tier data view. Enterprise Synthetic and sequence These alerts monitor transactions and track the HTTP-based software service activity of synthetic agents and standard users. They are based on traffic monitored by DC RUM or Enterprise Synthetic. Citrix/WTS hardware These alerts monitor the performance of Citrix servers or Windows Terminal Services (for example, the number of active or open sessions). Network link These alerts monitor link utilization. Internetwork traffic These alerts monitor traffic coming in and going out of a specific site. Synthetic backbone These alerts report problems related to Dynatrace Synthetic Monitoring transactional traffic. For each of the alert types, there is a predefined set of dimensions and metrics that make it easier for you to configure the alert triggering conditions. 5. Specify the alert name. 6. Optional: Provide an additional alert description to help you keep track of the alert functionality. 7. From the provided list, choose the servers reporting traffic to which the new alert will be applied. By default, all available report servers are selected. NOTE Assigning or un-assigning an alert to a device is not the same as enabling or disabling a definition on a device. A disabled definition is still available on a CAS (although it is inactive), but when you clear the assignment to a device, the definition will be removed from this particular device. 8. Click Next to specify detector settings for the new alert. To find out how to configure detector settings for a link performance alert, see Configuring Triggering Conditions for Link Performance Alerts [p. 43]. For all other alert types, follow the steps described in Configuring Trigger Conditions for Alerts [p. 26]. Configuring Trigger Conditions for Alerts Detector settings precisely describe the conditions under which an alert is raised and lowered and when it expires. This basic procedure applies to the following user-defined alert types: 26

27 Chapter 3 Defining New Alerts Real user performance (probe) Application user performance Enterprise synthetic and sequence Citrix/WTS hardware Internetwork performance Synthetic backbone For alerts related to link performance, follow the steps described in Configuring Triggering Conditions for Link Performance Alerts [p. 43]. To configure when the alert should be triggered: Specify which part of traffic you want to monitor with your alert 1. Leave the dimension unselected (default setting). As a result, the thresholds are applied to the aggregated metric values for all dimensions and the notification also contains aggregated information for all dimensions (one entry). To learn how to modify the default settings and how this affects your alert definition, refer to Configuring Optional Alert Detector Settings [p. 29]. 2. Leave the default dimension filter settings. The default dimension filters limit the conditions in which the alert is triggered to the real traffic type or to specific transaction source, depending on the alert type. 3. Select a metric to which the alert will be applied. Depending on the complexity of an alert, you can choose between the two options: In the Monitored metric section, select a metric from the predefined list. Apart from metric names, the list contains additional controls that enable you to change the default display settings, such as whether metric mnemonics (IDs) are displayed or whether are displayed in alphabetical order or in groups based on their function. For an explanation of all available metrics, see Metrics Available for User-defined Alert Definitions [p. 79]. Define a compound metric using the Compound metric builder. With the builder, you can define expressions composed of basic metrics and arithmetic operators. To define a compound metric: a. Open the metric list. b. Access the Compound metric builder from a the Build compound metric link that is available on the metric list. c. Click Add metric and select a metric from the list. Depending on the complexity of the expression you want to define, you need to add at least two metrics. d. Choose an arithmetic operator from the list. 27

28 Chapter 3 Defining New Alerts The list includes addition +, subtraction -, division /, and a percentage value % (quotient multiplied by 100). Only the addition + and subtraction - operators can be combined in the same expression. e. Click OK to save the configuration. For example, define a metric to report the percentage of slow operations in all detected client operations. In the compound metric builder, first add and select the Slow operations metric, then add and choose Client operations, and then choose the % operator and save the definition. Specify the conditions under which the alert should be raised The value of the selected metric is constantly monitored and compared against the thresholds that you define. If the conditions are met, the alert is raised. The thresholds can be either single values or ranges of values that are defined by specifying more than one condition. 4. Leave the default comparison mode settings. A comparison mode describes how the metric value should be compared against threshold value. There are three modes to choose from: single, absolute, and relative. The default is single mode, which works as follows: If only a value condition is specified, the current value of the metric being monitored is compared with a threshold value specified in this condition. If a baseline condition is specified, the current value of the metric is compared with the baseline value multiplied by the baseline multiplier. For more information, see Comparison Modes Overview [p. 32]. 5. Click Add condition. At least one condition must be set for an alert definition. Note that for an alert to be raised, all defined conditions must be fulfilled. 6. Choose a type for the newly added condition. There are three condition types: value, baseline, and cut-off. Note that depending on other detector settings, some condition types may not be available for selection. For example, in a basic scenario with the dimensions unselected you can use only value conditions. For an explanation of how to use and combine conditions, see Condition Types Overview [p. 33] and Limitations on Using Baseline Conditions [p. 34]. 7. Define an alert triggering threshold. For each condition you add, you need to choose a relational operator from a list and define a threshold value. Operators include: greater than less than greater than or equal to less than or equal to equal to (for value and cut-off conditions only) 28

29 Chapter 3 Defining New Alerts NOTE Threshold values defined for alert-triggering conditions are not the same as the performance thresholds set in the server database for the purpose of generating performance reports. Threshold values defined for alert-triggering conditions are used only for the defined alerts and affect only one report: the Alert Log Viewer report. Performance thresholds set in the server database for the purpose of generating performance reports affect the appearance of, and values presented on, all of the performance reports. 8. Click Next to configure alert notifications. For more information, see Configuring Alert Notifications [p. 53]. What to Do Next To define more detailed alert triggering rules, you can modify the default detector settings and configure a number of additional filters. For more information, see Configuring Optional Alert Detector Settings [p. 29]. Configuring Optional Alert Detector Settings There are several detector settings that you can modify for the alert to be triggered in very specific situations. These settings apply to the following user-defined alert types: Real user performance (probe) Application user performance Enterprise synthetic and sequence Citrix/WTS hardware Internetwork performance Synthetic backbone For alerts related to link performance, follow the steps described in Configuring Triggering Conditions for Link Performance Alerts [p. 43]. Choose the settings that apply in your situation: Specify which part of traffic you want to monitor with your alert Select a dimension that you want to monitor. If you select a dimension, the thresholds that you define are applied to each dimension of a selected type (for example, each tier or software service). Also, in the alert notification message, you receive information about each selected dimension separately (a separate entry for each dimension). If a dimension is not selected, the thresholds are applied to the aggregated metric values for all dimensions and the notification also contains aggregated information for all dimensions (one entry). 29

30 Chapter 3 Defining New Alerts To select a dimension: 1. Click Add dimension to add a dimension selection box to the Monitored dimensions section. 2. Move the pointer over the dimension name and click it to display a complete dimension list. 3. Select a dimension from the list. For an explanation of all available dimensions, see Dimensions Available for User-defined Alert Definitions [p. 75]. Specify additional dimension filters. By defining filters, you narrow down the applicability of the alert to a specific dimension or to a range of dimensions. You can define filters for dimensions other than those selected in the Select a dimension that you want to monitor step., or even define filters if no dimension was selected. To add a dimension filter: 1. Click Add filter to add a filter selection box to the Dimension filters section. 2. Move the pointer over the dimension name to display a complete dimension list. 3. Select a dimension from the list. For an explanation of all available dimensions, see Dimensions Available for User-defined Alert Definitions [p. 75]. 4. Define a filter expression. For information about filter syntax, see Filtering on text fields [p. 38]. Specify the conditions under which the alert should be raised Change the default comparison mode setting. A comparison mode describes how the metric value should be compared against threshold value. There are three modes to choose from: single (default), absolute, and relative. In single mode, threshold conditions are compared to a metric value from the last monitoring interval. This is the default and is sufficient for most purposes. For more information, see Comparison Modes Overview [p. 32]. Only if baseline condition is selected: Select the baseline source. This setting specifies whether to use the average or pinned baselines for the baseline condition. By default, the average source is selected. This option is available only for generic performance, transaction performance, and Citrix performance alerts. It will not work unless the selected baseline type is enabled for that combination of metrics and dimensions. For more information, see Baseline Modes in the Data Center Real User Monitoring Administration Guide. Specify whether an additional alert should be raised if no traffic data is observed. Typical alerts are triggered by activity in data files located on the AMD. However, if no traffic data is observed or analyzed, and no data is sent to the report server, the alert cannot be triggered. The Treat no data as zero (0) option allows for that specific alert to be raised when no traffic data at all is observed for the monitored software service, server, or site. 30

31 Chapter 3 Defining New Alerts NOTE If DC RUM becomes inoperative and no monitored traffic data is generated, the alert will not be raised because the performance alert processing is triggered by incoming new monitored traffic data. Citrix performance alerts support the Treat no data as zero (0) option only for the Number of sessions and Number of active user sessions metrics. For example, a software service is being monitored where a number of operations are expected to occur during every monitoring period. Under normal circumstances the data presented on the report would show data for the entire time range. If no data was received for that software service by the report server, the report would not reflect the missing monitoring period. The report would present analysis on incomplete data (the missing data sets would be unacknowledged). By default, this option is disabled. To enable it, select the Treat no data as zero (0) check box. Select an auxiliary metric. This is an optional parameter and is not compared with any thresholds. You select it from a predefined metric list or compose it from several metrics as described in Step 3 [p. 27]. The value of this metric can be included in the alert message, so it can be used for supplying additional information to the alert recipient. For example, if in Step 3 [p. 27] you defined an expression that calculates the percentage of slow operations in the total number of detected operations, you can select Operations as your auxiliary metric. This results in both the percentage of slow operations and the total number of operations being reported in the notification message. If the number of operations turns out to be very small, then even if the percentage of slow operations is substantial, the problem may not be serious even though the triggering conditions are met and the alert is raised. An auxiliary metric can also be used when defining output filters for an alert. For more information, see Specifying Output Filters [p. 37]. Configure output filters. On the Output filters tab, you can set additional conditions on alert output fields, so that only alerts satisfying these conditions are raised. For more information, see Specifying Output Filters [p. 37]. Modify default alert propagation settings. This is done on the Propagation settings tab. By default, the alert is raised after one interval in which the defined conditions are met and lowered after one interval during which the conditions are not fulfilled, but you can change this setting and also specify when the alert should be re-issued or canceled. Additionally, you can define additional messages that are appended to the main notification message whenever a specific event occurs.for more information, see Modifying Alert Propagation Settings [p. 40]. 31

32 Chapter 3 Defining New Alerts Comparison Modes Overview A comparison mode describes how a metric value should be compared against a threshold value. There are three modes to choose from: single, absolute, and relative. In single mode, threshold conditions are compared to a metric value from the last monitoring interval. This is the default and is sufficient for most purposes. In some cases, however, you may be interested in an abnormal change of a certain metric value between two monitoring intervals. Then you should use either absolute or relative mode. These two modes are similar, but an absolute comparison uses the simple difference between the metric values, while a relative comparison uses the difference expressed as a percentage. Note that while the interpretation of the value threshold condition for these two modes is quite natural, the baseline condition uses a more complex formula to normalize comparison against the baseline. Single If only a value condition is specified, the current value of the metric being monitored is compared with a threshold value specified in this condition. If a baseline condition is specified, the current value of the metric is compared with the baseline value multiplied by the baseline multiplier. A cut-off condition is not available in the single mode. Absolute If no baseline condition is specified, the increments in the measured metric are compared to the value specified in the value condition. That is, if the measured metric assumed the value of A in one monitoring interval and then value B in the next one, the value of B - A will be taken. If a baseline condition is specified, the alert calculates a percentage increase in the actual absolute increment of the metric versus absolute baseline increment. This means that for two subsequent monitoring intervals, we measure the absolute increment in the value of the metric and subtract from it an increment calculated from comparing baseline value for the same monitoring intervals: (B - A) - (baseline_2 - baseline_1) We then take the resulting value relative to the differences in baselines, that is we divide it by the absolute (positive) value of (baseline_2 - baseline_1) and multiply it by 100%: (((B - A) - (baseline_2 - baseline_1)) / baseline_2 - baseline_1 ) * 100% Where the pipe symbol ( ) denotes extraction of an absolute (that is always positive) value from a number. The result is compared with the value the user entered in the baseline threshold field. Relative If no baseline condition is specified, the relative increments in the measured metric are compared to the value specified in the threshold field. That is, if the measured metric assumed the value of A in one monitoring interval and then value B in the next one, the value that would be taken is: ((B - A)/A) * 100% 32

33 Chapter 3 Defining New Alerts If a baseline condition is specified, the alert calculates a percentage increase in the actual relative increment of the metric versus relative baseline increment. The calculations are similar to those performed for the absolute mode, except that all differences in metric values or in baseline values are relative: (B - A)/A and (baseline_2 - baseline_1)/baseline_1 This gives the following formula: (((B - A)/A - (baseline_2 - baseline_1)/baseline_1) / baseline_2 -baseline_1 /baseline_1) * 100% The result is compared with the value the user entered in the baseline threshold field. Condition Types Overview When defining a new alert, you have to define at least one condition that describes when the alert should be raised. If more conditions are specified, all must be fulfilled for an alert to be raised. There are three condition types you can choose: value, baseline and cut-off condition. Value condition Use it to specify a constant threshold value for the metric or metric expression. By defining two value conditions for a selected metric, you can specify a range of values. For example, if one condition set to greater than or equal to 100, and the other to less than or equal to 200, it means all values between 100 and 200, including 100 and 200. You can define maximum two value conditions for your alert. Baseline condition Use it to compare the metric value with the calculated baseline value (see baseline data [p. 164]), usually multiplied by a specified number (baseline multiplier). In the single alert comparison mode, the current value of the metric being monitored is compared with the baseline value multiplied by the baseline multiplier. For example, for a metric measuring the number of transactions, greater than 2 means that for the alert to be raised the number of transaction must be greater than two baseline values. In the absolute or relative alert comparison modes, the calculated absolute or relative value of metric increment is compared to the specified percentage value. The baseline condition may be accompanied by a value condition. For example, if the value condition is set to greater than 500 and the baseline condition is set to greater than 2, the alert will be raised when the number of transactions exceeds two baseline values and it also must exceed the absolute value of 500 transactions. NOTE Baseline condition is only available if the alert detector settings includes at least one dimension or combination of dimensions from the predefined list. For more information, see Limitations on Using Baseline Conditions [p. 34]. You can only define 1 baseline condition. 33

34 Chapter 3 Defining New Alerts Cut-off condition Use it to eliminate false alerts being triggered by insignificant changes of the monitored metric. For example, an increase in response time from 0.1 ms to 0.5 ms (milliseconds) may appear to be a substantial change (400%), but in real terms it is only 0.4 ms, which might be insignificant. Setting the cut-off condition for response time to >1 will ensure that the change in the metric will be considered only for values above 1 ms. This condition applies only to the last monitoring interval, and is available only for differential absolute and relative comparison modes. You can define maximum two cut-off thresholds for a selected metric. Availability of conditions for different comparison modes The availability of condition types is affected by the selected comparison mode in the alert detector settings. A comparison mode describes how the current metric values are compared against thresholds or baselines. Depending on the nature of the metric, you can monitor whether its value simply exceeds another predefined value (single mode), or you can monitor how much the value changed over time, as compared with how fast the baseline changed for the corresponding period of time (absolute and relative mode). Combinations of conditions allowed in the single comparison mode: Value Baseline Value and baseline Combinations of conditions allowed in the absolute and relative comparison modes: Value Baseline Value and baseline Value and cut-off Baseline and cut-off Value, baseline and cut-off The comparison mode is not the only element that may limit the use of some conditions. As mentioned in the note above, baseline condition is only available if the alert detector settings includes at least one dimension or combination of dimensions from the predefined list. For more information, see Limitations on Using Baseline Conditions [p. 34]. Limitations on Using Baseline Conditions A baseline condition is used to specify a multiplier of a baseline value for a metric. Whether it is available when you configure a new metric alert depends on a few other alert detector settings. For a baseline condition to work, you first need to select a monitored dimension in the alert detector settings. Although this step is not obligatory for other condition types, the baseline condition will not be available for selection if you skip choosing at least one dimension group. Selecting a dimension also means that you choose to receive alert notifications for each 34

35 Chapter 3 Defining New Alerts representative of the group (for example, every tier or every software service) and not one generic notification for all dimensions. When configuring an alert using a baseline condition, remember that baseline values used by this condition type may in some situations not be calculated, as is the case during the first 48 hours after you install DC RUM. To prevent an alert definition from being activated before the baselines are calculated, you need to enable Delayed processing. For more information, see Modifying Alert Propagation Settings [p. 40]. Another limitation is that a baseline threshold is available only for some dimensions or dimension combinations. NOTE To use an alert based on a baseline condition, you must enable the selected baseline type on the report server. Otherwise, the alert will not be triggered and the notifications will not be sent. For more information, see Baseline Modes in the Data Center Real User Monitoring Administration Guide. Generic performance alerts For a generic performance alert, the following dimensions can be used with a baseline condition: Software service Reporting group Application Transaction Tier Site Area Region WAN link name The important thing to note, however, is that the above dimensions can be used singly or only in one of the following combinations: Software service server IP address Software service server IP address operation Software service server IP address task Software service server IP address module Software service server IP address service Reporting group software service server IP address operation Application software service server IP address operation Transaction software service server IP address operation Tier software service server IP address operation 35

36 Chapter 3 Defining New Alerts Site software service Area software service Region software service WAN link name software service Transaction performance alerts For a transaction performance alert, the following dimensions can be used with a baseline condition: Application Transaction Region Site Area The dimensions can be used in the following combinations: Application transaction Application region Application area Application site Transaction region Transaction area Transaction site Citrix performance alerts For a Citrix performance alert, the only dimension that can be used with a baseline condition is a software service. It can only be combined with a server IP address. Limitations on Using Metrics in Alert Definitions Because of the character of some dimensions and dimension combinations, some metrics may not be available for alert definitions. Distribution metrics are not calculated for WAN links and sites. User metrics (unique users and performance, network or availability affected users) are not calculated for WAN links. User metrics (unique users and performance, network or availability affected users) are not calculated for the following dimension combinations: Reporting group software service server IP address operation Application software service server IP address operation Transaction software service server IP address operation 36

37 Chapter 3 Defining New Alerts Tier software service server IP address operation Specifying Output Filters You can set additional conditions on metric alert output fields, so that only alerts satisfying these conditions are raised. To add an output filter for an alert: 1. In the second step of the alert definitions wizard (Define triggering and propagation conditions), switch to the Output filters tab. 2. Click Add filter group. It will display a pane in which you can select an element (a dimension, a unit of measure, an auxiliary metric value, or an alert comparison mode) that you want to filter on, and where you can type a filtering expression. a. Move the pointer above the element (dimension, metric, or other) to display a list of elements to which you can apply a filter. b. Select an element from the list. c. Type the filter expression. When defining a filter, use basic syntax guidelines hidden under the information, see Filter Syntax [p. 37]. icon. For more Basic usage example Assume that in the detector settings you have defined a compound metric that reports on the percentage of slow operations in the total number of the detected operations, and that you selected operations as an auxiliary metric, so that the value of this metric could be reported in the notification message. To have the alert raised only if the number of the operations exceeds 1000, add a filter group and select Auxiliary metric value from the element list, and then set the filter value to >1000. Such a condition will be applied to the entire traffic, but you can make it apply only to selected dimensions such as specific software services. To do that, add another filter in the same filter group, choose Software service from the element list, and specify the service name as the filter value. If you want the alert to be raised when the number of operations exceeds 1000 for software service A or 2000 for software service B, you need to have two separate filter groups, one for each software service. NOTE For information on output filters for link performance alerts, see Configuring Output Filters for Link Performance Alerts [p. 45]. Filter Syntax The described syntax rules apply to the Dimension filters and Output filters in the alert definition wizard. 37

38 Chapter 3 Defining New Alerts Filtering on numeric fields The following syntax can be used for numeric fields: A single numeric value To match one particular value. All numbers less than the specified value Use a less-than sign < followed by the number. For example, <400 means all values less than 400. All numbers greater than the specified value Use a greater-than sign > followed by the number. For example, >400 means all values greater than 400. All numbers less than or equal to the specified value Use a less-than sign < followed by an equal sign = and the number. For example <=400 means all values less than or equal to 400. All numbers greater than or equal to the specified value Use a greater-than sign > followed by an equal sign = and the number. For example >=400 means all values greater than or equal to 400. A range of numbers Use a dash to specify a range of numbers, including the numbers at both end of the interval. For example, means all values between 127 and 255 including 127 and 255. A negative condition Use a tilde character ~ to match all values except those that conform to the specified pattern. For example, ~400 will filter all values that are not 400. Logical disjunction (OR) of your match conditions Use a pipe symbol to filter values that match one of the specified conditions. For example, will filter all values that are 400 or 500. Logical conjunction (AND) of your match conditions Use an ampersand & to filter values that match all of the specified strings. For example, >400 & <500 will filter all values greater than 400 and less than 500. Note that specifying a range would not be equivalent, since a range includes both end values. Enumeration of values to match Use a comma to enumerate values. Any value An empty pattern means there is no filter and all values will be accepted. A value range suffixes You can also use suffixes k, M, G, T for kilo, mega, giga, and tera. Filtering on text fields The following syntax can be used for text fields: Match any string containing the specified pattern If you specify an unquoted string, it will be matched with all string that contain that substring. For example, RG will filter all strings containing RG, such as RG_1, BUSS_RG, BG_RG_3 38

39 Chapter 3 Defining New Alerts Match a string exactly If you enclose the string you specify in quotation marks ("), it will be matched exactly, that is it will only filter that string and not strings that contain it. For example, "RG_2" Any character If in a given place in your pattern you want to relax your match condition to match any character, use a question mark?. For example, A?B will match ABC as well as ACC. Any substring If in a given place in your pattern you want to relax your match condition to match any substring, use the asterisk character *. For example *RG will filter all strings ending with RG and RG* will filter all strings starting with RG. A negative condition You can request to match all strings except those that conform to the specified pattern. To do this, precede your pattern with the tilde character ~. For example, WWW will filter all strings that do not contain substring WWW. Logical disjunction (OR) of your match conditions To filter a string that matches one of the specified strings, use a pipe symbol. For example, WWW HTTP will filter all strings containing WWW or HTTP. Logical conjunction (AND) of your match conditions To filter a string that matches all of the specified strings, use an ampersand &. For example, WWW&HTTP will filter all strings containing WWW and HTTP. Enumeration of strings to match Use a comma to enumerate values. Match any string An empty string means there is no filter and all values will be accepted. This is equivalent to specifying a single asterisk *. You can combine conditional syntax with logical syntax as shown as in the example below: Example of combined filter syntax Let us assume that you need to filter out the following services: SMTP_PROD DNS FTP HTTP For that purpose, type the following expression in the : ~SMTP_PROD & ~DNS & ~FTP & ~HTTP Filtering on IP address fields The following syntax can be used for IP address fields: #.#.#.#, where # is any integer from 0 and 255. You can also use an asterisk * in place of a number. Both IPv4 and IPv6 address types are supported. 39

40 Chapter 3 Defining New Alerts Modifying Alert Propagation Settings Alert propagation characteristics specify how the alert is triggered and canceled. By default, an alert is raised after one monitoring interval during which the triggering conditions are fulfilled. In most cases the default settings will be sufficient, but you can modify them as needed. You can modify the propagation settings in the Detector settings step of the alert definition wizard. 1. Switch to the Propagation settings tab. 2. Select the propagation characteristics you want to apply to the alert. The following alert propagation options are supported: Raise after n intervals This setting specifies the number of intervals that must occur before the alert can be raised. Cancel after n intervals After the alert occurs, this option specifies the number of subsequent monitoring intervals for which the alert condition must not be present for the alert to cancel itself. That is, after this many monitoring intervals without an event occurring, it will be assumed that conditions are back to normal. The default value is one interval. This option has to be enabled if you want to receive notification messages after the alert state is over. For more information, see Configuring Alert Notifications [p. 53]. Reissue every n intervals Specifies the number of events that must occur following the last notification of the alert before the alert notification is re-issued. Reissue every n minutes Specifies the number of minutes that must elapse following the last notification of the alert before the alert notification is re-issued. Abort after n minutes The number of minutes after which the alert will expire, regardless of whether the triggering conditions are fulfilled. The default value is 60 minutes. Delayed processing This option affects mainly alert definitions that require normal (baseline or average) values of metrics to be calculated. This means that the alert will become active a predefined time after system startup (default: two days). It can be enabled in situations when the baselines are not yet calculated (for example, during the first 48 hours after you install DC RUM). See also normal value [p. 165] and baseline data [p. 164]. The option should also be enabled if you want to use predefined New objects alerts reporting on new applications, servers, services, users or workstations detected in your network. This will enable DC RUM to learn which objects belong to your network before raising an alert. 40

41 Chapter 3 Defining New Alerts NOTE Reissuing and aborting alerts is possible only if you specify when the alert should be canceled first. However, after the canceling of an alert is enabled, aborting will be automatically enabled as well and impossible to disable, whereas re-issuing is optional. When you want your alert to be reissued, you can specify either the number of intervals or the number of minutes after which it should be repeated, and not both. Some of these options, when selected, allow you to define an annotation, which is additional text to be appended to the alert basic message. An annotation is propagation-specific, which means that the annotation for an alert cancellation can be different from the annotation for an alert re-issue. 3. Proceed to the next wizard step or finish the configuration. To proceed to notification configurations, click Next. To save the configuration, choose Finish. Visual representation of alert propagation settings The diagrams below show the relationships between the time when alert triggering conditions are met, the moment when an alert is raised and sending an alert notification. The time shown means not the real time of an event, but the time as it is reported in the Alert Log, where the timestamp from the beginning of a given interval is used. For example, if you specify that the alert should be raised after 3 consecutive monitoring intervals, and the triggering conditions are met, the reported timestamp will be that of the beginning of the third interval. This is why on a diagram it may look as if the alert was actually raised after the second interval, and not after the third one. However, aborting an alert is marked exactly as it happens. For example, if an abort time is set to 15 minutes, it is shown at the beginning of the fourth interval. Example 1. Alert raised after n intervals. To ensure that the alert is triggered only when the problem persists in time, you can configure the alert to be sent only after several consecutive intervals (by default, 1) during which the conditions specified in the alert definition are met. Example setting: Raise after 3 monitoring intervals during which the conditions are fulfilled. Notifications (alert raised) Alert state Conditions fulfilled Monitoring interval Time [min]

42 Chapter 3 Defining New Alerts Example 2. Alert canceled after n intervals. Specifying when the alert should be canceled means that it is not automatically lowered until the specified number of intervals elapses. Example settings: Raise after 1 monitoring interval during which the conditions are met. Cancel after 4 monitoring intervals during which the conditions are not fulfilled. Abort after 30 minutes. Notifications (alert canceled) Notifications (alert aborted) Abandoned due to earlier cancellation Notifications (alert raised) Alert state Conditions fulfilled Monitoring interval Time [min] Example 3. Alert aborted after n minutes after the alert was triggered. You can configure the alert to be aborted even if the triggering conditions are still present. Example settings: Raise after 1 monitoring interval during which the conditions are met. Cancel after 1 interval during which the conditions are not fulfilled. Abort after 15 minutes. Notifications (alert cancelled) Abandoned due to earlier abort Notifications (alert aborted) Notifications (alert raised) Alert state Conditions fulfilled Monitoring interval Time [min] Example 4. Alert reissued every n minutes after the alert was triggered. In this example, an alert is first raised immediately after the trigger conditions appeared, and then reissued every ten minutes. Raise after 1 monitoring interval during which the conditions are met. Cancel after 2 intervals during which the conditions are not fulfilled. 42

43 Chapter 3 Defining New Alerts Abort after 40 minutes. Reissue every 10 minutes. Notifications (alert cancelled) Notifications (alert aborted) Abandoned due to earlier cancellation Notifications (alert reissued) Notifications (alert raised) Alert state Conditions fulfilled Monitoring interval Time [min] Configuring Triggering Conditions for Link Performance Alerts Link performance alerts are metric alerts that operate on a set of predefined dimensions and metrics that report on link utilization. To configure when a link performance alert should be raised: 1. Optional: Specify a dimension filter. By defining filters, you narrow down the applicability of the alert to a specific dimension or to a range of dimensions. For a link performance alert, you can filter traffic on the following dimensions: Link name The name of the monitored link as visible on the Link View Status - Links CAS report. Link type The type of the monitored link as visible on thelink View Status - Links CAS report. Link alias The alias of the monitored link as visible on the Link View Status - Links CAS report. To add a dimension filter: a. Click Add filter to add a filter selection box to the Dimension filters section. b. Hover the mouse pointer over the dimension name to display a complete dimension list. c. Select a dimension from the list. d. Define a filter expression. For information about filter syntax, see Filtering on text fields [p. 38]. 2. Define alert triggering conditions. a. Specify the alert mode. 43

44 Chapter 3 Defining New Alerts Before you define the alert triggering conditions, you need to specify a number of generic detector settings for an alert (for example, the type of traffic to which the alert would be applied) or Choose which value (current or baseline) of metrics describing the traffic should be compared with the thresholds you will define. The current or baseline value refers to one of the four metrics for a selected traffic type: Incoming bytes Outgoing bytes Incoming utilization Outgoing utilization Specify the time frame the number of intervals or minutes for which the current or baseline metric values should be calculated. Decide on the traffic type link usage traffic or SNMP (router-related) traffic to which the alert should be applied. Specify if the alert should be raised when the current or baseline value of one of the above metrics exceeds the threshold for traffic going in at least one direction (incoming or outgoing), or in both directions at the same time. b. Define triggering condition details. You can define a maximum of four conditions that will describe when exactly the alert should be raised: two for incoming traffic and two for outgoing traffic. By default, all conditions are disabled. Minimum incoming threshold Minimum incoming utilization or volume. If the value of the selected metric falls below this threshold, the alert is raised. If you do not want an alert to be raised for low utilization or volume, leave this condition disabled. Maximum incoming threshold Maximum incoming utilization or volume. If the value of the selected metric exceeds this threshold, the alert is raised. If you do not want an alert to be raised for high utilization or volume, leave this condition disabled. Minimum outgoing threshold Minimum outgoing utilization or volume. If the value of the selected metric falls below this threshold, the alert is raised. If you do not want an alert to be raised for low utilization or volume, leave this condition disabled. Maximum outgoing threshold Maximum outgoing utilization or volume. If the value of the selected metric exceeds this threshold, the alert is raised. If you do not want an alert to be raised for high utilization or volume, leave this condition disabled. 3. Optional: Switch to the Output filters tab to configure additional filtering rules. You can set additional conditions on metric alert output fields, so that only alerts satisfying these conditions are raised. For an explanation of how the output filters work, refer to Configuring Output Filters for Link Performance Alerts [p. 45]. 44

45 4. Modify default alert propagation settings. This is done on the Propagation settings tab. By default, the alert is raised after one interval in which the defined conditions are met and lowered after one interval during which the conditions are not fulfilled, but you can change this setting and also specify when the alert should be re-issued or canceled. Additionally, you can define additional messages that are appended to the main notification message whenever a specific event occurs.for more information, see Modifying Alert Propagation Settings [p. 40]. 5. Click Next to configure alert notifications. For more information, see Configuring Alert Notifications [p. 53]. Configuring Output Filters for Link Performance Alerts You can set additional conditions on link alert output fields, so that only alerts satisfying these conditions are raised. For a link performance alert, the predefined set of elements on which you can use filters is limited to elements related to link characteristics and link utilization. To configure output filters for a link alert: 1. In the second step of the alert definitions wizard (Define triggering and propagation settings), switch to the Output filters tab. 2. Click Add filter group. It will display a pane in which you can select an element that you want to filter on and type a filtering expression. The list of available elements includes the link-related dimensions and metrics and an alert mode. a. Hover the mouse pointer above the element (dimension, metric, or other) name to display a list of elements to which you can apply a filter. b. Select an element from the list. c. Type the filter expression. When defining a filter, use basic syntax guidelines hidden under the information, see Filter Syntax [p. 37]. Chapter 3 Defining New Alerts icon. For more While filtering by a metric value or by a specific dimension seems to be a standard action, filtering by an alert mode requires you to know the code names by which you can refer to each mode: SINGLEDIR Raise the alert if utilization in at least one direction, incoming or outgoing, is outside the specified threshold values. BOTHDIR Raise the alert if utilization in both directions, incoming and outgoing, is outside the specified threshold values. SNMPSINGLEDIR Raise the alert if SNMP utilization in at least one direction, incoming or outgoing, is outside the specified threshold values. SNMPBOTHDIR Raise the alert if SNMP utilization in both directions, incoming and outgoing, is outside the specified threshold values. 45

46 Chapter 3 Defining New Alerts BASELINE SINGLEDIR Raise the alert if baseline utilization in at least one direction, incoming or outgoing, is outside the specified threshold values. BASELINE BOTHDIR Raise the alert if baseline utilization in both directions, incoming and outgoing, is outside the specified threshold values. BASELINE SNMPSINGLEDIR Raise the alert if baseline SNMP utilization in at least one direction, incoming or outgoing, is outside the specified threshold values. BASELINE SNMPBOTHDIR Raise the alert if baseline SNMP utilization in both directions, incoming and outgoing, is outside the specified threshold values. 3. Click Next to proceed to configuring alert notifications or click Finish to save the configuration. 46

47 CHAPTER 4 Managing Alert Notification Recipients Alert notifications can be sent to addresses of report server users, to trap clients, to Compuware Open Server (COS), to mobile users, or to scripts. Information about report server and mobile users is automatically downloaded from Central Security Server and is not configurable on the Alert Management screen, but trap recipients, COSes, and scripts need to be added and configured before they can receive notification messages. To browse the configured recipients and alert definitions available for assignment, and to define new trap clients, COSes, or to configure scripts, go to the Recipients tab on the Alert Management screen. If the Central Security Server is bound to an LDAP server, the LDAP users are displayed on the Recipients tab only after they log in to a report server or RUM Console. If a report server user for which notifications are enabled is removed from the CSS database, that user will be marked with a warning sign until those notifications are deleted, at which time the user will also be removed from the list of recipients. To add or enable notifications for the selected recipients, use the Actions menu available for each alert. Alert notifications are assigned to report server users based on user names. You can assign notifications to a user for whom no address has been configured, but the notification messages will not be sent as long as no address is defined for that user. Defining Trap Clients To receive traps generated by the report server, you need to specify trap clients on the Trap Recipients screen. Before You Begin Traps make use of UDP port 162. For traps to function, this port must not be disabled and there must be at least one trap receiver activated and running on the server. To add a trap recipient: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. 47

48 Chapter 4 Managing Alert Notification Recipients The Alert Management window appears. 3. On the Alert management screen, switch to the Recipients tab and choose Trap recipients. You will access the screen listing all trap recipients and available alert definitions. 4. Click Recipient Configuration. The Recipient Configuration window opens. 5. On the Trap Recipients tab, click Add. The Add Trap Recipient window opens. 6. Specify the trap parameters. Host IP address of the trap recipient. Read Community Name Community name of the trap recipient. Port Port number of the trap recipient. 7. Click Save to save the data and close the pop-up window. The new trap client appears on the trap recipients list. By default, the added recipient is activated. 8. Click Close. 9. Optional: Enable IPv6 support. By default, a report server generates SNMP traps version 1 that supports only IPv4 addresses. If you want the notifications to also include IPv6 addresses, you must use SNMP traps version 2 and modify the commonalarm-hcbs.properties configuration file on the report server. a. Locate and open the commonalarm-hcbs.properties file on a machine that will send alert notifications. In a default installation, the entire file path is: <install_dir>\config\commonalarm-hcbs.properties b. In the file, append an additional value to the MibBuilder.flags property. MibBuilder.flags = definestrapswithsmiv2, combinesagent2trapobjid c. Save the configuration. What to Do Next You can define multiple trap recipients and activate or deactivate them as required. You can also edit the recipients and change the host, Read Community Name, and port information at any time. To activate, deactivate, and edit a trap client, use the links in the Actions table column, on the Recipient Configuration screen. To delete a trap client from the list, select the check box beside the recipient name and click the Delete button. 48

49 Adding a New Compuware Open Server Compuware Open Server (COS) version is capable of receiving alert notification messages from DC RUM. You can check the version number of the currently running module in the Administration Console. To open the Administration Console, from the Windows Start menu choose Programs Compuware Compuware Open Server Administration Console. To add a new Compuware Open Server: 1. Start and log on to RUM Console. 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. On the Alert Management screen, switch to the Recipients tab and choose Compuware Open Server. The screen lists the configured Compuware Open Servers and available alert definitions. 4. Click Recipient Configuration. The Recipient Configuration window opens. 5. Click Add. The Add Compuware Open Server window opens. 6. Specify information required for the new Compuware Open Server. The information includes the server name or IP address, port numbers, and user credentials. 7. Click Save to confirm the new COS information. 8. Click Close to go back to the recipient list. Configuring Scripts Chapter 4 Managing Alert Notification Recipients CAS is able to run command line scripts (for example, batch files) with a set of parameters defined by a user. The parameter values are calculated based on the same macros as it is done for message notifications. Before You Begin Before configuring scripts, ensure that: You know the location of the executable (a script file or a script interpreter). You are familiar with special fields and macros that are mapped to real traffic data and can be used as parameters. For more information, see Step 2 [p. 53]. The user that runs the Central Analysis Server service has permissions for running the script. To configure script recipients: 1. Start and log on to RUM Console. 49

50 Chapter 4 Managing Alert Notification Recipients 2. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 3. On the Alert management screen, switch to the Recipients tab and choose Scripts. The screen lists all script recipients and available alert definitions. 4. Click Recipient Configuration. The Recipient Configuration window opens. 5. On the Scripts tab, click Add. The Add Script Recipient window opens. 6. Specify the script parameters. Description Provide a short description of what the script does. Alert type Select the alert type for which the script will be executed. Executable Point a file containing a program that is executed when an alert occurs. This can be, for example, a batch file (.bat), an executable file (.exe), or an interpreter that executes a script. The file must be located on the same server as CAS. Parameters Type script parameters. You can use the available macros as parameter values. Windows batch files run by the cmd.exe shell do not accept the multi-line parameter values. To pass such a parameter value to the cmd executable, for example {notificationmessage}, you need to apply the {encodeurl()} macro to escape newline and other unsupported characters. For example {encodeurl(notificationmessage)}. Note also the 8192 character command line limit imposed by the cmd executable. These limitations do not apply to Windows PowerShell and Python scripts. NOTE Parameters are configured per script, not per alert. The same script can be a recipient of more than one alert type. To send script notifications for two different types of alerts (requiring two different parameter configurations), you have to add such script recipient twice on the recipient configuration screen. 50

51 Chapter 4 Managing Alert Notification Recipients Figure 5. Example Script Recipient Configuration 7. Click Save to save the configuration and close the Add Script Recipient window. The new script recipient appears in the recipient table. 8. Click Close to leave the Recipient Configuration screen. What to Do Next To edit a script recipient, use the links in the Actions table column, on the Recipient Configuration screen. To delete a script recipient from the list, select the check box beside the recipient name and click Delete. 51

52 Chapter 4 Managing Alert Notification Recipients 52

53 CHAPTER 5 Configuring Alert Notifications Unless you configure notification messages, the information about raised alerts will only be available in the alert logs. Before You Begin For the alert notifications to be sent, first you need to configure the SMTP server. For more information, see Specifying the SMTP Server for Scheduled Report Mailing in the Data Center Real User Monitoring Administration Guide. To add and enable sending notifications for a recipient: 1. Select the recipient type. To have the notification sent to a user address or to COS, define the message in the Message template text box on the notification configuration screen. To have the notification sent to an SNMP trap recipient, you need to specify a separate message on the Trap recipient tab. Due to trap limitations, messages longer than 512 characters will be truncated. 2. Formulate a message template. The template can be composed of generic textual information and fields (metrics, dimensions, and other) selected from the list that is accessed via the icon. When a notification is sent, each field is mapped to real traffic data. The meaning of the fields is explained beside each field number: {0} Alert Name {1} Application {2} Transaction For example, to be notified about a high number of Message Queue errors detected for a queue, you could define the following template: The count of Message Queue errors for queue {8} has exceeded the predefined threshold of {19} (metric value: {17}{18}). The total number of operations for this queue was {20}{21}. 53

54 Chapter 5 Configuring Alert Notifications When received in an , the message will resemble this example: The count of Message Queue errors for queue MQOPEN has extended the predefined threshold of 200 (metric value: 342k). The total number of operations for this queue was 533k. Below the numbered fields, a set of macros is provided for additional specialized processing of the message, for example: {getipaddress()} Normalize IP address {urlhostbase} Report Server Address Some of the macros are stand-alone while others accept a parameter in the form of a reference to the given field number. Places requiring such a reference are indicated by round brackets () {getipaddress()} Normalize IP address: accepts a reference to an IP address field in the alarm template, and returns an IPv4 dotted quad. {urlhostbase} Returns the report server URL. {sts} Returns the timestamp of the start of the last monitoring interval for Software service, operation, and site data (zdata). Note that a timestamp is expressed as a number and not as a date/time string. {encodeurl()} Encode the given field in the alarm template, such as a URL, to replace special characters, for example spaces, using %-denoted escape sequences. {etszdata} Returns the timestamp of the end of the last monitoring interval for Software service, operation, and site data (zdata). Note that a timestamp is expressed as a number and not as a date/time string. {winsup4transdata} Returns the timestamp of the start of the monitoring interval for Synthetic and sequence transaction data (transdata). Note that a timestamp is expressed as a number and not as a date/time string. This macro works only for ADS. {getdate()} Translates a reference to a timestamp field to a date and time string. {etstransdata} Returns the timestamp of the end of the monitoring interval for transdata. (Note that a timestamp is expressed as a number and not as a date/time string.) {samplinginterval} Returns the length of the monitoring interval, expressed in minutes. Add report URL Includes a link to a DMI report in the alert notification message. This option is available only for CAS 12.3 or later. In the Add report URL window: 54

55 Chapter 5 Configuring Alert Notifications a. In the Section list, select the report group. If you have more than one report server attached to the RUM Console, you see only the sections that are common for all report servers. b. In the Link to report list, select the DMI report whose URL will be included in the notification message. You can choose from all saved reports, regardless of the report owner. If you have more than one report server attached to the RUM Console, you see only the reports that are common for all servers. c. Select monitored dimensions to be used as report filters. The list of monitored dimensions is taken from the Triggering and propagation settings screen. For more information, see Configuring Optional Alert Detector Settings [p. 29]. d. Decide whether to use the alert generation time as a report time range. 3. Optional: Define the message subject for recipients. The subject is a key for aggregating alert notifications. Each has a default subject defined. If you leave the default message subject, you will receive a single notification message about all alerts raised in a single monitoring interval. To change the default subject, type the new subject in the text box. You can use the same macros as in Step 2 [p. 53]. This option is available only for CAS 12.3 or later. NOTE Notifications with a user-defined message subject are sent separately. Alert aggregation is not available. If you use macros, the notifications will be aggregated for those alerts that have the same parameter values in the message subject (for example, the same application name). 4. Assign the notification to a selected recipient. From the Actions menu for a given recipient, select Enable notifications. By choosing this option, you leave the notification default settings. From the Actions menu for a given recipient, select Edit notifications, which will open the notification edit screen and enable you to limit sending of notifications to specific conditions. See Step 5 [p. 55] through Step 10 [p. 57]. 5. Optional: Specify when the notifications should be sent. By default, the notification is sent always to the specified recipient, provided the alert conditions are met. However, you can define a set of filters that will limit sending notifications only to specific situations. You can define several filtering criteria in a notification, and all of them have to be met for a message to be sent. To limit sending of notifications to specific conditions: a. In the Notification pane, click Add filter. b. Place the mouse pointer over the element list and choose an element (for example, a metric) to which you want to apply the filter. 55

56 Chapter 5 Configuring Alert Notifications c. Define a filter expression. For information about filter syntax, see Filtering on text fields [p. 38]. 6. Optional: Specify whether you want to receive a single notification message about all alerts raised in a single monitoring interval. NOTE This option is available only if your selected recipient is a CAS user. For notifications sent by traps or to a COS, all messages are always sent separately. It is configured with the Aggregate check box for each device. By default, it is selected. An example of an aggregated message: Timestamp: date and time Realized bandwidth too low Affected location: Default Normally: bps Currently: bps Affected users: 15 Realized bandwidth too low Affected location: MY Location Normally: bps Currently: bps Affected users: 9 Realized bandwidth too low Affected location: Sienna AS Normally: bps Currently: bps Affected users: 3 If you cleared the Aggregate check box, you would receive three separate notifications instead of the above message. 7. Optional: Add another notification. You can have more than one notification in an alert definition. Note that all defined filtering criteria have to be met for at least one notification, and not necessarily for all notifications. To add a notification, click Notification and then repeat Step 5 [p. 55] through Step 6 [p. 56]. 8. Optional: Specify whether you want to receive a notification when the alert state is finished. For that purpose, select the Send on alert finish check box. When the alert state is finished, you will receive a notification that reports a current metric value. To change the default behavior and make the alert mechanism report metric value from the moment the alert state started, you need to use Advanced Properties Editor configuration screen on CAS. Note that this setting will only work if, in the alert propagation settings, you have specified when the alert should be canceled and aborted. For more information, see Modifying Alert Propagation Settings [p. 40]. 9. Optional: Enable of disable the notifications for mobile users. On the Mobile tab, specify whether you want the mobile users to receive a notification by selecting the Send push notifications check box. 56

57 Chapter 5 Configuring Alert Notifications 10. Click OK to save the notifications. 11. Click Next to review the alert definition summary and to publish the configuration. What to Do Next At any time, you can disable, enable, or delete the alert notifications for the selected recipients: On the Recipients tab of the Alert management screen. On the Users, Traps, or COS tabs of the Configure Alert Notifications screen. (Deleting only) On the notification edit screen for a selected recipient. To change the notification status, use the Actions menu. Sending SNMP Alert Notifications to a Single Trap Manager When several report servers send notifications to a single trap manager, you must enable unique object identifier (OID) generation, either automatically or manually, or the manager will map trap definitions and instances incorrectly. To initiate creation of unique trap OIDs on a machine that will send alert notifications: 1. Locate configuration file commonalarm-hcbs.properties on a machine that will send alerts. In a default installation, the entire file path is: <\install_dir>\config\commonalarm-hcbs.properties 2. Add two properties to this file to give you control over the way OIDs are generated. MibBuilder.flags = combinesagent2trapobjid With this setting, OID sub-identifier strings are generated automatically based on the service IP address and port number. MibBuilder.agentId = NON_NEGATIVE_INTEGER With this setting, you define the OID sub-identifier manually, where NON_NEGATIVE_INTEGER is a non-negative integer you specify. Note that you cannot use this property alone; use it in combination with MibBuilder.flags to enable OID generation. Disabling the Alert Engine on CAS To disable the entire alert system, edit the report server configuration file commonalarm-hcbs.properties. 1. Open the commonalarm-hcbs.properties file in a text editor. You will find this file on the report server in the <install_dir>\config\ directory. 2. Add the following line to the file: CAE_AlarmEngine.doNOTprocessAlarms=true 57

58 Chapter 5 Configuring Alert Notifications The alert engine (detection and generation mechanisms) will be disabled when you restart the report server and alert notifications will not be sent. What to Do Next To restore the alert engine operation, remove this line or comment it out. 58

59 APPENDIX A Alert Usage Example in a Web-based Environment In this example, the CAS is monitoring a public Web site providing an electronic channel for selling goods to a global customer base. Customers access the service through the Internet with Web browsers. All users are tracked individually by IP address. Gathering requirements from observations The following requirements for alerts were defined: 1. Detect problems when HTTP servers are not able to serve pages quickly enough. 2. Detect and inform about spiders/competition trying to harvest prices from our price pages. 3. Detect and inform about missing products. These requirements were derived from the following observations and assumptions: 1. From time to time, due to internal design of server processes, the Web servers have internal processing problems leading to slower execution of transactions and, over time, to application slow-down. Typically the servers respond to HTTP requests in around ms, but when problem appear this increases to 500 ms, which is still not noticeable by users, and after minutes, if no action is taken, response times reach 1 s, 5 s, and finally servers stop responding after 1 h. This does not concern any particular URL, is not related to the load, and is detectable on the server level. 2. The competition runs spider software on external computers with anonymous IP addresses. The software is trying to gather information about pricing by loading many different price pages (pages that contain the action=showprice parameter/string in the URL). Because the Web site is public and anonymous, there is no other way to prevent such situations than by automatically detecting IP addresses that load many price pages in short periods of time. Such addresses can then be banned. Typical Web site users load no more than 25 pages in 5 minutes; anything above that is suspicious. If, in addition to that, most (>70%) pages are price pages, this indicates that we are dealing with a suspect address. 3. Sometimes, when there is a bad referral or a mistake in the page logic, the Web site user may ask to buy or quote a non-existent product. This causes the server to display the message 59

60 Appendix A Alert Usage Example in a Web-based Environment The requested product does not exist and ask the user to start over again. To detect such errors, we will use metric alert reporting on Operation attributes(1). The following examples show how you might implement such alerts. Alert Definition Example: High Server Time for Service For this example, we need to detect situations in which server response time exceeds the usual values by 200%. This has to be measured for the whole server (IP address), not for any particular URL. We will have to configure the baseline comparison threshold to 3 (300%) and specify that we want to monitor server IP addresses and software services. Additionally, we may want to receive information about the number of clients who use a specific software service on a given server. For that purpose, we will select the number of unique clients as an auxiliary metric. Finally, we need to ensure that detected behavior is a real problem. We do this by making sure that server time does not go to normal before 30 minutes. Then we raise an alert after that period of time. We assume that you want to receive notifications about alert states by In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 2. On the Alerts tab, click User-defined. 3. Click Add Alert to access the alert configuration wizard. Specify Basic Settings 4. From the Alert type list, select Generic performance. 5. In the Alert name box, specify a descriptive name that will later help you identify the alert. 6. Optional: Provide a short description of the alert. 7. Optional: Modify the default assignment of the alert to the report servers. By default, the definition is assigned to all available report servers. 8. Click Next. The Define Triggering and Propagation Conditions screen of the alert definition wizard is displayed. Define Triggering and Propagation Conditions 9. On the Define Triggering and Propagation Conditions page of the wizard, on the Detection settings tab, set the values that will trigger the alert. a. Click Add dimension to add a dimension selection box to the Monitored dimensions section. b. Move the mouse cursor over the dimension name and click it to display a complete dimension list. c. From the dimension list, select Server IP address. d. Repeat Step 9.a [p. 60] through Step 9.c [p. 60], but choose Software service instead of Server IP address. e. Leave the default dimension filter settings. 60

61 Appendix A Alert Usage Example in a Web-based Environment The default settings ensure that the alert will not operate on synthetic traffic. Do not add any other dimension filters, unless, for example, you want to limit the applicability of your alert to specific server IP address range or a specific software service. f. Leave the Comparison mode set to Single (default setting). g. From the list of metrics, select Server time. h. Click Add condition and choose a Value condition from the list. i. Set the threshold value to 0.15 s to specify that anything below 150 milliseconds will not be considered a problem. Leave the default relational operator (greater than) selected. j. Click Add condition and choose a Baseline condition. Note that both value and baseline condition have to be fulfilled for the alert to be raised. k. Select the baseline type. The default type is average baseline. For more information, see Baseline Modes in the Data Center Real User Monitoring Administration Guide. l. Set the baseline multiplier to 3 to specify that the detected value needs to be 3 times the baseline value. Leave the default relational operator (greater than) selected. m. Define the auxiliary metric. Under Auxiliary metric, select Unique users from the metric list. 10. In this example, you do not need to configure the Output filters for the alert. 11. Click the Propagation settings tab to specify how the alert will be propagated. a. Set Raise after to 6 to indicate that the alert is to be raised after 6 intervals (the thresholds need to be exceeded in 6 consecutive monitoring intervals, which would be 30 minutes if the monitoring intervals are the default length of 5 minutes). b. Select Delayed processing to indicate that we do not want to activate the alert until baseline values are calculated. 12. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. Configure Alert Notifications 13. On the Configure Alert Notifications page of the wizard, specify the notification message template. In the template, you can use generic textual information and elements selected from the list hidden under the icon. When a notification is sent, these elements are substituted with the real-traffic data served by AMD. For example: High server time of the service running on server {6} ({7}), via software service {5}. Server time values - current: {17} {18}, threshold: {19} {18}. The number of users connecting to the service: {20}. When delivered, the message may look like this: High server time for the service running on server (MyServer), via software service MyService. 61

62 Appendix A Alert Usage Example in a Web-based Environment Server time values - current: 330 ms, threshold 150 ms. The number of users connecting to the service: Assign and enable the notification for a selected user. If you want to receive a notification by , go to Actions Enable notifications for a specific report server user listed on the Users tab. If you skip this step, the alerts will be written only to the alert log. 15. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 16. On the Review Summary page of the wizard, verify your alert settings before you apply them to the report servers. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 17. Click Apply. On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. Alert Definition Example: Abnormal URL Traffic for Software Service User In this example, we want to raise the alert if someone loads at least 25 price list pages. Abnormal usage of the Web site can be detected by the Abnormal URL traffic for software service user alert, where we calculate the number of URLs loaded by any single user, compare this to the number of URLs with sensitive information (price pages in our case), and raise an alert if this exceeds the threshold. Configuration of this alert requires the definition of thresholds (such as the total number of pages and percentage of price pages in all loaded pages) and the price page pattern (*action=showprice*). 1. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 2. On the Alerts tab, click Predefined. The predefined alerts are listed. In this example, we configure a predefined alert to suit our purposes. Details concerning the selected alert are shown under the list. 3. Select Show disabled to display the alerts that are by default disabled. 4. In the Filter box above the list, type abnormal to filter the list on that word. You want to find the Abnormal URL traffic for software service user alert. 5. Click the Abnormal URL traffic for software service user alert to select it. That line will be highlighted in the list and the details concerning that alert will be displayed under the list. 62

63 Appendix A Alert Usage Example in a Web-based Environment 6. In the alert details and devices section (under the list), in the Actions column, select Actions Edit alert for the device to which you want to apply this alert. When more than one CAS is listed, be sure to select the row for the CAS to which you intend to apply the alert. When you select Edit alert, the alert wizard will open for the selected alert and device. Specify Basic Settings 7. On the Specify Basic Settings page of the wizard, click Next to skip to the next screen. In this example, there is no need to change the information on this tab. It is possible to edit the description and name, but this is generally not recommended, because you change the threshold values and other parameters, not the underlying predefined alert mechanism. Define Triggering and Propagation Conditions 8. On the Detection Settings tab, set the values that will trigger the alert. Set Lower limit of the unacceptable number of URLs to 25 to indicate that we might be interested if a single user loads more than 25 pages of any kind (restricted or not). If they are loading 25 or fewer pages, we do not care about them, but we might be worried if they load more than 25 pages, depending on whether they also match the other parameters. Set Lower limit of the unacceptable number of restricted URLs to 70 to indicate that we are definitely interested if someone loads more than 25 pages (see above) and more than 70 percent of those pages match the Restricted URLs parameter. Set Restricted URLs to *action=showprice* to match any of our price list pages. 9. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. Configure Alert Notifications 10. On the Configure Alert Notifications page of the wizard, click Next to skip to the next tab. In this example, there is no specific example changes on this tab. Normally, however, you would use the three tabs (Users, Trap Recipients, and Compuware Open Servers) to specify where and how to send out alerts. If you specify nothing here, the alerts will be written only to the alert log. 11. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 12. On the Review Summary page of the wizard, verify your alert settings before you apply them. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 13. Click Apply. 63

64 Appendix A Alert Usage Example in a Web-based Environment On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. Alert Definition Example: Operation attributes(1) We do not have any out-of-the-box alerts for detecting application responses or errors based on responses, but we can define a new metric-based alert to detect when the metric Operation attributes(1) is greater than 0 and break down by URL to inform the operator which page caused the error. 1. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 2. On the Alerts tab, click User-defined. In this example, we will define our own alert. 3. Click the Add Alert button. When you select Add alert, the alert wizard will open. Specify Basic Settings 4. In the Alert type list, select Generic performance. 5. Type an Alert name for your alert. You need to give your alert a simple but descriptive name. If you are creating a number of alerts, you should name them consistently and with sorting in mind (so that similar alerts appear together in tables of alerts). 6. Optional: Type an Alert description for your alert. A good practice is to give your alert a short, simple description that will remind you why you created the alert and what it does. This description is displayed on the Alert Management screen when you select an alert in the list. 7. In the Report servers selected for the alert list, specify the report server installations to which your alert should be applied. All active report servers in your console device list appear in this table, and by default, the new alert is enabled on all of them. 8. Click Next. The Define Triggering and Propagation Conditions page of the alert definition wizard is displayed. Define Triggering and Propagation Conditions 9. On the Detection Settings tab, specify the conditions under which the alert will be raised. a. In the Monitored dimensions section, click Add dimension and select Operation from the dimension list. b. In the Dimension filters section, click Add filter and select Server IP address from the list. c. In the Server IP address text box, type your server's IP address. 64

65 Appendix A Alert Usage Example in a Web-based Environment d. In the Monitored metric section, open the list of metrics and select Operation attributes(1) from the list. e. Click Add condition and select Value condition. The default values (greater than and 0) are correct for this example. 10. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. Configure Alert Notifications 11. Define a notification message template. NOTE For the notification to be sent, there needs to be at least one recipient configured in the RUM Console. For more information, see Managing Alert Notification Recipients [p. 47]. In the template, use generic textual information and elements selected from the list hidden under the icon. For this alert, you could specify the simple message: Alert: {0} Operation: {8} Operation attributes: {17} When a notification is sent, it will contain the name of the alert, the name of the operation and the number of the detected operation attributes for each operation. 12. For a selected user, go to Actions Enable notifications. 13. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 14. On the Review Summary page of the wizard, verify your alert settings before you apply them to the report servers. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 15. Click Apply. On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. 65

66 Appendix A Alert Usage Example in a Web-based Environment 66

67 APPENDIX B Alert Usage Example in an Enterprise Environment In this example, the CAS is monitoring an enterprise wide area network with thousands of users grouped in hundreds of remote locations connected by a private network to single data center where applications are hosted on dozens of servers. Gathering requirements from observations The following requirements for alerts were defined: 1. Detect low-performance network locations. 2. Detect malicious software trying to spread over the network. 3. Detect new active IP addresses that accept connections in the data center. These requirements were derived from the following observations: 1. Some locations are connected with private leased lines and some use VPN connections over the Internet. In both cases it is essential to detect situations in which network performance starts affecting user experience. 2. When a workstation or desktop is infected with certain malicious software, it may start to contact many machines trying to spread the malicious software over the network. We need to detect client IP addresses that suddenly increase the number of network connections or connection attempts, and detect the application and port it is trying to use. 3. No new machines should be installed in or connected to the data center without prior authorization. If we detect a new IP address accepting connections, we should raise an alert. The following examples show how you might implement such alerts. Alert Definition Example: Network Performance for Site In this example, we want to raise an alert if we detect low-performance network sites. 1. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 67

68 Appendix B Alert Usage Example in an Enterprise Environment 2. On the Alerts tab, click User-defined. 3. Click Add Alert to access the alert configuration wizard. Specify Basic Settings 4. From the Alert type list, select Internetwork performance. 5. In the Alert name box, specify a descriptive name that will later help you identify the alert. 6. Optional: Provide a short description of the alert. 7. Optional: Modify the default assignment of the alert to the report servers. By default, the definition is assigned to all available report servers. 8. Click Next. The Define Triggering and Propagation Conditions screen of the alert definition wizard is displayed. Define Triggering and Propagation Conditions 9. On the Detection settings tab, set the values that will trigger the alert. a. Click Add dimension to add a dimension selection box to the Monitored dimensions section. b. Move the mouse cursor over the dimension name and click it to display a complete dimension list. c. From the dimension list, select Site. d. Leave the default dimension filter settings. The default settings ensure that the alert will operate only on real traffic, and that traffic within the site will not be taken into account. Do not add any other dimension filters, unless, for example, you want to limit the applicability of your alert to specific site. e. Leave the Comparison mode set to Single (default setting). f. From the list of metrics, select Network performance. g. Click Add condition and choose a Value condition from the list. h. Set the threshold value to 50 and change the default operator to less than. The alert will be raised if the current value of network performance for a site will be below the specified threshold. 10. In this example, you do not need to configure the Output filters for the alert. 11. Click the Propagation settings tab to specify how the alert will be propagated. Set Raise after to 3 to indicate that the alert is to be raised after three events. The threshold needs to be exceeded in three consecutive monitoring intervals, which means 15 minutes if the monitoring intervals are the default length of 5 minutes. 12. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. Configure Alert Notifications 13. On the Configure Alert Notifications page of the wizard, specify the notification message template. 68

69 Appendix B Alert Usage Example in an Enterprise Environment In the template, you can use generic textual information and elements selected from the list hidden under the icon. When a notification is sent, these elements are substituted with the real-traffic data served by AMD. For example: Alert ID {0}: The network performance for a site {1} has exceeded the predefined threshold {4}. Current value: {3} When delivered, the message may look like this: Alert ID My Internetwork Alert: The network performance for a site Default Data Center has exceeded the predefined threshold <50. Current value: Assign and enable the notification for a selected user. If you want to receive a notification by , go to Actions Enable notifications for a specific report server user listed on the Users tab. If you skip this step, the alerts will be written only to the alert log. 15. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 16. On the Review Summary page of the wizard, verify your alert settings before you apply them to the report servers. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 17. Click Apply. On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. Alert Definition Example: Excessive Number of Servers Used by User. Top Software Service Identified In this example, we want to raise the alert if we detect malicious software trying to spread over the network. For detecting users affected by malicious software trying to spread over the network, the best choice is the Excessive number of servers used by user. Top software service identified alert, which was designed specifically for this purpose. 1. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 2. On the Alerts tab, click Predefined. The predefined alerts are listed. In this example, we configure a predefined alert to suit our purposes. Details concerning the selected alert are shown under the list. 3. In the Filter box above the list, type excessive to filter the list on that word. 69

70 Appendix B Alert Usage Example in an Enterprise Environment You want to find the Excessive number of servers used by user. Top software service identified alert. 4. Click the Excessive number of servers used by user. Top software service identified alert to select it. That line will be highlighted in the list and the details concerning that alert will be displayed under the list. 5. In the alert details and devices section (under the list), in the Actions column, select Actions Edit alert for the device to which you want to apply this alert. When more than one CAS is listed, be sure to select the row for the CAS to which you intend to apply the alert. When you select Edit alert, the alert wizard will open for the selected alert and device. Specify Basic Settings 6. On the Specify Basic Settings page of the wizard, click Next to skip to the next screen. In this example, there is no need to change the information on this tab. It is possible to edit the description and name, but this is generally not recommended, because you change the threshold values and other parameters, not the underlying predefined alert mechanism. Define Triggering and Propagation Conditions 7. On the Detection Settings tab, set the values that will trigger the alert. Set the Multiplier of the normal number of servers parameter to 5, indicating that the alert will be raised if the user attempts to connect to five times more servers. The baseline (normal) value, multiplied by the specified ratio, constitutes the upper limit of acceptable number of servers that are fully monitored, that is, servers for which all statistical information can be obtained from the monitored traffic. Set the Lower limit of the unacceptable number of servers parameter to 20, indicating that anything below 20 servers will not be considered a problem. This threshold also applies to servers for which all statistical information can be obtained from the monitored traffic (fully monitored servers). For the alert to be raised, both this threshold and the baseline threshold need to exceeded at the same time. Set the Alternative lower limit of the unacceptable number of servers parameter to 100, indicating that an alert will be raised if a user attempts to connect to more than 100 servers. This threshold applies to the total number of servers, both fully monitored and those for which only some basic statistics can be obtained from the traffic. It is used only if the thresholds defined by the two other detector parameters are not exceeded. 8. Click the Propagation settings tab to specify how the alert will be propagated. 9. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. 70

71 Appendix B Alert Usage Example in an Enterprise Environment Configure Alert Notifications 10. On the Configure Alert Notifications page of the wizard, click Next to skip to the next tab. In this example, there is no specific example changes on this tab. Normally, however, you would use the three tabs (Users, Trap Recipients, and Compuware Open Servers) to specify where and how to send out alerts. If you specify nothing here, the alerts will be written only to the alert log. 11. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 12. On the Review Summary page of the wizard, verify your alert settings before you apply them to the report servers. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 13. Click Apply. On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. Alert Definition Example: New Server Detected In this example, we want to raise the alert if we detect new active IP addresses that accept connections in the data center. For detecting new active IP addresses, we will use the New server detected alert and will configure it to filter IP ranges belonging to Data Center subnets. 1. In the RUM Console, select Alerts from the top menu. The Alert Management window appears. 2. On the Alerts tab, click Predefined. The predefined alerts are listed. In this example, we configure a predefined alert to suit our purposes. Details concerning the selected alert are shown under the list. 3. Select the Show disabled check box to display alerts that are by default disabled. 4. In the Filter box above the list, type new to filter the list on that word. You want to find the New server detected alert. 5. Click the New server detected alert to select it. That line will be highlighted in the list and the details concerning that alert will be displayed under the list. 6. In the alert details and devices section (under the list), in the Actions column, select Actions Edit alert for the device to which you want to apply this alert. When more than one CAS is listed, be sure to select the row for the CAS to which you intend to apply the alert. 71

72 Appendix B Alert Usage Example in an Enterprise Environment When you select Edit alert, the alert wizard will open for the selected alert and device. Specify Basic Settings 7. On the Specify Basic Settings page of the wizard, click Next to skip to the next screen. In this example, there is no need to change the information on this tab. It is possible to edit the description and name, but this is generally not recommended, because you change the threshold values and other parameters, not the underlying predefined alert mechanism. Define Triggering and Propagation Conditions 8. View the Detection Settings tab; there are no settings to make for this alert. 9. Click the Output filters tab to specify when to raise this alert. a. On the Output filters tab, click Add filter group. b. In the list, select Server IP address. c. In the Server IP address edit box, set the address to *. This will limit the scope of the alert to the data center subnet. Note that you can change the above address to one that matches your network. 10. Click the Propagation settings tab to specify how the alert will be propagated. Leave Raised after with the default setting1, indicating that the alert is to be raised after one interval during which a new server was observed. Enable Delayed processing so that the alert is not raised before the system learns which devices belong to your network and which can be considered new. 11. Click Next. The Configure Alert Notifications screen of the alert definition wizard is displayed. Configure Alert Notifications 12. On the Configure Alert Notifications page of the wizard, click Next to skip to the next tab. In this example, there is no specific example changes on this tab. Normally, however, you would use the three tabs (Users, Trap Recipients, and Compuware Open Servers) to specify where and how to send out alerts. If you specify nothing here, the alerts will be written only to the alert log. 13. Click Next. The Review Summary screen of the alert definition wizard is displayed. Review Summary 14. On the Review Summary page of the wizard, verify your alert settings before you apply them to the report servers. If you need to change anything, click Previous to go back to the appropriate page of the wizard. 15. Click Apply. 72

73 Appendix B Alert Usage Example in an Enterprise Environment On the pop-up window you can select the option to save your changes as a draft, if you intend to make more changes now, or to immediately publish the changes if you want to make your changes live now. 73

74 Appendix B Alert Usage Example in an Enterprise Environment 74

75 APPENDIX C Dimensions Available for User-defined Alert Definitions Metric alert definitions operate on pre-defined sets of dimensions originating from the AMD. This list contains dimensions that you can use when configuring detector settings for new metric alert definitions. Note, however, that for each alert category basic, EUE data, Citrix data or Point-to-Point data a different set of dimensions will be available. Analyzer The name of the traffic analyzer. For more information see Concept of Protocol Analyzers Application A universal container that can accommodate transactions. Client area Sites, areas, and regions define a logical grouping of clients and servers, or Backbobne nodes in case of Synthetic Backbone reports, into a hierarchy. They are based on manual definitions and/or on clients' BGP Autonomous System names, CIDR blocks or subnets. Sites are the smallest groupings of clients and servers. Areas are composed of sites. Regions are composed of areas. Client region Sites, areas, and regions define a logical grouping of clients and servers, or Backbone nodes in case of Synthetic Backbone reports, into a hierarchy. They are based on manual definitions and/or on clients' BGP Autonomous System names. Sites are the smallest groupings of clients and servers. Areas are composed of sites. Regions are composed of areas. Client site Sites, areas, and regions define a logical grouping of clients and servers, or Backbobne nodes in case of Synthetic Backbone reports, into a hierarchy. They are based on manual definitions, clients' BGP Autonomous System names, CIDR blocks or subnets. Sites are the smallest groupings of clients and servers. Areas are composed of sites. Regions are composed of areas. Client site UDL A dimension designed to filter only the User Defined Links. By default it is set to true (Yes) for WAN Optimization Sites report. 75

76 Appendix C Dimensions Available for User-defined Alert Definitions Client site WAN Optimized Link Indicates whether a site to which the client belongs is selected as both a UDL and a WAN optimized link. Internal traffic handling Indicates if the traffic within a site should be monitored or not. Used for filtering traffic when you define an internetwork alert. Is front-end tier? Indicates whether a given tier is a front-end tier for a selected application. Link alias A custom name created by a user for a selected link. Link name A link name, as reported by the information source (Network Monitoring Probe, Flow Collector, AMD). Link type The type of a monitored link, for example Ethernet or Frame Relay. Module Module is the third level in the reporting hierarchy. For example, in database monitoring this is the database name, and in SOAP monitoring this is the SOAP service. This entity can be broken to smaller bits such as tasks. Operation For HTTP, this is the URL of the base page to which the hit belongs. For other analyzers this can be a query, operation type or an operation status. Operation is ascertained by the AMD, based on referrer, timing relations between hits and per-transaction monitoring configured on the AMD. This dimension can assume values of a particular operation - if this operation is monitored. Note: The visibility of this dimension on reports depends on whether another dimension, related to servers - e.g. server IP or server DNS - has been used when formulating the query. The All other operations record serves a catch-all net for al the traffic that has been seen to-from a server, but was not classified as belonging to a specific monitored-by-name operation. It accounts for statistics of: operations which were not reported in per specific operation records (for example those that fall out of topn reported operations for a specific analyzer) - in such case the number of operations and slow operations, as well as operation time and other transactional statistics will be reported as an aggregate/average; traffic which was not classified to any operations (for example, idle TCP session closure, TCP handshake without any operation, etc) - in such case only volumetric statistics (bytes, packets) will be reported for this specific traffic. Reporting group (obsolete) Reporting group is a universal container that can accommodate software services, servers, URLs or any combination of these. Reporting groups can contain software services of every type. Advanced Diagnostics Server can import reporting group configuration from Central Analysis Server. Server IP address The IP address of the server. 76

77 Appendix C Dimensions Available for User-defined Alert Definitions Server name The name of the server resolved by a DNS server. Service Service is the highest level of multi-level reporting hierarchy. For example, in SAP GUI monitoring this is the business process. This entity can be broken to smaller bits such as modules. Software service The software service name, where by a software service we understand a service implemented by a specific piece of software, offered on a TCP or UDP port of one or more servers and identified by a particular TCP port number. Task (obsolete) Task is the second level in the reporting hierarchy. For example, in HTTP monitoring this is the page name; in database monitoring this is the operation name (may contain regular expression if configured on the AMD) or operation type prefix, and in SOAP monitoring this is the SOAP method. This entity can be broken to smaller bits such as operations or operation types. Tier A specific point of the application where we measure data. It can be a specific traffic type or a server. Traffic type The type of client traffic: real or synthetic, that is, generated by a synthetic agent. Transaction A universal container that can accommodate operations. This metric refers only to transactions without errors. Transaction source Informs whether the transaction comes from Synthetic Monitoring probes, Agentless Monitoring Device, Cerner RTMS, or is user-defined. WAN link name The name of the WAN link. stepname - ###!!! FIXME Name missing in TMX!!! - internalid: stepname - descriptiontmxid: RTM_DV_CVENT_HELP_DIMENSION_stepName An operation-task pair attribute or an (ordered) alias for operation occurrences. Step names are assigned in a configuration file read by the report server or come from third-party sources. Within a given task, steps enable you to distinguish operations by name. Because steps are assigned sequence numbers, you can follow the order of operations recorded for a given task. Each step is always related to one operation and one task. It is possible, however, for several operations to have identical step names and each task to have more than one step, so that many-to-one relationships are likely to occur on reports. 77

78 Appendix C Dimensions Available for User-defined Alert Definitions 78

79 APPENDIX D Metrics Available for User-defined Alert Definitions User-defined alert definitions operate on a pre-defined set of metrics originating from the AMD. Real user performance (probe) Aborts The number of operations aborted by the client. It applies to all TCP-based protocols. For example, for HTTP/HTTPS, it is the number of operations manually stopped by the user by either clicking on the Stop or Refresh buttons or selecting another URL. Note that, in the case of HTTP, this number includes Short aborts and Long aborts. Affected users (availability) The number of unique users that were affected by the availability problems. Affected users (network) The number of unique users that experienced network performance problems. Affected users (performance) The number of users that experienced application performance problems. For transactional protocols, a problem is noted if at least one operation is completed in time longer than the performance threshold. For transactionless TCP-based protocols, a problem is noted if user wait per kb of data is longer than the threshold value. Application Delivery Channel Delay In WAN optimized scenario, Application Delivery Channel Delay (ADCD) is a quality metric represented in milliseconds. The ADCD is determined by initial observation of the traffic between a client and a server. ADCD is a derivative of RTT measured on a WAN link expressed in time and as such it can be understood as latency, where the larger ADCD would indicate a higher network latency. ADCD also includes time spent in the data center WOC for traffic buffering and processing. A change of ADCD from its initial value reflects a change of quality in WAN optimization service. For example, sudden increase of ADCD would suggest that the quality of the service has worsened and conversely, a sudden decrease of ADCD value could suggest an improvement in WAN optimization. 79

80 Appendix D Metrics Available for User-defined Alert Definitions Application Delivery Channel Delay (range 1) The number of operations whose ADCD value is within range 1 as defined in the RUM Console. Application Delivery Channel Delay (range 2) The number of operations whose ADCD value is within range 2 as defined in the RUM Console. Application Delivery Channel Delay (range 3) The number of operations whose ADCD value is within range 3 as defined in the RUM Console. Application Delivery Channel Delay (range 4) The number of operations whose ADCD value is within range 4 as defined in the RUM Console. Application performance For transactional protocols, this is the percentage of software service operations completed in a time shorter than the performance threshold. For SMTP and transactionless TCP-based protocols, this is the percentage of monitoring intervals in which user wait time per kb of data was shorter than the threshold value. Attempts The number of monitoring intervals during which attempts were made to connect to a server. Note that this is counted separately for each server, client and software service. Thus, if in a given monitoring interval there are attempts to connect to three different servers, the Attempts metric will be incremented by three for that one monitoring interval. The actual value shown on the report is the sum total of all the attempts, for all the monitoring intervals, in the period covered by the report. Availability (TCP) Availability limited to the network context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (TCP) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Availability (application) Availability limited to the application context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (Application) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Availability (total) The percentage of successful attempts, calculated using the following formula: Availability (total) = 100% * (All Attempts All failures) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure All failures = all failures (transport) + all failures (TCP) + all failures (application). 80

81 Appendix D Metrics Available for User-defined Alert Definitions Availability (transport) Availability limited to the transport context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (Transport) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Bad MOS calls The number of VoIP calls with the Mean Opinion Score (MOS) rating below acceptable threshold. Bad R-factor calls The number of VoIP calls with R-factor value below the acceptable value. Bad delay calls The number of VoIP calls with delay above the acceptable level. Bad jitter calls The number of VoIP calls with jitter exceeding the acceptable level. Bad lost packets calls The number of VoIP calls with loss rate above the acceptable level. Call duration The average call duration. Calls The total number of VoIP calls. Note that for a selected software service the number of calls as seen from the sites' perspective may differ from the number seen from the endpoints' perspective. This is because in one site we may have two users taking part in the same call. Client ACK RTT Client ACK RTT is the time it takes for an ACK packet with no payload to travel from the user to the AMD and back again. Client ACK RTT measurements This metric keeps track of how many Client ACK RTT measurements were made. ACK measurement is performed during ACK packet transmission either from server or client side of the transaction. Client RTT Client RTT is the time it takes for a SYN packet (sent by a server) to travel from the AMD to the client and back again, as shown in the following picture. Client AMD Server T1 SYN T2 T3 T6 T7 SYN ACK ACK T5 Client RTT T8 T4 T9 A client RTT measurement begins when the SYN ACK packet from the server to the client passes by the AMD (T5). The packet reaches the client machine (T6) and is processed, 81

82 Appendix D Metrics Available for User-defined Alert Definitions while an acknowledgment is sent back to the server (T7). Client processing time impact (T7-T6) is again very low. Client RTT measurement ends when the ACK packet reaches the AMD (T8). Therefore, the Client Round Trip Time is calculated as T8-T5. Depending on the actual setup, Client RTT measurements may vary dramatically. In corporate environments, it may be a few milliseconds for LAN-connected clients or a couple dozens milliseconds for WAN-connected clients. In this case, where the client is coming from the Internet, the end-to-end Client RTT measurement is a compound of transit time through the Internet backbone as well as through the "last mile" access network. The impact of the last mile can be easily calculated, based on the connection speed and the packet size (56B in case of TCP SYN packet). For a 28 kbps dial-up connection, this amounts to 16 milliseconds one way, or 32 milliseconds for a complete round-trip measurement. For a 1.6 Mbps DSL line, this makes 56 microseconds towards complete client RTT measurement. Client RTT (range 1) The number of operations whose client RTT value is within range 1 as defined in the RUM Console. Client RTT (range 2) The number of operations whose client RTT value is within range 2 as defined in the RUM Console. Client RTT (range 3) The number of operations whose client RTT value is within range 3 as defined in the RUM Console. Client RTT (range 4) The number of operations whose client RTT value is within range 4 as defined in the RUM Console. Client TCP data packets The total number of TCP packets sent by the clients, excluding the traffic control packets. Client TCP data packets lost The number of lost TCP data packets sent by the clients, excluding the traffic control packets. The number of lost TCP packets always regards the context of the counter, for example, an application, a server or any other entity. Client bytes The number of bytes sent by the clients. Note that this includes headers. Client not responding errors The number of errors of category Client not responding. Errors of this category occur when the server closes the TCP session with a RESET packet after the client has been idle for too long. Such a situation happens when the server TCP/IP stack detects that network connection to the client exists, but the client remains idle and does not respond. In such a case, the server closes the TCP session with a RESET packet. This may occur when the client has been silently disconnected from the network, for example, due to link failure, or the client has crashed. Note that this error will not occur if the client session has ended gracefully, that is, by closing the client application. 82

83 Appendix D Metrics Available for User-defined Alert Definitions Client operation size The size of a client operation. Note: an operation can be split over several packets. For traffic parsed with HTTP and SSL decrypted analyzers, Client operation size is the size in bytes of the operation request (HTTP GET or POST). Client operations The number of operations (for HTTP/SSL this is equivalent to the number of pages, for DB/2 it is equivalent to the number of queries) from the client side. For traffic analyzed with the analyzers General-volume and ICA (Citrix), this is the number of client data transfers for which network realized bandwidth was measured. Client packets The number of packets sent by the clients. Client packets lost (client to AMD) The number of packets sent by a client that were lost - between the client and the AMD - and needed to be retransmitted. Client realized bandwidth Client realized bandwidth refers to the actual transfer rate of client data when the transfer attempt occurred, and takes into account factors such as loss rate (retransmissions). Thus, it is the size of an actual transfer divided by the transfer time. Closed TCP connections The total number of successful or failed TCP connections. Connection establishment timeout errors The number of TCP errors of category Connection establishment timeout errors. This category of errors applies when there was no response from the server to the SYN packets transmitted by the client. Connection refused errors The number of TCP errors of category Connection refused errors, also referred to as Session establishment errors. This category of errors applies when a server rejects a request from a client to open a TCP session. Such a situation usually happens when the server runs out of resources, either due to operating system kernel configuration or lack of memory. Custom metric (1)(avg) The average value of user-defined metrics in category 1 observed in the HTTP or XML traffic. Custom metric (1)(cnt) The number of occurrences of user-defined metrics in category 1 observed in the HTTP or XML traffic. Custom metric (1)(sum) The sum of all values of user-defined metrics in category 1 observed in the HTTP or XML traffic. Custom metric (2)(avg) The average value of user-defined metrics in category 2 observed in the HTTP or XML traffic. 83

84 Appendix D Metrics Available for User-defined Alert Definitions Custom metric (2)(cnt) The number of occurrences of user-defined metrics in category 2 observed in the HTTP or XML traffic. Custom metric (2)(sum) The sum of all values of user-defined metrics in category 2 observed in the HTTP or XML traffic. Custom metric (3)(avg) The average value of user-defined metrics in category 3 observed in the HTTP or XML traffic. Custom metric (3)(cnt) The number of occurrences of user-defined metrics in category 3 observed in the HTTP or XML traffic. Custom metric (3)(sum) The sum of all values of user-defined metrics in category 3 observed in the HTTP or XML traffic. Custom metric (4)(avg) The average value of user-defined metrics in category 4 observed in the HTTP or XML traffic. Custom metric (4)(cnt) The number of occurrences of user-defined metrics in category 4 observed in the HTTP or XML traffic. Custom metric (4)(sum) The sum of all values of user-defined metrics in category 4 observed in the HTTP or XML traffic. Custom metric (5)(avg) The average value of user-defined metrics in category 5 observed in the HTTP or XML traffic. Custom metric (5)(cnt) The number of occurrences of user-defined metrics in category 5 observed in the HTTP or XML traffic. Custom metric (5)(sum) The sum of all values of user-defined metrics in category 5 observed in the HTTP or XML traffic. Excluded operations The number of operations for which the operation time was above a safety threshold. The term "operations" refers to operations in the context of the particular protocol, and can mean HTTP/HTTPS page loads, database queries, XML (transactional services) operations, Jolt transactions on a Tuxedo server, s, DNS requests, Oracle Forms submissions, MQ operations, VoIP calls, MS Exchange operations, or SAP operations. Failures (TCP) The total number of operations that failed due to Connection refused or Connection establishment timeout errors. Failures (application) The number of operation attributes of all types set to be reported as an application failure. 84

85 Appendix D Metrics Available for User-defined Alert Definitions Failures (total) The total number of failures, that is all Failures (transport) + all Failures (TCP) + all Failures (application) Failures (transport) The number of operations that failed due to the problems in the transport layer. These include protocol errors, SSL alerts classified as a failure, incomplete responses selected be classified as failures. HTTP client errors (4xx) The sum of all HTTP client errors (4xx). This includes 4 categories of errors (4xx), by default HTTP Unauthorized (401, 407) errors, HTTP Not Found (404) errors, custom client (4xx) errors and Other HTTP (4xx) errors. The contents of the first 3 categories can be configured by users. However, there are two types of the 4XX errors that are of particular importance: errors 401 related to server-level authentication, and errors 404 indicating requests for non-existent content. These two error types are reported separately, by specific metrics. 401 Unauthorized - Server reports this error when user's credentials supplied with request do not satisfy page access restrictions. The HTTP server layer, not the application layer, reports 401 errors. The AMD will report on "Unauthorized" errors only if server-level authentication has been configured. This is common practice for sites that are comfortable with very basic user access policies. Most commercial-grade applications do not rely on server-level authentication (e.g. most of online banking applications or online shopping), but rather authenticate users on the application layer. In such a case, even if authentication fails, the server will typically send 200 OK responses and authentication error information will be explained in page content. So this kind of error is not very common in commercial sites. 404 Not Found - Server reports "Not Found" errors when it cannot fulfill client request for a resource. Usually it happens due to malformed URL, which directs to a non-existing page or image. Such a URL request may result from a user, who misspelled the URL, trying to access a URL that the user stored in his "Favorites" folder a long time ago, or some other mistake. Malformed URLs may also exist in invalid or incorrectly designed Web pages so the error will be reported by browsers trying to load such a page. Significant and constant number of these errors usually indicates that some pages on the server have design-related or link validation issues. In some cases, 404 errors result from the server overload. It is good practice to check whether the percentage of errors is load-related. HTTP client errors - category 3 (default name) The number of HTTP custom client errors (4xx). By default, there is no specific error type assigned here. HTTP not found errors 404 (default name) The number. These include the observed custom HTTP 404 Not found errors. HTTP other client errors (4xx) The number of HTTP other client errors (4xx). There are four categories of HTTP client errors (4xx), of which three can be configured by users. By default, the first category includes HTTP Unauthorized (401, 407) errors, 85

86 Appendix D Metrics Available for User-defined Alert Definitions the second category - HTTP Not Found (404) errors. The third category contains no default error types assigned, and can be configured by a user. Finally, a group of HTTP Other (4xx) errors contains all errors that do not fall into any other client errors category. The number is calculated based on the formula: [HTTP errors 4xx] - [HTTP Not Found errors 404] - [HTTP Not Authorized ( )] - [HTTP errors configured by user]. HTTP other server errors (5xx) The number of HTTP server errors (5xx) that do not fall into categories 1 or 2 of custom HTTP server errors (5xx). HTTP redirect time The average amount of time that was spent between the time when a user went to a particular URL and the time this user was redirected to another URL and issued a request to that new URL. The HTTP redirect time refers to the transactions for which redirection actually took place. HTTP response time (range 1) The number of operations whose HTTP response time is within range 1 as defined in the RUM Console. HTTP response time (range 2) The number of operations whose HTTP response time is within range 2 as defined in the RUM Console. HTTP response time (range 3) The number of operations whose HTTP response time is within range 3 as defined in the RUM Console. HTTP response time (range 4) The number of operations whose HTTP response time is within range 4 as defined in the RUM Console. HTTP server errors (5xx) The number of observed HTTP server errors (5xx). The response status codes 5xx indicate cases, in which the Web server is aware that there was a server error or it is incapable of performing the request. Such error presence usually means that the Web server does not function as intended. The following 5xx errors are defined by the HTTP protocol standards: 500 Internal Server Error - The server encountered an unexpected condition, which prevented it from fulfilling the request. 501 Not Implemented - The server does not support the functionality required to fulfill the request. 502 Bad Gateway - The server received an invalid response from a back-end application server. 503 Service Unavailable - The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. 504 Gateway Timeout - The server did not receive response from a back-end application server. 505 HTTP Version Not Supported - The server does not support the HTTP protocol version that was used in the request message. 86

87 Appendix D Metrics Available for User-defined Alert Definitions HTTP server errors category 1 (default name) The number of custom HTTP server errors (5xx), category 1. By default, there are no specific error types assigned to this category. HTTP server errors category 2 (default name) The number of custom HTTP server errors (5xx), category 2. By default, there are no specific error types assigned to this category. HTTP server image time This is the total amount of time it takes for images (non-html content) to be prepared for delivery. HTTP unauthorized errors 401, 407 (default name) The number of observed custom HTTP authentication related errors. Hits These include "HTTP 401 Unauthorized" and "HTTP 407 Proxy authentication required" errors. HTTP servers generate errors "401 Unauthorized" in cases, when anonymous clients are not authorized to view the requested content and must provide authentication information in the WWW-Authenticate request header. The 401 errors are similar to "403 Forbidden" errors, however used when authentication is possible but it has failed or not yet been provided. The 407 error is basically similar to 401, but it indicates that the client should first authenticate with a proxy server. The AMD will report these errors only if the server-level authentication has been configured. Simple and basic user access policies are common in Web sites that do not store user-sensitive and/or business critical information. Most commercial-grade applications, based on HTTP, such as home banking applications or online shopping sites, rely on the application-level authentication rather than the server-level authentication. Such applications are designed in the way that even if the user authentication fails, the HTTP server usually sends the 200 OK response code and the authentication error message in the page content. Therefore, the 401 Unauthorized and 407 Proxy authentication required error codes are quite rare in commercial environments. The number of subcomponents of error-free operations. Note that this metric is recorded at the time when the monitored operations are closed. In case of HTTP, it is when the whole page has been loaded. Compare "Hits (started)". For example, when the user issues an HTTP GET, a "Hit (started)" is reported immediately, whereas if a whole page is loaded and the operation is closed, it is reported as a "Hit". Hits (range 1) The number of operations whose hit count is within range 1 as defined in the RUM Console. Hits (range 2) The number of operations whose hit count is within range 2 as defined in the RUM Console. Hits (range 3) The number of operations whose hit count is within range 3 as defined in the RUM Console. Hits (range 4) The number of operations whose hit count is within range 4 as defined in the RUM Console. 87

88 Appendix D Metrics Available for User-defined Alert Definitions Hits (started) The number of subcomponents of operations. Unlike the "Hits" metric, "Hits (started)" is recorded immediately, not at the end of an operation. For example, when the user issues an HTTP GET, a "Hit (started)" is reported immediately, whereas if a whole page is loaded and the operation is closed, it is reported as a "Hit". Idle sessions The number of idle TCP sessions, that have not been active for a period of time longer than a predefined time-out time, 5 minutes by default. Idle time The part of the operation time spent between receiving a part of the response and requesting a subsequent part. It enables you to isolate the time taken by client from the time when the data was still being transmitted on the network Incomplete Responses The number of incomplete responses, that is partial and server aborted responses, as well as situations when a server did not respond to the request at all or responded in an urecognizable way. LAN-WAN byte ratio The amount of compression performed and expressed as a percentage. 100% for pass-through. Greater than 100% if more bytes on the WAN side, including both pass-through and optimized traffic. Less than 100% if fewer bytes on the WAN side, including both pass-through and optimized traffic. Network performance The percentage of total traffic that did not experience network-related problems (traffic in which the values of loss rate and RTT did not exceed configured thresholds). Network performance affected bytes The volume of TCP traffic that did experience network-related problems. The traffic measured here includes both directions of data transfer, to and from client, or downstream and upstream, but does NOT include bytes transferred internally within the site. By network-related problems we understand excessive RTT or Loss Rate: at any given moment, traffic is considered to be experiencing network-related problems if, at that particular time, the values of Loss Rate or RTT exceed pre-configured thresholds. In situations when RTT measurements prove to be insufficient, ACK RTT may also become an additional criterion for determining network problems. Network performance relevant bytes The total volume of TCP traffic. Includes both directions of data transfer, to and from client, or downstream and upstream, but does NOT include bytes transferred internally within the site. Network time The time the network (between the user and the server) takes to deliver requests to the server and to deliver operation information back to the user. In other words, network time is the portion of the overall time that is due to the delivery time on the network. 88

89 Appendix D Metrics Available for User-defined Alert Definitions Operation attributes (1) The number of operation attributes of type 1, observed for the given software service. Operation attributes (2) The number of operation attributes of type 2, observed for the given software service. Operation attributes (3) The number of operation attributes of type 3, observed for the given software service. Operation attributes (4) The number of operation attributes of type 4, observed for the given software service. Operation attributes (5) The number of operation attributes of type 5, observed for the given software service. Operation length The number of packets that contained in an average operation. Operation load time (range 1) The number of operations that were loaded in time within range 1 as defined in the RUM Console. Operation load time (range 2) The number of operations that were loaded in time within range 2 as defined in the RUM Console. Operation load time (range 3) The number of operations that were loaded in time within range 3 as defined in the RUM Console. Operation load time (range 4) The number of operations that were loaded in time within range 4 as defined in the RUM Console. Operation size (range 1) The number of operations whose byte count is within range 1 as defined in the RUM Console. Operation size (range 2) The number of operations whose byte count is within range 2 as defined in the RUM Console. Operation size (range 3) The number of operations whose byte count is within range 3 as defined in the RUM Console. Operation size (range 4) The number of operations whose byte count is within range 4 as defined in the RUM Console. Operation time The time it took to complete an operation. The term "operation" refers to an operation in the context of a particular protocol, and can mean HTTP/HTTPS page loads, database queries, XML (transactional services) operations, Jolt transactions on a Tuxedo server, s, DNS requests, Oracle Forms submissions, MQ operations, VoIP calls, MS Exchange operations, or SAP operations. Note that an operation can be split over several packets. For HTTP and HTTPS, it is equal to the redirect time plus the network time plus server HTTP time plus server think time. 89

90 Appendix D Metrics Available for User-defined Alert Definitions Operations The number of operations. The term "operations" refers to operations in the context of the particular protocol, and can mean HTTP/HTTPS page loads, database queries, XML (transactional services) operations, Jolt transactions on a Tuxedo server, s, DNS requests, Oracle Forms submissions, MQ operations, VoIP calls, MS Exchange operations, or SAP operations. Other SSL errors (default name) SSL alerts other than those for SSL errors 1 and SSL errors 2. Other time Part of the operation time, calculated as Operation time - Server Time - Network Time - Idle time. Out of contract bytes The number of bytes marked as Out-of-contract in the TOS field in the TCP header. This setting can signify that the data was sent over and above a certain preset limit. Out of contract packets The number of packets marked as Out-of-contractin the TOS field in the TCP header. This can signify that the data was sent over and above a certain preset limit. Percentage of optimized traffic (bytes) Indicates the traffic distribution in two separate branches: optimized traffic and passed-through traffic. The higher the value, the more bytes are optimized. Low values may indicate poorly configured optimization or optimization device overload. RTT measurements The number of RTT measurements. An RTT measurement occurs during every TCP handshake, so it provides some insight into the number of attempted TCP sessions, and the potential accuracy of the RTT measurements that are reported. Realized bandwidth (range 1) The number of operations whose realized bandwidth is within range 1 as defined in the RUM Console. Realized bandwidth (range 2) The number of operations whose realized bandwidth is within range 2 as defined in the RUM Console. Realized bandwidth (range 3) The number of operations whose realized bandwidth is within range 3 as defined in the RUM Console. Realized bandwidth (range 4) The number of operations whose realized bandwidth is within range 4 as defined in the RUM Console. Redirect time The average amount of time that was spent between the time when a user went to a particular URL and the time this user was redirected to another URL and issued a request to that new URL. The difference between Redirect Time and HTTP Redirect Time is that the former counts all operations, while the latter refers only to those operations for which redirection actually took place. 90

91 Appendix D Metrics Available for User-defined Alert Definitions Redirect time (range 1) The number of operations whose redirect time is within range 1 as defined in the RUM Console. Redirect time (range 2) The number of operations whose redirect time is within range 2 as defined in the RUM Console. Redirect time (range 3) The number of operations whose redirect time is within range 3 as defined in the RUM Console. Redirect time (range 4) The number of operations whose redirect time is within range 4 as defined in the RUM Console. Request time The time it took the client to send the HTTP request to the server (for example, by means of an HTTP GET or HTTP POST). Note: This time includes TCP connection setup time and SSL session setup time (if any). It starts when the client starts the TCP session on the server and ends when the server receives the whole request. Sometimes an operation is slow because of a big request rather than due to a large response. Response messages The total number of protocol-specific server responses. That includes both errors and other identifiable response strings, as configured in monitoring. SSL conn. setup per operation The time it took to establish an SSL connection between the client and the server, weighted per number of operations. For HTTP-based software services, a single operation means a single page. SSL conn. setup per session The time it took to establish an SSL connection between the client and the server. SSL errors 1 (default name) If not explicitly configured, general SSL alerts from the following list: 10,20,21,22,30,40,49,50,51. SSL errors 2 (default name) If not explicitly configured, general SSL alerts from the following list: 41,42,43,44,45,46,48. Server ACK RTT RTT measurement performed during ACK packet transmission, from server side of the operation. Also provided are minimum, maximum and standard deviation values. Server ACK RTT measurements These metrics keep track of how many RTT of Server ACK measurements were made. ACK measurement is performed during ACK packet transmission either from server or client side of the transaction. Server RTT The time it takes for a SYN packet (sent by a user) to travel from the AMD to a monitored server and back again. Also provided are minimum, maximum and standard deviation values. 91

92 Appendix D Metrics Available for User-defined Alert Definitions Client AMD Server T1 SYN T2 T6 SYN ACK T5 Server RTT T3 T4 T7 ACK T8 T9 Server TCP data packets The total number of TCP packets sent by the servers, excluding the traffic control packets. Server TCP data packets lost The number of lost TCP data packets sent by the servers, excluding the traffic control packets. The number of lost TCP packets always regards the context of the counter, for example, an application, a client or any other entity. Server bytes The number of bytes sent by servers. The number includes headers. Server loss rate (range 1) The number of operations whose server loss rate is within range 1 as defined in the RUM Console. Server loss rate (range 2) The number of operations whose server loss rate is within range 2 as defined in the RUM Console. Server loss rate (range 3) The number of operations whose server loss rate is within range 3 as defined in the RUM Console. Server loss rate (range 4) The number of operations whose server loss rate is within range 4 as defined in the RUM Console. Server not responding errors The number of Server Not Responding errors. This category of errors applies when the client closes the TCP session with a RESET packet after the server has failed to respond for too long. Server operation size The size of a server operation. In HTTP and HTTPS (decrypted and non-decrypted), server operation size equals the operation size. Server packets The number of packets sent by the servers. Server packets lost (AMD to client) The number of packets sent by a server that were lost - between the AMD and the client - and needed to be retransmitted. Server realized bandwidth Server realized bandwidth refers to the actual transfer rate of server data when the transfer attempt occurred, and takes into account factors such as loss rate (retransmissions). Thus, it is the size of an actual transfer divided by the transfer time. 92

93 Appendix D Metrics Available for User-defined Alert Definitions Server response time This is the amount of time it takes for a server to provide its initial response to a user's operation request. Often servers will respond with some information quickly, before all the information is ready for delivery. Together with the server think time, the server response time sums to the overall server time. Note that if there was no think time recorded for the opration, it equals the server time. Server session termination errors The number of Server Session Termination errors. This category of errors applies when the server detects an error on the software service level and closes the TCP session with a RESET packet. Server time The time it took the server to produce a response for the given request. Server time (range 1) The number of operations whose server time is within range 1 as defined in the RUM Console. Server time (range 2) The number of operations whose server time is within range 2 as defined in the RUM Console. Server time (range 3) The number of operations whose server time is within range 3 as defined in the RUM Console. Server time (range 4) The number of operations whose server time is within range 4 as defined in the RUM Console. Short aborts The number of transactions stopped before timeout. For HTTP, this is the number of page loads software service manually stopped by the user by either clicking on the Stop or Refresh buttons or selecting another URL before 8 seconds of waiting for the page download (8 seconds is default). For XML, this is the number of transactions stopped before a threshold number of seconds of waiting (8 seconds is the default). Slow operations The number of operations for which the operation time was above a predefined threshold value. The term "operations" refers to operations in the context of the particular protocol, and can mean HTTP/HTTPS page loads, database queries, XML (transactional services) operations, Jolt transactions on a Tuxedo server, s, DNS requests, Oracle Forms submissions, MQ operations, VoIP calls, MS Exchange operations, or SAP operations. Note that slow operations for SMB are not determined using the time threshold, but maximum and minimum realized bandwidth thresholds. Slow operations (application design - # of components) The number of slow operations caused by the number of components, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. 93

94 Appendix D Metrics Available for User-defined Alert Definitions Slow operations (application design - redirect time) The number of slow operations caused by redirect time, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (application design - request size) The number of slow operations caused by request size, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (application design - response size) The number of slow operations caused by response size, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (client/3rd party) The number of slow operations caused by client/3rd party category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (data center) The number of slow operations caused by the data center category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (multiple reasons) The number of slow operations caused by multiple reasons, that is when the algorithm was not able to determine one primary reason for slowness. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (network - latency) The number of slow operations caused by latency, which is one of the detailed reasons in the network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (network - loss rate) The number of slow operations caused by loss rate, which is one of the detailed reasons in the network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (network - other) The number of slow operations caused by other factors than latency or loss rate, which is one of the detailed reasons in the network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow user sessions The number of client sessions, which contained at least one slow operations (page load for HTTP or HTTPS). 94

95 Appendix D Metrics Available for User-defined Alert Definitions Standalone hits The number of hits not associated with any operation, such as orphaned redirects, unauthorized hits, and discarded hits (no server response). TCP SYN time The time needed to establish a connection on the TCP/IP layer, that is, the average time it took to transfer SYN packets. Total bytes compression The data optimization observed, expressed as a byte reduction and a percentage, where a lower byte count on the WAN side means a higher reduction: 0% for pass-through. Less than 0% if more bytes were observed on the WAN side, including both pass-through and optimized traffic. Greater than 0% if fewer bytes were observed on the WAN side, including both pass-through and optimized traffic. This metric should not exceed 100%. Total bytes on LAN side The sum of bytes (client's and server's) observed on the LAN side before network traffic is directed into the WAN Optimization Controller (WOC). Total bytes on WAN side The sum of bytes (client's and server's) observed on the WAN side after network traffic leaves the WAN Optimization Controller (WOC), including bytes that have been passed through and those that have been marked as optimized. Transfer time Unique users The number of unique users detected in the monitored traffic. User sessions The number of user HTTP sessions. The count can be identified by information contained in intercepted HTTP cookies or by HTTP authorization. VoIP Jitter VoIP average jitter measured by the probe, for both downstream and upstream traffic. Jitter is a variation in voice data transit delay, in milliseconds. In general, higher levels of jitter are more likely to occur on either slow or heavily congested links. VoIP MOS VoIP average Mean Opinion Score (MOS) rating of the call quality, for both downstream and upstream traffic. VoIP R-factor VoIP average R-factor value, for both downstream and upstream traffic. It is a transmission quality rating, with a typical range of An R-Factor score is derived from multiple VoIP metrics, including latency, jitter, and loss. VoIP RTCP Jitter VoIP average jitter as reported by Real Time Transport Protocol (RTCP), for both downstream and upstream traffic. Jitter is a variation in voice data transit delay, in 95

96 Appendix D Metrics Available for User-defined Alert Definitions milliseconds. Higher levels of jitter are more likely to occur on either slow or heavily congested links. VoIP delay VoIP average networking delay, as reported by Real Time Transport Protocol (RTCP), measured for both downstream and upstream traffic. VoIP loss rate The percentage of VoIP packets lost or discarded that needed to be retransmitted, measured for both upstream and downstream traffic. Zero window size events Client sets this in TCP header when it wants the other side to slow down with data transmission because it cannot keep up with the transmission speed. Indicates that receiving machine is busy with other tasks. Synthetic backbone 1st byte time Time between the completion of the TCP connection with the destination server that will provide the displayed page's HTML, graphic, or other component and the reception of the First Packet (also known as first byte) for that object. Overloaded web servers often have a long First Byte time. Application performance For transactional protocols, this is the percentage of software service operations completed in a time shorter than the performance threshold. For SMTP and transactionless TCP-based protocols, this is the percentage of monitoring intervals in which user wait time per kb of data was shorter than the threshold value. Byte limit exceeded errors The number of byte Limit errors. The byte limit error occurs when the attempt to download a page or object was blocked because the reported size of the object was greater than the current limit. Bytes (average) The average number of bytes downloaded for the pages per test. Bytes (sum) The total number of bytes downloaded for pages. Bytes downloaded The total number of downloaded bytes. Connect time The time (in seconds) that it takes to connect to a Web server across a network. After obtaining the target IP address by using the DNS Lookup, the Dyntrace Performance Network Agent establishes a TCP connection with the device at that IP address. TCP connections are started by the agent's transmitting a special "SYN" packet and then receiving an "ACK" packet from the server. The elapsed time between transmitting the SYN to the server and receiving the ACK response is the Initial Connection time. Connections (average) The average number of connections established for pages per test. 96

97 Appendix D Metrics Available for User-defined Alert Definitions Connections (sum) Total number of connections established for pages. Content match errors The number of errors caused by the content not matching the condition defined for the page or object. Content time Time required to receive the content of a page or page component starting with the receipt of the first content and ending with the last packet received. DNS time The time it takes to translate the host name (for example, into the IP address (for example, ). The Dyntrace Performance Network Agent performs this translation by using the Internet's standard Domain Name Service (DNS). Failing objects The number of failing page objects within the tests. Failing pages The number of unsuccessful attempts to access a page within a test. HTTP client errors The number of HTTP client errors. HTTP server errors The number of HTTP server errors. Hosts (average) The average number of hosts associated with the page. Hosts (sum) Total number of hosts associated with the page(s). Number of 200 objects The total number of page objects with a return code of Number of 300 objects The total number of page objects with a return code of Number of 400 objects The total number of page objects with a return code of Number of 500 objects The total number of page objects with a return code of Number of objects The number of tested page objects, including successful and failing objects. Number of pages The number of tested pages. Number of tests The number of test executions. Objects (average) The average number of objects downloaded per page. Objects (sum) The total number of downloaded page objects. Objects with network errors The total number of page objects with network related errors. 97

98 Appendix D Metrics Available for User-defined Alert Definitions Objects with server errors The total number of page objects with server related errors. Page availability The percentage of successful pages vs all pages. Page response time The average time it took the page to produce a response for the given request. Pages executed The number of pages tested. Pages executed (failed) The number of tested pages reported as failed. Pages executed (slow) The number of tested pages reported as slow. Processing time The average client-side processing time. Response time Total time required to download a complete web page and its objects. For full object tests, response time is the time, as measured in seconds, from when a user clicks on a link to the time that the Web page is fully downloaded. For HTML-only tests, it is the time, as measured in seconds, from when a user clicks on the link to the time when the root object is downloaded. Response time encompasses the collection of all objects that make up a page including third-party content on off-site servers, graphics, frames, redirections, and so on. For an operation flow (a series of interactive operations on several web pages), the response time is measured from the start of the operational flow (the moment a user clicks on a link) to the end of the operational flow (the content from the last web page content is downloaded.) SSL time Time it takes to establish a Secure Socket Layer (SSL) connection and exchange SSL keys. Socket time-out errors The number of errors caused by no response to the attempts to open a TCP connection to the server. Successful objects The number of successfully tested page objects. Successful pages The number of successful attempts to access a page within a test. Test availability The percentage of successful tests vs all tests. User script errors The number user script errors. The user script error occurs when JavaScript on the page did not or was not able to execute properly. 98

99 Application user experience Aborted transactions The number of aborted transactions due to the HTTP timeout. This metric is calculated only for Client tiers. Aborts The number of aborted operations. This metric is calculated only for the Network tier, the Client network tier, Client optimized network tier and data center tiers. Affected users (availability) The number of unique users that were affected by TCP availability problems. For Client optimized network, Client network, and Network tiers, this metric is not calculated. Affected users (network) The number of unique users that experienced network performance problems. Affected users (performance) The number of unique users that experienced application performance problems or network performance problems. For Client optimized network tier, this metric is not calculated. Application Monitoring Operations The number of Application Monitoring operations. Application health index The percentage of fast operations calculated as "Fast Operations / (Failures + Operations) * 100%". Availability (TCP) Availability limited to the network context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (TCP) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Availability (application) Availability limited to the application context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (Application) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Availability (total) Depending on the particular tier, this term may mean: Appendix D Metrics Available for User-defined Alert Definitions For Client tiers: the percentage of successful attempts, calculated as *(failures/attempts). For the Citrix/WTS (presentation) tier: the percentage of successful TCP connection attempts, calculated as *(failures/attempts). For other Network tiers: the percentage of successfully sent packets, calculated as *(sent packets that were lost/total number of sent packets). 99

100 Appendix D Metrics Available for User-defined Alert Definitions For other Data center tiers: the percentage of successful attempts, calculated using the following formula: Availability (total) = 100% * (All Attempts All failures) / All Attempts whereall attempts = all failures + all successful operations + all standalone hits not classified as a failure + all abortsall failures = all failures (transport) + all failures (TCP) + all failures (application). Availability (transport) Availability limited to the transport context, calculated using the following formula: Availability (application) = 100% * (All Attempts Failures (Transport) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure. Average CPU utilization The percentage of elapsed time that the processor spent to execute non-idle threads. This counter is the primary indicator of processor activity, and shows the average percentage of busy time. Average memory utilization The average percentage of used physical memory (RAM). Client RTT Client RTT is the time it takes for a SYN packet (sent by a server) to travel from the AMD to the client and back again, as shown in the following picture. Client AMD Server T1 SYN T2 T3 T6 T7 SYN ACK ACK T5 Client RTT T8 T4 T9 A client RTT measurement begins when the SYN ACK packet from the server to the client passes by the AMD (T5). The packet reaches the client machine (T6) and is processed, while an acknowledgment is sent back to the server (T7). Client processing time impact (T7-T6) is again very low. Client RTT measurement ends when the ACK packet reaches the AMD (T8). Therefore, the Client Round Trip Time is calculated as T8-T5. Depending on the actual setup, Client RTT measurements may vary dramatically. In corporate environments, it may be a few milliseconds for LAN-connected clients or a couple dozens milliseconds for WAN-connected clients. In this case, where the client is coming from the Internet, the end-to-end Client RTT measurement is a compound of transit time through the Internet backbone as well as through the "last mile" access network. The impact of the last mile can be easily calculated, based on the connection speed and the packet size (56B in case of TCP SYN packet). For a 28 kbps dial-up connection, this amounts to 16 milliseconds one way, or 32 milliseconds for a complete round-trip measurement. For a 1.6 Mbps DSL line, this makes 56 microseconds towards complete client RTT measurement. Client Volume The number of client transmitted bytes. 100

101 Appendix D Metrics Available for User-defined Alert Definitions Client loss rate The percentage of total packets sent by a client that were lost and needed to be retransmitted. This metric is calculated only for the following tiers: RUM sequence transactions, Citrix/WTS (presentation), Client optimized network (for WAN and Pass-through deployment only), and tiers based on TCP-based analyzers. DNS errors The number of DNS errors. Database errors The number of database errors in the database analyzer: For TDS, which includes Sybase and MS SQL Server, any value from the following table is considered an error. For MySQL, if an ERR_Packet is returned, the error count is incremented. An error with a severity level of 19 or higher stops the execution of the current SQL batch and the error message is written to the error log. Errors that can be corrected by the user: 11: The given object or entity does not exist. 12: SQL statements that do not use locking because of special options. In some cases, read operations performed by these SQL statements could result in inconsistent data, because locks do not guarantee consistency. 13: Transaction deadlock errors. 14: Security-related errors such as permission denied. 15: Syntax errors in the SQL statement. 16: General errors that can be corrected by the user. Software errors that cannot be corrected by the user and that require system administrator action: 17: The SQL statement caused the database server to run out of resources (such as memory, locks, or disk space for the database) or to exceed some limit set by the system administrator. 18: There is a problem in the database engine software, but the SQL statement completes execution, and the connection to the instance of the database engine is maintained. System administrator action is required. 19: A non-configurable database engine limit has been exceeded and the current SQL batch has been terminated. System problems: 20-25: Fatal errors, meaning that the database engine task that was executing a SQL batch is no longer running. The task records information about what occurred and then terminates. In most cases, the application connection to the instance of the database engine also terminates. If this happens, depending on the problem, the application might not be able to reconnect. Database warnings The number of database warnings in the database analyzer: 101

102 Appendix D Metrics Available for User-defined Alert Definitions For TDS, which includes Sybase and MS SQL Server, this count will always be zero. TDS does not track anything as a warning. For MySQL, if an OK_Packet is returned, the warning count value in that packet is checked and the total warning field is updated with the returned number. End-to-end RTT The time it takes for a SYN packet to travel from the client to a monitored server and back again. Failures (TCP) The number of operations that failed due to one the TCP errors. Failures (application) The number of operation attributes of all types set to be reported as an application failure. Failures (total) The total number of failures, that is all Failures (transport) + all Failures (TCP) + all Failures (application) Failures (transport) The number of operations that failed due to the problems in the transport layer. You configure the failures (transport) to include the following: protocol errors, SSL alerts, aborts and incomplete responses. Fast operations/transactions The number of operations or transactions for which the execution time was below a predefined threshold value. These include HTTP/HTTPS page loads, SQL database queries, XML (transactional services) operations, s, DNS requests, Oracle Forms submissions, MQ operations, MS Exchange operations, SAP operations, transactions (for RUM data). HTTP errors The number of observed HTTP client errors (4xx) and server errors (5xx). Idle time The part of the operation time spent between receiving a part of the response and requesting a subsequent part. It enables you to isolate the time taken by client from the time when the data was still being transmitted on the network. Incomplete responses The number of incomplete responses, that is partial and server aborted responses, as well as situations when a server did not respond to the request at all or responded in an urecognizable way. LDAP errors The number of LDAP Erros. The LDAP Errors are reported in the following categories: LDAP critical errors LDAP server errors LDAP security errors LDAP syntax errors LDAP client error LDAP client error 102

103 Appendix D Metrics Available for User-defined Alert Definitions Long aborts For HTTP, this is the number of operations manually stopped by the user by either clicking on the Stop or Refresh buttons or selecting another URL after at least 8 seconds of waiting for the page download (8 seconds is default). For XML, this is the number of transactions stopped after at least a threshold number of seconds of waiting (8 seconds is the default). MQ appl. errors The number of operation attributes of all types set to be reported as MQ application errors for software services based on an MQ analyzer. MQ errors The total number of IBM WebSphere Message Queue errors, including client errors, server errors, protocol errors and security errors. MS Exchange errors The total number of RPC server and RPC protocol errors. Network performance The percentage of total traffic that did not experience network-related problems (traffic in which the values of loss rate and RTT did not exceed configured thresholds). Network time The time the network takes to deliver the request to the server and to deliver the resulting response back to the user. In other words, network time is the portion of the operation time that is spent on transferring data over the network. Operation attributes The number of operation attributes of all types (type 1 to 5), observed for the given software service. Operation/Transaction time The average value of operation or transaction time for all operations or transactions performed on the particular tier. Operations/Transactions Depending on the tier definition and on the traffic analyzer used, this metric shows the number of: HTTP(S) operations SQL database queries XML (transactional services) operations messages DNS requests Oracle Forms submissions MQ operations MS Exchange operations SAP operations Cerner transactions Transactions (for RUM data) 103

104 Appendix D Metrics Available for User-defined Alert Definitions Other time For RUM sequence transactions, the other time is a sum of the client time, the client response time, and the application processing time. For synthetic transactions, the other time is equal to the client time. For RUM Browser data, the other time is equal to the client time if provided by the Application Monitoring server. The other time is not calculated for Dynatrace Performance Network data. Performance Depending on the particular tier, the term performance can mean: For Client tiers: the percentage of transactions completed in a time shorter than the defined time threshold, calculated as *(slow transactions/all transactions). For the Client optimized network tier: the percentage of compressed bytes. For other Network tiers: the percentage of total traffic that did not experience network-related problems. For Data center tiers: for transactional protocols, this is the percentage of software service operations completed in a time shorter than the performance threshold. For transactionless, TCP-based protocols, this is the percentage of monitoring intervals in which user wait time per kb of data was shorter than the threshold value. RMI/Simple parser errors Total number of RMI/Simple parser errors. RTT measurements The number of RTT measurements. An RTT measurement occurs during every TCP handshake, so it provides some insight into the number of attempted TCP sessions, and the potential accuracy of the RTT measurements that are reported. This metric is calculated only for the following tiers: RUM sequence transactions, Citrix/WTS (presentation), Client optimized network (for LAN only), and tiers based on TCP-based analyzers. Realized bandwidth The actual transfer rate of server data when the transfer attempt occurred. This metric takes into account factors such as loss rate (retransmissions). Redirect time The average amount of time that was spent between the time when a user went to a particular URL and the time this user was redirected to another URL and issued a request to that new URL. The difference between Redirect Time and HTTP Redirect Time is that the former counts all operations, while the latter refers only to those operations for which redirection actually took place. Response messages The total number of protocol-specific server responses. That includes both errors and other identifiable response strings, as configured in monitoring. SAP errors The number of errors detected on the protocol level in communication between SAP application server and SAP GUI client as well as between SAP application server and a third party clients using Remote Function Calls (RFC). SMTP errors The total number of SMTP errors. 104

105 Appendix D Metrics Available for User-defined Alert Definitions SSL errors The number of all SSL alerts. This metric is the sum of SSL errors 1, SSL errors 2, and Other SSL errors. Server RTT The time it takes for a SYN packet to travel from the AMD to a monitored server and back again. This metric is calculated only for the following tiers: RUM sequence transactions, Citrix/WTS (presentation), Client optimized network (for LAN only), and tiers based on TCP-based analyzers. Client AMD Server T1 SYN T2 T6 SYN ACK T5 Server RTT T3 T4 T7 ACK T8 T9 Server TCP data packets The total number of TCP packets sent by the servers, excluding the traffic control packets. This metric is calculated only for the following tiers: RUM sequence transactions, Citrix/WTS (presentation), Client optimized network (for LAN only), and tiers based on TCP-based analyzers. Server Volume The number of server transmitted bytes. Server loss rate The percentage of total packets sent by a server that were lost - between the AMD and the server - and needed to be retransmitted. This metric is calculated only for the following tiers: RUM sequence transactions, Citrix/WTS (presentation), Client optimized network (for WAN and Pass-through deployment only), and tiers based on TCP-based analyzers. Server time The time it took the server to produce a response for the given request. Short aborts The number of transactions stopped before timeout. For HTTP, this is the number of page loads software service manually stopped by the user by either clicking on the Stop or Refresh buttons or selecting another URL before 8 seconds of waiting for the page download (8 seconds is default). For XML, this is the number of transactions stopped before a threshold number of seconds of waiting (8 seconds is the default). Slow operations (application design - # of components) The number of slow operations caused by the number of components, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (application design - redirect time) The number of slow operations caused by redirect time, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness 105

106 Appendix D Metrics Available for User-defined Alert Definitions algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (application design - request size) The number of slow operations caused by request size, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (application design - response size) The number of slow operations caused by response size, which is one of the detailed reasons in the application design category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (client/3rd party) The number of slow operations caused by client/3rd party category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (data center) The number of slow operations caused by the data center category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (multiple reasons) The number of slow operations caused by multiple reasons, that is when the algorithm was not able to determine one primary reason for slowness. Slow operations (network - latency) The number of slow operations caused by latency, which is one of the detailed reasons in the network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (network - loss rate) The number of slow operations caused by loss rate, which is one of the detailed reasons in the network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations (network - other) The number of slow operations caused by other factors than latency or loss rate, which is one of the detailed reasons in network category as calculated using the primary reason for slowness algorithm. Note that this includes only sucessful operations. Failures and aborted operations are not taken into account. Slow operations/transactions The number of operations or transactions for which the execution time was above a predefined threshold value. These include HTTP/HTTPS page loads, SQL database queries, XML (transactional services) operations, s, DNS requests, Oracle Forms submissions, MQ operations, MS Exchange operations, SAP operations, transactions (for RUM data). TCP errors The total number of TCP errors. 106

107 Appendix D Metrics Available for User-defined Alert Definitions Those errors may indicate server or application problems and therefore measurements of those are critical to understanding the issues that may affect end-user experience. AMDs measure and report on the following types of TCP errors: Connection Refused Errors - Client attempts to open a TCP session with a server, which rejects the request. SYN packet from Client is followed by RESET packet from Server, with matching TCP sequence numbers. This error is typically caused by resource exhaustion on the server, which is unable to accept more concurrent TCP sessions. This may be either a configuration issue (too few resources allocated in the kernel) or lack of memory. SYN flood attacks typically result in servers being unable to accept new connections. Server session termination error - Server is unexpectedly terminating a connection that was successfully opened. The server sends a RESET packet to the Client. Such an error originates at an application using TCP session that is monitored. It does not necessarily mean application failure; usually it means that the application encountered a condition in which it decided to immediately terminate session with the client, for example, because of an application security policy violation by the client. Session Abort - Client is unexpectedly terminating a connection that was successfully opened. The Client sends a RESET packet to the Server. These errors are inspected in the context of the client application and may or may not be reported. For example, the browser running HTTP may terminate the load of a GIF file if it is older than the one that it had previously cached and this is normal behavior. However, if all connections to the server are terminated because the user hits the STOP button, then this is abnormal session termination and is reported as "Aborted operation" or "Stopped Page". Client not responding errors (server timeout errors) - Server networking stack takes an assumption that the network connection to the client exists, but the client remains idle and does not respond. In such a case, the server closes the TCP session with the RESET packet. Such a condition may occur when the client has been silently disconnected from the network, for example, due to a link failure, or the client has crashed. Note that this error will not occur if the client has ended the session gracefully, e.g. by closing the client application. Server not responding errors (client timeout errors) - Client networking stack takes an assumption that network connection to the server exists, but the server remains idle and does not respond. In such a case, the client closes the TCP session with the RESET packet. This may occur either during the Session Setup phase (no response to the SYN packet), or during a normal data exchange process. Such a situation may result in the intermittent network problems between the client and the server. In the case the traffic is routed through asymmetric paths across the Internet, which is often the case, the path from the server to the client may be broken. Total bandwidth usage The number of all transmitted bits (client + server) per second. 107

108 Appendix D Metrics Available for User-defined Alert Definitions Total network time A difference between Total transaction time and sum of Total server time and Total redirect time. This metric is calculated only for the Data center tiers and for the following dimension combinations: Application-Tier and Application-Transaction-Tier. Total redirect time The sum of the averages of redirect time of all operations assigned to a transaction. This metric is used to indicate the redirect time used to achieve the result of multi-step transactions. It is calculated only for Data center tiers and for the following dimension combinations: Application-Tier and Application-Transaction-Tier. Total server time The sum of the averages of server time of all operations assigned to a transaction. This metric is used to indicate the server time used to achieve the result of multi-step transactions. It is calculated only for Data center tiers and for the following dimension combinations: Application-Tier and Application-Transaction-Tier. Total transaction time The sum of the averages of operation time of all operations assigned to a transaction. This metric is used to indicate the total time used to achieve the result of multi-step transactions. It is calculated only for Data center tiers and for the following dimension combinations: Application-Tier and Application-Transaction-Tier. Transaction errors The number of errors that originate from Synthetic Monitoring transactions or RUM sequence transactions. Transactional service errors The total number of transactional service errors. Two-way loss rate The average loss rate calculated for both directions. The sum of client and server retransmitted packets averaged by the sum of total client and server packets. Unique users The number of unique users detected in monitored traffic. Note that for RUM Browser the notion of users refers to visits. Volume The number of all transmitted bytes (client + server). Internetwork Traffic Attempts The number of monitoring intervals during which attempts were made to connect to a server. Note that this is counted separately for each server, client and software service. Thus, if in a given monitoring interval there are attempts to connect to three different servers, the Attempts metric will be incremented by three for that one monitoring interval. The actual value shown on the report is the sum total of all the attempts, for all the monitoring intervals, in the period covered by the report. Availability (total) The percentage of successful attempts, calculated using the following formula: Availability (total) = 100% * (All Attempts All failures) / All Attempts 108

109 Appendix D Metrics Available for User-defined Alert Definitions where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure All failures = all failures (transport) + all failures (TCP) + all failures (application). Bad MOS calls The number of VoIP calls with the Mean Opinion Score (MOS) rating below acceptable threshold. Bad R-factor calls The number of VoIP calls with R-factor value below the acceptable value. Bad delay calls The number of VoIP calls with delay above the acceptable level. Bad jitter calls The number of VoIP calls with jitter exceeding the acceptable level. Bad lost packets calls The number of VoIP calls with loss rate above the acceptable level. Call duration The average call duration. Calls The total number of VoIP calls. Note that for a selected software service the number of calls as seen from the sites' perspective may differ from the number seen from the endpoints' perspective. This is because in one site we may have two users taking part in the same call. Client not responding errors The number of errors of category Client not responding. Errors of this category occur when the server closes the TCP session with a RESET packet after the client has been idle for too long. Such a situation happens when the server TCP/IP stack detects that network connection to the client exists, but the client remains idle and does not respond. In such a case, the server closes the TCP session with a RESET packet. This may occur when the client has been silently disconnected from the network, for example, due to link failure, or the client has crashed. Note that this error will not occur if the client session has ended gracefully, that is, by closing the client application. Closed TCP connections The total number of successful or failed TCP connections. Connection establishment timeout errors The number of TCP errors of category Connection establishment timeout errors. This category of errors applies when there was no response from the server to the SYN packets transmitted by the client. Connection refused errors The number of TCP errors of category Connection refused errors, also referred to as Session establishment errors. This category of errors applies when a server rejects a request from a client to open a TCP session. Such a situation usually happens when the server runs out of resources, either due to operating system kernel configuration or lack of memory. 109

110 Appendix D Metrics Available for User-defined Alert Definitions Downstream TCP packets The total number of TCP packets sent in the downstream direction, excluding the traffic control packets. Downstream VoIP Jitter VoIP average jitter measured by the probe in the downstream traffic, that is, from a remote VoIP phone to the local endpoint. Jitter is a variation in voice data transit delay, in milliseconds. The Jitter is the mean value of the deviation: deviation Jitter = LastJitter * D * In RTP packets, a creation time stamp is written. To calculate the deviation of last packet, the creation time stamp (TRTP) is subtracted from the time stamp written in previous packet (LastTRTP). The previous value is subtracted from, Arrival to destination last packet Timestamp (TM), subtracted from previous Arrival to destination packet Timestamp (LastTM). D = absolute value ((TM - LastTM) - (TRTP - LastTRTP)). Downstream VoIP MOS VoIP average Mean Opinion Score (MOS) measured in the downstream direction, that is, from a remote VoIP phone to the subscriber. It is within a range from 1 to 5. MOS is calculated basing on some statically configured parameters and dynamically measured call variables. Statically configured parameters are codec parameters and MOS constants. Dynamically measured call variables are: latency, size of frame and loss rate. MOS may be unavailable if there is no RTCP traffic in the call. Downstream VoIP R-factor VoIP average R-factor value in the downstream direction, that is, from a remote VoIP phone to the subscriber. A value derived from metrics such as latency, jitter, and packet loss, the R-Factor value helps quickly assess the quality-of-experience for VoIP calls on the network. Typical scores range from 50 (bad) to 90 (excellent). Downstream VoIP RTCP Jitter VoIP average jitter as reported by Real Time Transport Protocol (RTCP) for the downstream traffic, that is, from a remote VoIP endpoint to the local endpoint. Jitter reflects a variation in voice data transit delay, in milliseconds. The Jitter is the mean value of the deviation: deviation Jitter = LastJitter * D * In RTP packets, a creation time stamp is written. To calculate the deviation of last packet, the creation time stamp (TRTP) is subtracted from the time stamp written in previous packet (LastTRTP). The previous value is subtracted from, Arrival to destination last packet Timestamp (TM), subtracted from previous Arrival to destination packet Timestamp (LastTM). D = absolute value ((TM - LastTM) - (TRTP - LastTRTP)). Downstream VoIP delay VoIP weighted average networking delay in the downstream direction, that is, from a remote to the local VoIP endpoint. The Delay for one call is calculated as follows: Delay = Latency + LookAheadDelay + JitterBufferDelay + PLSize / BaseFrameSize * BaseFrameDuration Where Latency in this formula is not a Delay from a Report Block of RTCP packet. It is calculated on the basis of time stamps (measured by the Probe) of RTCP packets and Delays extracted from Report Blocks of RTCP packets. Other parameters apart from PLSize are codec specific. PLSize is the current RTP payload size. Downstream VoIP loss rate The percentage of VoIP packets lost or discarded that needed to be retransmitted, measured for downstream traffic. 110

111 Appendix D Metrics Available for User-defined Alert Definitions Downstream bandwidth usage Downstream traffic bandwidth per data resolution (hour/day/week/month). Downstream bytes The number of bytes transferred in the downstream direction (to the subscriber). Downstream packets The number of packets transmitted in the downstream direction. Downstream packets lost The number of lost TCP data packets sent in the downstream direction, excluding the traffic control packets. Downstream realized bandwidth Realized bandwidth in the downstream direction, to a site. Failures (total) The total number of failures, that is all Failures (transport) + all Failures (TCP) + all Failures (application) Local ACK RTT RTT measurement performed during ACK packet transmission, from local site side of the transaction. Local ACK RTT measurements These metrics keep track of how many RTT of local site's ACK measurements were made. ACK measurement is performed during ACK packet transmission either from server or client side of the transaction. Local RTT The round-trip time measured for the local site. Network performance The percentage of total traffic that did not experience network-related problems (traffic in which the values of loss rate and RTT did not exceed configured thresholds). RTT measurements The number of RTT measurements. An RTT measurement occurs during every TCP handshake, so it provides some insight into the number of attempted TCP sessions, and the potential accuracy of the RTT measurements that are reported. Remote ACK RTT Remote ACK RTT is the time it takes for an ACK packet with no payload to travel from the remote site the AMD and back again. Remote ACK RTT measurements This metric keeps track of how many remote ACK RTT measurements were made. ACK measurement is performed during ACK packet transmission either from server or client side of the transaction. Remote RTT The round-trip time measured for the remote site. Server not responding errors The number of Server Not Responding errors. This category of errors applies when the client closes the TCP session with a RESET packet after the server has failed to respond for too long. 111

112 Appendix D Metrics Available for User-defined Alert Definitions Server session termination errors The number of Server Session Termination errors. This category of errors applies when the server detects an error on the software service level and closes the TCP session with a RESET packet. Successful attempts The number of monitoring intervals during which successful attempts were made to connect to a server. Note that this is counted separately for each server. Thus, if in a given monitoring interval there are attempts to connect to three different servers, the Successful attempts metric will be incremented by three for that one monitoring interval. Note also that, even if TCP errors occur, but the connection is established during the given monitoring interval, then this monitoring interval is counted as a success (for that server). TCP errors The total number of TCP errors. Those errors may indicate server or application problems and therefore measurements of those are critical to understanding the issues that may affect end-user experience. AMDs measure and report on the following types of TCP errors: Connection Refused Errors - Client attempts to open a TCP session with a server, which rejects the request. SYN packet from Client is followed by RESET packet from Server, with matching TCP sequence numbers. This error is typically caused by resource exhaustion on the server, which is unable to accept more concurrent TCP sessions. This may be either a configuration issue (too few resources allocated in the kernel) or lack of memory. SYN flood attacks typically result in servers being unable to accept new connections. Server session termination error - Server is unexpectedly terminating a connection that was successfully opened. The server sends a RESET packet to the Client. Such an error originates at an application using TCP session that is monitored. It does not necessarily mean application failure; usually it means that the application encountered a condition in which it decided to immediately terminate session with the client, for example, because of an application security policy violation by the client. Session Abort - Client is unexpectedly terminating a connection that was successfully opened. The Client sends a RESET packet to the Server. These errors are inspected in the context of the client application and may or may not be reported. For example, the browser running HTTP may terminate the load of a GIF file if it is older than the one that it had previously cached and this is normal behavior. However, if all connections to the server are terminated because the user hits the STOP button, then this is abnormal session termination and is reported as "Aborted operation" or "Stopped Page". Client not responding errors (server timeout errors) - Server networking stack takes an assumption that the network connection to the client exists, but the client remains idle and does not respond. In such a case, the server closes the TCP session with the RESET packet. Such a condition may occur when the client has been silently disconnected from the network, for example, due to a link failure, or the client has crashed. Note that this error will not occur if the client has ended the session gracefully, e.g. by closing the client application. 112

113 Appendix D Metrics Available for User-defined Alert Definitions Server not responding errors (client timeout errors) - Client networking stack takes an assumption that network connection to the server exists, but the server remains idle and does not respond. In such a case, the client closes the TCP session with the RESET packet. This may occur either during the Session Setup phase (no response to the SYN packet), or during a normal data exchange process. Such a situation may result in the intermittent network problems between the client and the server. In the case the traffic is routed through asymmetric paths across the Internet, which is often the case, the path from the server to the client may be broken. Total bandwidth usage The number of all transmitted bits (client + server) per second. Total bytes The number of all transmitted bytes (client + server). Upstream TCP packets The total number of TCP packets sent in the upstream direction, excluding the traffic control packets. Upstream VoIP Jitter Average jitter measured by the probe in the upstream traffic, that is, from a local to a remote VoIP endpoint. Jitter reflects variation in voice data transit delay, in milliseconds. The Jitter is the mean value of the deviation: deviation Jitter = LastJitter * D * In RTP packets, a creation time stamp is written. To calculate the deviation of last packet, the creation time stamp (TRTP) is subtracted from the time stamp written in previous packet (LastTRTP). The previous value is subtracted from, Arrival to destination last packet Timestamp (TM), subtracted from previous Arrival to destination packet Timestamp (LastTM). D = absolute value ((TM - LastTM) - (TRTP - LastTRTP)). Upstream VoIP MOS VoIP average Mean Opinion Score (MOS) measured in the upstream direction, that is, from a subscriber to a remote VoIP phone. It is within a range from 1 to 5. MOS is calculated basing on some statically configured parameters and dynamically measured call variables. Statically configured parameters are codec parameters and MOS constants. Dynamically measured call variables are: latency, size of frame and loss rate. MOS may be unavailable if there is no RTCP traffic in the call. Upstream VoIP R-factor VoIP average R-factor value in the upstream direction, that is, from a subscriber to a remote VoIP phone. A value derived from metrics such as latency, jitter, and packet loss, the R-Factor value helps quickly assess the quality-of-experience for VoIP calls on the network. Typical scores range from 50 (bad) to 90 (excellent). Upstream VoIP RTCP Jitter VoIP average jitter as reported by Real Time Transport Protocol (RTCP) for the upstream traffic, that is, from a local VoIP endpoint to a remote one. Jitter reflects a variation in voice data transit delay, in milliseconds. Upstream VoIP delay VoIP average networking delay in the upstream direction, that is, from a local to a remote VoIP endpoint. The Delay for one call is calculated as follows: Delay = Latency + LookAheadDelay + JitterBufferDelay + PLSize / BaseFrameSize * BaseFrameDuration Where Latency in this formula is not a Delay from a Report Block of RTCP packet. It is 113

114 Appendix D Metrics Available for User-defined Alert Definitions calculated on the basis of time stamps (measured by the Probe) of RTCP packets and Delays extracted from Report Blocks of RTCP packets. Other parameters apart from PLSize are codec specific. PLSize is the current RTP payload size. Upstream VoIP loss rate The percentage of VoIP packets lost or discarded that needed to be retransmitted, measured for upstream traffic. Upstream bandwidth usage The number of upstream bits per second. Upstream bytes The number of bytes transmitted in the upstream direction. Upstream packets The number of packets transmitted in the upstream direction. Upstream packets lost The number of lost TCP data packets sent in the upstream direction, excluding the traffic control packets. Upstream realized bandwidth Realized bandwidth in the upstream direction, from a site, area or region. VoIP Jitter VoIP average jitter measured by the probe, for both downstream and upstream traffic. Jitter is a variation in voice data transit delay, in milliseconds. In general, higher levels of jitter are more likely to occur on either slow or heavily congested links. VoIP MOS VoIP average Mean Opinion Score (MOS) rating of the call quality, for both downstream and upstream traffic. VoIP R-factor VoIP average R-factor value, for both downstream and upstream traffic. It is a transmission quality rating, with a typical range of An R-Factor score is derived from multiple VoIP metrics, including latency, jitter, and loss. VoIP RTCP Jitter VoIP average jitter as reported by Real Time Transport Protocol (RTCP), for both downstream and upstream traffic. Jitter is a variation in voice data transit delay, in milliseconds. Higher levels of jitter are more likely to occur on either slow or heavily congested links. VoIP delay VoIP average networking delay, as reported by Real Time Transport Protocol (RTCP), measured for both downstream and upstream traffic. VoIP loss rate The percentage of VoIP packets lost or discarded that needed to be retransmitted, measured for both upstream and downstream traffic. Zero window size events Client sets this in TCP header when it wants the other side to slow down with data transmission because it cannot keep up with the transmission speed. Indicates that receiving machine is busy with other tasks. 114

115 Network Link Appendix D Metrics Available for User-defined Alert Definitions Average CPU utilization The percentage of elapsed time that the processor spent to execute non-idle threads. This counter is the primary indicator of processor activity, and shows the average percentage of busy time. Average disk utilization The percentage of elapsed time that disk storage was busy servicing read or write requests. This counter shows the average percentage of busy time. Average memory utilization The average percentage of used physical memory (RAM). Average number of active sessions The average number of active Windows Terminal Services sessions. Average number of open sessions The average number of open Windows Terminal Services sessions. Maximum CPU utilization The percentage of elapsed time that the processor spent to execute non-idle threads. This counter is the primary indicator of processor activity, and shows the maximum percentage of busy time. Maximum disk utilization The percentage of elapsed time that disk storage was busy servicing read or write requests. This counter shows the maximum percentage of the busy time. Maximum memory utilization The percentage of used physical memory. This counter shows the maximum percentage of used RAM. Maximum number of active sessions The maximum number of total Terminal Services sessions. Maximum number of open sessions The maximum number of open Terminal Services sessions. Minimum CPU utilization The percentage of elapsed time that the processor spent to execute non-idle threads. This counter is the primary indicator of processor activity, and shows the minimum percentage of busy time. Minimum disk utilization The percentage of elapsed time that disk storage was busy servicing read or write requests. This counter shows the minimum percentage of busy time. Minimum memory utilization The percentage of used physical memory. This counter shows the minimum percentage of used RAM. Minimum number of active sessions The minimum number of active Terminal Services sessions. Minimum number of open sessions The minimum number of open Terminal Services sessions. 115

116 Appendix D Metrics Available for User-defined Alert Definitions Enterprise Synthetic and Sequence Aborted transactions The number of aborted transactions (transaction error code: -3). An aborted transaction is reported when one or more consecutive URLs detected in the traffic match the defined transaction steps, but the next URL detected does not match the transaction definition. Affected users (availability) The number of unique users that were affected by the availability problems. Affected users (performance) The number of users that experienced application performance problems. For transactional protocols, a problem is noted if at least one operation is completed in time longer than the performance threshold. For transactionless TCP-based protocols, a problem is noted if user wait per kb of data is longer than the threshold value. Application Delivery Channel Delay In WAN optimized scenario, Application Delivery Channel Delay (ADCD) is a quality metric represented in milliseconds. The ADCD is determined by initial observation of the traffic between a client and a server. ADCD is a derivative of RTT measured on a WAN link expressed in time and as such it can be understood as latency, where the larger ADCD would indicate a higher network latency. ADCD also includes time spent in the data center WOC for traffic buffering and processing. A change of ADCD from its initial value reflects a change of quality in WAN optimization service. For example, sudden increase of ADCD would suggest that the quality of the service has worsened and conversely, a sudden decrease of ADCD value could suggest an improvement in WAN optimization. Application performance For transactional protocols, this is the percentage of software service transactions completed in a time shorter than the performance threshold. For transactionless TCP-based protocols, this is the percentage of monitoring intervals in which user wait time per kb of data was shorter than the threshold value. Application processing time The average time spent by software service on operation processing. Attempts Availability (total) The percentage of successful attempts, calculated using the following formula: Availability (total) = 100% * (All Attempts All failures) / All Attempts where All attempts = all failures + all successful operations + all standalone hits not classified as a failure + all aborts not classified as a failure All failures = all failures (transport) + all failures (TCP) + all failures (application). Client RTT Client RTT is the time it takes for a SYN packet (sent by a server) to travel from the AMD to the client and back again, as shown in the following picture. 116

117 Appendix D Metrics Available for User-defined Alert Definitions Client AMD Server T1 SYN T2 T3 T6 T7 SYN ACK ACK T5 Client RTT T8 T4 T9 A client RTT measurement begins when the SYN ACK packet from the server to the client passes by the AMD (T5). The packet reaches the client machine (T6) and is processed, while an acknowledgment is sent back to the server (T7). Client processing time impact (T7-T6) is again very low. Client RTT measurement ends when the ACK packet reaches the AMD (T8). Therefore, the Client Round Trip Time is calculated as T8-T5. Depending on the actual setup, Client RTT measurements may vary dramatically. In corporate environments, it may be a few milliseconds for LAN-connected clients or a couple dozens milliseconds for WAN-connected clients. In this case, where the client is coming from the Internet, the end-to-end Client RTT measurement is a compound of transit time through the Internet backbone as well as through the "last mile" access network. The impact of the last mile can be easily calculated, based on the connection speed and the packet size (56B in case of TCP SYN packet). For a 28 kbps dial-up connection, this amounts to 16 milliseconds one way, or 32 milliseconds for a complete round-trip measurement. For a 1.6 Mbps DSL line, this makes 56 microseconds towards complete client RTT measurement. Client bytes The number of bytes sent by the clients. Note that this includes headers. Client packets The number of packets sent by the client. Client response time The average time spent by client side on transaction processing. Client time Client time is the time interval between the last data packet from transaction response message from TCP session server to the first packet of the acknowledgment from TCP session server to the client. Client time is similar to server time, but measured in context of transaction response message. Client time (failed transactions) The client time for all failed transactions (transactions with a -2 status code). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. Client time (requests) The client time for all transaction requests (both requests that became successful transactions and requests that ended as transactions with errors). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. Failed transactions For Synthetic Monitoring transactions, it is the number of transactions for which the give-up threshold was exceeded. For RUM transactions, failed transactions are all transactions with status other than -3 (aborted). 117

118 Appendix D Metrics Available for User-defined Alert Definitions Failures (total) The total number of failures, that is all Failures (transport) + all Failures (TCP) + all Failures (application) HTTP abort error This error is reported when one of the URLs in a transaction detected in a monitored traffic does not match the transaction definition. This refers to any URL in a sequence of URLs, except the firs one. HTTP client errors (4xx) The sum of all HTTP client errors (4xx). This includes 4 categories of errors (4xx), by default HTTP Unauthorized (401, 407) errors, HTTP Not Found (404) errors, custom client (4xx) errors and Other HTTP (4xx) errors. The contents of the first 3 categories can be configured by users. HTTP client errors - category 3 (default name) The number of HTTP custom client errors (4xx). By default, there is no specific error type assigned here. HTTP not found errors 404 (default name) The number of observed custom HTTP 404 Not found errors. HTTP other client errors (4xx) The number of HTTP other client errors (4xx). There are four categories of HTTP client errors (4xx), of which three can be configured by users. By default, the first category includes HTTP Unauthorized (401, 407) errors, the second category - HTTP Not Found (404) errors. The 3rd category contains no default error types assigned, and can be configured by a user. Finally, a group of Other HTTP (4xx) errors contains all errors that do not fall into any other client errors category. The number is calculated based on the formula: [HTTP errors 4xx] - [HTTP Not Found errors 404] - [HTTP Not Authorized ( )] - [HTTP errors configured by user]. HTTP other server errors (5xx) The number of HTTP server errors (5xx) that do not fall into categories 1 or 2 of custom HTTP server errors (5xx). HTTP server errors (5xx) The number of all observed HTTP server errors (5xx). HTTP server errors category 1 (default name) The number of custom HTTP server errors (5xx), category 1. By default, there are no specific error types assigned to this category. HTTP server errors category 2 (default name) The number of custom HTTP server errors (5xx), category 2. By default, there are no specific error types assigned to this category. HTTP timeout error This type of error is reported if the time between the occurrence of consecutive URLs constituting a transaction exceeds the predefined timeout value. HTTP unauthorized errors 401, 407 (default name) The number of observed custom HTTP authentication related errors. These include "HTTP 401 Unauthorized" and "HTTP 407 Proxy authentication required" errors. 118

119 Appendix D Metrics Available for User-defined Alert Definitions HTTP servers generate errors "401 Unauthorized" in cases, when anonymous clients are not authorized to view the requested content and must provide authentication information in the WWW-Authenticate request header. The 401 errors are similar to "403 Forbidden" errors, however used when authentication is possible but it has failed or not yet been provided. The 407 error is basically similar to 401, but it indicates that the client should first authenticate with a proxy server. The AMD will report these errors only if the server-level authentication has been configured. Simple and basic user access policies are common in Web sites that do not store user-sensitive and/or business critical information. Most commercial-grade applications, based on HTTP, such as home banking applications or online shopping sites, rely on the application-level authentication rather than the server-level authentication. Such applications are designed in the way that even if the user authentication fails, the HTTP server usually sends the 200 OK response code and the authentication error message in the page content. Therefore, the 401 Unauthorized and 407 Proxy authentication required error codes are quite rare in commercial environments. Incomplete transaction error This error tells us that transaction was reported although monitored traffic did not match the first steps in the transaction definition. Network time The time the network takes to deliver the request to the server and to deliver the resulting response back to the user. In other words, network time is the portion of the operation time that is spent on transferring data over the network. Network time (failed transactions) The network time for all failed transactions (transactions with a -2 status code). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. Network time (requests) The network time for all transaction requests (both requests that became successful transactions and requests that ended as transactions with errors). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. No response error The number of errors of the category No response. These errors are reported when a request is detected in the monitored traffic, but the actual operation following this request is not observed. RTT measurements The number of RTT measurements. Server RTT The time it takes for a SYN packet to travel from the AMD to a monitored server and back again. 119

120 Appendix D Metrics Available for User-defined Alert Definitions Client AMD Server T1 SYN T2 T6 SYN ACK T5 Server RTT T3 T4 T7 ACK T8 T9 Server bytes The number of bytes sent by servers. The number includes headers. Server packets The number of packets sent by the servers. Server time The time it took the server to produce a response for the given request. Server time (failed transactions) The server time for all failed transactions (transactions with a -2 status code). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. Server time (requests) The server time for all transaction requests (both requests that became successful transactions and requests that ended as transactions with errors). This metric is valid only for 'Transactions (Synthetic Monitoring)' transaction source. Slow transactions The number of transactions for which the transaction time was above a predefined threshold value. Transaction requests The number of all transaction requests, both requests that became successful transactions and requests that ended as transactions with errors. Transaction time The time it took to complete a transaction. Transactions The number of transactions. Unique users The number of unique users detected in the monitored traffic. Server re-transmissions The number of re-transmitted TCP packets sent by a server. 120

121 APPENDIX E Alert Definitions Provided with DC RUM The following alerts are supported by at least one of the DC RUM report servers: Central Analysis Server or Advanced Diagnostics Server. Unless specified otherwise, the alert definitions apply to Central Analysis Server. Alert definitions often refer to terms such as baseline or the normal value, to which the current value of a parameter is compared. 404_TIME_4_URL This alert indicates availability problems with the specified URL pattern. It is triggered based on the average server time per URL and on the number of HTTP 404 errors for the specified URL pattern. Characteristics Name: Low P2P MOS over VoIP software service for site Type: performance Status (default): disabled Detector: SQL-based Message URL Pattern: URL_pattern Metric: Page Download Time + Page Not Found Response Threshold: load_time_threshold ms + number page not found responses Description: at least one server time of downloads downloads resulting 404 code lasted more than time_threshold ms. The mean server time for downloads downloads was time ms. Summary Statistics: URL 121

122 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter URL pattern Threshold Req no of occurrences Valid values a valid URL, including wildcards number number Description Threshold value, in milliseconds, of the server time. Threshold value of the number of HTTP 404 errors for the URL pattern. All the alert parameters are defined in the <install_dir>\config\alarmdetectorparams-rtm.config file in the following form: 404_TIME_4_URL String: "url_pattern" = threshold_in_ms,number_of_occurrences_per_pattern Example: 404_TIME_4_URL String: " = 20000,200 AMD_DROP_PCKTS_ALL_I The alert is triggered when the number of packets dropped by all AMD interfaces while reading packets from a wire is too high. Characteristics Name: Packets dropped by all AMD interfaces Type: diagnostics Status (default): enabled Detector: built-in, SQL Message On AMD {getipaddress(4)}:{5} - {6} of {7} received packets were dropped (rate is {8}%), {9} of {10} transferred packets were dropped (rate is {11}%). Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Valid values Description Received dropped packets number threshold Threshold value describing the maximum acceptable number of dropped packets received by all AMD interfaces. Default value: Transferred dropped packets threshold number Threshold value describing the maximum acceptable number of dropped packets transferred by all AMD interfaces. Default value:

123 AMD_DROP_PCKTS_DRVR Appendix E Alert Definitions Provided with DC RUM The alert is raised when the number of packets dropped while reading packets from device driver is too high. Characteristics Name: Packets dropped by AMD driver Type: diagnostics Status (default): enabled Detector: built-in, SQL Message On AMD {getipaddress(4)}:{5} - {6} of {7} packets were dropped (rate is {8}%) Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Drop packets threshold Valid values number Description Threshold value describing the maximum acceptable number of dropped packets read from the AMD driver. Default value: AMD_DROP_PCKTS_SNGL_I The alert is raised when the number of packets dropped by a given interface while reading packets from a wire is too high. Characteristics Name: Packets dropped on AMD interface Type: diagnostics Status (default): enabled Detector: built-in, SQL Message Interface {6} from AMD {getipaddress(4)}:{5} - {7} of {8} received packets were dropped (rate is {9}%), {10} of {11} transferred packets were dropped (rate is {12}%). Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). 123

124 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Valid values Description Received dropped packets number threshold Threshold value describing the maximum acceptable number of dropped packets received by a given AMD interface. Default value: Transferred dropped packets threshold number Threshold value describing the maximum acceptable number of dropped packets transferred by a given AMD interface. Default value: AMD_NOTRAFFIC_DRVR The alert is raised when the number of packets processed by driver is too low. Characteristics Name: Low number of packets processed by driver Type: diagnostics Status (default): enabled Detector: built-in, SQL Message On AMD {getipaddress(4)}:{5} - Only {8} of {6} packets were processed. Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Processed packets threshold Valid values number Description Threshold value describing the minimum acceptable number of packets processed by the AMD driver. Default value: 100 AMD_NOTRAFFIC_I The alert is raised when traffic level on an interface is below the threshold. Characteristics Name: Low traffic on active AMD interface Type: diagnostics Status (default): enabled Detector: built-in, SQL 124

125 Appendix E Alert Definitions Provided with DC RUM Message On active interface {6} from AMD {getipaddress(4)}:{5} traffic was below alert threshold. RX bytes {8}. Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Received bytes threshold Valid values number Description Threshold value describing the minimum acceptable number of bytes received by an active sniffing AMD interface. Default value: 100 AMD_SSL_ENGINE The alert is raised when the error status code is detected for SSL engine. Characteristics Name: SSL engine status Type: diagnostics Status (default): enabled Detector: built-in, SQL Message On AMD {getipaddress(4)}:{5} SSL engine error status detected. Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters This alert has no parameters to customize. AMD_SSL_STATUS The alert is raised when SSL decryption errors are detected, for example, some sessions were not decrypted. Characteristics Name: SSL sessions not decrypted Type: performance Status (default): enabled Detector: built-in, SQL 125

126 Appendix E Alert Definitions Provided with DC RUM Message On AMD {getipaddress(4)}:{5} - {15} of {8} finished sessions were not decrypted due to no private key found, {17}% of finished sessions not decrypted due to incompleted SSL handshake, {18}% of finished sessions not decrypted ot partially decrypted. Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Valid values Description Percentage of finished number sessions not decrypted due to incomplete SSL handshake threshold Threshold value describing the maximum acceptable percentage of finished sessions that were not decrypted because of incomplete SSL handshake. Default value: 10 Sessions with no private key found threshold Percentage of not decrypted ot partially decrypted sessions threshold number number Threshold value describing the maximum acceptable number of sessions for which no private key was found. Default value: 10 Threshold value describing the maximum acceptable percentage of sessions that were not decrypted or partially decrypted. Default value: 10 AMD_UNIDIR_TRAFF The alert is raised when unidirectional traffic exceeding defined threshold was detected. Characteristics Name: Unidirectional traffic detected Type: diagnostics Status (default): enabled Detector: built-in, SQL Message On AMD {getipaddress(4)}:{5} - {6} unidirectional sessions detected ({7}% of sessions). Alert generated for the time interval from ({getdate(1)}) to ({getdate(2)}) based on {3} amdstats file(s). Detector parameters Parameter Valid values Description Percentage of unidirectional number sessions threshold Threshold value describing the maximum acceptable percentage of unidirectional sessions. Default value: 2 126

127 Appendix E Alert Definitions Provided with DC RUM Parameter Number of unidirectional sessions threshold Valid values number Description Threshold value describing the maximum acceptable number of unidirectional sessions. Default value: 1000 APPL_ABNOR This alert is triggered by abnormal traffic volume for a software service. Characteristics Name: Abnormal traffic volume for software service Type: anomalies Status (default): disabled Detector: SQL-based Message Abnormal traffic volume for software service software_service_name (protocol) on port port. Volume values - current: volume B, reference_type: reference_value B. The increment of the volume relative to the reference_type: excess_volume_percentage%. The number of users using the software service: number_of_users. Detector parameters Parameter Valid values Description Multiplier of the normal number volume value A coefficient for scaling the normal value of traffic volume. The scaled value constitutes the lower limit of the unacceptable traffic values. Lower limit of the unacceptable volume [B] Lower limit of volume when the normal value unknown [B] number number The lower limit of the unacceptable traffic volume expressed in bytes. It is used when the normal value is known. The alert is generated only if both limits the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1} are exceeded. The lower threshold of unacceptable traffic volume expressed in bytes. It is used when the normal value is not known. AVL_DROP_4_APPL This alert reports on applications for which availability errors have been observed. 127

128 Appendix E Alert Definitions Provided with DC RUM Characteristics Name: Software service availability problem Type: performance Status (default): enabled Detector: Message Availability for software service {1} too low. Observed percentage of availability was: {2}%. The number of active software service users: {3}. Detector parameters Parameter Upper limit of availability [%] Lower limit of the unacceptable number of required connections Lower limit of the unacceptable attempt number Valid value number number number Description The availability threshold representing the upper limit of a set of unacceptable availability quotients. It is expressed in percent, the default value is 75%. The connection threshold representing the lower limit of a set of unacceptable totals of RTT measurements, of connection establishment timeout errors, and of connection refused errors. The default value is 40. The attempt threshold representing the lower limit of a set of unacceptable attempt numbers. The default value is 20. DATABASE_SIZE This alert is triggered when there is too little free space for the server database. It is available in CAS and ADS. Characteristics Name: Insufficient disk space for report server database Type: diagnostics Status (default): enabled Detector: built-in, non-sql Message The reserve the database maintains for {1} is insufficient. This data is stored on disk(s) {2}. The database size is {3} MB, the reserve size is: {4} MB ({5}%). The oldest batch of monitoring data will be purged from the database to make space for new data. See server.log for details on what data has been purged. 128

129 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Reserve Threshold [% of Database Size] Valid value number as percentage Description The threshold value of the remaining free space, expressed as a percentage of the current size of the database. If the current database is 200 MB and this value is set to 50, this alert is triggered when the amount of free space on the database server falls below 100 MB. This number is represented in the alert message as parameter {5}. DISKS_STORAGE This alert is triggered when there is too little free space on the server hard drives. It is not dependent on AMD data; all the disks installed in the server are checked. The alert is available in Central Analysis Server and Advanced Diagnostics Server. Characteristics Name: Disks storage Type: diagnostics Status (default): enabled Detector: built-in, non-sql Message Not enough space on disk hard_drive. Free disk space: disk_space MB (percentage_of_free_disk_space%). Total disk storage: disk_space MB. Detector parameters Parameter External disk checker Threshold [% of free space] Valid value valid file path number as percentage Description Name of the executable file used for testing available free storage size: df.exe Minimum percentage value of free storage. For example, 10. EXC_ACT This alert is triggered when a user generates a high volume of traffic (excessive activity) that may be considered excessive or suspicious activity. Characteristics Name: Excessive number of servers used by user. Top software service identified. Type: anomalies 129

130 Appendix E Alert Definitions Provided with DC RUM Status (default): enabled Detector: SQL-based Message Excessive activity of user IP_address (user_name). The top software service software_service_name. The number of servers connected through the software service: #server_count (server_percentage%). The total number of servers the user has connected to: #user_servers. The number of servers has exceeded the #limit. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Multiplier of the normal number of servers Valid value number Description Scaling factor for the normal value of the number of servers for monitored traffic. The scaled value constitutes the lower limit of unacceptable values. Default value: 5. Lower limit of the unacceptable number of number The lower limit of unacceptable values of the current number of servers for monitored traffic. Default value: 20. servers Note: Both limits the one defined by scaling factor {0} combined with the normal value and the one supplied in parameter {1} have to be exceeded for the alert to be generated. Alternative lower limit number of the unacceptable number of servers The lower limit of unacceptable values of the current number of servers for whole traffic. The limit is used if the current number of servers for monitored traffic is not an unacceptable value according to criteria based on parameters {0} and {1}. Default value: 100. EXC_ACT2 This alert detects when a user generates a high volume of traffic (excessive activity). The alert does not use a SQL-based detector, which is faster, but not configurable. It is similar to EXC_ACT. Characteristics Name: Excessive number of servers used by user. Top software service identified. Non-SQL detector. Type: anomalies Status (default): enabled Detector: non-sql-based 130

131 Appendix E Alert Definitions Provided with DC RUM Message Excessive activity of user IP_address (user_name). The top software service software_service_name. The number of servers connected through the software service: #server_count (server_percentage%). The total number of servers the user has connected to: #user_servers. The number of servers has exceeded the #limit. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Multiplier of the normal number of servers Valid value number Description Scaling factor for the normal value of the number of servers for monitored traffic. The scaled value constitutes the lower limit of unacceptable values. Default value: 5. Lower limit of the unacceptable number of number The lower limit of unacceptable values of the current number of servers for monitored traffic. Default value: 20. servers Note: Both limits the one defined by scaling factor {0} combined with the normal value and the one supplied in parameter {1} have to be exceeded for the alert to be generated. Alternative lower limit number of the unacceptable number of servers The lower limit of unacceptable values of the current number of servers for whole traffic. The limit is used if the current number of servers for monitored traffic is not an unacceptable value according to criteria based on parameters {0} and {1}. Default value: 100. EXC_ACT_SIMPLE This alert provides simplified information similar to that supplied by the EXC_ACT (excessive activity) alert. Characteristics Name: Excessive number of servers used by user Type: anomalies Status (default): disabled Detector: SQL-based Message Excessive activity of user IP_address (user_name). The total number of servers the user has connected to: user_servers. 131

132 Appendix E Alert Definitions Provided with DC RUM The number of servers has exceeded the limit. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Multiplier of the normal number of servers Valid value number Description Scaling factor for the normal value of the number of servers for monitored traffic. The scaled value constitutes the lower limit of unacceptable values. Default value: 5. Lower limit of the unacceptable number of number The lower limit of unacceptable values of the current number of servers for monitored traffic. Default value: 20. servers Note: Both limits the one defined by scaling factor {0} combined with the normal value and the one supplied in parameter {1} have to be exceeded for the alert to be generated. Alternative lower limit number of the unacceptable number of servers The lower limit of unacceptable values of the current number of servers for whole traffic. The limit is used if the current number of servers for monitored traffic is not an unacceptable value according to criteria based on parameters {0} and {1}. Default value: 100. FLOW_DROP_4_CL_LOC This alert is triggered when a certain percentage of traffic to or from client sites is adversely affected by traffic volume (when at least one traffic metric exceeds a threshold values). The triggering metrics are: downstream ACK RTT downstream RTT downstream realized bandwidth end-to-end downstream loss rate end-to-end upstream loss rate upstream ACK RTT upstream RTT Characteristics Name: Affected volume too high for site Type: performance Status (default): disabled Detector: SQL-based 132

133 Appendix E Alert Definitions Provided with DC RUM Message The percentage of affected traffic to/from site site (site_dns_name) has exceeded the permitted level. Observed percentage of affected traffic volume was: percentage_of_affected_volume%. Detector parameter Parameter Lower limit of the percentage of the affected volume Valid value percentage as number Description Threshold value of the percentage of affected traffic volume. The alert is generated if the percentage of affected traffic volume exceeds the value given in parameter {0}. Default: 30. HOT_IP This alert is triggered when the number of operations executed by a single user in the past reporting intervals exceeds a certain threshold. Characteristics Name: Excessive number of operations. Non-SQL detector. Type: anomalies Status (default): disabled Detector: built-in, non-sql Message Excessive activity of user IP_address (user_name). Non-SQL detector. Number of operations: number_of_operations. IMPORTANT This alert does not track the activity of clients that use IPv6 addresses. Detector parameter Parameter Number of operations threshold Valid value number Description Absolute number of operations performed from the reported IP address. Default: 100. HTTP_SERV_EFF This alert is triggered when the server time and the number of slow operations for a particular software service reach unacceptable levels. 133

134 Appendix E Alert Definitions Provided with DC RUM Characteristics Name: Server time and number of slow operations unacceptable for service Type: performance Status (default): disabled Detector: SQL-based Message Excessive activity of user IP_address (user_name). Non-SQL detector. Number of operations: number_of_operations. Detector parameters Parameter Valid value Description Relative increment of normal values of the server time percentage as An increment added to 100% (that is, to one unit) to number calculate the scaled normal value of the server time. This scaled normal value is used as the outer lower bound of an unacceptable server time. The increment is expressed in percent. Default: 100. Lower limit of the number unacceptable server time [ms] The auxiliary outer lower bound of unacceptable server times. It is used when the normal value of the server time is known and thus it guarantees that the value of the server time is significant. It is expressed in milliseconds. Default: 50. Alternative lower limit of the unacceptable server time [ms] number This threshold is exceeded when the server time is greater than the value set. It is used when the normal value of the server time is not known. It is expressed in milliseconds. Default: 100. Lower limit of the percentage of slow pages percentage as This threshold is exceeded when more than this percent number of all operations are slow. Default: 20. Lower limit of the percentage of users receiving slow pages percentage as This threshold is exceeded when more than this percent number of all users of this service are receiving slow operations. The threshold is evaluated when greater than zero. Default: 0. INCORR_LOGIN This alert is triggered by failed attempts to log in. It is typically used to warn of unauthorized log-in attempts. Characteristics Name: Incorrect login for user Type: diagnostics Status (default): enabled 134

135 Appendix E Alert Definitions Provided with DC RUM Detector: built-in, non-sql Message Login errors detected for user user from terminal. Detector parameters This alert has no parameters. LOAD_TIME_4_URL This alert is triggered if the operation time for the entire page (including frames, linked style sheets, images, and other page components) exceeds the threshold value and if there have been at least a specified number of such page loads. The alert is similar to LOAD_TIME_4_URL_4_CLI. Characteristics Name: High operation time for URL pattern occurred more than N times per client Type: performance Status (default): disabled Detector: SQL-based Message URL Pattern: URL_pattern Metric: Page Download Time Threshold: time ms. Description: at least page_downloads page downloads of total lasted more than time_threshold ms. The mean time for number downloads was time ms. Summary Statistics: URL Detector parameters Parameter thrs noo Valid value number number Description Threshold value of the operation time, expressed in milliseconds. Threshold value of the number of operations per user. All the alert parameters are defined in the <install_dir>\config\alarmdetectorparams-rtm.config file in the following form: LOAD_TIME_4_URL String: url_pattern = threshold_in_ms, page_loads_per_user Example: LOAD_TIME_4_URL String: " = 20000,2 135

136 Appendix E Alert Definitions Provided with DC RUM LOAD_TIME_4_URL_4_CLI This alert is triggered if the operation time for the entire page (including frames, linked style sheets, images, and other page components) exceeds the threshold value and if there have been at least a specified number of such page loads for a specified client names or client IP address. The alert is very similar to LOAD_TIME_4_URL. Characteristics Name: High operation time for URL pattern and for client IP observed too often Type: performance Status (default): disabled Detector: SQL-based Message URL Pattern: URL_pattern Client name:client_name Metric: Page download time Threshold: time ms. Description: at least one page download of page_downloads lasted more than time_threshold ms. The mean time for number downloads was time ms. Summary Statistics: URL Note that when the client name cannot be resolved, a client IP address is reported instead. Detector parameters Parameter thrs noo Valid value number number Description Threshold value of the operation time, expressed in milliseconds. Threshold value of the number of operations per user. All the alert parameters are defined in the <install_dir>\config\alarmdetectorparams-rtm.config file in the following form: LOAD_TIME_4_URL_4_CLI String: client_name:url_pattern = threshold_in_ms, page_loads_per_user Example: LOAD_TIME_4_URL_4_CLI String: "client286: = 20000,2 LOC_CL_UP_STR_EFF This alert is triggered if a site reports unacceptable metrics related to server loss rate, client RTT, or slow operations. In particular, the alert is triggered by unacceptable values for: 136

137 Appendix E Alert Definitions Provided with DC RUM Server loss rate (compared to normal values for the monitoring interval and also using additional thresholds) Client RTT (compared to normal values for the monitoring interval and also using additional thresholds) Characteristics Name: Server loss rate, client RTT, and server realized bandwidth unacceptable for site Type: performance Status (default): disabled Detector: SQL-based Message Server loss rate, client RTT, and server realized bandwidth unacceptable for the site site. Values of the loss rate - loss_rate: percentage%, current: percentage%. Values of the RTT - RTT: RTT ms, current: RTT ms. Values of the realized bandwidth - rband: rband bps, current: rband bps. The number of active site users: users. Detector parameters Parameter Valid value Description Relative increment of number normal values of the server loss rate The coefficient for scaling the normal value of the loss rate. The scaled value constitutes the lower limit of the unacceptable loss rate value. Lower limit of the unacceptable server loss rate Alternative lower limit of the unacceptable server loss rate number number The lower limit of the unacceptable loss rate. It is used when the normal value is known. The preceding two limits (the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1}) must both be exceeded for the alert to be generated. The lower threshold of the unacceptable loss rate. It is used when the normal value is not known. The preceding two limits (the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1}) must both be exceeded for the alert to be generated. Relative increment of number normal values of the client RTT The coefficient for scaling the normal value of RTT. The scaled value constitutes the lower limit of the unacceptable RTT value. 137

138 Appendix E Alert Definitions Provided with DC RUM Parameter Valid value Description Lower limit of the number unacceptable client RTT [ms] Alternative lower limit of number the unacceptable client RTT [ms] Relative decrement of number normal values of the server realized bandwidth The lower limit of the unacceptable RTT. It is used when the normal value is known. The lower threshold of unacceptable RTT. It is used when the normal value is not known. The coefficient for scaling the normal value of realized bandwidth. The scaled value constitutes the upper limit of the unacceptable realized bandwidth value. Lower limit of the unacceptable server realized bandwidth [bps] Alternative lower limit of the unacceptable server realized bandwidth [bps] number number The upper limit of the unacceptable realized bandwidth. It is used when the normal value is known. The preceding two limits (the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1}) must both be exceeded for the alert to be generated. The upper threshold of unacceptable realized bandwidth. It is used when the normal value is not known. LOC_HTTP_STR_EFF This alert is triggered if a site reports unacceptable metrics related to server loss rate, client RTT, or slow operations. In particular, the alert is triggered by unacceptable values for: Server loss rate (compared to normal values for the monitoring interval and also using additional thresholds) Client RTT (compared to normal values for the monitoring interval and also using additional thresholds) Characteristics Name: Server loss rate, client RTT, and number of slow operations unacceptable for site Type: performance Status (default): disabled Detector: SQL-based Message Server loss rate, client RTT, and number of slow pages unacceptable for the site site. Values of the loss rate - loss_rate: percentage%, current: percentage%. Values of the RTT - RTT: RTT ms, current: RTT ms. 138

139 Appendix E Alert Definitions Provided with DC RUM The number of pages - slow: slow_pages, whole: slow_pages. The number of active site users: users. Detector parameters Parameter Valid value Description Relative increment of number normal values of the server loss rate The coefficient for scaling the normal value of the loss rate. The scaled value constitutes the lower limit of the unacceptable loss rate value. Lower limit of the unacceptable server loss rate number The lower limit of the unacceptable loss rate. It is used when the normal value is known. The preceding two limits (the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1}) must be exceeded for the alert to be generated. Alternative lower limit of number the unacceptable server loss rate Relative increment of number normal values of the client rtt The lower threshold of unacceptable loss rate. It is used when the normal value is not known. The coefficient for scaling the normal value of RTT. The scaled value constitutes the lower limit of the unacceptable RTT value. Lower limit of the unacceptable client rtt [ms] number The lower limit of the unacceptable RTT. It is used when the normal value is known. The preceding two limits (the one defined by coefficient {0} combined with the normal value and the one supplied in parameter {1}) must be exceeded for the alert to be generated. Alternative lower limit of number the unacceptable client rtt [ms] Lower limit of the number percentage of slow pages The lower threshold of unacceptable RTT. It is used when the normal value is not known. The lower limit of the unacceptable number of slow operations. Lower limit of the percentage of users receiving slow pages number The lower limit of the unacceptable number of users experiencing the slow operations. If this number is greater than 0, the number of users experiencing slow operations is checked. LOSS_RATE This alert is triggered when the loss rate reported for a site is excessive. Characteristics Name: Loss rate too high for site Type: performance 139

140 Appendix E Alert Definitions Provided with DC RUM Status (default): disabled Detector: SQL-based Message Server loss rate, client RTT, and number of slow pages unacceptable for the site site. Values of the loss rate - loss_rate: percentage%, current: percentage%. Values of the RTT - RTT: RTT ms, current: RTT ms. The number of pages - slow: slow_pages, whole: slow_pages. The number of active site users: users. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Valid value Description Multiplier of normal values of the loss rate Lower limit of the unacceptable upstream loss rate Lower limit of the unacceptable downstream loss rate number percentage as number percentage as number A factor scaling normal values of the upstream and the downstream loss rate for a site. The scaled normal values constitute the lower limits of unacceptable values of the current upstream and downstream loss rates. If the factor equals 0, comparisons to normal values are omitted. A lower bound of unacceptable values of the current upstream loss rate for a site. A lower bound of unacceptable values of the current downstream loss rate of a site. Both limits (the one defined by coefficient {0} combined with the normal values and the ones supplied in parameters {1} or {2}) must be exceeded for the alert to be generated. LOW_OPER_4_CAP_MOD This alert is triggered by problems with memory utilization, average time of processing data files, or nightly task execution time. The alert is raised only once per time the threshold for any of three capacity criteria is exceeded. If more than one threshold is exceeded at the same time, the alert is still triggered only once, but the alert message reports on all existing alert causes. Characteristics Name: Capacity problems of reporting server module Type: diagnostics Status (default): enabled 140

141 Appendix E Alert Definitions Provided with DC RUM Detector: built-in, non-sql Message Capacity problem detected for the reporting server module: module_name, sub-module: module_name. Capacity problem cause: cause. Detector parameters This alert has no parameters. LOW_OPER_4_SYS_MOD This alert is triggered by problems with different report server subsystems. This alert appears in Central Analysis Server and Advanced Diagnostics Server. For a list of all the module and sub-module names for your report server installation, see Tools Diagnostics System Status. Characteristics Name: Inoperability of reporting server module Type: diagnostics Status (default): enabled Detector: built-in, non-sql Message Inoperability detected for reporting server module: module_name, sub-module: module_name. Inoperability cause: cause. Detector parameters This alert has no parameters. METRIC_ALM_2 This alert is raised if the value of the application performance metric for an application drops below a specified threshold. Characteristics Name: Application performance of an application (front-end) Type: generic performance Status (default): enabled Detector: built-in, metric-based 141

142 Appendix E Alert Definitions Provided with DC RUM Message Application performance of application Application (front-end only) has dropped below Threshold%. Current value of application performance: Metric_value%. Number of affected users: Auxiliary_metric_value. Detector parameters Parameter Threshold Auxiliary threshold Valid value number as percentage number Description Threshold describing the minimum acceptable application performance for an application. Default: 80 %. The alert is not raised unless performance falls below the availability threshold and at least this many users are affected by poor availability. Default value: 5 For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. METRIC_ALM_3 This alert is raised if the value of the application performance metric for a transaction drops below a specified threshold. Characteristics Name: Application performance of a transaction (front-end) Type: generic performance Status (default): enabled Detector: built-in, metric-based Message Application performance of transaction Transaction (front-end only) has dropped below Threshold%. Current value of application performance: Metric_value%. Number of affected users: Auxiliary_metric_value. Detector parameters Parameter Threshold Auxiliary threshold Valid value number as percentage number > 0 Description Threshold describing the minimum acceptable application performance for a transaction. Default: 80 %. The alert is not raised unless performance falls below the performance threshold and at least this many users are affected by poor performance. Default value: 5 142

143 Appendix E Alert Definitions Provided with DC RUM Parameter Valid value Description For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. METRIC_ALM_4 This alert is raised if the value of the availability metric for an application drops below a specified threshold. Characteristics Name: Availability of an application (front-end) Type: generic performance Status (default): enabled Detector: built-in, metric-based Message Availability of application Application (front-end only) has dropped below Threshold%. Current value of availability: Metric_value%. Number of affected users: Auxiliary_metric_value. Detector parameters Parameter Threshold Auxiliary threshold Valid value number as percentage number > 0 Description Threshold describing the minimum acceptable availability of an application. Default: 95 %. The alert is not raised unless performance falls below the availability threshold and at least this many users are affected by poor availability. Default value: 5 For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. METRIC_ALM_5 This alert is raised if the value of the network performance of a site drops below a specified threshold. Characteristics Name: Network performance of a site Type: generic performance Status (default): enabled 143

144 Appendix E Alert Definitions Provided with DC RUM Detector: built-in, metric-based Message Network performance of site Site has dropped below Threshold%. Current value of network performance: Metric_value%. Number of affected users: Auxiliary_metric_value. Detector parameters Parameter Threshold Auxiliary threshold Valid value number as percentage number > 0 Description Threshold describing the minimum acceptable network performance of a site. Default: 50 %. The alert is not raised unless performance falls below the performance threshold and at least this many users are affected by poor performance. Default value: 5 For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. NEW_APP If a new software service is detected on a number of servers, this alert returns the first detected server. Characteristics Name: New software service detected Type: new objects Status (default): disabled Detector: SQL-based Message New software service software_service_name has been detected. One of the servers on which the software service was detected was: IP_address. Detector parameter This alert has no parameters. NEW_SERVER If a server hosts several known software services, this alert returns the first detected software service. 144

145 Appendix E Alert Definitions Provided with DC RUM Characteristics Name: New server detected Type: new objects Status (default): disabled Detector: SQL-based Message New server IP_address has been detected. One of the software services which was detected on the server was: software_service_name. Detector parameter This alert has no parameters. NEW_SERVICE This alert is triggered if a new software service appears on a server and some traffic to this service is detected. Characteristics Name: New service detected Type: new objects Status (default): disabled Detector: SQL-based Message A new service has been detected. Software service software_service_name was used for connecting to server IP_address. Detector parameter This alert has no parameters. NEW_USER This alert is triggered if a new user registers on the monitored network and starts using a software service. Characteristics Name: New user detected Type: new objects Status (default): disabled 145

146 Appendix E Alert Definitions Provided with DC RUM Detector: SQL-based Message New user user (IP_address) connected to: software_service_name running on server: IP_address (DNS_name). Usage: usage B. User agent: agent. Detector parameter This alert has no parameters. NEW_WORKSTATION This alert is triggered when a new client IP address is detected. Characteristics Name: New workstation detected Type: new objects Status (default): disabled Detector: SQL-based Message New workstation workstation (IP_address) has been detected. Detector parameter This alert has no parameters. OP_GAP_4_SRV This alert is intended to trace software services for which insufficient operations have been observed. Characteristics Name: Insufficient number of operations for a software service Type: performance Status (default): disabled Detector: built-in Message Software service {1} provided not enough operations. Number of operations: {2} and of service clients: {3}. 146

147 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Upper limit of operations Valid values number Description The upper limit of insufficient values of the number of operations. The default value is 100. PAGE_LOAD This alert is triggered when at least one client-bound operation load time is greater than a predefined threshold. Characteristics Name: Long operation time for URL Type: performance Status (default): enabled Detector: SQL-based Message Long page load detected for URL URL addressed to server IP_address (DNS_name) via software service software_service_name and user agent agent. Page load time values - affected: time ms, total time ms Number of affected users: number_of_users (percentage %). Detector parameter Parameter Threshold [ms] Valid value number Description The operation time threshold. Default: (milliseconds). RBAND This report is triggered when the realized bandwidth is too low for a site. Characteristics Name: Realized bandwidth too low for a site Type: performance Status (default): disabled Detector: SQL-based Message Realized bandwidth too low for site: site_name. 147

148 Appendix E Alert Definitions Provided with DC RUM Normal is: real_bndw bps, currently: real_bndw bps, active users: client_count Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Valid value Description Comparison toggle 0 or 1 The comparison method selector (default 0). [0/1] If it equals 0, the realized bandwidth for a site, scaled by parameter Normal multiplier/threshold, is compared to the normal value of the realized bandwidth. Otherwise, the current realized bandwidth is compared to parameter Normal multiplier/threshold. Normal multiplier/threshold number If Comparison toggle [0/1]equals 0, then Normal multiplier/threshold is a multiplier for the normal value of the site realized bandwidth (by default 0.9). If Comparison toggle [0/1] equals 1, Normal multiplier/threshold is the upper limit of unacceptable values of realized bandwidth. SERVC_PERF This alert is triggered when the average server time exceeds a predefined threshold value. Characteristics Name: High server time for service Type: performance Status (default): disabled Detector: SQL-based Message The server time of the service running on server IP_address (server_name) via software service software_service_name was too high. Server time values - current: server_time ms, baseline: baseline_server_time ms. The number of users connecting to the service: number_of_users. Detector parameters All three conditions have to be satisfied for the alert to be generated: the condition defined by the coefficient Baseline multiplier combined with the normal value, the condition based on parameter Noise threshold [ms], and the condition based on parameter Baseline threshold [ms]. 148

149 Appendix E Alert Definitions Provided with DC RUM Parameter Baseline multiplier Valid value number Description The factor for scaling the normal value of the average server time (by default, 2). The scaled value constitutes the lower limit of unacceptable values of the server time. Noise threshold [ms] number The lower limit of unacceptable values of the server time. The limit is used to eliminate services with the server time too low to be of concern. Default: 10 (milliseconds). Baseline threshold [ms] number The lower limit the normal value, for it to be used for comparisons. Default: 5 (milliseconds). SRV_ERR_GROW_4_HTTP_REQS This alert reports HTTP services for which an increase in the number of errors has been observed. An HTTP service is a network station that offers a set of resources over HTTP. This alert is triggered when the number of HTTP 5xx errors observed in a specified interval and for a particular HTTP service attains an unacceptable level in comparison to a baseline (normal) value. Characteristics Name: HTTP server errors unacceptable for URL Type: anomalies Status (default): enabled Detector: SQL-based Message Abnormal traffic volume for software service software_service_name (protocol) on port port. Volume values - current: volume B, reference_type: reference_value B. The increment of the volume relative to the reference_type: excess_volume_percentage%. The number of users using the software service: number_of_users. Detector parameters Parameter Valid values Description Multiplier of normal values of the number of server errors [%] number as percentage The coefficient used for scaling the normal value of the number of HTTP 5xx errors. The scaled normal constitutes an outer lower-bound of the set of unacceptable numbers of errors. It is expressed in percent. Default: 200. Lower limit of the unacceptable error number number The noise threshold representing an auxiliary outer lower-bound of the above set. It is used when the normal 149

150 Appendix E Alert Definitions Provided with DC RUM Parameter Independent lower limit of the unacceptable error number Global lower limit of the unacceptable error number Valid values number number Description value is known and thus it guarantees that the number of errors is significant. Default: 20. An independent threshold, representing an outer lower-bound of the set of unacceptable numbers of errors. This is the value with which the number of errors is compared, when the normal value is not known. Default: 50. A common threshold representing an outer lower-bound of the set of unacceptable numbers of errors. This is used when the previous criteria are not satisfied. Default:

151 Appendix E Alert Definitions Provided with DC RUM SSL_APPL_INOPER This alert is triggered when the SSL connection setup time for a software service on a server is too long. Characteristics Name: SSL connection setup time too long for SSL service Type: performance 151

152 Appendix E Alert Definitions Provided with DC RUM Status (default): enabled Detector: SQL-based Message The SSL connection setup time of software service software_service_name on server IP_address (server_name) was too long. The setup time value: time ms. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Parameter Valid value Description Selector of the SSL 0 or 1 connection setup time (0 - per session / 1 - per page) Lower limit of the number unacceptable setup time [ms] Toggle defining what metric is to be used for analysis; if the parameter equals 0, we use the current plain SSL connection setup time; otherwise, the current transaction-weighted SSL connection setup time is used. Lower limit of unacceptable values of the SSL setup time, expressed in milliseconds. SUSP_CLI_TRAFF This alert is triggered when a specific user registered from multiple IP addresses during a single monitoring interval or a defined period of activity. Characteristics Name: Multiple IP addresses used by a user Type: anomalies Status (default): disabled Detector: built-in, non-sql Message Multiple IP addresses used by user user_name. The total IP addresses used by the user: number_of_ips in last number_of_minutes minutes. IMPORTANT This alert does not track the activity of clients that use IPv6 addresses. 152

153 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Activity threshold (number of IP addresses) Activity timeout (minutes) Valid values number number Description The number of IP addresses to trigger the alert. Default: 5. The number of additional minutes over which the condition is measured. The condition is measured over the length of one monitoring interval plus the number of minutes specified here, rounded down to an integer number of monitoring intervals. Default:0. The default value means that the condition is measured over single monitoring intervals. Entering a value of, for example, 7 and assuming that the monitoring interval is configured to 5 minutes, would cause an additional 5 minutes to be added to the time over which the condition is measured. SUSP_URL_TRAFF This alert is triggered when abnormal traffic for a software service user is detected for a specified URL. IMPORTANT This alert heavily degrades report server performance. Create only a limited number of such alerts. Characteristics Name: Abnormal URL traffic for software service user Type: anomalies Status (default): enabled Detector: built-in, non-sql Message Abnormal URL traffic for user IP_address (name), software service software_service_name. The total of URLs requested by the user: number_of_urls. The percentage of the restricted URLs: percentage_restricted_urls%. The distribution of the restricted URLs: distribution_restricted_urls. IMPORTANT This alert does not track the activity of clients that use IPv6 addresses. 153

154 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Lower limit of the unacceptable number of URLs Lower limit of the unacceptable number of restricted URLs [%] Restricted URLs Toggle [0/1] to show the distribution of the restricted URLs Valid values number number string number Description The lower limit of unacceptable values of the number of hits. Default: 100. The lower limit of the ratio of URL traffic to restricted resources and URL traffic to all resources. Default: 50%. The set of URLs representing the restricted resources. The set is composed of URLs separated by space character. The default value of 0 does not show the distribution of the restricted URLs. SVR_TIME_4_URL This alert detects performance problems with a specified URL pattern. The alert is triggered when the average server time per URL exceeds the threshold value and if the number of such page loads is greater than a specified threshold. Characteristics Name: High server time for URL pattern occurred more than N times Type: performance Status (default): disabled Detector: SQL-based Message URL Pattern: URL_pattern Metric: Server Time Threshold: threshold ms Description: at least one server time of number downloads lasted more than time_threshold ms. The mean server time for downloads downloads was time ms. Summary Statistics: URL Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file. Syntax of configuration is as follows: SVR_TIME_4_URL String: "URL_pattern" = threshold_in_ms,number_of_occurrences_per_pattern 154

155 Appendix E Alert Definitions Provided with DC RUM TFC_LVL Example SVR_TIME_4_URL String: " = 20000,200 This alert is triggered when the traffic level for a server is out of bounds. Characteristics Name: Traffic limit exceeded for a server Type: anomalies Status (default): disabled Detector: SQL-based Message Traffic level out of bounds for server IP_address (server_name). Traffic level values - current: traffic_current bps, normal: traffic_normal bps. The number of users connecting to the server: number_of_users. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file only for servers that use IPv4 addresses. Parameter Multiplier of the normal traffic Threshold of the volume [B] Valid values number number Description The factor scaling the normal traffic level. The scaled level constitutes the lower limit of unacceptable values of traffic. Default: 10. The lower limit of unacceptable values of traffic volume. The limit is used to eliminate servers for which the volume is too low to cause concern. Default: B, that is, 100 kb. TFC_SUSP This alert is triggered when the average packet size differs too much from the normal (baseline) value. The alert is raised if the traffic normal average packet size is greater than its specified lower limit, and the current average packet size is smaller than its specified upper limit, while, at the same time, the current server traffic volume is greater than a predefined threshold. Characteristics Name: High traffic volume and big average packet size for a server 155

156 Appendix E Alert Definitions Provided with DC RUM Type: anomalies Status (default): disabled Detector: SQL-based Message Suspicious traffic characteristics on server: IP_address (server_name). Mean packet size values - current: packet_size B, normal: packet_size B. The number of users connecting to the server: number_of_users. Detector parameters Overriding values can be specified in the <install_dir>\config\alarmdetectorparams-rtm.config configuration file only for servers that use IPv4 addresses. Parameter Threshold of the normal mean packet size [B] Threshold of the mean packet size [B] Threshold of the volume [B] Valid values number number number Description The lower limit of unacceptable values of the normal value of average packet size. The limit is used to eliminate servers for which the average packet size is not significant. Default value: 100 (bytes). The upper limit of unacceptable values of the average packet size. The limit is used to eliminate servers for which the average packet size is too large to cause concern. Default value: 70 (bytes). The lower limit of unacceptable values of the traffic volume. The limit is used to eliminate servers for which volume is too low to be of concern. Default: bytes, that is, 3 MB. TRANSMETRIC_ALM_2 This alert is raised if the value of the application performance metric for an application (synthetic traffic, active monitoring) drops below a specified threshold. Characteristics Name: Application performance of an application (synthetic traffic, active monitoring) Type: performance Status (default): enabled Detector: built-in, metric-based Message Application performance of application Application (synthetic traffic, active monitoring) has dropped below Threshold%. 156

157 Appendix E Alert Definitions Provided with DC RUM Current value of application performance: Metric_value%. Number of slow transactions: Auxiliary_metric_value. Detector parameters Parameter Valid value Description Threshold number as percentage Threshold describing the minimum acceptable application performance level for an application. Default: 95 %. For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. TRANSMETRIC_ALM_3 This alert is raised if the value of the availability metric for an application (synthetic traffic, active monitoring) drops below a specified threshold. Characteristics Name: Availability of an application (synthetic traffic, active monitoring) Type: performance Status (default): enabled Detector: built-in, metric-based Message Availability of application Application (synthetic traffic, active monitoring) has dropped below Threshold%. Current value of availability: Metric_value%. Number of failed transactions: Auxiliary_metric_value. Detector parameters Parameter Threshold Valid value number as percentage Description Threshold describing the minimum acceptable availability level for an application. Default: 95 %. For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. TRANSMETRIC_ALM_4 This alert is raised if the value of the application performance metric for a site (synthetic traffic, active monitoring) drops below a specified threshold. 157

158 Appendix E Alert Definitions Provided with DC RUM Characteristics Name: Application performance of a site (synthetic traffic, active monitoring) Type: performance Status (default): enabled Detector: built-in, metric-based Message Application performance of site Site (synthetic traffic, active monitoring) has dropped below Threshold%. Current value of availability: Metric_value%. Number of slow transactions: Auxiliary_metric_value. Detector parameters Parameter Threshold Valid value number as percentage Description Threshold describing the minimum acceptable application performance level for site. Default: 80 %. For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. TRANSMETRIC_ALM_5 This alert is raised if the value of the availability metric for a site (synthetic traffic, active monitoring) drops below a specified threshold. Characteristics Name: Availability of a site (synthetic traffic, active monitoring) Type: performance Status (default): enabled Detector: built-in, metric-based Message Availability of site Site (synthetic traffic, active monitoring) has dropped below Threshold%. Current value of availability: Metric_value%. Number of slow transactions: Auxiliary_metric_value. 158

159 Appendix E Alert Definitions Provided with DC RUM Detector parameters Parameter Threshold Valid value number as percentage Description Threshold describing the minimum acceptable application availability level for a site. Default: 80 %. For information on modifying the detector settings and on filtering, see Configuring Trigger Conditions for Alerts [p. 26]. URL_RESP_EFF This alert is triggered when the server time and the number of slow operations (pages) exceeds thresholds. Characteristics Name: Server time and number of slow operations unacceptable for URL Type: performance Status (default): enabled Detector: SQL-based Message Server time and number of slow pages unacceptable for URL URL requested on server IP_address via software service software_service_name. Values of the server time - type: normal_server_time ms, current: current_server_time ms. The number of pages - slow: slow_pages, whole: pages. The number of active users requesting the URL: users. Detector parameters Parameter Valid value Description Relative increment of normal values of the server time percentage as An increment added to 100% (that is, to one unit) to number calculate the scaled normal value of the server time. This scaled normal value is used as the outer lower bound of an unacceptable server time. The increment is expressed in percent. Default: 100. Lower limit of the number unacceptable server time [ms] The auxiliary outer lower bound of unacceptable server times. It is used when the normal value of the server time is known and thus it guarantees that the value of the server time is significant. It is expressed in milliseconds. Default:

160 Appendix E Alert Definitions Provided with DC RUM Parameter Alternative lower limit of the unacceptable server time [ms] Valid value number Description This threshold is exceeded when the server time is greater than the value set. It is used when the normal value of the server time is not known. It is expressed in milliseconds. Default: 100. Lower limit of the percentage of slow pages percentage as This threshold is exceeded when more than this percent number of all operations are slow. Default: 20. Lower limit of the percentage of users receiving slow pages percentage as This threshold is exceeded when more than this percent number of all users of this service are receiving slow operations. The threshold is evaluated when greater than zero. Default: 0. USER_AVAILABILITY This alert is triggered when a user has connectivity problems (Connection Refused or Connection Establishment Timeout errors) and no bytes have been transferred. Characteristics Name: Service availability problem for a user Type: performance Status (default): disabled Detector: SQL-based Message Availability problem: user user_name connected to: software_service_name running on server: IP_address (DNS_name). User agent: user_agent. Values of TCP errors - refused: connection_refused, connection establishment timeout: connection_timeout. Detector parameters This alert has no parameters. VPN_DROP_OFF This is a template to be used as a basis for custom metric alerts. Characteristics Name: VPN gateway drop off Type: performance Status (default): disabled Detector: built-in, non-sql 160

161 Appendix E Alert Definitions Provided with DC RUM Message By default, the VPN_DROP_OFF message template is empty. You have to create the message based on your needs. The following message template could be used as a starting point: Threshold has been exceeded for URL: URL Reporting Group: repgroup, Server: IPaddress (DNSServer), Location: location, Metric value: metval. Threshold settings: threshsett, Auxiliary metric: auxmetric. Detector parameters For more information, see Configuring Trigger Conditions for Alerts [p. 26]. 161

162 Appendix E Alert Definitions Provided with DC RUM 162

163 Glossary Glossary The following glossary contains definitions of terms used across the DC RUM documentation. For definitions of metrics provided by DC RUM in DMI data views, see Central Analysis Server Data Views in the Data Center Real User Monitoring Central Analysis Server User Guide. alert An event notification generated by the report server when certain predefined events occur or when selected parameters related to user sessions, applications, and server activity reach predefined threshold levels. All other The object classification assigned to all clients who have not been assigned to an explicit site. analyzer Software component provided by Dynatrace to perform monitoring and traffic analysis. The report server uses analyzers to monitor operations for specific software services based on popular protocols, such as HTTP, provided that the underlying transport protocol is TCP, or UDP only in case of DNS-based software services. The report server can also analyze and report statistical information on non-transactional UDP-based or IP-based protocols. For more information, see Concept of Protocol Analyzers in the Data Center Real User Monitoring Administration Guide. Synonyms: decode application In DC RUM reports, a universal container that can accommodate transactions. Each application can contain one or more transactions. For more information, see Managing Business Units in the Data Center Real User Monitoring Administration Guide. 163

164 Glossary area In the context of the DC RUM report server, a collection of sites. An area has the same properties as a site, but refers to a larger entity. Areas cannot overlap. Any given site can belong to one and only one area. See also site and region. bandwidth usage A measurement calculated as the number of bits transferred during a specific time interval divided by the time interval. This measurement does not take into account factors such as inactive periods when the application was not attempting to transfer data, or transmission loss rate. baseline data Data from the last several days (usually nine days) aggregated into one average or typical day. Baselines are necessary for considering the variations in traffic on different days of the week, random anomalies in traffic load, or to compare traffic with a known baseline from a specific point in time. Baseline data is generated once a day after the arrival of data from the first monitoring interval after 00:10 am (in the background). Baseline data is not averaged over the day within each day and therefore may vary rapidly depending on the time of day just as monitored data would. Each monitoring interval is assigned the value averaged over the nine-day period for this specific monitoring interval. Baseline data is generated once a day after the arrival of data from the first monitoring interval after 00:10 am (in the background). Baseline data is not averaged over the day within each day and therefore may vary rapidly depending on the time of day just as monitored data would. Each monitoring interval is assigned the value averaged over the nine-day period for this specific monitoring interval. Requesting baseline data for Yesterday will yield the same results as requesting baseline data for Today, because baseline data for yesterday will still be calculated over the last nine days counting from today. Class of Service (CoS) The name identifying a Type of Service value. The mapping of Class of Service names to different values of Type of Service is defined in the report server configuration. See also Type of Service. client In the context of the DC RUM report server, the IP address of a user. Users can be identified by their IP addresses or in a number of other ways, such as by HTTP cookie contents or VPN login names. client internal IP address Term used by the report server in relation to virtual private networks where external users of the network appear inside the network under different (internal) IP addresses. custom metric A user-defined metric that extracts values from HTTP or XML requests (for example, HTML pages or SOAP messages). Each custom metric can be displayed as a sum of values or as their average. The sum metrics can be used to trace users or resources that use the most or least resources (for example, clients 164

165 Glossary who make the largest money transfers in a bank or purchase large quantities of items from an online bookstore). The average metrics can help in observing trends. For information on defining custom metrics, refer to the RUM Console Online Help. custom tier A tier that can be modified by a user. See also tier. decode A synonym for analyzer. Default Data Center site The classification for any server that has not been assigned to an explicit site. downstream In the context of the report server, the direction of traffic to a given region, area, site, or host. front-end tier In a user-defined configuration, the system architecture layer that is closest to the end user. See also tier. host A system component that participates in data exchange. A host can be either a server or a client machine, depending on the context and the direction of the monitored traffic. local The specified site for which the report server is displaying data. Local and remote are defined in the context of a particular site, area or region. When displaying data about a specified site, area or region, the report server refers to the site as local and to other sites as remote. If a report contains sections that focus on data from different sites, each site in turn will be designated as local. monitoring interval In the context of Global Configuration of the report server, the length of the shortest individual traffic-monitoring period. This period is usually a short interval of a few minutes. The latest values in a report are from the last closed monitoring interval, that is, from the last traffic-monitoring period. The monitoring interval is not the total time interval covered by the report. monitored session The session identified by application, server IP address, client IP address, and operation. normal value A baseline value collected based on the last several days (usually nine) and aggregated to calculate a typical value of a measure. For more information, see baseline data [p. 164]. 165

166 Glossary network ID The unique identifier assigned to a user for logging in to the network. Depending on the report server configuration, the network ID may be an IP address, HTTP authorization ID, HTTP cookie-based ID, a VPN ID, or static user name mapping. network performance The percentage of total traffic that did not experience network-related problems (traffic in which the values of loss rate and RTT did not exceed configured thresholds). For more information, see Network Performance Calculations in the Data Center Real User Monitoring Central Analysis Server User Guide. not monitored TCP TCP traffic that is not associated with a monitored application. This term is related to smart application monitoring. If smart application monitoring is enabled, application session information captured and reported by the AMD is not stored immediately in the report server database; it has to meet smart application monitoring thresholds before it is stored. not monitored UDP UDP traffic that is not associated with a monitored application. This term is related to smart application monitoring. If smart application monitoring is enabled, application session information captured and reported by the AMD is not stored immediately in the report server database; it has to meet smart application monitoring thresholds before it is stored. Privacy Enhanced Mail (PEM) Base64 encoded DER certificate, enclosed between -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- protocol In the context of the report server, layer 4 protocols according to the OSI model. The report server recognizes UDP and TCP-based protocols. realized bandwidth The actual transfer rate of application data when the transfer attempt occurred. This measurement takes into account factors such as loss rate and retransmission. The realized bandwidth is calculated as the size of the actual transfer divided by the transfer time. This metric reflects transient conditions on the network during the times when the transfer occurred. When the metric is averaged over a longer time interval, the average value is calculated only for those time sub-intervals for which actual data transfers attempts took place. region In the context of the report server, a collection of areas. A region has the same properties as an area, but refers to a larger entity. Regions cannot overlap. Any given area can belong to one and only one region. See also area and site. remote A site other than the specified site for which the report server is displaying data. 166

167 Glossary Local and remote are defined in the context of a particular site, area or region. When displaying data about a specified site, area or region, the report server refers to the site as local and to other sites as remote. If a report contains sections that focus on data from different sites, each site in turn will be designated as local. report server A common name for Central Analysis Server (CAS) or Advanced Diagnostics Server (ADS). The report server is the part of the Data Center Real User Monitoring responsible for measurement data processing, storage, and report generation. It connects to one or more AMDs and processes the measurement data into a relational database of measurements. The database is then used to serve interactive reports to the Data Center Real User Monitoring system user. reporting group A universal container that can accommodate software services, servers, operations, or any combination of these. Reporting groups can contain software services of every type but they were designed especially for HTTP-based services. Riverbed Steelhead A third-party appliance based on technology that optimizes the performance of TCP applications operating in a WAN environment. Steelhead combines data streamlining, transport streamlining, and application streamlining to improve WAN traffic performance. The software that runs a Steelhead appliance is called the Riverbed Optimization System (RiOS). Steelhead is generally deployed as a physical or virtual appliance. Mobile and software versions are also available. server In the context of the report server, the recipient of a TCP session or request (SYN packet), TCP, or UDP. Servers listen in on specified TCP/UDP ports, accept incoming requests, and reply to them. Usually, but not always, a server is a computer running software that offers a service or a number of services on one or more of the computer's ports. Servers are said to host software services. A server is identified by a unique IP address. This IP address appears on reports, unless the server's name can be resolved by means of a Domain Name Server (DNS), in which case the server's name is used instead. server from site The category assigned to application session information that does not meet smart application monitoring thresholds. If smart application monitoring is enabled, application session information captured and reported by AMD is not stored immediately in the report server database. It has to meet smart application monitoring thresholds. Sessions that meet the thresholds are stored under their server IP addresses, while those that do not, are stored as server from site. Network scanning by a workstation infected by a virus. Such a workstation will scan a large number of IP addresses. These addresses will not be reported individually, but on per-site basis. site An IP network from which users log in to a monitored network. 167

168 Glossary A site can be a range of IP addresses set manually, referred to as a class-c IP network; an automatically set class-b network; a range of addresses defined by a customized network mask; or a set of IP networks that is based on the BGP routing table analysis. Sites can be grouped together into areas, which in turn can be grouped together into regions. See also area, region, and All other. site realized bandwidth A weighted average of the software service realized bandwidth values for all services accessed from a particular site, weighted by the number of operations. software service A service, implemented by a specific piece of software, offered on a TCP or UDP port of one or more servers and identified by a particular TCP port number. Software services are identified on reports by either port numbers or assigned names. It is possible to configure the report server to define software services as services on particular ports of particular servers. In this case, a software service is identified by a combination of port number and server IP address. synthetic agent A simulator of user traffic to a given web site. Synthetic agents are designed to measure web site availability and performance. They are usually distributed over a number of different geographical locations. The report server is able to distinguish synthetic traffic from real user traffic. TCP availability The percentage of successfully completed connection attempts from the region, area, or site. By default, the measurement algorithm for this metric is based only on traffic that is generated by recognized applications or scanning attempts, which means that not monitored or unknown traffic is not taken into account. TCP session A collection of TCP packets exchanged between a given pair of client and server addresses, using a specific server port and client ports. tier A specific point where DC RUM collects performance data. For more information, see Multi-Tier Reporting in the Data Center Real User Monitoring Central Analysis Server User Guide. time The report server uses a granular concept of time, where events are recorded as having occurred at the beginning of their monitoring intervals: that is, all events that have occurred during a monitoring interval are time-stamped with the time corresponding to the beginning of that monitoring interval. If you need to specify time in a report server input field, you should do so according to the format defined in the operating system settings on the report server computer. 168

169 Glossary transaction Any of the following: A single operation, such as a web page load. A sequence of operations DC RUM monitors sequences of web page loads and sequences of XML calls, and it reports both on these sequences as transactions and on individual operations within the transactions. Defined collections of non-sequenced operations. A transaction defines a logical business goal, such as registration in an online store. One or more transactions together constitute an application. A transaction can have only one parent application. Data for a transaction can come from a Enterprise Synthetic agent or an AMD. The same transaction can contain data from different data sources at the same time. However, metrics for each of the data sources are aggregated separately. For more information, see Managing Business Units in the Data Center Real User Monitoring Administration Guide. Type of Service (ToS) A traffic identifier contained in an 8-bit field in the IP packet header (comprising a 6-bit Differentiated Services Code Point (DSCP) field and a 2-bit Explicit Congestion Notification (ECN) field). The contents of this field can be detected by the report server and displayed in reports. The use of this field is application-specific: it is used by applications to denote special types of traffic. See also Class of Service. unknown TCP proto TCP traffic that has not been recognized as belonging to a particular application. This situation can occur if the traffic is not defined in the Monitoring Configuration as belonging to a particular application, and the traffic has not been classified automatically by the autodiscovery mechanism. unknown UDP proto UDP traffic that has not been recognized as belonging to a particular application. This situation can occur if the traffic is not defined in the Monitoring Configuration as belonging to a particular application, and the traffic has not been classified automatically by the autodiscovery mechanism. upstream In the context of the report server, the direction of traffic from a given region, area, site or host. URI Uniform Resource Identifier. A URI provides a way to identify abstract or physical resources on the World Wide Web. It is a syntax for encoding the names and addresses of objects. The URI is a general form for creating some kind of address. A URL (Uniform Resource Locator) is a specific address used with some protocol such as HTTP or FTP that follows the general URI format. See also URL. 169

170 Glossary URL Uniform Resource Locator. The URL provides a standard way of specifying the location of a resource on the Internet: it is an Internet address. Resources are often web pages (HTML documents), but they can also be text or PDF documents, images, downloadable files, services, electronic mailboxes, or many other objects. URLs make resources available under a variety of naming schemes and access methods (such as HTTP, FTP, and ) addressable by one simple, uniform method. user Users can be identified by their IP addresses or in a number of other ways, such as by HTTP cookie contents or VPN login names. The term client in the context of report server refers to the IP address of a user. See also client. user session A collection of transactions identified by specific cookie value. A new cookie value sent by the client starts a new user session. A new cookie value issued by the server does not signify the start of a new session. The report server distinguishes between different user sessions by analyzing HTTP cookie information, that is, the contents of a particular named cookie or depending on the report server configuration the contents of all the cookies in HTTP transactions. For example, a user sends requests with cookie ABCD=1234. In one of the responses, the server changes the value to ABCD=5678. The report server recognizes subsequent requests with cookie value ABCD=5678 as a continuation of the session: no session count is increased. virtual IP address (VIP) A network interface that enables users to use IP addresses not directly related to the actual physical hardware. In systems that do not use virtual IP addresses, if an interface fails, any connections to that interface are lost. With virtual IP addressing on the system and routing protocols within the network providing automatic reroute, recovery from failures occurs without disruption to the existing user connections that are using the virtual interface, as long as packets can arrive through another physical interface. Virtual Private Network (VPN) The provision of private voice and data networking from the public switched network through advanced public switches. The network connection appears to the user as an end-to-end circuit, without actually involving a permanent physical connection as in the case of a leased line. VPNs retain the advantages of private networks but add benefits like capacity on demand. The report server can monitor multiple VPNs. There is no fixed limit to the number of monitored VPNs and remote users; however, the capacity of the monitoring software depends on the overall system performance and on the VPN traffic. WAN Optimization A Wide Area Network deployment in which software and network services are optimized through at least two or more WAN Optimization Controllers (WOCs). The goal of WAN optimization is to improve application response time and reduce the required bandwidth over a WAN connection by using a WAN controller on each end of the WAN link. 170

171 Glossary A WAN Optimizer is deployed on either end of a WAN connection to optimize the traffic sent over the WAN. The WAN Optimizer classifies, prioritizes, and compresses network data, caches network traffic, and streamlines protocols to maximize the performance of a service delivered over distributed network. The most common optimization techniques involve: Transport (TCP) optimization TCP flow-control round trips are reduced by: Fast error recovery Mitigated slow-start Window scaling Pre-established TCP connection pools between the WAN-optimizing appliances Payload Optimization The TCP payload is indexed and stored on disk on each side of the WAN: Data segments (blocks) are replaced with references to this data Byte-level indexing is independent of the application or file Application Acceleration Application-specific acceleration is used to reduce application traffic. In Common Internet File System (CIFS) SMB emulation is used: By spoofing the CIFS protocol By reading ahead and writing behind Specific modules can be made available from individual vendors for a specific application Using a combination of these techniques and setting up the acceleration appliances to act as proxy servers can accelerate end-user experience significantly. WAN Optimization Controller (WOC) WAN optimization controllers (WOCs) are physical devices that transparently intercept local network traffic, optimize it, and send the optimized traffic over the WAN link to the receiving controller. On the other side of the WAN, the receiving WOC transparently converts the optimized traffic from the WAN link into normal network traffic. The typical WAN optimization scenario involves at least two WOCs located between the data center (or a server) and a branch office (or a client). Wide Area Application Engine (WAE) A Cisco platform that consists of a portfolio of network appliances that host Cisco WAN optimization and application acceleration solutions that enable branch-office server consolidation and performance improvements for centralized applications and content across the WAN. Wide Area Application Services (WAAS) A Cisco technology that optimizes the performance of TCP-based applications operating in a WAN environment. WAAS combines WAN optimization, optimization of the Transport Control Protocol (TCP), Data Redundancy Elimination (DRE, also known as de-duplication) and 171

172 Glossary application protocol acceleration in a single appliance or blade. It runs on Wide Area Application Engine (WAE) hardware platforms, including stand-alone appliances and network modules (NME) for the Cisco Integrated Services Routers (ISRs). 172

173 Index Index 404_TIME_4_URL 121 A alert recipients 47, 49 CAS users 47 COS 47 scripts 47, 49 trap clients 47 alert types , , anomalies 127, , 133, 149, , 155, 160 diagnostics , , 134, generic performance metric alerts 127 new objects performance 121, , , , , 151, 154, transaction performance alerts 9 10, 12 13, 17 23, 25 26, 29, 33 34, 36 37, 40, 43, 45, 47, 49, 53, 57, 59 60, 62, 64, 67, 69, 71, 75, 79, 96, 99, 108, , 121 Abnormal URL traffic for software service user 62 adding 25 available dimensions 75 available metrics 79, 96, 99, 108, canceling 9 condition baseline cut-off 33 value 33 configuration 25 defining 13 definitions 121 delayed processing 40 deleting 21 delivery 9, 13 alerts (continued) detector settings 26, 29, 34 baseline condition 34 dimension filter 37 disabling 20 disabling system 57 displaying 17 displaying definitions 17 duplicating editing 18 editing limitations 22 editing on single device 19 edition 17 enabling 20 examples 59, 67 enterprise environment 67 Web-based environment 59 Excessive number of servers used by user. Top software service identified 69 Generic performance 64 High server time for service 60 Internetwork traffic 108 link performance 43, 45 detector settings 43 output filters 45 logs 13 mechanism 9 message content 53 metric alert 75, 79 available metrics 79 metric limitations 36 Network link 115 Network Performance for Site 67 New server detected 71 notification 12, 53 output filter 37 predefined

174 Index alerts (continued) propagation characteristics 40 publishing on different CAS versions 23 raising 9 raising and canceling conditions 9 Real user performance 79, 96, 99 recipients 47, 49 COS 49 trap client 47 repeating 9, 12 states 12 Synthetic and sequence 116 traps 13 type 10 anomalies 10 Citrix performance 10 diagnostics 10 generic 10 link performance 10 new objects 10 performance 10 point-to-point data 10 upgrade 23 user-defined 25 versioning 23 AMD_DROP_PCKTS_ALL_I 122 AMD_DROP_PCKTS_DRVR 123 AMD_DROP_PCKTS_SNGL_I 123 AMD_NOTRAFFIC_DRVR 124 AMD_NOTRAFFIC_I 124 AMD_SSL_ENGINE 125 AMD_SSL_STATUS 125 AMD_UNIDIR_TRAFF 126 anomalies alerts 127, , 133, 149, , 155, 160 APPL_ABNOR 127 EXC_ACT 129 EXC_ACT_SIMPLE 131 EXC_ACT2 130 HOT_IP 133 SRV_ERR_GROW_4_HTTP_REQS 149 SUSP_CLI_TRAFF 152 SUSP_URL_TRAFF 153 TFC_LVL 155 TFC_SUSP 155 VPN_DROP_OFF 160 APPL_ABNOR 127 auxiliary metric 26, 29, 37 AVL_DROP_4_APPL 127 B baseline C comparison mode 26, 29, 32 comparison mode (continued) absolute 26, 29, 32 relative 26, 29, 32 single 26, 29, 32 condition baseline cut-off 33 value 33 cut-off 33 D DATABASE_SIZE 128 detector settings 26, 29 diagnostics alerts , , 134, AMD_SSL_ENGINE 125 AMD_DROP_PCKTS_ALL_I 122 AMD_DROP_PCKTS_DRVR 123 AMD_DROP_PCKTS_SNGL_I 123 AMD_NOTRAFFIC_DRVR 124 AMD_NOTRAFFIC_I 124 AMD_SSL_STATUS 125 AMD_UNIDIR_TRAFF 126 DATABASE_SIZE 128 DISKS_STORAGE 129 INCORR_LOGIN 134 LOW_OPER_4_CAP_MOD 140 LOW_OPER_4_SYS_MOD 141 dimension filter 37 syntax rules 37 disabling alert system 57 DISKS_STORAGE 129 E EXC_ACT 129 EXC_ACT_SIMPLE 131 EXC_ACT2 130 F FLOW_DROP_4_CL_LOC 132 G generic performance alerts METRIC_ALM_2 141 METRIC_ALM_3 142 METRIC_ALM_4 143 METRIC_ALM_5 143 H HOT_IP 133 HTTP_SERV_EFF

175 Index I incoming traffic 43 INCORR_LOGIN 134 L link utilization 43 LOAD_TIME_4_URL 135 LOAD_TIME_4_URL_4_CLI 136 LOC_CL_UP_STR_EFF 136 LOC_HTTP_STR_EFF 138 LOSS_RATE 139 LOW_OPER_4_CAP_MOD 140 LOW_OPER_4_SYS_MOD 141 M metric alerts 75, 79, 121, 127 available dimensions 75 available metrics 79 AVL_DROP_4_APPL 127 definitions 121 METRIC_ALM_2 141 METRIC_ALM_3 142 METRIC_ALM_4 143 METRIC_ALM_5 143 N new objects alerts NEW_APP 144 NEW_SERVER 144 NEW_SERVICE 145 NEW_USER 145 NEW_WORKSTATION 146 NEW_APP 144 NEW_SERVER NEW_USER 145 NEW_WORKSTATION 146 notification 53 configuration 53 O OP_GAP_4_SRV 146 outgoing traffic 43 output filter 37 syntax rules 37 P PAGE_LOAD 147 performance alerts 121, 127, , , , , 151, 154, performance alerts (continued) 404_TIME_4_URL 121 AVL_DROP_4_APPL 127 FLOW_DROP_4_CL_LOC 132 HTTP_SERV_EFF 133 LOAD_TIME_4_URL 135 LOAD_TIME_4_URL_4_CLI 136 LOC_CL_UP_STR_EFF 136 LOC_HTTP_STR_EFF 138 LOSS_RATE 139 OP_GAP_4_SRV 146 PAGE_LOAD 147 RBAND 147 SERVC_PERF 148 SSL_APPL_INOPER 151 SVR_TIME_4_URL 154 TRANSMETRIC_ALM_2 156 TRANSMETRIC_ALM_3 157 TRANSMETRIC_ALM_4 157 TRANSMETRIC_ALM_ URL_RESP_EFF 159 USER_AVAILABILITY 160 R RBAND 147 repeating alert notification 9 repeating alert notifications 12 S script 49 SERVC_PERF 148 SNMP 57 traps 57 SRV_ERR_GROW_4_HTTP_REQS 149 SSL_APPL_INOPER 151 SUSP_CLI_TRAFF 152 SUSP_URL_TRAFF 153 SVR_TIME_4_URL 154 T TFC_LVL 155 TFC_SUSP 155 TRANSMETRIC_ALM_2 156 TRANSMETRIC_ALM_3 157 TRANSMETRIC_ALM_4 157 TRANSMETRIC_ALM_ traps 47, 57 and IPv6 57 configuration 47 OIDs 57 recipients 47 SNMP

176 Index U URL_RESP_EFF 159 USER_AVAILABILITY 160 V VPN_DROP_OFF