Monitoring Remedy with BMC Solutions Overview How does BMC Software monitor Remedy with our own solutions? The challenge is many fold with a solution like Remedy and this does not only apply to Remedy, but also competing solutions as well as other web based enterprise solutions. Analysis The source of performance (or lack of it) can be attributed to a large variety of factors, some within the software itself, and some within the immediate infrastructure, as well as within the greater environment e.g. the internet. BMC Solutions The following stack of solutions at the present time June 2014 - can be used for full stack monitoring of the solution. It must be noted that these solutions will help pinpoint where the issues are (and usually slow performance is a combination of issues a set of bottlenecks) allowing them to be addressed, and may not in themselves resolve the issue. BMC has multiple modules that can monitor Remedy (7.6.04 is the oldest version we monitor). A complete Remedy stack monitoring should include the following: From the End-user perspective o BMC APM EUEM (web-interface only for http/https traffic), additional watchpoints may be required o Borland Silk Performer Synthetic Transaction Monitoring for BMC Software (newer replacement of TM-ART) Mid-tier/application tier o BMC APM- Application diagnostics o BPPM for Internet Servers (monitors the web server, such as Apache, Microsoft IIS etc.) Remedy Applications (i.e. Incident Management, Change Management etc.) o BMC PATROL Knowledge Module for Remedy AR Server Back-end database (Oracle, Sybase, MS SQL Server etc.) o BPPM for Databases (monitors all databases that Remedy supports) Any Network equipment, such as F5 load balancers o Entuity or any other network monitoring tool. Operating system where remedy is running on (i.e. Windows, Linux, VMs etc.) o BPPM for Servers/Virtual Servers (monitors health of the system, like CPU, Memory, Disk, Remedy processes/services, logs) Hardware platform, storage o BMC Performance Manager for Hardware by Sentry Software 1 P a g e
Some Causes of Performance Issues This is NOT an exclusive list, but illustrates the complexity and some of the points that can cause performance issues. They are not mutually exclusive, quite probably the reverse, with a combination of them causing the solution to be slow. Browser version / type / cache some browsers, especially older versions of those browsers are significantly slower than others. Internet Explorer is slower than Chrome. Caching settings may alter the performance LAN / WAN / Internet connectivity is another recurrent source of performance issues Server set up, clustering, number of users per JVM in the mid-tier configuration Hardware specification and balancing (memory, CPU, storage) Database hardware / configuration indices, field settings, I/O, values and filters Frequency of polling / cron tasks Query efficiency Default queries pulling too many records at one time Reporting requirements and efficiency of queries underlying reports All of the above could have an impact, and can be examined in far greater depth in order to get to a resolution to performance issues The Reality Before looking at an example of how Remedy is monitored it is really important to understand that there is no one solution, and one that may be good today may not be so good tomorrow things change such as (not an exclusive list): Other traffic on the network Additional volume of records Alterations to configuration o With or without change management o New / update reports are written, o New / updated queries are deployed o New functionality deployed Archiving may be performed periodically Use of new / different browsers / caching setting Furthermore, even with just 2 deployments of Remedy, used in similar organizations there are enough potential environmental as well as configuration settings that some monitoring setting that may work in environment 1 will not necessarily be as useful in environment 2. Again, differences could be Geographic coverage Other network traffic Hardware differences (e.g. a different manufacturer s database or version on database is deployed) Different settings (SLA s may be heavily deployed in one environment and not so in another) Volumes of data transacted as well as stored (Archiving may not be taking place) Different reports and KPI s configured with more or less efficient queries Different levels of automation deployed Complexity of the deployment (e.g. Approval process) 2 P a g e
Levels of notifications As a consequence the details below must be viewed as a guideline and no more when determining what to monitor, how to monitor and what solution is used to monitor the system. The details below apply to deploying BMC Software monitoring solutions, but could probably be adapted to other monitoring solutions. This document covers each monitoring at high level for Production Environment. AR SERVER MONITORING: The following OS KM parameters are set to alert when the set thresholds are breached. Windows OS Monitoring Occurrences Incident Ticket Parameters Thresholds Polling Cycle Logical Disks [Free space%] Major Event Critical Event D: < 15% < 10% Immediate Yes 2 mins C: < 15% < 10 % Immediate Yes 2 mins Memory Major Event Critical Event Memory Used in % > 85 % > 95% 11 Yes 2 mins CPU Major Event Critical Event Total Processor Utilization in % > 85 % > 95 % AR Services Status (Up/down) Major Event Critical Event 9 Yes 2 mins BMC Remedy Action Request System Server onbmc-s - Service down BMC Remedy Flashboards Server - onbmc-s - Service down McAfee Framework Service - Service down McAfee McShield - Service down McAfee Task Manager - Service down Remote Procedure Call (RPC) - Service down BMC Remedy Email Engine - onbmc-s 1 - Service down Email monitoring using Email Script for Servers Critical Event 3 P a g e
which has Email engine Running Number of emails that have been incorrectly flagged as delivered - > 0 Time since oldest Pending email - > 900 Seconds Emails that are pending delivery - > 0 Immediate Yes 2 mins Immediate Yes 2 mins Immediate Yes 2 mins AR KM Monitoring Service down Immediate Yes 2 mins LDAP Port Monitoring Port Down Immediate Yes 2 mins TCP Established Immediate Yes 2 mins Connections >3000 Tomcat SSO Down Assignment Engine On Demand Monitoring Approval Engine On Demand Monitoring On Demand Reconciliation Jobs Monitoring 4 P a g e
However, there are lot many other parameters in monitoring which are used for analysis. 5 P a g e
6 P a g e
MID TIER MONITORING: Individual mid tiers are monitored in the same way as AR server are along with the new set of process and service for the Mid-Tier Server. Windows OS Monitoring Parameters Thresholds Occurrences Incident Ticket Polling Cycle Logical Disks [Free space%] Major Event Critical Event D: < 15% < 10% Immediate Yes 2 mins C: < 15% < 10 % Immediate Yes 2 mins Memory Major Event Critical Event Memory Used in % > 85 % > 95% 11 Yes 2 mins CPU Major Event Critical Event Total Processor Utilization in % > 85 % > 95 % 9 Yes 2 mins Mid-tier Services Status (Up/down) Major Event Critical Event Apache Tomcat Tomcat6 - Service down McAfee Framework Service - Service down McAfee McShield - Service down McAfee Task Manager - Service down Remote Procedure Call (RPC) - Service down TCP Established Connections >3000 Immediate Yes 10 mins 7 P a g e
There are lot of other parameters monitored using the OS KM as shown below. 8 P a g e
DASHBOARD AND ANALYTICS SERVER MONITORING: Along with the standard monitoring of the OS following service and processes are monitored for the Dashboard servers: Processes Status ( Up/down) Major Event Critical Event Occurrences Incident Ticket Polling Cycle CIA NA Process down Dash board Services Status (Up/down) Major Event Critical Event Apache Tomcat NA Service down BOE120MySQL NA Service down BMC Atrium DIL Repository NA Service down 9 P a g e
McAfee Framework Service NA Service down McAfee McShield NA Service down McAfee Task Manager NA Service down Remote Procedure Call (RPC) NA Service down BMC Atrium DIL Server NA Service down Server Intelligence Agent (onbmc_ada) NA Service down Report Execution On Demand A configured a report which is executed at regular intervals to identify if there is an issue with the BO and DB. F5 LOAD BALANCER MONITORING USING CUSTOMIZED PATROL KM: With the help of F5 load balancer KM, we are monitoring the status of active pool members in F5. If any of the pool member goes down an alert is sent to BPPM. DATABASE MONITORING: SQL database is monitored for multiple parameters and the below one s are used for alerting to keep an eye on the heart of AR systems. Occurrences Incident Polling Cycle Parameters Alarm conditions Alarm Ticket Suspect Database Any Database Yes Immediate Yes 4 hours SQL Server Agent Immediate Yes 15 mins Any Job Failure Yes Job Failures Blocker Procs For any blocking processes if the blocking persists for more than 30 secs. Yes SQL Agent Status When service is Yes down SQL Server Status When service is Yes down Cache Hit Ratio <90 Yes Long Running Trans >300Sec The alert should contain the session id executing the transaction along with the user Yes 10 P a g e
Deadlock name for the session Warning alert for any deadlock. The alert needs to have the session ids of all the sessions that are involved in deadlock. Yes Disk Space Monitoring for Databases: Maintain 50% free storage space for all production DB servers Warning alert threshold set at 25% free space o Email, alerting and escalation Critical alert threshold set at 20% free space o Email, alerting and escalation TMART MONITORING: We are running synthetic transaction by the name HLAL which goes to Homepage, Login, Application Listing and Logout to check if the AR application is working fine and measure any performance degradation. The availability is checked within the data centre and on case to case basis we run the transactions to run from remote data centres. 11 P a g e
ALERTING: Production: HLAL: Availability or Accuracy < 100 % for consecutively 2 cycles. This is done keeping in mind that there should not be an increase in false alerts because of any network glitch, browser or monitoring application related issue. HLAL_perf: Login Response Time > 10 seconds for consecutively 5 cycles. An analysis has been done over multiple ITSM systems and 10 seconds login time have been found to be the benchmark for deciding if the performance is indeed getting worse. We run the script every 2 seconds. Dev & QA: HLAL: Availability or Accuracy < 100 % for consecutively 2 cycles. No performance alerting for Dev and QA URL s. INDIVIDUAL AR SERVER MONITORING USING AR SERVER KM: The Patrol Knowledge Module for AR Server is used to monitor the individual AR Server availability. This is configured in case of AR Server group is implemented. The KM uses Java based drivers to connect to the individual AR Server. The KM detects basic performance and availability of the AR Server. DEV & QA ENVIRONMENT MONITORING: Development and Quality Assurance environments are monitored for Availability using TMART and only, Disk utilization and McAfee Services are monitored in BPPM. Availability of URL s is monitored using the TMART transaction HLAL which goes to the Homepage, do a Login, does the Application Listing and finally Logout. 12 P a g e
BPPM is used only to monitor the Disk Space and McAfee related services. Following parameters are monitored. Windows OS Monitoring Occurrences Incident Ticket Polling Cycle Parameters Thresholds Logical Disks [Free space%] Major Event Critical Event D: < 15% < 10% Immediate Yes 2 mins C: < 15% < 10 % Immediate Yes 2 mins Services Status (Up/down) Major Event Critical Event McAfee Framework Service - Service down McAfee McShield - Service down McAfee Task Manager - Service down Acknowledgements With thanks to: Franco Ferrero Bob Mosely Nick Goff Theodore Cory 13 P a g e
1. For each of the monitoring layers outlined below we want to know the specific monitoring targets and their default thresholds a. Mid-tier/application tier i. BMC APM- Application diagnostics What app threshold do you monitor for? ii. BPPM for Internet Servers (monitors the web server, such as Apache, Microsoft IIS etc.) Which Apache thresholds? JMX monitoring points? Etc.. b. Remedy Applications (i.e. Incident Management, Change Management etc.) i. BMC PATROL Knowledge Module for Remedy AR Server What specific things is the KM for Remedy AR Server monitoring? We want the details and default thresholds please KM monitors AR Application status and AR Server Statistics. As of now there is no thresholds set. Metrics of AR Server Statistics c. Back-end database (Oracle, Sybase, MS SQL Server etc.) i. BPPM for Databases (monitors all databases that Remedy supports) We use Oracle Enterprise Manager (OEM) for DB monitoring. We want to ensure we have the default Oracle DB monitoring points and thresholds provided to ensure we sync them Database monitoring include availability of database, tablespace usage, also database related filesystems for utilization d. Operating system where remedy is running on (i.e. Windows, Linux, VMs etc.) i. BPPM for Servers/Virtual Servers (monitors health of the system, like CPU, Memory, Disk, Remedy processes/services, logs) Again, what Remedy processes, services, and logs are monitored, what are the default thresholds OS monitoring for Windows including Total CPU, % of Memory used, Disk Freespace. Processes include arcmdbd,armonitor,arplugin,arrecond,arserver,arsvcdsp,slmbrsvc,slmcollsvc. arerror log is monitored for plugin errors. Default OS threshold for Windows 14 P a g e
Apart from this, we use a custom Patrol KM to create blackout for Change suppressions. 15 P a g e