SCALEXTREME S SERVER MONITORING ESSENTIALS

A WHITEPAPER BY SCALEXTREME SCALEXTREME S SERVER MONITORING ESSENTIALS COVERS: INTRODUCTION UNDERSTANDING IT INFRASTRUCTURE PERFORMANCE METRICS WHAT IS AN INFRASTRUCTURE METRIC? IT PERFORMANCE METRICS AND INDICATORS WHICH INFRASTRUCTURE PERFORMANCE METRICS SHOULD I MONITOR? WHAT CAUSES AVAILABILITY VARIANCE AND IT INFRASTRUCTURE ALERTS? DETERMINING WHO NEEDS TO GET AN IT SYSTEMS ALERT WHAT TO DO AFTER AN IT INFRASTRUCTURE ALERT

INTRODUCTION Server monitoring is a critical part of the daily work of systems administrators and IT operators. It encompasses performance measurement, diagnostics and trouble shooting. Monitoring can be active, requiring an IT professional to read dashboards, consult log files or watch performance numbers. It can also be passive. Passive systems monitoring involves relying on specialized software to provide alerts when something goes wrong. Here we provide an overview of what you need to know to begin effectively monitoring your IT infrastructure: Understanding IT Infrastructure Performance Metrics IT Performance Metrics and Indicators Which Infrastructure Performance Metrics Should I Monitor? Protecting Your Critical IT Infrastructure with Monitoring Alerts What Causes Availability Variance and IT Infrastructure Alerts? Determining Who Needs to Get an IT Systems Alert What to Do After an IT Infrastructure Alert ScaleXtreme offers a comprehensive server monitoring product with powerful features to help you ensure your Linux and Windows systems are available and running smoothly. You can monitor performance of server hardware, operating systems (Windows and Linux), and critical applications, with our extensive selection of IT performance metrics. You can create monitoring alerts based on a wide range of metric thresholds and conditions and receive instant notifications when an alert is triggered. You can define event-based actions with API calls, create custom performance metrics and see multiple detailed monitoring views all through a single product. Good Server Monitoring is particularly important if you re working with both traditional, onpremise IT infrastructure and new cloud computing instances. UNDERSTANDING IT INFRASTRUCTURE PERFORMANCE METRICS Keeping your computer systems in top working condition depends on the carefully balanced interaction of multiple different components. Computer performance metrics provide a means of monitoring the performance of these components, so you can proactively find and address problems before they interfere with your organization's technical and business processes. Understanding performance metrics is key to implementing a successful monitoring strategy that can help you protect and optimize your critical IT infrastructure. WHAT IS AN INFRASTRUCTURE METRIC? A metric is a measure of the performance of a system component or subsystem. System administrators use metrics to monitor the behavior of these components. Understanding metrics can help you prevent, diagnose, and fix underlying issues that may cause bottlenecks and lead to costly system failures and downtime. 2

Computer systems include the following interactive components: Machine components: "Hardware" devices, such as CPU, physical memory, disk, network interface, etc. Operating system components: Software that facilitates the interaction between machine/hardware, application programs, and users; includes managing processes, file systems, memory allocation, I/O operations, etc. Application components: Software that performs specialized functions for users such as web server, database, word processor, etc. Each of these components has unique performance characteristics that can be measured to determine if the component is performing normally when compared to established baseline values. If metric values are too high, low, or fall outside an expected range, then an issue may exist with the server that requires further attention. EXAMPLE Let's look at a basic performance metric: % CPU utilization. This metric tells you the overall percentage of your processor that is currently in use. The higher the percentage utilization, the more likely the system is to experience performance problems. If your baseline for CPU utilization is between 15-25% under normal conditions, then a reading above 25% indicates an abnormally high level of usage. If usage becomes too high, server performance may suffer, become sluggish, and the system may even crash. In such a case, you should attempt to identify the cause of the abnormally high metric, and take corrective action to reduce the load on the processor. This might involve re-configuring the server or adding additional server resources. It's a good idea to learn as much as possible about how your computer hardware, operating system, and applications actually work, so that you can effectively interpret metric data. It's the first step to implementing monitoring, the process that gives you visibility into your IT systems and the power to protect them. From there, you can begin to work with pre-set alert thresholds based on established baseline performance metrics that will help ensure your systems are running efficiently and reliably. IT PERFORMANCE METRICS AND INDICATORS Determining what's normal is one of the first tasks an IT professional will tackle in a new operational environment. It's critical to establish baselines to then be able to interpret monitoring output and metric data effectively. Only then can an IT professional take corrective action. The actual performance of your system will depend on a variety of factors, including the overall system architecture, type of CPU, amount of memory, disk space, I/O devices, and applications in use, as well as network traffic and demand on resources. These factors can vary widely based on any number of conditions. Still, you can determine operational norms for your particular system by measuring baseline values for relevant performance metrics. You can then compare metric data to these baselines to identify deviations that might indicate a problem with your system. 3

If you see that some metric is way out of whack, you'll have a good indication of where to begin working toward a solution. ESTABLISHING BASELINE IT INFRASTRUCTURE PERFORMANCE METRICS To establish baseline performance metrics for your servers, you will need to measure server performance at various times and under different load conditions. This will give you an idea of the expected range of performance for your system under normal circumstances. EXAMPLE IT professionals sample system loads under various conditions to establish baselines. They might record CPU load, a common metric of infrastructure utilization, at the beginning of the workday, when there are only a few system users; in the middle of the workday, when there are many users; and in the middle of the night when there are no system users. They might also sample metrics during critical business cycles, such as payroll processing, or over the holidays, when activity may spike. A good sample size can be collected over the course of several weeks. You can then compare these baselines with subsequent performance metric data to determine how well a server or system performs. Performance metrics that deviate dramatically from baseline measurements may indicate areas where a server needs to be optimized or reconfigured. Baselines also give you a point of departure for setting useful monitoring alert thresholds. Once you understand how your system performs, you can begin to set alerts that accurately reflect trends and/or conditions that are a departure from normal system behavior and may jeopardize your system. WHICH INFRASTRUCTURE PERFORMANCE METRICS SHOULD I MONITOR? There are a wide variety of metrics available for monitoring on both Linux and Windows systems. The specific metrics you choose to monitor will depend of course on the specific architecture and function of your system. Suppose, for example, you are running a multi-tier web site. In this case, you ll want to monitor the performance of each server in the system, with attention to network activity, both on the public Internet, and between front- and back-end services. You ll also want to monitor metrics for each of your application programs (including web server, database, load balancer, etc.), checking metric data against baseline readings to verify your applications are performing as expected delivering the highest quality end user experience. We divide infrastructure performance metrics into two general categories: System Metrics: System metrics measure the performance of machine (hardware) components and Operating System components. Application Metrics: Application metrics measure the performance of application programs. 4

SYSTEM METRICS Monitoring the following types of system metrics can help you avoid many issues frequently implicated in poor performance and system failures: CPU: The CPU (Central Processing Unit) is the central component or brain of your computer system. Its function is to run all program instructions for the system via arithmetical, logical, and input/output operations. CPU performance is critically important. If the CPU has problems or fails, the system is likely to run poorly, crash frequently, or not start up at all. During periods of high use, the CPU can come under a great deal of stress, making it more vulnerable to failure. Thus it s essential to monitor the performance of your CPU, and take action when necessary to maintain optimal CPU performance. There are a wide variety of CPU monitoring metrics available to help you keep the brains of your operation in top working order. Physical Memory: Physical Memory, also known as RAM (Random Access Memory), is a temporary hardware storage area where frequently used data, such as operating system and program files, is loaded before being accessed by the CPU. Adding as much RAM as possible to your system is highly recommended. Insufficient RAM can lead to sluggish behavior. This is because most computers use an overflow mechanism known as virtual memory to extend RAM capacity. When RAM fills up, the computer swaps data out to a partition on the hard disk. Swapping data can cause performance issues, as data transfer from disk is much slower than from physical memory. Monitoring metrics such as available memory and swap space, can help you ensure physical memory is used as efficiently as possible. Disk Usage: There are a number of hard disk performance characteristics that can impact overall system performance. Most of these relate to the speed of data input/output (I/O) from disk. In general, the faster your hard disk can read and write data, the better your system will perform. Another important measure is the amount of available space on the disk. The more packed your disk becomes, the more sluggish your performance will be. A crowded disk can also cause fragmentation, where blocks of data are split up across different locations on the drive. This can really slow down the action and even jeopardize the safety of your data. Remember the old saying: You should routinely back up data on your hard disk to ensure that your irreplaceable data is safe and sound. Network: Network metrics provide a way to evaluate the performance of your system s network operations. This is done by measuring characteristics of TCP/IP connections over your system s network interface card (NIC). These metrics, including total number of packets, total packets received/sent, and number of packets dropped can help you identify issues that may be hampering the flow of traffic on Internet connections and on your internal network. Processes: A process is a running program instruction. Computer systems have multiple processes running concurrently and each process may consist of multiple threads also running currently. Certain processes are initiated at the hardware/operating system level, while others are initiated by applications. If you have too many processes running, system performance can suffer, as all of these processes use CPU resources. Monitoring processes can help you determine if you have too many processes running, and help you identify processes that you may be able to terminate, to reduce load on the CPU. 5

APPLICATION METRICS Application metrics measure the performance of your system s application programs. These metrics collect data on quantifiable characteristics of user-to-application and application-toapplication interactions, including events such as accesses, requests, queries, and so on. IT professionals use application metrics to monitor the availability and responsiveness of software applications in order to ensure that performance meets the needs of both end users and business. Application metrics can also provide insight into how efficiently applications are using underlying system components, including CPU, Memory, Disk, Network, etc. In addition, through root cause analysis, certain application monitoring solutions can expose performance issues in application code. As IT systems become increasingly more distributed across physical, virtual, and cloud-based systems, the ability to manage application performance over geographically dispersed and technically diverse infrastructure has emerged as an important IT capability. This has sparked thoughtful debate in the IT analytics community as to what constitutes effective Application Performance Management (APM). MONITORING A MULTI-TIER APPLICATION ARCHITECTURE A multi-tier e-commerce web site provides a good example of how monitoring Application Metrics can help you maintain a positive experience for end users, while meeting the expectations of your business organization. A typical multi-tier architecture would involve an end user interacting via a client application (such as a web browser) over the Internet with a web server application. In turn, the web server application interacts with (queries) a database application to get data that satisfies the user s request, such as product information, prices, etc. The web server may also interact with other applications, such as a store program, that enables purchasing of requested items; or a load balancer that helps distribute the workload across multiple servers. If any one of the applications in this interconnected system develops a performance issue or fails, the entire site may become agonizingly slow or crash altogether. And nothing is worse for the user experience, or your business, than a slow or unavailable web site. WEB SERVER AND DATABASE METRICS Each application in the e-commerce web site should be monitored to ensure an optimal user experience. In particular, it s a good idea to monitor Web Server metrics, such as uptime to make sure the site is consistently available. You ll also want to monitor total accesses, to make that transaction volume is consistent with baseline metrics. If the number of total accesses is significantly below baseline, this could indicate an issue with the application configuration, CPU load, or network connection that requires prompt attention. You might also want to check requests per second to evaluate load capability in general, the greater the number of requests per time, the more efficiently the application is using the underlying resources. It s equally important to monitor Database application metrics, such as average queries, connections, threads, and uptime, to make sure that response times, transaction volumes, and availability are consistent with baselines. 6

This will help you to ensure that the user experience is seamless and free from unexpected delays or other performance issues originating on the data tier of your web site. PROTECTING YOUR CRITICAL IT INFRASTRUCTURE WITH MONITORING ALERTS Most IT infrastructure monitoring solutions provide alerts that notify you when conditions arise that may indicate a problem with your system. These alerts keep you informed on the status of key system metrics, and help you to pinpoint the cause of performance issues, so you can address them proactively before they cascade into systemic problems or failures. THRESHOLDS, ALERT STATES, AND ACTIONS Alerts are based on pre-set threshold values pertaining to a metric for a particular component, such as CPU, Memory, Disk, Network, Web Server, Database, etc. When threshold conditions are satisfied, the alert state changes, for example from the OK state to the PROBLEM state, and an action is taken, such as sending an email notification. Alert state changes can also initiate web hooks that perform custom actions via pre-defined API calls. Web hooks enable programmatic integration of monitoring into your organization s existing processes. Once an alert is triggered, there are a wide variety of corrective actions the system administrator will want to consider to bring the system back into balance. Setting up an alert involves these simple steps: Select the specific type of system or application metric on which the alert is to be based. For example, High CPU Load, Physical Memory, Disk Usage, Network, Web Server, Database, etc. Set the metric threshold value that will trigger the alert. The metric value you set for this threshold should be based on established baseline metrics for your infrastructure, as well as generally recommended threshold values for the particular metric. Select the actions that will be taken when the alert is triggered. This may include email notifications to relevant parties, as well as custom actions, such as Web Hooks, that enable programmatic integration with other systems via API calls. CHOOSING ALERT THRESHOLDS Choosing proper alert thresholds is key to setting up a monitoring system that can provide genuine protection for your system. Set thresholds too low and you may end up getting spammed by false alerts. Set thresholds too high, and the only notification you receive might be Sorry, System Failure. It is important to base your monitoring alert thresholds on established baseline metric values that you have gathered for your system. This ensures that your alerts are calibrated to your system architecture. Properly calibrated alert thresholds are more likely to detect actual deviations from normal system behavior, and thus provide your systems a measure of protection. 7

You can then refine alert threshold values as needed based on subsequent monitor readings and/or changes in your system architecture. CUSTOM ACTIONS: WEB HOOKS AND INTEGRATION Some monitoring programs (including ScaleXtreme) support custom actions via web hooks. A web hook is an event-based API call: When a specified event occurs, such as an alert being triggered, a pre-defined API call is performed. Web hooks enable programmatic integration of monitoring into an organization s pre-existing processes, including ticketing systems, resource provisioning (auto-scaling), and any number of other workflow processes. For example, you might have a High CPU Load with an alert threshold that is set to twice your established baseline value for CPU Load average. Using a web hook, when this threshold is reached, alert notification data could be sent to your ticketing system via HTTP POST to programmatically generate a new support ticket; and additional server resources could be launched via your companies provisioning processes. WHAT CAUSES AVAILABILITY VARIANCE AND IT INFRASTRUCTURE ALERTS? Systems monitoring and management products are designed to alert IT operators and systems administrators when something goes wrong. The hope is that the people responsible for a business's compute infrastructure can fix problems before they lead to outages or system downtime. Every enterprise hits its own infrastructure speed bumps. Some can be easily handled with either programmatic or manual intervention. Others require notification up the chain of command and a multi-faceted response. Developing a solid incident and alert response process entails understanding some of the most common triggers for IT alerting: Seasonality: Some businesses experience seasonal variation in the use of their systems. Online retailers such as Amazon.com, ebay.com and Walmart.com see dramatic traffic increases around Christmas time, which puts strain on their IT resources. Other businesses, experience different seasonality. Accounting firms see usage increase before the end of each quarter. Personal finance and tax preparation groups experience additional loads before Federal Income Taxes come due in April. Usage/Traffic Spikes: Seasonality can be predicted in many cases, but usage spikes and traffic spikes are often unplanned and characterized by an explosive growth in the demands on IT. An application that gets mentioned in TechCrunch or a website featured on Good Morning America are familiar with the effects of such a spike. A spike can set off alerting software and require IT administrators to either rapidly spin up new machines or face a potential outage. Configuration Issues: Poor system performance can often be attributed to configuration issues, especially in a distributed and highly-heterogeneous operating environment. An application may require access to a certain database in order to record interactions with customers. A load balancer has to know where to send traffic. As systems become virtualized and incorporate cloud computing, IT administrators have to be aware of how their applications work together with infrastructure. 8

Equipment Failure: This is perhaps one of the most common reasons an IT alert is triggered. A piece of equipment fails. A server burns out or loses power. A public cloud instance terminates. Someone in the organization de-provisions a machine that is in use someplace else. Virtualization and cloud computing make it much more difficult to know where applications are running. A good systems monitoring and management tool can provide clarity into where systems are and what applications are running on top of them. That way, when there's an outage or a failure, a team can rapidly respond. DETERMINING WHO NEEDS TO GET AN IT SYSTEMS ALERT It isn't always easy to decide who in your organization should get a notification when a key IT performance metric triggers an alert. Most IT organizations end up with an ad hoc notification plan cobbled together from a series of miscommunications: someone who should have been told of a system issue didn't get a message and complained or someone who doesn't need to get an alert finds his or her inbox filled with incomprehensible messages. Either way, it ends up with a headache for IT and anyone who has to interact with infrastructure. Establishing clear responsibilities and escalation pathways can radically simplify the process deciding who needs to receive metric-based notifications. There are five fundamental groups that require some form of alert: 1. First Responders: These are the people who first see an automated alert or constantly watch key metrics for major changes. They work through a pre-determined process to drill down on what the problem is or what functional group it effects. First responders need the most information and early insights. 2. Fixers: When a problem arises, someone has to fix it. A fixer, or fixers, can troubleshoot it and create a solution. These people are technical experts in whatever system needs attention. Some organizations combine fixers with first responders. Fixers need to know about problems before the result in outages and may want to be alerted as soon as benchmarks are exceeded. 3. Owners: For each system there's someone responsible for its uptime and availability. This person hears from the first responders, overseas the fixers and interacts with the people who rely on the system to do their jobs. The owner needs to know when a system is in trouble and should be aware of a problem before there's an outage. Owners also may wish to get periodic reports on system health. 4. Users: These are the people that rely on a given system to either execute their job functions or to utilize a certain service. Users need to know when a system is unavailable and may need some explanation or assurance against future outages. 5. The Informed: In every enterprise, there's a class of people who have to be told when a problem arises. They can be compliance officers, executive management, public relations representatives - every organization has a different set of people who have to take action when something goes wrong. These people neither need nor want overly technical explanations and certainly need not be included in the alerts generated by an automated monitoring solution. Owners typically work directly with those that have to be informed. Establishing communications channels for systems alerts in advance can dramatically improve an organization's response and overall uptime. It prevents headaches by establishing protocols and ensures that the organization gets the biggest benefit from its systems monitoring solution. 9

It's a good idea to learn as much as possible about how your computer hardware, operating system, and applications actually work, so that you can effectively interpret metric data. It's the first step to implementing monitoring, the process that gives you visibility into your IT systems and the power to protect them. From there, you can begin to work with pre-set alert thresholds based on established baseline performance metrics that will help ensure your systems are running efficiently and reliably. WHAT TO DO AFTER AN IT INFRASTRUCTURE ALERT You've set up a systems monitoring solution and calibrated alerts to make sure that you discover problems before they happen. You've determined who needs need to receive alerts, who needs to take action and who else needs to be informed. Now what? Monitoring and alerts are only as useful as your ability to do something about the information you get. IT architects can either empower systems administrators to take the actions most appropriate for the situation or prescribe a certain set of processes for troubleshooting. Often this decision comes down to the size of the organization and its IT infrastructure. Some of the most common actions systems administrators have to take when responding to an issue surfaced by monitoring or IT alerting include: Provisioning New Machines: If systems administrators see high CPU or disk usage, it may be a signal that they are running at or near capacity and need to provision new virtual machines, physical servers or public cloud instances in order to assure application availability. This requires either reconstituting the application stack on a new machine from a static or "gold" image or dynamically assembling the server from the building blocks defined through a cloud systems management tool. Restarting a Server: When monitoring tools deliver an alert that a server is nonresponsive for several minutes, it may indicate a need to restart the server. Although this sounds simple enough, competent systems administrators will consider what knock-on effects shutting down a certain stack will have on the entire system. They may decide to bring the stack up on a backup server before restarting the original machine. Running Scripts: These are the fundamental tools of systems administration and can be defined and canonicalized for an organization or created on the fly to deal with certain types of problems. If your organization uses the same type of scripts over and over again, it can be valuable to put them in a library and share access to them across the groups with functional responsibility for IT. Applying Patches: Systems tools can indicate a server requires a Windows patch or a Linux update. Indeed, not patching systems can be the cause of underperformance or even a security breach. Patch management requires careful oversight and a coordination of resources across functional groups. Producing Reports: Alerts can be a sign of serious problems and may require those with functional responsibility to create reports that demonstrate their response to the alert and/or the ongoing system health for others in the organization. 10

CONTACT SCALEXTREME: 4 West 4th Avenue, Suite 401 San Mateo, CA 94402 info@scalextreme.com 877.972.2539 11