Wait, How Many Metrics? Monitoring at Quantcast Count what is countable, measure what is measurable, and what is not measurable, make measurable. Galileo Galilei Quantcast offers free direct audience measurement for hundreds of millions of web destinations and powers the ability to target any online audience across tens of billions of events per day. Operating at this scale requires Quantcast to have an expansive and reliable monitoring platform. In order to stay on top of its operations, Quantcast collects billions of monitoring metrics per day. The Ganglia infrastructure we've developed lets us collect these metrics from a wide variety of sources and deliver them to several different kinds of consumers; analyze, report and alert on them in real time; and provide our product teams with a platform for performing their own analysis and reporting. The monitoring infrastructure at Quantcast collects and stores about 2 million unique metrics from a number of data centers around the world, for a total of almost 12 billion metrics per day, all of which are made available to our monitoring and visualization tools within seconds of their collection. This infrastructure rests on Ganglia's Atlas-like shoulders. Some of the sources that generate these metrics are: operating system metrics (CPU/memory/disk utilization) application metrics (queries per second) infrastructure metrics (network hardware throughput, UPS power utilization) derived metrics (SLA compliance) business metrics (spend and revenue) We have a similarly broad spectrum of consumers of monitoring data: alerting tools (Nagios) business analysis of historical trends performance analysis by product teams Ganglia gives us a system that can listen to all of these sources, serve all of these consumers, and be performant, reliable, and quick to update.
Reporting, Analysis, and Alerting Alerting Integration Like many companies, we use Nagios to monitor our systems and alert us when something is not working correctly. Instead of running a Nagios agent on each of our hosts, we use Ganglia to report system and application metrics back to our central monitoring servers. Not only does this reduce system overhead, but it also reduces the proliferation of protocols and applications running over our WAN network. Nagios runs a Perl script called check_qcganglia, which has access to every metric in our RRD files stored on ramdisk. This allows us to have individual host checks as well as health checks for entire clusters and grids using the Ganglia summary data. Another great benefit of this is that we can configure checks and alerts on the application level data that our developers put into Ganglia. This also allows us to alert on aggregated business metrics like spend and revenue. We have also implemented a custom Nagios alerting script which, through configuration of the alert type, determines which Ganglia graphs would be useful for an operator to see right away in the alert email and attaches those graphs as images to the outbound alert. Typically these include CPU and memory utilization graphs for the individual host as well as application graphs for the host, cluster, and grid. These help the on-call operator immediately assess the impact of any given alert on the overall performance of the system. Holt-Winters Aberrance Detection Because Ganglia is built on RRD Tool, we're able to leverage some of its most powerful (if intimidating) features as well. One of these is its ability to do Holt- Winters forecasting and aberrance detection. The Holt-Winters algorithm works on data that is periodic and generally consistent over time. It derives an upper
and lower bound from historical data and uses that to make predictions about current and future data. We have several metrics that are well-suited for Holt-Winters aberrance detection, such as http requests per second. This varies significantly over the course of a single day and each day of the week, but from one Monday to the next, our traffic is pretty consistent. If we see a large swing of traffic from one minute to the next, it typically indicates a problem. We use Nagios and RRD Tool to monitor the Holt-Winters forecast for our traffic and alert if the traffic varies outside of the expected range. This allows us to see and respond to network and application problems very quickly, using dynamically-derived thresholds that are always up to date. Below is an example of an aberrance. We took a datacenter offline for maintenance, and this triggered aberrance detection. Here's an example of an unexpected aberrance alert.
Ganglia as an Application Platform Since a gmond daemon is running on every machine in the company, we encourage our application teams to use Ganglia as the pathway for reporting performance data about their applications. This has several benefits: easy integration for performance monitoring. The Ganglia infrastructure is already configured, so the application developer doesn't have to do any special work to make use of it. Especially in concert with the json2gmetrics tool described below, it's easy for an application to generate and report any metrics the developers think would be useful to see. Also, the developer doesn't have to worry about opening up firewall ports, parsing or rotating log files, or running other services devoted to monitoring, such as JMX. powerful analysis tools. By submitting metrics to Ganglia in this way, application developers get easy access to very powerful analysis tools (namely the Ganglia web interface). This is particularly useful when troubleshooting problems with an application that correlate with operating system metrics, e.g. attributing a drop in requests per second to a machine swapping the application out of memory. simple and flexible alerting. Our Ganglia infrastructure is tied in with our alerting system. Once an alert is configured, an application can trigger an alert state by generating a specific metric with a sentinel value. This lets us centralize alert generation and event correlation; for instance, we can prevent an application from generating alerts when the system it runs on will be in a known maintenance window. This is much better than the alternative of each application having to re-invent the wheel of monitoring for an invalid state and sending its own emails. Best Practices Hard-won experience has given us some best practices for using Ganglia. Following these has helped us scale Ganglia up to a truly impressive level while retaining high reliability and real-time performance. Using tmpfs to handle high IOPS Collecting the sheer number of metrics we do and writing them to a spinning disk would be impossible. To solve the IOPS problem, we write all our metrics to a ramdisk (Linux tmpfs) and then back them up to a spinning disk with a cron job. Since we have multiple gmetad hosts in separate locations, we're able to protect against an outage at either location. Also our cron job runs frequently enough that if the tmpfs is cleared (for instance, if the server reboots) our window of lost data is small.
Sharding/instancing Another way we deal with our large number of metrics is to split up our gmetad collectors into several instances. We logically divided our infrastructure into several different groups like webservers, internal systems, map-reduce cluster and realtime bidding. Each system has its own gmetad instance, and the only place all the graphs come together are on web dashboards that embed the graphs from each instance. This gives us the best of both worlds: a seamless interface to a large volume of data, with the added reliability and performance of multiple instances. Tools We've developed several tools to let us get more use out of Ganglia. snmp2ganglia snmp2ganglia is a daemon we wrote that translates SNMP OIDs into Ganglia metrics. It takes a configuration file that maps a given OID (and whatever adjustments it might need) into a Ganglia metric, with a name, value, type, and units label. The configuration file also associates each OID with one or more hosts, such as network routers or UPS devices. On a configurable schedule, snmp2ganglia polls each host for the OIDs associated with it and delivers those values to Ganglia as metrics. We make use of the capability to "spoof" a metric's source IP and DNS name to create a pseudo-host for each device. json2gmetrics json2gmetrics is a Perl script that wraps the Ganglia::Gmetric library, to make it easy to import a lot of metrics from almost any source. The script takes a JSON string with a list of mappings of the form "name: <metric>, value: <value>", etc. With this tool, it's trivial for any program or system to generate Ganglia metrics, and each import requires only one fork (as opposed to one fork per metric with the built-in gmetric tool). gmond plugins We make extensive use of gmond-style plugins, especially for operating system metrics. For instance, some of the plugins we've written collect: SMART counters about disk health, temperature, etc. network interface traffic (packets and bytes) and errors CPU and memory consumed by each user account mail queue length and statistics
service uptimes These metrics are a powerful tool for troubleshooting and analyzing system performance, especially when correlating systems that might seem unrelated through the Ganglia web UI. RRD management scripts We have a small army of scripts that work with RRD files: Adding new RRAs to existing files, such as a new MAX or MIN RRA Adjusting the parameters of existing RRAs Merging data from multiple files into one coherent RRD Smoothing transient spikes in RRD data that throw off average calculations (particularly common with SNMP OID counters) Drawbacks Although Ganglia is both powerful and flexible, it does have its weak points. Each of these drawbacks describes a challenge we had to overcome to run Ganglia at our scale. Necessity of sharding As our infrastructure grew, we discovered that a single gmetad process just couldn't keep up with the number of metrics we were trying to collect. Updates would be skipped because the gmetad process couldn't write them to the RRD files in time, and the CPU and memory pressure on a single machine was overwhelming. We solved this problem by scaling horizontally: setting up a series of monitoring servers, each of which runs a gmetad process that collects some specific part of the whole set of metrics. This way, we can run as many monitoring servers as we need to reach the scale of our operations. The downside of this solution is significantly increased management cost; we have to manage and coordinate multiple monitoring servers and track which metrics are being collected where. We've mitigated this problem somewhat by setting up the Ganglia web UI on each server to redirect requests to the appropriate server for each category of metrics, so end-users see a unified system that hides the sharding. RRD data consolidation RRD files offer the following quality: keeping an unbounded amount of timeseries data in a constant amount of space. They achieve this by consolidating older data with a consolidation function (such as MIN, MAX, or AVERAGE) to make room for newer data at full resolution. For example, an RRD file might be configured to keep a datapoint at 15-second intervals for an hour. This means keeping 240 unique values (60 minutes/hour * 4 datapoints/minute). A whole day of data at this resolution would require 5760
datapoints! However, instead of keeping all of those datapoints, this RRD file might instead be configured to keep just one averaged datapoint per hour for the whole day. As new updates are processed, a whole hour's worth of 15-second datapoints would be averaged to get a single datapoint that represents the entire hour. The file would contain high-resolution data for the last hour and a consolidated representation of that data for the rest of the day. For Quantcast, we often need to analyze events in a very precise time window. If we saw a CPU load spike for just one minute, for instance, RRD Tool's data consolidation would average that spike with the rest of the data for that hour into a single value that did not reflect the spike; the precise data about the event would be lost. Coordination over a WAN Quantcast is a global operation, so we needed to make some improvements to Ganglia to get it to a global scale. One of these improvements involved the process of collecting data from a gmond daemon. When a gmetad process collects metrics from a gmond, the data is emitted as an XML tree with a node for each metric. This winds up being a whole lot of highly-compressible plaintext data. In order to save bandwidth over our WAN links, we patched gmond to emit a compressed version of that tree, and we also patched gmetad to work with compressed streams. This substantially reduced the amount of monitoring data we were sending over our WAN links in each site, which left us with more room for business-critical data. Excessive IOPS for RRD updates Whenever a gmetad process writes a metric to an RRD file, the update process requires several accesses to the same file: one read in the header area, one write in the header area, and at least one update in the data area. RRD Tool attempts to improve the speed of this process by issuing fadvise/madvise system calls to prevent the filesystem from doing normally clever things that interfere with this specific access pattern. Since we write our RRD files into a tmpfs, the fadvise/madvise calls were just slowing things down. We wrote a LD_PRELOAD library to hook those calls and negate them (return 0). This allowed us to collect a much larger number of metrics with each gmetad shard, and also significantly reduced CPU usage by the gmetad process.
Conclusions Ganglia is an incredibly powerful and flexible tool, but at its heart it's designed to collect performance monitoring data about the machines in a single highperformance computing cluster and display them as a series of graphs in a web interface. Quantcast has taken that basic tool and stretched it to its limit: growing it to an exceptionally large scale, extending it across a global infrastructure, and scouring the hidden corners of Ganglia's potential to find new tools and functionality. Most importantly, though, we've made Ganglia into a platform upon which other services rely. From Nagios alerts and rich emails, to integration with application development, it's become a cornerstone of our production operations.