ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem
INTRODUCTION As IT infrastructure has grown more complex, IT administrators and operators have struggled to retain control. Gone are the days of single-server applications running on self-contained servers. Today, the typical application is made up of many backend services and banks of front end servers. As early as 2001, a single Google query touched thousands of machines 1. While Google has always pushed the limits of distributed computing, architectures that they pioneered are increasingly becoming the norm. Even at the micro level, systems are becoming more complex. The typical Internet application operating at any significant scale requires a minimum of two to three machines (a backend database, an application server, and a front end web server). Modern IT departments are often responsible for dozens or hundreds of applications across thousands of hosts at multiple locations. Monitoring and diagnosing performance problems on enterprise-scale distributed architecture requires new tools and methodologies that are optimized for the unique problems in this space. A TRADITIONAL APPROACH To dive deeper into this topic, let s look at a typical scenario that affects almost any IT organization. Given a user-facing web application, how do we not only monitor its performance but find the root cause of performance problems before they affect a large number of users? The traditional solution to this problem has been a combination of monitoring, alerting, and query systems carefully configured and managed by the IT operations team. This may mean running a system like Nagios to monitor individual servers and alert on well-known error conditions, such as a full disk partition. It is typically combined with performance monitoring tools like Ganglia and Graphite to visualize key performance metrics. While powerful, these brute force tools begin to break down when dealing with large-scale, distributed infrastructure. Let s take the example of an application with multiple front-end web servers and what happens when a single web server starts reporting higher than average latency for user requests. This type of performance problem is typically found by monitoring a dashboard that shows a latency spike. While this model of operation works great for applications that run on a single or even a small number of servers, it fails when an application is deployed across hundreds or thousands of servers. To find this problem, you need to be looking at the right dashboard that includes the specific web server around the time that the performance problem starts. Even if you see the problem, you then have to manually employ many search tools to figure out why this specific server is misbehaving. Compounding this problem is the reality of modern IT ops, wherein operators are responsible for multiple applications, often interacting with hosts from multiple data centers. Tools that lack the capacity for drawing connections across large leaps are simply inadequate. 2015 Rocana, Inc. 1
GUIDED ROOT CAUSE ANALYSIS At Rocana, we take a different approach to solving these types of problems. Instead of relying on a patchwork of monitoring tools built for simpler times, we ve built nextgeneration IT operations analytics tools around the concept of guided root cause analysis. A methodology for solving problems, guided root cause analysis brings all of your operations data into a single platform that features anomaly detection driven by machine learning, flexible visualization, and search tools specifically designed to enable IT operators to drill into performance and other operations problems. Rather than trying to replace IT operators with purely automated tools that can t respond to real-world situations, we seek to augment the skills of the IT operator with tools designed from the ground up to deal with the large-scale data problems created by today s modern IT infrastructure. The scale and processing capacity of modern data centers, combined with drastic leaps in user traffic and the complexity of modern applications, puts accurate monitoring out of the reach of legacy solutions. Large scale operations require Big Data monitoring and analytics capacity, with the means to accumulate and process billions of events per hour in real time, and facilitate exploring the logs of thousands of machines with effective speed and specificity, and making sense of the data the operator is evaluating. Taking our example of a performance problem plaguing a single web server that is part of a much larger deployment, a truly effective solution must be able to detect a problem in minutes (not days or even hours), and help a single IT operator sniff out the source of that problem without needing specialized familiarity with that application or its immediate environment. These are the criteria Rocana Ops sets out to satisfy. Let s see how it can be used to highlight a potential problem and guide the operator toward the root cause, all using a single interface on a unified data gathering and analytics platform. DETECTING PERFORMANCE PROBLEMS One of the ways Rocana Ops augments the IT staff is by employing machine learning to highlight potential problems before they cause a negative experience for end users. Rocana Ops includes anomaly detection methods that compare the current value of key metrics against historical models. This allows Rocana Ops to highlight when key elements of your infrastructure start behaving unexpectedly. Since response time is a key factor in the success of any user-facing web application, IT operators typically have strict service level agreements declaring acceptable response times. But by the time the application is violating those SLAs, the end user s experience may already be suffering. The advantage of anomaly detection is that it s not limited to drawing attention to issues that we know are outside of our SLAs or pre-defined thresholds, but can highlight activity that is outside what we would consider normal, potentially helping us get ahead of problems that would violate the standards fixed in an SLA. 2015 Rocana, Inc. 2
Let s take a look at the Rocana Ops home page and see what catches our eye: Right away, we can see that one of our web servers is reporting an anomaly in HTTP request response time. The time reported is still well within our SLA, and it s only affecting a single host, but we d like to get to the bottom of this before it impacts our users further. DIAGNOSING POTENTIAL CAUSES Like most operations teams, we have established some custom dashboards to help us monitor and diagnose key host metrics that might impact performance, such as disk I/O, memory, and CPU utilization. After a quick glance, it s clear that app-02.int.rocana.com and db-02.int.rocana.com are experiencing higher than average CPU utilization. This certainly seems like a probable cause, but is there anything we can do to increase our level of certainty? Since we ve been extracting response time as a metric for anomaly detection, we re able to edit this dashboard and graph the response time alongside the other metrics. 2015 Rocana, Inc. 3
Let s see what happens when we do just that: This definitely helps our case, and points to a likely direct cause of the higher response times. The remaining question is: What s causing the CPU spike? FINDING THE ROOT CAUSE From looking at the dashboard, we can see that the CPU spike and the response time spike occurred around 7:23 this morning. To dig in further, we can move to the search tab and see if we are able to find any interesting events that coincide with that time frame. Let s start by looking at the related hosts for a 30-minute time window centered around the anomaly. While this got us close, we re still left with a lot of events to sort through. Before looking at the individual events, maybe there s something we can tell by looking at the event volume broken down by the hosts where we saw changes in metrics correlated with the spike in response time. Specifically we re interested in 2015 Rocana, Inc. 4
web-02.int.rocana.com (where we saw the latency spike), app-02.int.rocana.com (where we saw a big CPU utilization spike), and db-02.int.rocana.com (where we saw a smaller, but still significant CPU utilization spike). Now we re onto something. The event volume for the app-02.int.rocana.com host shows noticeable activity around the same time as the detected anomaly. Let s zoom in our search and focus on the events from that single host in the 5 minutes centered around the anomaly. We re getting close. We see that application server was restarted around the time we saw the anomaly and CPU utilization spiked on db-02.int.rocana.com and app-02.int.rocana.com. Let s conduct one more search looking for the startup log message from the application server across a wider range of time. By skimming these events we can see that the startup event for the recent restart of the application server on app-01.int.rocana.com shows a new version. It looks like someone accidentally deployed development code to one of the production application servers. We can even go a step further and compare the version on this server with the rest of the application servers used for this application. Finally, we can work with the dev team to get the application code reverted to the stable version. 2015 Rocana, Inc. 5
CONCLUSION IT infrastructure is more complex than it has ever been. As the cloud, containerization, and microservices become more popular, this complexity increases exponentially. Traditional, brute-force IT monitoring tools only scratch the surface when it comes to handling this complexity. IT operators have always needed tools that can highlight anomalous behavior before it becomes problematic, as well as enable them to quickly zero in on root causes. This ability to connect the necessary dots is dependent on the capacity to zoom out and back in, from service, to host, to data center, and back again. In a large-scale, diversified IT operation, this now means combining unprecedented data capacity with surgical specificity, which has left the last generation s solutions behind. This simple but powerful example of a guided root cause analysis process demonstrates the effectiveness of Rocana s novel approach to IT Operations analytics. At Rocana, we believe this approach is spearheading a shift toward augmented IT operations, effectively putting the ability to explore the largest, most complex next-gen IT environment from the convenience of a single browser session, without requiring any special syntax knowledge or extensive familiarity with the tool. ABOUT ROCANA Rocana (formerly known as ScalingData) provides enterprises with the ability to maintain control of their modern, global-scale infrastructure. By using Big Data and advanced analytics, Rocana augments staff skills to increase efficiency and awareness, thereby improving service assurance. Unlike brute-force, legacy log management tools that lack scalability, are slow, and have poor cost-to-value ratios, Rocana Ops is optimized to manage huge amounts of data and encourage analysis to show a complete picture of IT operations. 1 Dean, Jeffrey. Challenges in Building Large-Scale Information Retrieval Systems. WSDM 09. Imagina Building, Barcelona, Spain. 10 February 2009. Invited talk. Rocana, Inc. 548 Market St #22538, San Francisco, CA 94104 +1 (877) ROCANA1 info@rocana.com www.rocana.com 2015 Rocana, Inc. All rights reserved. Rocana and the Rocana logo are trademarks or registered trademarks of Rocana, Inc. in the United States and/or other countries. WP-IPP-0815 2015 Rocana, Inc. 6