1 Big Insights from Little Data: A Spotlight on Unlocking Insights from the Log Data That Matters
2 Introduction Logentries is a leading SaaS provider for collecting, analyzing and managing machine- generated log data. Logentries processes billions of log events every day from users across 100 countries. In this report, researchers at Logentries show how they have analysed a sample of this data over a 14 day period and highlight how the challenge of dealing with big data can be especially difficult when a user is looking to identify specific events in large volumes of data. The research team analyzed over 22 billion events produced across a 14 day period by 6000 Heroku apps, which make up a sample of the Logentries user base. The team analyzed these events from the perspective of a DevOps professional, someone responsible for building and running cloud applications. They determined that a significant amount of the most useful and actionable data for a given use case goes unnoticed, because it is a vanishingly small percentage of the overall dataset. Why Log Data Matters DevOps professionals are commonly tasked with building, deploying, managing and maintaining applications, both in the cloud and on premise. To help ensure their applications are performing properly, DevOps often rely on log data for troubleshooting, diagnostics and application systems monitoring. However, they are challenged by the large amount of log data that an application generates during the normal course of business. Even a relatively small application can generate millions of log events per day. The challenge for DevOps is to not only manage increasing volumes of log data, but is also to quickly and efficiently sift through the millions or even billions of log events collected to find the individual log entries containing the important information they re looking for. In many instances, this information includes the application errors, exceptions, warnings and other critical log events related to application and system performance needed to monitor and improve performance or reliability. By reviewing and taking action based on these types of events, DevOps can quickly react to potential issues in real time, preventing outages by proactively addressing potential issues before they adversely impact an application. This report highlights the challenge facing DevOps teams today. The report finds, after looking across 6000 applications producing over 22 billion Heroku log events, that on average 99.82% of log events are noise (for the given use case), with 0.18% of log events containing critical information for DevOps concerned with application performance and reliability (i.e. application/platform errors, warnings or exceptions). Given that 0.18% of log data contains valuable information for this particular end user, it demonstrates that while big data receives most of the
3 attention in the media, it s actually the little data that counts and which often provides the big insights. The report also shows that this issue is significantly exasperated as systems grow in size, with only 0.02% (a 50 th of 1%) of log data containing valuable information for this DevOps use case (i.e % noise), and where the little data that matters, becomes a smaller percentage of overall log volumes. The report also gives a more fine grained analysis of the breakdown of this data and gives insight into the characteristics of the log data produced by Heroku apps of varying sizes. The Modern Needle in the Big Data Haystack Over a 14 day period Logentries analyzed log data from a sample of over 6,000 Heroku applications, providing a view into typical log volumes, error rates and performance characteristics. For the purposes of the analysis, the Heroku applications were split into two groups, moderately sized applications and large applications. Moderately sized applications were defined as those that generated less than 2GB of log data per day. Larger applications were defined as those that generated more than 2GB of log data per day. Relevant events from a DevOps perspective were defined as entries that contained any Heroku Error code or application exceptions that occurred in an application % 0.031% % 0.181% 0.010% 0.052% Signal Noise Fatal Events Cri\cal Events Warnings Excep\ons
4 Within the 0.181% of events containing relevant information for DevOps as defined above, 0.088% were warnings, 0.052% were critical events, 0.01% were fatal and 0.031% were application exceptions. Excep\ons 17% Fatal Events 5% Cri\cal Events 29% Warnings 49% Fatal Events Cri\cal Events Warnings Excep\ons Looking at the total amount of relevant information found as a whole, 49% of the total were warnings, 29% were critical events, 5% were fatal and 17% were application exceptions. Sample 1: Moderately Sized Heroku Applications Moderately sized Heroku applications were defined as those applications that generated less than 2 GB of log data per day. Based on this definition, our sampling included 11.4 billion log events. Within this sample, 99.67% of the events sampled were signal noise, burying the 0.33% of the log events that a DevOps person would likely be most interested in.
5 Log Event Breakdown, Moderate Heroku App % % % % Signal Noise Warnings Cri\cal Events Fatal Events Excep\ons Within the 0.33% events, 49% were warnings, 27% were warnings, 6% were fatal and 18% were application exceptions. Moderate Heroku App - Error Events Breakdown 18% 6% 49% 27% Warnings Cri\cal Events Fatal Events Excep\ons
6 Sample 2: Larger Heroku Applications Larger Heroku applications were defined as those applications that generated more than 2 GB of log data per day. Based on this definition, our sampling included approximately 11 million log events. Within this sample, 99.98% of the events sampled were noise, burying the 0.02% of the log events that a DevOps person would likely be most interested in. Log Event Breakdown, Large Heroku App 0.006% 0.001% % 0.017% 0.000% 0.010% Signal Noise Fatal Events Cri\cal Events Warnings Excep\ons Within the 0.02% events, 35% were warnings, 59% were critical events, 1% were fatal and 5% were exceptions. Large Heroku App - Error Events Breakdown Excep\ons 5% Fatal Events 1% Warnings 35% Cri\cal Events 59% Fatal Events Cri\cal Events Warnings Excep\ons
7 Examining the Needle Relevant events were defined with input from Heroku Engineers and were grouped based on Heroku error codes. Note, HTTP errors started with the letter H, runtime errors started with R and logging errors started with L. The individual events were categorized as follows: Warnings: Warning error codes relate to less severe error codes but also refused connections and timeouts that can often ultimately lead to critical or fatal errors. They include H20 App boot timeout, R13 Attach Error, H21 Backend connection refused, H19 Backend connection timeout, H22 Connection limit reached, R16 Detached, L10 Drain buffer overflow, R12 Exit timeout, H80 Maintenance Mode, H17 Poorly Formatted HTTP response, H16 Redirect to Herokuapp.com, H18 Request Interrupted, and L11 Tail buffer overflow. Critical events: Critical events were error codes related to degraded performance and possibly some dropped requests. They included H12 Request Timeout, R14 Memory Quota Exceeded, H15 Idle Connection, H13 Connection closed without response, and H11 Backlog too deep. Fatal: Fatal events were error codes related to application or dyno crashing - i.e. your user request will not be served. Note: H99 represents an error on the Heroku platform. Unlike all of the other errors which will require action from you to correct, this one does not require action from you. The included H10 App Crashed, R10 Boot timeout, H99 Platform Error, and R15 Memory Quota Vastly Exceeded. Exceptions: Included all application exceptions and relate to exceptions generated by individual applications. Digging into the data a little more shows how averages can sometimes be misleading. Many applications did not have events from any of the above categories. In fact, looking at a single day s worth of log data for all 6000 applications, the total number of applications producing at least one event from each event category were as follows: Warnings: 3357, Critical Events: 1415, Fatal: 462, Application Exceptions: Warning Cri\cal Fatal Excep\on
8 Warnings were the most prevalent category and appeared in over 50% of the applications. This is not surprising, since warnings can often be safely ignored and can be left unaddressed by developers as they build their apps. Critical events, fatal events and application exceptions appeared more sporadically across smaller numbers of apps, and are in general likely a result of a more severe application error. The volume of events in these categories also tend to increase as application log volumes grew as systems under load usually produce more exceptions when something goes awry. Large spikes in error events in your logs can often mean something serious is happening in your system. While undesirable, it s usually pretty noticeable if you have a major outage. Log management technologies can be used to assess, diagnose and resolve the issue. Intermittent issues and anomalies in your log data, however, can be much more difficult to track down and can often go unnoticed for long periods, resulting in a loss of business and undesirable user experiences. While log management solutions have largely focused on search capabilities, it is not possible to search for something you are unaware of. The anomalies in the little data are the small percentage of key insights that you really need to know about. Typical examples of such issues from a DevOps perspective include: A small percentage of request response times are outside an acceptable range. For example <1% of requests are taking more than 3 seconds. This could easily go unnoticed during test but will hurt your bottom line if a section of your user base is going elsewhere due to a poor user experience. A fraction of users receiving request timeouts (i.e. site looks like it is down). Similar to the issue above if a small section of your user base is having problems with your service, you may never find out about it and your system may be failing silently. Ongoing warnings that go unaddressed may lead to more severe problems. A typical example is a memory leak, which continues to leak until you get a dreaded out of memory error resulting in a much more serious outage. An example outside of DevOps scenario can include a change in the rate of signups, or in any KPI for that matter. From our analysis above, if on average less than 0.18% of log data contains useful information, only a fraction of this 0.18% relates to those hard to find issues. Thus finding these nuggets of information within billions of events can prove extremely challenging without the right tools. Unfortunately, most log management tools to date have focused on providing powerful search capabilities, which does not greatly help to solve the issue of finding the anomalies in your data.
9 Searching assumes you know what you looking for. However for intermittent issues or anomalies in your data you need different capabilities to easily identify such problems. To find the little data that has the big impact our strong belief is that building a log management solution solely around searching is fundamentally the wrong approach: rather, it is our belief that capabilities such as visualization, pre- analysis of data and the ability to plug in intelligence for different use cases are key to identifying the little data that matters. Heroku Heroku (pronounced her- OH- koo) is a cloud, platform as a service (PaaS) that enables application developers to build applications without having to deploy, operate or scale the underlying hardware of software. Supporting Ruby, Node.js, Clojure, Java, Python and Scala, Heroku developers can choose between 1x and 2x dynos to scale their applications based on growth and scale needs. Apps can be deployed in US and European geographic regions. Heroku was founded in 2007 and is owned by Salesforce.com. DevOps DevOps is the blending of tasks performed by a company's application development and systems operations teams. Traditional application development teams are in charge of gathering business requirements and building the application. The development team tests the program in a development environment for quality assurance (QA) and once the appropriate testing is completed, the code is released to operations for use. The traditional operations team is tasked with deploying and maintaining the software, and ensuring the application is available for end- users and runs smoothly. The challenge with the traditional model where application development teams and the operations team are separate, the development team may not be aware of operational roadblocks that prevent the program from working as anticipated. And the operations team may not be aware of how the application was designed, architected and built. DevOps is both a philosophy that promotes better communication, collaboration, and integration between the two teams, and also an emerging and critical role where individuals possess the skills and operate as both a developer
10 and a systems operations engineer. DevOps seeks to help organizations more rapidly produce applications and update systems to meet changing market and customer needs. As an example, in many software companies today software development teams deploy multiple software updates per today to their operational systems. Conclusions & Recommendations After taking a sample of our entire user base and analyzing 6000 Heroku applications which produced over 22 billion log events across a 14 day period, the research team at Logentries have given a real world view into how the challenge of big data is not simply scaling to consume and manage it, but rather is having the ability to easily and quickly find critical information needed to take action. Time to insight is a key driver in how quickly DevOps can react to system issues and prevent major outages. In this case, Logentries highlighted the role of the DevOps professional in building and running cloud based applications, and ensuring those cloud applications are meeting user needs. By sampling its Heroku community and the roughly 22 billion log events generated by it over a 2 week period, Logentries illustrated the unique challenges posed to DevOps in being able to consume and process large quantities of machine generated data. More specifically, the data collected showed for individuals looking to maintain application health and performance that on average across 6000 heroku applications 99.82% of log data is noise. In this case, it s the 0.18% of log events that contain the critical application health and performance information needed by DevOps - the cloud application equivalent of a needle in the log data haystack. For DevOps building and managing Heroku applications, we recommend: Log your Heroku events to a log management tool and use that tool to track and diagnose application health and performance. If you aren t using a solution today, we d obviously recommend Logentries as it uses a real- time analytics engine and a collective intelligence model to automatically pre- process, index and tag Heroku log events in real- time, and provide users with a view in to their Heroku logs out of the box. Logentries automatically identifies Heroku error codes and categorizes them into the groups outlined in this report (warnings, critical events, fatal and application exceptions) and provides a view into these events through the Logentries Heroku Dashboard which helps you identify anomalies in your data. Logentires doesn t require any programming, advanced search queries or costly setup to start using. Review your specific Heroku applications and the error codes being reported in your event logs. Use these error codes to investigate potential issues and the underlying sources to determine changes that should be made.
11 Evaluate whether your Heroku applications are performing in line with acceptable norms and compare them against these report averages. You can also evaluate your application against either the moderately sized application generated sampling statistics or the larger sized application generated sampling statistics. If an application is higher that the averages reported we recommend you bring your application within Heroku norms. Setup real- time monitoring and alerting of your Heroku log stream so you can be proactively notified if an error occurs. You can set the frequency and duration for receiving notifications of errors logged.
12 About the Authors Trevor Parsons, PhD Co- Founder, Chief Scientist Trevor Parsons is Chief Scientist and Co- founder of Logentries. Trevor has over 10 years experience in enterprise software and in particular has specialized in developing enterprise monitoring and performance tools for distributed systems. He is also a research fellow at the Performance Engineering Lab Research Group and was formerly a Scientist at the IBM Center for Advanced Studies. Trevor holds a PhD from University College Dublin, Ireland. Benoit Guadin, PhD Sr. System Architect Benoit obtained his Ph.D at the University of Rennes in France in 2004, on controlling discrete event systems. After holding a teaching position for a year in France and a postodoctoral position at the Fraunhoffer institute in Berlin, Benoit came to Ireland in He was a postdoctoral fellow at University College Dublin and then a research fellow at the University of Limerick. Finally Benoit joined Logentries in October 2012, bringing the expertise he previously gathered in topics such as Formal Methods, Control Theory, Autonomic Systems, Dynamic Analysis, Automatic System Modeling and Ontologies. About Logentries Logentries is a SaaS offering for collecting and analyzing huge quantities of machine- generated log data and making that data easily accessible to individual developers, small teams, and enterprise customers. While traditional log management solutions require advanced technical skills to use or are costly to setup, Logentries provides a simply accessible alternative. With Logentries your log files are filtered, pre- processed and correlated up front for quicker and easier retrieval of the individual log entries that matter most. In turn, this pre- processing is combined with a collective insights model that enables important information to be dynamically tagged, automatically routed, and easily shared across teams and computing platforms. Logentries eliminates the need for an in- house data expert or
13 specialist to interpret and use the data. We are easy to setup and free to use, with flexible pricing options available. Resources Try Logentries, it s free - https://logentries.com/quick- start/ Build a Heroku application today https://www.heroku.com/ Check out an overview of the Heroku error codes Concerned your Heroku app is too slow, checking our blog post, How do I know if my Heroku app is slow, (H12 and H11 errors) Don t let your Heroku apps fail silently? Checkout another classic Logentries blog post on Heroku error monitoring - Additional information on Heroku errors can be found at the following: H12, request time out errors: https://devcenter.heroku.com/articles/request- timeout H70, Access to Banboo HTTP endpoint denied errors: https://devcenter.heroku.com/articles/custom- domains#custom- subdomains