Automated Anomaly Detection and Performance Modeling of Enterprise Applications

Transcription

1 Automated Anomaly Detection and Performance Modeling of Enterprise Applications 6 LUDMILA CHERKASOVA and KIVANC OZONAT Hewlett-Packard Labs NINGFANG MI Northeastern University JULIE SYMONS Hewlett-Packard and EVGENIA SMIRNI College of William and Mary Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Application performance issues have an immediate impact on customer experience and satisfaction. A sudden slowdown of enterprise-wide application can effect a large population of customers, lead to delayed projects, and ultimately can result in company financial loss. Significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Our thesis is that online performance modeling should be a part of routine application monitoring. Early, informative warnings on significant changes in application performance should help service providers to timely identify and prevent performance problems and their negative impact on the service. We propose a novel framework for automated anomaly detection and application change analysis. It is based on integration of two complementary techniques: (i) a regression-based transaction model that reflects a resource consumption model of the application, and (ii) an application performance signature that provides a compact model of runtime behavior of the application. The proposed integrated framework provides a simple and powerful solution for anomaly detection and analysis of essential performance changes in application behavior. An additional benefit of the proposed approach is its simplicity: It is not intrusive and is based on monitoring data that is typically available in enterprise production environments. The introduced E. Smirni has been partially supported by NSF grants CSR and CCF Author s addresses: L. Cherkasova, HP Labs, Palo Alto, CA; lucy.cherkasova@hp.com; K. Ozonat, HP Labs, Palo Alto, CA; kivanc.ozonat@hp.com; N. Mi, Northeastern University, Boston, MA; n.mi@neu.edu; J. Symons, HP, Cupertino, CA; julie.symons@hp.com; E. Smirni, College of William and Mary, Williamsburg, VA; esmirni@cs.wm.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. C 2009 ACM /2009/11-ART6 $10.00 DOI /

2 6:2 L. Cherkasova et al. solution further enables the automation of capacity planning and resource provisioning tasks of multitier applications in rapidly evolving IT environments. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems; C.4 [Special-Purpose and Application-Based Systems]: Performance of Systems General Terms: Experimentation, Management, Performance Additional Key Words and Phrases: Anomaly detection, online algorithms, performance modeling, capacity planning, multitier applications ACM Reference Format: Cherkasova, L., Ozonat, K., Mi, N., Symons, J., and Smirni, E Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. Comput. Syst. 27, 3, Article 6 (November 2009), 32 pages. DOI = / INTRODUCTION Today s IT and services departments are faced with the difficult and challenging task of ensuring that enterprise business-critical applications are always available and provide adequate performance. As the complexity of IT systems increases, performance analysis becomes time consuming and labor intensive for support teams. For larger IT projects, it is not uncommon for the cost factors related to performance tuning, performance management, and capacity planning to result in the largest and least controlled expense. We address the problem of efficiently diagnosing essential performance changes in application behavior in order to provide timely feedback to application designers and service providers. Typically, preliminary performance profiling of an application is done by using synthetic workloads or benchmarks which are created to reflect a typical application behavior for typical client transactions. While such performance profiling can be useful at the initial stages of design and development of a future system, it may not be adequate for analysis of performance issues and observed application behavior in existing production systems. For one thing, an existing production system can experience a very different workload compared to the one that has been used in its testing environment. Secondly, frequent software releases and application updates make it difficult and challenging to perform a thorough and detailed performance evaluation of an updated application. When poorly performing code slips into production and an application responds slowly, the organization inevitably loses productivity and experiences increased operating costs. Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Yet, such tools are not readily available to application designers and service providers. The traditional reactive approach is to set thresholds for observed performance metrics and raise alarms when these thresholds are violated. This approach is not adequate for understanding the performance changes between application updates. Instead, a proactive approach that is based on continuous application performance evaluation may assist enterprises in reducing loss of productivity by time-consuming diagnosis of essential performance changes in application performance. With

3 Automated Anomaly Detection and Performance Modeling 6:3 complexity of systems increasing and customer requirements for QoS growing, the research challenge is to design an integrated framework of measurement and system modeling techniques to support performance analysis of complex enterprise systems. Our goal is to design a framework that enables automated detection of application performance changes and provides useful classification of the possible root causes. There are a few causes that we aim to detect and classify. Performance Anomaly. By performance anomaly we mean that the observed application behavior (e.g., current CPU utilization) cannot be explained by the observed application workload (e.g., the type and volume of transactions processed by the application suggests a different level of CPU utilization). Typically, it might point to either some unrelated resource-intensive process that consumes system resources or some unexpected application behavior caused by not fully debugged application code. Application Transaction Performance Change. By transaction performance change we mean an essential change (increase or decrease) in transaction processing time, for example, as a result of the latest application update. If the detected change indicates an increase of the transaction processing time then an alarm is raised to assess the amount of additional resources needed and provides the feedback to application designers on the detected change (e.g., is this change acceptable or expected?). It is important to distinguish between performance anomaly and workload change. A performance anomaly is indicative of abnormal situation that needs to be investigated and resolved. On the contrary, a workload change (i.e., variations in transaction mix and load) is typical for Web-based applications. Therefore, it is highly desirable to avoid false alarms raised by the algorithm due to workload changes, though information on observed workload changes can be made available to the service provider. Effective models of complex enterprise systems are central to capacity planning and resource provisioning. In Next Generation Data Centers (NGDC), where server virtualization provides the ability to slice larger, underutilized physical servers into smaller, virtual ones, fast and accurate performance models become instrumental for enabling applications to automatically request necessary resources and support design of utility services. Performance management and maintenance operations come with more risk and an increased potential for serious damage if they are done wrong, and/or decision making is based on erroneous data. The proposed approach aims to automatically detect the performance anomalies and application changes. It effectively serves as the foundation for an accurate, online performance modeling, automated capacity planning, and provisioning of multitier applications using simple regressionbased analytic models [Zhang et al. 2007b]. The rest of the article is organized as follows. Section 2 introduces client versus server transactions. Section 3 provides two motivating examples. Section 4 introduces a regression-based modeling technique for performance anomaly detection. Section 5 presents a case study to validate the proposed technique

4 6:4 L. Cherkasova et al. and discusses its limitations. Section 6 introduces a complementary technique to address observed limitations of the first method. Section 6.3 extends the earlier case study to demonstrate the additional power of the second method, and promotes the integrated solution based on both techniques. Section 7 applies the results of the proposed technique to automate and improve accuracy of capacity planning in production systems. Section 8 describes related work. Finally, a summary and conclusions are given in Section CLIENT VS. SERVER TRANSACTIONS The term transaction is often used with different meanings. In our work, we distinguish between a client transaction and a server transaction. A client communicates with a Web service (deployed as a multitier application) via a Web interface, where the unit of activity at the client side corresponds to a download of a Web page. In general, a Web page is composed of an HTML file and several embedded objects such as images. This composite Web page is called a client transaction. Typically, the main HTML file is built via dynamic content generation (e.g., using Java servlets or JavaServer Pages) where the page content is generated by the application server to incorporate customized data retrieved via multiple queries from the back-end database. This main HTML file is called a server transaction. Typically, the server transaction is responsible for most latency and consumed resources [Cherkasova et al. 2003] (at the server side) during client transaction processing. A client browser retrieves a Web page (client transaction) by issuing a series of HTTP requests for all the objects: First it retrieves the main HTML file (server transaction) and after parsing it, the browser retrieves all the embedded objects. Thus, at the server side, a Web page retrieval corresponds to processing multiple requests that can be retrieved either in sequence or via multiple concurrent connections. It is common that a Web server and application server reside on the same hardware, and shared resources are used by the application and Web servers to generate main HTML files as well as to retrieve page-embedded objects. Since the HTTP protocol does not provide any means to delimit the beginning or the end of a Web page, it is very difficult to accurately measure the aggregate resources consumed due to Web page processing at the server side. There is no practical way to effectively measure the service times for all page objects, although accurate CPU consumption estimates are required for building an effective application provisioning model. To address this problem, we define a client transaction as a combination of all the processing activities at the server side to deliver an entire Web page requested by a client, that is, generate the main HTML file as well as retrieve embedded objects and perform related database queries. We use client transactions for constructing a resource consumption model of the application. The server transactions reflect the main functionality of the application. We use server transactions for analysis of the application performance changes (if any) during the application lifecycle.

5 Automated Anomaly Detection and Performance Modeling 6:5 Observed Predicted 30 Utilization (%) Time (day) Fig. 1. CPU utilization of OVSD service. 3. TWO MOTIVATING EXAMPLES Frequent software updates and shortened application development time dramatically increase the risk of introducing poorly performing or misconfigured applications to production environment. Consequently, the effective models for online, automated detection of whether application performance deviates from its normal behavior pattern become a high-priority item on the service provider s wish list. Example 1. Resource Consumption Model Change. In earlier papers [Zhang et al. 2007a, 2007b], a regression-based approach is introduced for resource provisioning of multitier applications. The main idea is to use a statistical linear regression for approximating the CPU demands of different transaction types (where a transaction is defined as a client transaction). However, the accuracy of the modeling results critically depends on the quality of monitoring data used in the regression analysis: If collected data contain periods of performance anomalies or periods when an updated application exhibits very different performance characteristics, then this can significantly impact the derived transaction cost and can lead to an inaccurate provisioning model. Figure 1 shows the CPU utilization (red line) of the HP Open View Service Desk (OVSD) over a duration of 1 month (each point reflects an 1-hour monitoring period). Most of the time, CPU utilization is under 10%. Note that for each weekend, there are some spikes of CPU utilization (marked with circles in Figure 1) which are related to administrator system management tasks and which are orthogonal to transaction processing activities of the application. Once provided with this information, we can use only weekdays monitoring data for deriving CPU demands of different transactions of the OVSD service. As a result, the derived CPU cost accurately predicts CPU requirements of the application and can be considered as a normal resource consumption model of the application. Figure 1 shows predicted CPU utilization which is computed using the CPU cost of observed transactions. The predicted CPU utilization accurately models the observed CPU utilization with an exception of weekends system management periods. However, if we were not aware of

6 6:6 L. Cherkasova et al. trans latency (ms) Tr1 Tr time (mins) Fig. 2. The transaction latency measured by HP (Mercury) Diagnostics tool. performance anomalies over weekends, and would use all the days (i.e., including weekends) of the 1-month dataset, the accuracy of regression would be much worse (the error would double) and this would significantly impact the modeling results. Example 2. Updated Application Performance Change. Another typical situation that requires a special handling is the analysis of the application performance when it was updated or patched. Figure 2 shows the latency of two application transactions, Tr1 and Tr2, over time (here, a transaction is defined as a server transaction). Typically, tools like HP (Mercury) Diagnostics [Mercury Diagnostics] are used in IT environments for observing latencies of the critical transactions and raising alarms when these latencies exceed the predefined thresholds. While it is useful to have insight into the current transaction latencies that implicitly reflect the application and system health, this approach provides limited information on the causes of the observed latencies and cannot be used directly to detect the performance changes of an updated or modified application. The latencies of both transactions vary over time and get visibly higher in the second half of the figure. This does not look immediately suspicious because the latency increase can be a simple reflection of a higher load in the system. The real story behind this figure is that after timestamp 160, we began executing an updated version of the application code where the processing time of transaction Tr1 is increased by 10 ms. However, by looking at the measured transaction latency over time we can not detect this: The reported latency metric does not provide enough information to detect this change. Problem Definition. Application servers are a core component of a multitier architecture. A client communicates with a service deployed as a multitier application via request-reply transactions. A typical server reply consists of the Web page dynamically generated by the application server. The application server may issue multiple database calls while preparing the reply. Understanding the cascading effects of the various tasks that are sprung by a single request-reply

7 Automated Anomaly Detection and Performance Modeling 6:7 transaction is a challenging task. Furthermore, significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Typically, a performance analyst needs to perform additional data-cleansing procedures to identify and filter out time periods that correspond to abnormal application performance. These procedures are manual and time consuming, and are based on offline data analysis and ad hoc techniques [Arlitt and Farkas 2005]. Inability to extract correct data while eliminating erroneous measurements can lead to inappropriate decisions that have significant technical and business consequences. At the same time, such a cleansing technique cannot identify periods when the updated or patched application performs normally but its performance exhibits a significant change. While there could be a clear difference in system performance before and after such an application update, the traditional data-cleansing techniques fail to identify such a change. In this way, while collected measurements are correct and accurate, the overall dataset has measurements related to two, potentially very different, systems: before the application update and after the application update. For accurate and efficient capacity planning and modeling future system performance, one needs to identify a subset of measurements that reflect the updated/modified application. The goal of this article is to design an online approach that automatically detects the performance anomalies and application changes. Such a method enables a set of useful performance services: early warnings on deviations in expected application performance, raise alarms on abnormal resource usage, create a consistent dataset for capacity planning and modeling future application resource requirements (by filtering out performance anomalies and pointing out the periods of changed application performance). The next sections present our solution that is based on integration of two complementary techniques: (i) a regression-based transaction model that correlates processed transactions and consumed CPU time to create a resource consumption model of the application; and (ii) an application performance signature that provides a compact model of runtime behavior of the application. 4. REGRESSION-BASED APPROACH FOR DETECTING MODEL CHANGES AND PERFORMANCE ANOMALIES We use statistical learning techniques to model the CPU demand of the application transactions (client transactions) on a given hardware configuration, to find the statistically significant transaction types, to discover the time segments where the resource consumption of a given application can be approximated by the same regression model, to discover time segments with performance anomalies, and to differentiate among application performance changes and workload-related changes as transactions are accumulated over time. Prerequisite to applying regression is that a service provider collects the application server access log that reflects all processed client transactions

8 6:8 L. Cherkasova et al. (i.e., client Web page accesses), and the CPU utilization of the application server(s) in the evaluated system. 4.1 Regression-Based Transaction Model To capture the site behavior across time we observe a number of different client transactions over a monitoring window t of fixed length L. The transaction mix and system utilization are recorded at the end of each monitoring window. Assuming that there are totally n transaction types processed by the server, we use the following notation. T m denotes the time segment for monitored site behavior and T m denotes the cardinality of the time segment T m, that is, the number of monitoring windows in T m ; N i,t is the number of transactions of the ith type in the monitoring window t, where 1 i n; U CPU,t is the average CPU utilization of application server during this monitoring window t T m ; D i is the average CPU demand of transactions of the ith type at application server, where 1 i n; D 0 is the average CPU overhead related to activities that keep the system up. There are operating system processes or background jobs that consume CPU time even when there are no transactions in the system. From the utilization law, one can easily obtain Eq. (1) for each monitoring window t. n D 0 + N i,t D i = U CPU,t L (1) i=1 Let C i,m denote the approximated CPU cost of D i for 0 i n in the time segment T m. Then, an approximated utilization U CPU,t can be calculated as n U CPU,t = C i=1 0,m + N i,t C i,m, (2) L To solve for C i,m, one can choose a regression method from a variety of known methods in the literature. A typical objective for a regression method is to minimize either the absolute error or the squared error. In all experiments, we use the nonnegative Least Squares Regression (nonnegative LSQ) provided by MATLAB to obtain C i,m. This nonnegative LSQ regression minimizes the error ɛ m = (U CPU,t U CPU,t) 2. t T m such that C i,m 0. When solving a large set of equations with collected monitoring data over a large period of time, a direct (naive) linear regression approach would attempt to set nonzero values for as many transactions as it can to minimize the error when the model is applied to the training set. However, this may lead to poor prediction accuracy when the model is later applied to other datasets, as the

9 Automated Anomaly Detection and Performance Modeling 6:9 model may have become too finely tuned to the training set alone. In statistical terms, the model may overfit the data if it sets values to some coefficients to minimize the random noise in the training data rather than to correlate with the actual CPU utilization. In order to create a model which utilizes only the statistically significant transactions, we use stepwise linear regression [Draper and Smith 1998] to determine which set of transactions are the best predictors for the observed CPU utilization. The algorithm initializes with an empty model which includes none of the transactions. At each following iteration, a new transaction is considered for inclusion in the model. The best transaction is chosen by adding the transaction which results in the lowest mean squared error when it is included. For each N i (0 i n ) we try to solve the set of equations in the form D 0 + N i,t D i = U CPU,t L, (3) while minimizing the error ɛ i = (U CPU,t U CPU,t) 2. t T m Once we performed this procedure for all the transactions, we select the transaction N k (0 k n ) which results in the lowest mean squared error, that is, such that ɛ k = min ɛ i. 0 i N Then, transaction N k is added to the empty set. After that, the next iteration is repeated to choose the next transaction from the remaining subset to add to the set in a similar way. Before the new transaction is included in the model, it must pass an F-test which determines if including the extra transaction results in a statistically significant improvement in the model s accuracy. If the F-test fails, then the algorithm terminates since including any further transactions cannot provide a significant benefit. The coefficients for the selected transactions are calculated using the linear regression technique described before. The coefficient for the transactions not included in the model is set to zero. Typically, for an application with n transactions, one needs at least n + 1 samples to do regression using all n transactions. However, since we do transaction selection using a stepwise linear regression and an F-test, we can do regression by including only a subset of n transactions in the regression model. This allows us to apply regression without having to wait all n + 1 samples. 4.2 Algorithm Outline Using statistical regression, we can build a model that approximates the overall resource cost (CPU demand) of application transactions on a given hardware configuration. However, the accuracy of the modeling results critically depends on the quality of monitoring data used in the regression analysis: If the collected data contain periods of performance anomalies or periods when an updated application exhibits very different performance characteristics, then this

10 6:10 L. Cherkasova et al. Fig. 3. Finding optimal segmentation and detecting anomalies. Fig. 4. Model reconciliation. can significantly impact the derived transaction cost and can lead to an inaccurate approximation model. The challenge is to design an online method that alarms service providers of model changes related to performance anomalies and application updates. Our method has the following three phases. Finding the Optimal Segmentation. This stage of the algorithm identifies the time points when the transaction cost model exhibits a change. For example, as shown in Figure 3, the CPU costs of the transactions (Tr 1, Tr 2,..., Tr n ) during the time interval (T 0, T 1 ) are defined by a model (C 0, C 1, C 2,..., C n ). After that, for a time interval (T 1, T 2 ) there was no a single regression model that provides the transaction costs within a specified error bound. This time period is signaled as having anomalous behavior. As for time interval (T 2, T 3 ), the transaction cost function is defined by a new model (C 0, C 1, C 2,..., C n ). Filtering Out the Anomalous Segments. Our goal is to continuously maintain the model that reflects a normal application resource consumption behavior. At this stage, we filter out anomalous measurements identified in the collected dataset, for example, the time period (T 1, T 2 ) that corresponds to anomalous time fragment as shown in Figure 3. Model Reconciliation. After anomalies have been filtered out, one would like to unify the time segments with no application change/update/modification by using a single regression model: We attempt to reconcile two different segments (models) by using a new common model as shown in Figure 4. We try to find a new solution (new model) for combined transaction data in (T 0, T 1 ) and (T 2, T 3 ) with a given (predefined) error. If two models can be reconciled then an observed model change is indicative of the workload change and not of the application change. We use the reconciled model to represent application behavior across different workload mixes. If the model reconciliation does not work, then it means these models indeed describe different consumption models of application over time, and it is indicative of an actual application performance change.

11 Automated Anomaly Detection and Performance Modeling 6: Online Algorithm Description This section describes the three phases of the online model change and anomaly detection algorithm in more detail. (1) Finding the optimal segmentation. This stage of the algorithm identifies the time points where the transaction cost model exhibits a change. In other words, we aim to divide a given time interval T into time segments T m (T = T m ) such that within each time segment T m the application resource consumption model and the transaction costs are similar. We use a cost-based statistical learning algorithm to divide the time into segments with a similar regression model. The algorithm is composed of two steps: construction of weights for each time segment T m ; dynamic programming to find the optimum segmentation (that covers a given period T) with respect to the weights. The algorithm constructs an edge with a weight, w m, for each possible time segment T m T. This weight represents the cost of forming the segment T m. Intuitively, we would like the weight w m to be small if the resource cost of transactions in T m can be accurately approximated with the same regression model, and to be large if the regression model has a poor fit for approximating the resource cost of all the transactions in T m. The weight function, w m is selected as a Lagrangian sum of two cost functions: w 1,m and w 2,m, where the function w 1,m is the total regression error over T m : w 1,m = (U CPU,t U CPU,t) 2, t T m the function w 2,m is a length penalty function. A length penalty function penalizes shorter time intervals over longer time intervals and discourages the dynamic programming from breaking the time into segments of very short length (since the regression error can be significantly smaller for a shorter time segments). It is a function that decreases as the length of the interval T m increases. We set it to a function of the entropy of segment length as w 2,m = ( T m ) log( T m / T ). Our goal is to divide a given time interval T into time segments T m (T = Tm ) that minimize the Lagrangian sum of w 1,m and w 2,m over the considered segments, that is, the segmentation that minimiz W 1 (T) + λw 2 (T), (4) where the parameter λ is the Lagrangian constant that is used to control the average regression error ɛ allow (averaged over T) allowed in the model, and W 1 (T) = w 1,m and W 2 (T) = w 2,m. m m We denote the set of all possible segmentations of T into segments by S. For instance, for T with three monitoring windows, t 1,t 2,t 3, S would have 4 elements: (t 1,t 2,t 3 ), (t 1 t2,t 3 ), (t 1, t 2 t3 ), and (t 1 t2 t3 ). Each element of the

12 6:12 L. Cherkasova et al. set S consists of segments, that is, T m s. For instance, (t 1, t 2 t3 ) consists of the segments t 1 and t 2 t3. The minimization problem defined by formula (4) can be expressed as s = argmin s S W (s) = W 1 (s) + λw 2 (s), (5) where λ is the fixed Lagrangian constant, s denotes an element of S, and W 1 (s) = T m s w 1,m and W 2 (s) = T m s w 2,m. We first provide the segmentation algorithm for a fixed value of λ (we have an explanation after the algorithm, how to find the right value of λ to be used in the algorithm for a given, allowable regression error ɛ). In the algorithm, l represents the index of the monitoring window, where the monitoring windows are denoted by t l. We represent segments using the following form: [t j, t k ]. It indicates the segment that extends from monitoring window t j to monitoring window t k. W t l denotes the minimum Lagrangian weight sum for segmentation of the first l time points, that is, t 1,t 2,... t l. (1) Set W t 1 = 0, and set l = 2. (2) For 1 j < l, set w 1,[t j,t k ] to the total regression error when a regression model is fit over the time samples in the segment [t j, t k ], then set w 2,[t j,t k ] = t k t j log( t k t j / T ) and, finally, set w [t j,t k ] = w 1,[t j,t k ] + λw 2,[t j,t k ]. (3) Set W t l = min 1 j l 1 (W t j + w [t j,t l ]). (4) Set j = argmin 1 j l 1 (W t j + w [t j,t l ]). (5) Then the optimum segmentation is the segmentation result for j (already obtained in previous iterations) augmented by the single segment from j to l. (6) Set l = l + 1, and return to step 2. In the preceding algorithm, step 2 sets the values for the w 1, w 2, and w terms, and step 3 and step 4 apply dynamic programming to find the minimum cost W l of segmentation up to the l th time sample. The algorithm shows the best segmentation for a fixed value of λ. Here is the additional sequence of steps to find the appropriate value of λ for the use in the algorithm. In our implementation, we first decide on an allowable regression error ɛ (typically, provided by the service provider), and then seek the λ that gives the best segmentation for that allowable error ɛ by iterating over different values of λ = λ 0, λ 1,..., λ k. In particular, if the algorithm for λ 0 = 1 results in the optimal segmentation with regression error greater than ɛ, then we choose λ 1 = 2 λ 0, and repeat the segmentation algorithm. Once we find a value λ k

13 Automated Anomaly Detection and Performance Modeling 6:13 that results in the optimal segmentation with regression error smaller than ɛ, then we use a binary search between values λ k 1 and λ k to find the best value of λ for a use in the algorithm with a given allowable regression error ɛ. Let us consider an example to explain the intuition for how the Eq. (5) works. Let us first consider the time interval T with no application updates or changes. Let time interval T be divided into two consecutive time segments T 1 and T 2. First of all, W 1 (T 1 ) + W 1 (T 2 ) W (T), hence there are two possibilities. One possibility is that a regression model constructed over T is also a good fit over time segments T 1 and T 2, and the combined regression error of this model over time segments T 1 and T 2 is approximately equal to the total regression error over the original time interval T. The other possibility is that there could be different regression models that are constructed over shorter time segments T 1 and T 2 with the sum of regression errors smaller than a regression error obtained when a single regression model is constructed over T. For the second possibility, the question is whether the difference is due to a noise or small outliers in T, or do segments T 1 and T 2 indeed represent different application behaviors, that is, before and after the application modification and update. This is where the W 2 function in Eq. (4) comes into play. The term log( T m / T ) is a convex function of T m. Therefore, each time a segment is split into multiple segments, W 2 increases. This way, the original segment T results in the smallest W 2 compared to any subset of its segments, and λ can be viewed as a parameter that controls the amount of regression error allowed in a segment. By increasing the value of λ, we allow a larger W 1, regression error, in the segment. This helps in reconciling T 1 and T 2 into a single segment representation T. In such a way, by increasing the value of λ one can avoid the incorrect segmentations due to noise or small outliers in the data T. When an average regression error over a single segment T is within the allowable error ɛ allow (ɛ allow is set by a service provider), the overall function (4) results in the smallest value for the single time segment T compared to the values computed to any of its subsegments, for instance, T 1 and T 2. Therefore, our approach groups all time segments defined by the same CPU transaction cost (or the same regression model) into a single segment. By decreasing the value of λ, one can prefer the regression models with a smaller total regression error on the data, while possibly increasing the number of segments over the data. There is a trade-off between the allowable regression error (it is a given parameter for our algorithm) and the algorithm outcome. If the allowable regression error is set too low then the algorithm may result in a high number of segments over data, with many segments being neither anomalies or application changes (these are the false alarms, typically caused by significant workload changes). From the other side, by setting the allowable regression error too high, one can miss a number of performance anomalies and application changes that happened in these data and masked by the high allowable error.

14 6:14 L. Cherkasova et al. (2) Filtering out the anomalous segments. An anomalous time segment is one where observed CPU utilization cannot be explained by an application workload, that is, measured CPU utilization cannot be accounted for by the transaction CPU cost function. This may happen if an unknown background process(es) is using the CPU resource either at a constant rate (e.g., using 40% of the CPU at every monitoring window during some time interval) or randomly (e.g., the CPU is consumed by the background process at different rates at every monitoring window). It is important to be able to detect and filter out the segments with anomalous behavior, as otherwise the anomalous monitoring windows will corrupt the regression estimations of the time segments with normal behavior. Furthermore, detecting anomalous time segments provides an insight into the service problems and a possibility to correct the problems before they cause major service failure. We consider a time segment T m as anomalous if one of the following conditions take place. The constant coefficient, C 0,m, is large. Typically, C 0,m is used in the regression model to represent the average CPU overhead related to idle system activities. There are operating system processes or system background jobs that consume CPU time even when there is no transaction in the system. The estimate for the idle system CPU overhead over a monitoring window is set by the service provider. When C 0,m exceeds this threshold a time segment T m is considered as anomalous. The segment length of T m is short, indicating that a model does not have a good fit that ensures the allowed error threshold. Intuitively, the same regression model should persist over the whole time segment between the application updates/modifications unless something else, anomalous, happens to the application consumption model and it manifests itself via the model changes. (3) Model reconciliation. After anomalies have been filtered out, one would like to unify the time segments with no application change/update/modification by using a single regression model. This way, it is possible to differentiate between the segments with application changes from the segments which are the parts of the same application behavior and were segmented out by the anomalies in between. In such cases, the consecutive segments can be reconciled into a single segment after the anomalies in the data are removed. If there is an application change, on the other hand, the segments will not be reconciled, since the regression model that fits to the individual segments will not fit to the overall single segment without exceeding the allowable error (unless the application performance change is so small that it still fits within the allowable regression error). 4.4 Algorithm Complexity The complexity of the algorithm is O(M 2 ), where M is the number of time samples collected so far. This is problematic since the complexity is quadratic in a term that increases as more time samples are collected. In our case study, we have not experienced a problem as we used only 30 hours of data with 1-minute

15 Automated Anomaly Detection and Performance Modeling 6:15 Table I. Testbed Components Processor RAM Clients (Emulated-Browsers) Pentium D / 6.4 GHz 4GB Front Server - Apache/Tomcat 5.5 Pentium D / 3.2 GHz 4GB Database Server - MySQL5.0 Pentium D / 6.4 GHz 4GB intervals, namely M = However, in measuring real applications over long periods of time, complexity is likely to become challenging. To avoid this, one solution might retain only the last X samples, where X should be a few orders larger than the number of transaction types n and/or cover a few weeks/months of historic data. This way, one would have a sufficiently large X to get accurate regression results, yet the complexity will not be too large. 5. CASE STUDY In this section, we validate the proposed regression-based transaction model as an online solution for anomaly detection and performance changes in application behavior and discuss limitations of this approach. The next subsection describes the experimental environment used in the case study as well as a specially designed workload for evaluating the proposed approach. 5.1 Experimental Environment In our experiments, we use a testbed of a multitier e-commerce site that simulates the operation of an online bookstore, according to the classic TPC-W benchmark [TPC-W Benchmark]. This allows to conduct experiments under different settings in a controlled environment in order to evaluate the proposed anomaly detection approach. We use the terms front server and application server interchangeably in this article. Specifics of the software/hardware used are given in Table I. Typically, client access to a Web service occurs in the form of a session consisting of a sequence of consecutive individual requests. According to the TPC-W specification, the number of concurrent sessions (i.e., customers) or emulated browsers (EBs) is kept constant throughout the experiment. For each EB, the TPC-W benchmark statistically defines the user session length, the user think time, and the queries that are generated by the session. The database size is determined by the number of items and the number of customers. In our experiments, we use the default database setting, that is, the one with 10,000 items and 1,440,000 customers. TPC-W defines 14 different transactions which are classified as either of browsing or ordering types as shown in Table II. We assign a number to each transaction (shown in parentheses) according to their alphabetic order. Later, we use these transaction ids for presentation convenience in the figures. According to the weight of each type of activity in a given traffic mix, TPC-W defines 3 types of traffic mixes as follows: the browsing mix with 95% browsing and 5% ordering; the shopping mix with 80% browsing and 20% ordering; the ordering mix with 50% browsing and 50% ordering.

16 6:16 L. Cherkasova et al. Table II. 14 Basic Transactions and Their Types in TPC-W Browsing Type Ordering Type Home (8) Shopping Cart (14) New Products (9) Customer Registration (6) Best Sellers (3) Buy Request (5) Product detail (12) Buy Confirm (4) Search Request (13) Order Inquiry (11) Execute Search (7) Order Display (10) Admin Request (1) Admin Confirm (2) Fig. 5. The pseudocode for the random process. One drawback of directly using the transaction mixes described previously in our experiments is that they are stationary, that is, the transaction mix and load do not change over time. Since real enterprise and e-commerce applications are typically characterized by nonstationary transaction mixes (i.e., with changing transaction probabilities in the transaction mix over time) under variable load [Douglis et al. 1997; Cherkasova and Karlsson 2001; Stewart et al. 2007] we have designed an approach that enables us to generate nonstationary workloads using the TPC-W setup. To generate a nonstationary transaction mix with variable transaction mix and load we run 4 processes as follows: the three concurrent processes each executing one of the standard transaction mixes (i.e., browsing, shopping and ordering respectively) with the arbitrary fixed number of EBs (e.g, 20, 30, and 50 EBs, respectively). We call them base processes; the fourth, so-called random process executes one of the standard transaction mixes (in our experiments, it is the shopping mix) with a random execution period while using a random number of EBs for each period. To navigate and control this random process we use specified ranges for the random parameters in this workload. The pseudocode of this random process is shown in Figure 5 (the code also shows parameters we use for the nonstationary mix in this article). Due to the fourth random process the workload is nonstationary and the transaction mix and load vary significantly over time.

17 Automated Anomaly Detection and Performance Modeling 6:17 Fig hour TPC-W workload used in the case study. In order to validate the online anomaly detection and application change algorithm, we designed a special 30-hour experiment with TPC-W that has 7 different workload segments shown in Figure 6, which are defined as follows. (1) The browsing mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (2) In order to validate whether our algorithm correctly detects performance anomalies, we generated a special workload with nonstationary transaction mix as described earlier and an additional CPU process (that consumes random amount of CPU) on a background. (3) The shopping mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (4) The ordering mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (5) The nonstationary TPC-W transaction mix described earlier in this section. (6) In order to validate whether we can automatically recognize the application change, we modified the source code of the Home transaction (the 8th transaction) in TPC-W by inserting a controlled CPU loop into the code of this transaction and increasing its service time by 5 ms. Using this modified code, we performed experiment with the nonstationary TPC-W transaction mix described before. (7) Another experiment with the modified TPC-W benchmark, where the service time of the Home transaction is increased by 10 ms and the nonstationary TPC-W transaction mix described previously. 5.2 Validation of Online Regression-Based Algorithm We applied our online regression-based algorithm to the special 30-hour workload shown in Figure 6. The expectations are that the algorithm should exhibit a model change and issue an alarm for anomaly workload described by segment 2 in Figure 6. Then a single model should represent workload segments

18 6:18 L. Cherkasova et al. Fig. 7. Model changes in the studied workload. Fig. 8. Model reconciliation in the studied workload. 3, 4, and 5 as shown in Figure 6. Moreover, the algorithm should reconcile segments 1, 3, 4, and 5 with a single model (since they are all generated with the same application operating under different workloads), while issuing the model changes for application modifications described by segments 6 and 7. We experimented with two values of allowable error in our experiments: ɛallow 1 = 3% and ɛ2 allow = 1% to demonstrate the impact of error setting and stress the importance of tuning this value. When we used ɛallow 1 = 3%, the algorithm had correctly identified 4 major model changes as shown in Figure 7. In fact, for the second segment there were 42 model changes (not shown in this figure to simplify the presentation) with maximum segment being 5 epochs. The algorithm accurately detected that the whole segment 2 is anomalous. Then the tool correctly performed the model reconciliation for the consecutive segments around the anomalous segment as shown in Figure 8. Finally, the algorithm correctly raised alarms on the application change when the regression model has changed and could not be reconciled (last two segments in Figure 8).

19 Automated Anomaly Detection and Performance Modeling 6:19 When we used ɛallow 1 = 1%, the algorithm had identified 6 major model changes: In addition to 4 model changes shown in Figure 7 the algorithm reported 2 extra segments at timestamps 790 and 1030 (separating apart segments 3, 4, and 5 shown in Figure 6) that correspond to workload changes and that are false alarms. It is important to use the appropriate error setting to minimize the number of false alarms. When the allowable error is set too small the proposed algorithm often overreacts and identifies model changes that correspond to significant workload changes under the same, unmodified application. In the next section, we will introduce a complementary technique that helps to get an insight in whether the model change is indeed an application change or whether it rather corresponds to a workload change. This technique can also be used for error tuning in the earlier described online algorithm. The power of the introduced regression-based approach is that it is sensitive to accurately detect a difference in the CPU consumption model of application transactions. However, it cannot identify the transactions that are responsible for this resource consumption difference. To complement the regression-based approach and to identify the transactions that cause the model change we use a different method described in the next section and which is based on building a representative application performance signature. 6. DETECTING TRANSACTION PERFORMANCE CHANGE Nowdays there is a new generation of monitoring tools, both commercial and research prototypes, that provide useful insights into transaction activity tracking and latency breakdown across different components in multitier systems. However, typically such monitoring tools just report the measured transaction latency and provide an additional information on application server versus database server latency breakdown. Using this level of information it is often impossible to decide whether an increased transaction latency is a result of a higher load in the system or whether it can be an outcome of the recent application modification, and is directly related to the increased processing time for this transaction type. In this section, we describe an approach based on an application performance signature that provides a compact model of runtime behavior of the application. The application signature is built based on new concepts: the transaction latency profiles and transaction signatures. These become instrumental for creating an application signature that accurately reflects important performance characteristics. Comparing new application signature against the old application signature allows detecting transaction performance changes. 6.1 Server Transaction Monitoring Many enterprise applications are implemented using the J2EE standard, a Java platform which is used for Web application development and designed to meet the computing needs of large enterprises. For transaction monitoring we use the HP (Mercury) Diagnostics [Mercury Diagnostics] tool which offers a monitoring solution for J2EE applications. The Diagnostics tool consists of

20 6:20 L. Cherkasova et al. Fig. 9. Multitier application configuration with the Diagnostics tool. two components: the Diagnostics Probe and the Diagnostics Server as shown in Figure 9. The Diagnostics tool collects performance and diagnostic data from applications without the need for application source-code modification or recompilation. It uses bytecode instrumentation and industry standards for collecting system and JMX metrics. Instrumentation refers to bytecode that the Probe inserts into the class files of the application as they are loaded by the class loader of the virtual machine. Instrumentation enables a Probe to measure execution time, count invocations, retrieve arguments, catch exceptions, and correlate method calls and threads. The J2EE Probe shown in Figure 9 is responsible for capturing events from the application, aggregating the performance metrics, and sending these captured performance metrics to the Diagnostics Server. We have implemented a Java-based processing utility for extracting performance data from the Diagnostics Server in real time and creating a so-called application log that provides a complete information on all transactions processed during the monitoring window, such as their overall latencies, outbound calls, and the latencies of the outbound calls. In a monitoring window, Diagnostics collects the following information for each transaction type: a transaction count; an average overall transaction latency for observed transactions. 1 This overall latency includes transaction processing time at the application server as well as all related query processing at the database server, that is, latency is measured from the moment of the request arrival at the application server to the time when a prepared reply is sent back by the application server; see Figure 10; a count of outbound (database) calls of different types; an average latency of observed outbound calls (of different types). The average latency of an outbound call is measured from the moment the database 1 Note that here a latency is measured for the server transaction (see the difference between client and server transactions described in Section 2).