Automated Anomaly Detection and Performance Modeling of Enterprise Applications

Size: px
Start display at page:

Download "Automated Anomaly Detection and Performance Modeling of Enterprise Applications"

Transcription

1 Automated Anomaly Detection and Performance Modeling of Enterprise Applications 6 LUDMILA CHERKASOVA and KIVANC OZONAT Hewlett-Packard Labs NINGFANG MI Northeastern University JULIE SYMONS Hewlett-Packard and EVGENIA SMIRNI College of William and Mary Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Application performance issues have an immediate impact on customer experience and satisfaction. A sudden slowdown of enterprise-wide application can effect a large population of customers, lead to delayed projects, and ultimately can result in company financial loss. Significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Our thesis is that online performance modeling should be a part of routine application monitoring. Early, informative warnings on significant changes in application performance should help service providers to timely identify and prevent performance problems and their negative impact on the service. We propose a novel framework for automated anomaly detection and application change analysis. It is based on integration of two complementary techniques: (i) a regression-based transaction model that reflects a resource consumption model of the application, and (ii) an application performance signature that provides a compact model of runtime behavior of the application. The proposed integrated framework provides a simple and powerful solution for anomaly detection and analysis of essential performance changes in application behavior. An additional benefit of the proposed approach is its simplicity: It is not intrusive and is based on monitoring data that is typically available in enterprise production environments. The introduced E. Smirni has been partially supported by NSF grants CSR and CCF Author s addresses: L. Cherkasova, HP Labs, Palo Alto, CA; lucy.cherkasova@hp.com; K. Ozonat, HP Labs, Palo Alto, CA; kivanc.ozonat@hp.com; N. Mi, Northeastern University, Boston, MA; n.mi@neu.edu; J. Symons, HP, Cupertino, CA; julie.symons@hp.com; E. Smirni, College of William and Mary, Williamsburg, VA; esmirni@cs.wm.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. C 2009 ACM /2009/11-ART6 $10.00 DOI /

2 6:2 L. Cherkasova et al. solution further enables the automation of capacity planning and resource provisioning tasks of multitier applications in rapidly evolving IT environments. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems; C.4 [Special-Purpose and Application-Based Systems]: Performance of Systems General Terms: Experimentation, Management, Performance Additional Key Words and Phrases: Anomaly detection, online algorithms, performance modeling, capacity planning, multitier applications ACM Reference Format: Cherkasova, L., Ozonat, K., Mi, N., Symons, J., and Smirni, E Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. Comput. Syst. 27, 3, Article 6 (November 2009), 32 pages. DOI = / INTRODUCTION Today s IT and services departments are faced with the difficult and challenging task of ensuring that enterprise business-critical applications are always available and provide adequate performance. As the complexity of IT systems increases, performance analysis becomes time consuming and labor intensive for support teams. For larger IT projects, it is not uncommon for the cost factors related to performance tuning, performance management, and capacity planning to result in the largest and least controlled expense. We address the problem of efficiently diagnosing essential performance changes in application behavior in order to provide timely feedback to application designers and service providers. Typically, preliminary performance profiling of an application is done by using synthetic workloads or benchmarks which are created to reflect a typical application behavior for typical client transactions. While such performance profiling can be useful at the initial stages of design and development of a future system, it may not be adequate for analysis of performance issues and observed application behavior in existing production systems. For one thing, an existing production system can experience a very different workload compared to the one that has been used in its testing environment. Secondly, frequent software releases and application updates make it difficult and challenging to perform a thorough and detailed performance evaluation of an updated application. When poorly performing code slips into production and an application responds slowly, the organization inevitably loses productivity and experiences increased operating costs. Automated tools for understanding application behavior and its changes during the application lifecycle are essential for many performance analysis and debugging tasks. Yet, such tools are not readily available to application designers and service providers. The traditional reactive approach is to set thresholds for observed performance metrics and raise alarms when these thresholds are violated. This approach is not adequate for understanding the performance changes between application updates. Instead, a proactive approach that is based on continuous application performance evaluation may assist enterprises in reducing loss of productivity by time-consuming diagnosis of essential performance changes in application performance. With

3 Automated Anomaly Detection and Performance Modeling 6:3 complexity of systems increasing and customer requirements for QoS growing, the research challenge is to design an integrated framework of measurement and system modeling techniques to support performance analysis of complex enterprise systems. Our goal is to design a framework that enables automated detection of application performance changes and provides useful classification of the possible root causes. There are a few causes that we aim to detect and classify. Performance Anomaly. By performance anomaly we mean that the observed application behavior (e.g., current CPU utilization) cannot be explained by the observed application workload (e.g., the type and volume of transactions processed by the application suggests a different level of CPU utilization). Typically, it might point to either some unrelated resource-intensive process that consumes system resources or some unexpected application behavior caused by not fully debugged application code. Application Transaction Performance Change. By transaction performance change we mean an essential change (increase or decrease) in transaction processing time, for example, as a result of the latest application update. If the detected change indicates an increase of the transaction processing time then an alarm is raised to assess the amount of additional resources needed and provides the feedback to application designers on the detected change (e.g., is this change acceptable or expected?). It is important to distinguish between performance anomaly and workload change. A performance anomaly is indicative of abnormal situation that needs to be investigated and resolved. On the contrary, a workload change (i.e., variations in transaction mix and load) is typical for Web-based applications. Therefore, it is highly desirable to avoid false alarms raised by the algorithm due to workload changes, though information on observed workload changes can be made available to the service provider. Effective models of complex enterprise systems are central to capacity planning and resource provisioning. In Next Generation Data Centers (NGDC), where server virtualization provides the ability to slice larger, underutilized physical servers into smaller, virtual ones, fast and accurate performance models become instrumental for enabling applications to automatically request necessary resources and support design of utility services. Performance management and maintenance operations come with more risk and an increased potential for serious damage if they are done wrong, and/or decision making is based on erroneous data. The proposed approach aims to automatically detect the performance anomalies and application changes. It effectively serves as the foundation for an accurate, online performance modeling, automated capacity planning, and provisioning of multitier applications using simple regressionbased analytic models [Zhang et al. 2007b]. The rest of the article is organized as follows. Section 2 introduces client versus server transactions. Section 3 provides two motivating examples. Section 4 introduces a regression-based modeling technique for performance anomaly detection. Section 5 presents a case study to validate the proposed technique

4 6:4 L. Cherkasova et al. and discusses its limitations. Section 6 introduces a complementary technique to address observed limitations of the first method. Section 6.3 extends the earlier case study to demonstrate the additional power of the second method, and promotes the integrated solution based on both techniques. Section 7 applies the results of the proposed technique to automate and improve accuracy of capacity planning in production systems. Section 8 describes related work. Finally, a summary and conclusions are given in Section CLIENT VS. SERVER TRANSACTIONS The term transaction is often used with different meanings. In our work, we distinguish between a client transaction and a server transaction. A client communicates with a Web service (deployed as a multitier application) via a Web interface, where the unit of activity at the client side corresponds to a download of a Web page. In general, a Web page is composed of an HTML file and several embedded objects such as images. This composite Web page is called a client transaction. Typically, the main HTML file is built via dynamic content generation (e.g., using Java servlets or JavaServer Pages) where the page content is generated by the application server to incorporate customized data retrieved via multiple queries from the back-end database. This main HTML file is called a server transaction. Typically, the server transaction is responsible for most latency and consumed resources [Cherkasova et al. 2003] (at the server side) during client transaction processing. A client browser retrieves a Web page (client transaction) by issuing a series of HTTP requests for all the objects: First it retrieves the main HTML file (server transaction) and after parsing it, the browser retrieves all the embedded objects. Thus, at the server side, a Web page retrieval corresponds to processing multiple requests that can be retrieved either in sequence or via multiple concurrent connections. It is common that a Web server and application server reside on the same hardware, and shared resources are used by the application and Web servers to generate main HTML files as well as to retrieve page-embedded objects. Since the HTTP protocol does not provide any means to delimit the beginning or the end of a Web page, it is very difficult to accurately measure the aggregate resources consumed due to Web page processing at the server side. There is no practical way to effectively measure the service times for all page objects, although accurate CPU consumption estimates are required for building an effective application provisioning model. To address this problem, we define a client transaction as a combination of all the processing activities at the server side to deliver an entire Web page requested by a client, that is, generate the main HTML file as well as retrieve embedded objects and perform related database queries. We use client transactions for constructing a resource consumption model of the application. The server transactions reflect the main functionality of the application. We use server transactions for analysis of the application performance changes (if any) during the application lifecycle.

5 Automated Anomaly Detection and Performance Modeling 6:5 Observed Predicted 30 Utilization (%) Time (day) Fig. 1. CPU utilization of OVSD service. 3. TWO MOTIVATING EXAMPLES Frequent software updates and shortened application development time dramatically increase the risk of introducing poorly performing or misconfigured applications to production environment. Consequently, the effective models for online, automated detection of whether application performance deviates from its normal behavior pattern become a high-priority item on the service provider s wish list. Example 1. Resource Consumption Model Change. In earlier papers [Zhang et al. 2007a, 2007b], a regression-based approach is introduced for resource provisioning of multitier applications. The main idea is to use a statistical linear regression for approximating the CPU demands of different transaction types (where a transaction is defined as a client transaction). However, the accuracy of the modeling results critically depends on the quality of monitoring data used in the regression analysis: If collected data contain periods of performance anomalies or periods when an updated application exhibits very different performance characteristics, then this can significantly impact the derived transaction cost and can lead to an inaccurate provisioning model. Figure 1 shows the CPU utilization (red line) of the HP Open View Service Desk (OVSD) over a duration of 1 month (each point reflects an 1-hour monitoring period). Most of the time, CPU utilization is under 10%. Note that for each weekend, there are some spikes of CPU utilization (marked with circles in Figure 1) which are related to administrator system management tasks and which are orthogonal to transaction processing activities of the application. Once provided with this information, we can use only weekdays monitoring data for deriving CPU demands of different transactions of the OVSD service. As a result, the derived CPU cost accurately predicts CPU requirements of the application and can be considered as a normal resource consumption model of the application. Figure 1 shows predicted CPU utilization which is computed using the CPU cost of observed transactions. The predicted CPU utilization accurately models the observed CPU utilization with an exception of weekends system management periods. However, if we were not aware of

6 6:6 L. Cherkasova et al. trans latency (ms) Tr1 Tr time (mins) Fig. 2. The transaction latency measured by HP (Mercury) Diagnostics tool. performance anomalies over weekends, and would use all the days (i.e., including weekends) of the 1-month dataset, the accuracy of regression would be much worse (the error would double) and this would significantly impact the modeling results. Example 2. Updated Application Performance Change. Another typical situation that requires a special handling is the analysis of the application performance when it was updated or patched. Figure 2 shows the latency of two application transactions, Tr1 and Tr2, over time (here, a transaction is defined as a server transaction). Typically, tools like HP (Mercury) Diagnostics [Mercury Diagnostics] are used in IT environments for observing latencies of the critical transactions and raising alarms when these latencies exceed the predefined thresholds. While it is useful to have insight into the current transaction latencies that implicitly reflect the application and system health, this approach provides limited information on the causes of the observed latencies and cannot be used directly to detect the performance changes of an updated or modified application. The latencies of both transactions vary over time and get visibly higher in the second half of the figure. This does not look immediately suspicious because the latency increase can be a simple reflection of a higher load in the system. The real story behind this figure is that after timestamp 160, we began executing an updated version of the application code where the processing time of transaction Tr1 is increased by 10 ms. However, by looking at the measured transaction latency over time we can not detect this: The reported latency metric does not provide enough information to detect this change. Problem Definition. Application servers are a core component of a multitier architecture. A client communicates with a service deployed as a multitier application via request-reply transactions. A typical server reply consists of the Web page dynamically generated by the application server. The application server may issue multiple database calls while preparing the reply. Understanding the cascading effects of the various tasks that are sprung by a single request-reply

7 Automated Anomaly Detection and Performance Modeling 6:7 transaction is a challenging task. Furthermore, significantly shortened time between new software releases further exacerbates the problem of thoroughly evaluating the performance of an updated application. Typically, a performance analyst needs to perform additional data-cleansing procedures to identify and filter out time periods that correspond to abnormal application performance. These procedures are manual and time consuming, and are based on offline data analysis and ad hoc techniques [Arlitt and Farkas 2005]. Inability to extract correct data while eliminating erroneous measurements can lead to inappropriate decisions that have significant technical and business consequences. At the same time, such a cleansing technique cannot identify periods when the updated or patched application performs normally but its performance exhibits a significant change. While there could be a clear difference in system performance before and after such an application update, the traditional data-cleansing techniques fail to identify such a change. In this way, while collected measurements are correct and accurate, the overall dataset has measurements related to two, potentially very different, systems: before the application update and after the application update. For accurate and efficient capacity planning and modeling future system performance, one needs to identify a subset of measurements that reflect the updated/modified application. The goal of this article is to design an online approach that automatically detects the performance anomalies and application changes. Such a method enables a set of useful performance services: early warnings on deviations in expected application performance, raise alarms on abnormal resource usage, create a consistent dataset for capacity planning and modeling future application resource requirements (by filtering out performance anomalies and pointing out the periods of changed application performance). The next sections present our solution that is based on integration of two complementary techniques: (i) a regression-based transaction model that correlates processed transactions and consumed CPU time to create a resource consumption model of the application; and (ii) an application performance signature that provides a compact model of runtime behavior of the application. 4. REGRESSION-BASED APPROACH FOR DETECTING MODEL CHANGES AND PERFORMANCE ANOMALIES We use statistical learning techniques to model the CPU demand of the application transactions (client transactions) on a given hardware configuration, to find the statistically significant transaction types, to discover the time segments where the resource consumption of a given application can be approximated by the same regression model, to discover time segments with performance anomalies, and to differentiate among application performance changes and workload-related changes as transactions are accumulated over time. Prerequisite to applying regression is that a service provider collects the application server access log that reflects all processed client transactions

8 6:8 L. Cherkasova et al. (i.e., client Web page accesses), and the CPU utilization of the application server(s) in the evaluated system. 4.1 Regression-Based Transaction Model To capture the site behavior across time we observe a number of different client transactions over a monitoring window t of fixed length L. The transaction mix and system utilization are recorded at the end of each monitoring window. Assuming that there are totally n transaction types processed by the server, we use the following notation. T m denotes the time segment for monitored site behavior and T m denotes the cardinality of the time segment T m, that is, the number of monitoring windows in T m ; N i,t is the number of transactions of the ith type in the monitoring window t, where 1 i n; U CPU,t is the average CPU utilization of application server during this monitoring window t T m ; D i is the average CPU demand of transactions of the ith type at application server, where 1 i n; D 0 is the average CPU overhead related to activities that keep the system up. There are operating system processes or background jobs that consume CPU time even when there are no transactions in the system. From the utilization law, one can easily obtain Eq. (1) for each monitoring window t. n D 0 + N i,t D i = U CPU,t L (1) i=1 Let C i,m denote the approximated CPU cost of D i for 0 i n in the time segment T m. Then, an approximated utilization U CPU,t can be calculated as n U CPU,t = C i=1 0,m + N i,t C i,m, (2) L To solve for C i,m, one can choose a regression method from a variety of known methods in the literature. A typical objective for a regression method is to minimize either the absolute error or the squared error. In all experiments, we use the nonnegative Least Squares Regression (nonnegative LSQ) provided by MATLAB to obtain C i,m. This nonnegative LSQ regression minimizes the error ɛ m = (U CPU,t U CPU,t) 2. t T m such that C i,m 0. When solving a large set of equations with collected monitoring data over a large period of time, a direct (naive) linear regression approach would attempt to set nonzero values for as many transactions as it can to minimize the error when the model is applied to the training set. However, this may lead to poor prediction accuracy when the model is later applied to other datasets, as the

9 Automated Anomaly Detection and Performance Modeling 6:9 model may have become too finely tuned to the training set alone. In statistical terms, the model may overfit the data if it sets values to some coefficients to minimize the random noise in the training data rather than to correlate with the actual CPU utilization. In order to create a model which utilizes only the statistically significant transactions, we use stepwise linear regression [Draper and Smith 1998] to determine which set of transactions are the best predictors for the observed CPU utilization. The algorithm initializes with an empty model which includes none of the transactions. At each following iteration, a new transaction is considered for inclusion in the model. The best transaction is chosen by adding the transaction which results in the lowest mean squared error when it is included. For each N i (0 i n ) we try to solve the set of equations in the form D 0 + N i,t D i = U CPU,t L, (3) while minimizing the error ɛ i = (U CPU,t U CPU,t) 2. t T m Once we performed this procedure for all the transactions, we select the transaction N k (0 k n ) which results in the lowest mean squared error, that is, such that ɛ k = min ɛ i. 0 i N Then, transaction N k is added to the empty set. After that, the next iteration is repeated to choose the next transaction from the remaining subset to add to the set in a similar way. Before the new transaction is included in the model, it must pass an F-test which determines if including the extra transaction results in a statistically significant improvement in the model s accuracy. If the F-test fails, then the algorithm terminates since including any further transactions cannot provide a significant benefit. The coefficients for the selected transactions are calculated using the linear regression technique described before. The coefficient for the transactions not included in the model is set to zero. Typically, for an application with n transactions, one needs at least n + 1 samples to do regression using all n transactions. However, since we do transaction selection using a stepwise linear regression and an F-test, we can do regression by including only a subset of n transactions in the regression model. This allows us to apply regression without having to wait all n + 1 samples. 4.2 Algorithm Outline Using statistical regression, we can build a model that approximates the overall resource cost (CPU demand) of application transactions on a given hardware configuration. However, the accuracy of the modeling results critically depends on the quality of monitoring data used in the regression analysis: If the collected data contain periods of performance anomalies or periods when an updated application exhibits very different performance characteristics, then this

10 6:10 L. Cherkasova et al. Fig. 3. Finding optimal segmentation and detecting anomalies. Fig. 4. Model reconciliation. can significantly impact the derived transaction cost and can lead to an inaccurate approximation model. The challenge is to design an online method that alarms service providers of model changes related to performance anomalies and application updates. Our method has the following three phases. Finding the Optimal Segmentation. This stage of the algorithm identifies the time points when the transaction cost model exhibits a change. For example, as shown in Figure 3, the CPU costs of the transactions (Tr 1, Tr 2,..., Tr n ) during the time interval (T 0, T 1 ) are defined by a model (C 0, C 1, C 2,..., C n ). After that, for a time interval (T 1, T 2 ) there was no a single regression model that provides the transaction costs within a specified error bound. This time period is signaled as having anomalous behavior. As for time interval (T 2, T 3 ), the transaction cost function is defined by a new model (C 0, C 1, C 2,..., C n ). Filtering Out the Anomalous Segments. Our goal is to continuously maintain the model that reflects a normal application resource consumption behavior. At this stage, we filter out anomalous measurements identified in the collected dataset, for example, the time period (T 1, T 2 ) that corresponds to anomalous time fragment as shown in Figure 3. Model Reconciliation. After anomalies have been filtered out, one would like to unify the time segments with no application change/update/modification by using a single regression model: We attempt to reconcile two different segments (models) by using a new common model as shown in Figure 4. We try to find a new solution (new model) for combined transaction data in (T 0, T 1 ) and (T 2, T 3 ) with a given (predefined) error. If two models can be reconciled then an observed model change is indicative of the workload change and not of the application change. We use the reconciled model to represent application behavior across different workload mixes. If the model reconciliation does not work, then it means these models indeed describe different consumption models of application over time, and it is indicative of an actual application performance change.

11 Automated Anomaly Detection and Performance Modeling 6: Online Algorithm Description This section describes the three phases of the online model change and anomaly detection algorithm in more detail. (1) Finding the optimal segmentation. This stage of the algorithm identifies the time points where the transaction cost model exhibits a change. In other words, we aim to divide a given time interval T into time segments T m (T = T m ) such that within each time segment T m the application resource consumption model and the transaction costs are similar. We use a cost-based statistical learning algorithm to divide the time into segments with a similar regression model. The algorithm is composed of two steps: construction of weights for each time segment T m ; dynamic programming to find the optimum segmentation (that covers a given period T) with respect to the weights. The algorithm constructs an edge with a weight, w m, for each possible time segment T m T. This weight represents the cost of forming the segment T m. Intuitively, we would like the weight w m to be small if the resource cost of transactions in T m can be accurately approximated with the same regression model, and to be large if the regression model has a poor fit for approximating the resource cost of all the transactions in T m. The weight function, w m is selected as a Lagrangian sum of two cost functions: w 1,m and w 2,m, where the function w 1,m is the total regression error over T m : w 1,m = (U CPU,t U CPU,t) 2, t T m the function w 2,m is a length penalty function. A length penalty function penalizes shorter time intervals over longer time intervals and discourages the dynamic programming from breaking the time into segments of very short length (since the regression error can be significantly smaller for a shorter time segments). It is a function that decreases as the length of the interval T m increases. We set it to a function of the entropy of segment length as w 2,m = ( T m ) log( T m / T ). Our goal is to divide a given time interval T into time segments T m (T = Tm ) that minimize the Lagrangian sum of w 1,m and w 2,m over the considered segments, that is, the segmentation that minimiz W 1 (T) + λw 2 (T), (4) where the parameter λ is the Lagrangian constant that is used to control the average regression error ɛ allow (averaged over T) allowed in the model, and W 1 (T) = w 1,m and W 2 (T) = w 2,m. m m We denote the set of all possible segmentations of T into segments by S. For instance, for T with three monitoring windows, t 1,t 2,t 3, S would have 4 elements: (t 1,t 2,t 3 ), (t 1 t2,t 3 ), (t 1, t 2 t3 ), and (t 1 t2 t3 ). Each element of the

12 6:12 L. Cherkasova et al. set S consists of segments, that is, T m s. For instance, (t 1, t 2 t3 ) consists of the segments t 1 and t 2 t3. The minimization problem defined by formula (4) can be expressed as s = argmin s S W (s) = W 1 (s) + λw 2 (s), (5) where λ is the fixed Lagrangian constant, s denotes an element of S, and W 1 (s) = T m s w 1,m and W 2 (s) = T m s w 2,m. We first provide the segmentation algorithm for a fixed value of λ (we have an explanation after the algorithm, how to find the right value of λ to be used in the algorithm for a given, allowable regression error ɛ). In the algorithm, l represents the index of the monitoring window, where the monitoring windows are denoted by t l. We represent segments using the following form: [t j, t k ]. It indicates the segment that extends from monitoring window t j to monitoring window t k. W t l denotes the minimum Lagrangian weight sum for segmentation of the first l time points, that is, t 1,t 2,... t l. (1) Set W t 1 = 0, and set l = 2. (2) For 1 j < l, set w 1,[t j,t k ] to the total regression error when a regression model is fit over the time samples in the segment [t j, t k ], then set w 2,[t j,t k ] = t k t j log( t k t j / T ) and, finally, set w [t j,t k ] = w 1,[t j,t k ] + λw 2,[t j,t k ]. (3) Set W t l = min 1 j l 1 (W t j + w [t j,t l ]). (4) Set j = argmin 1 j l 1 (W t j + w [t j,t l ]). (5) Then the optimum segmentation is the segmentation result for j (already obtained in previous iterations) augmented by the single segment from j to l. (6) Set l = l + 1, and return to step 2. In the preceding algorithm, step 2 sets the values for the w 1, w 2, and w terms, and step 3 and step 4 apply dynamic programming to find the minimum cost W l of segmentation up to the l th time sample. The algorithm shows the best segmentation for a fixed value of λ. Here is the additional sequence of steps to find the appropriate value of λ for the use in the algorithm. In our implementation, we first decide on an allowable regression error ɛ (typically, provided by the service provider), and then seek the λ that gives the best segmentation for that allowable error ɛ by iterating over different values of λ = λ 0, λ 1,..., λ k. In particular, if the algorithm for λ 0 = 1 results in the optimal segmentation with regression error greater than ɛ, then we choose λ 1 = 2 λ 0, and repeat the segmentation algorithm. Once we find a value λ k

13 Automated Anomaly Detection and Performance Modeling 6:13 that results in the optimal segmentation with regression error smaller than ɛ, then we use a binary search between values λ k 1 and λ k to find the best value of λ for a use in the algorithm with a given allowable regression error ɛ. Let us consider an example to explain the intuition for how the Eq. (5) works. Let us first consider the time interval T with no application updates or changes. Let time interval T be divided into two consecutive time segments T 1 and T 2. First of all, W 1 (T 1 ) + W 1 (T 2 ) W (T), hence there are two possibilities. One possibility is that a regression model constructed over T is also a good fit over time segments T 1 and T 2, and the combined regression error of this model over time segments T 1 and T 2 is approximately equal to the total regression error over the original time interval T. The other possibility is that there could be different regression models that are constructed over shorter time segments T 1 and T 2 with the sum of regression errors smaller than a regression error obtained when a single regression model is constructed over T. For the second possibility, the question is whether the difference is due to a noise or small outliers in T, or do segments T 1 and T 2 indeed represent different application behaviors, that is, before and after the application modification and update. This is where the W 2 function in Eq. (4) comes into play. The term log( T m / T ) is a convex function of T m. Therefore, each time a segment is split into multiple segments, W 2 increases. This way, the original segment T results in the smallest W 2 compared to any subset of its segments, and λ can be viewed as a parameter that controls the amount of regression error allowed in a segment. By increasing the value of λ, we allow a larger W 1, regression error, in the segment. This helps in reconciling T 1 and T 2 into a single segment representation T. In such a way, by increasing the value of λ one can avoid the incorrect segmentations due to noise or small outliers in the data T. When an average regression error over a single segment T is within the allowable error ɛ allow (ɛ allow is set by a service provider), the overall function (4) results in the smallest value for the single time segment T compared to the values computed to any of its subsegments, for instance, T 1 and T 2. Therefore, our approach groups all time segments defined by the same CPU transaction cost (or the same regression model) into a single segment. By decreasing the value of λ, one can prefer the regression models with a smaller total regression error on the data, while possibly increasing the number of segments over the data. There is a trade-off between the allowable regression error (it is a given parameter for our algorithm) and the algorithm outcome. If the allowable regression error is set too low then the algorithm may result in a high number of segments over data, with many segments being neither anomalies or application changes (these are the false alarms, typically caused by significant workload changes). From the other side, by setting the allowable regression error too high, one can miss a number of performance anomalies and application changes that happened in these data and masked by the high allowable error.

14 6:14 L. Cherkasova et al. (2) Filtering out the anomalous segments. An anomalous time segment is one where observed CPU utilization cannot be explained by an application workload, that is, measured CPU utilization cannot be accounted for by the transaction CPU cost function. This may happen if an unknown background process(es) is using the CPU resource either at a constant rate (e.g., using 40% of the CPU at every monitoring window during some time interval) or randomly (e.g., the CPU is consumed by the background process at different rates at every monitoring window). It is important to be able to detect and filter out the segments with anomalous behavior, as otherwise the anomalous monitoring windows will corrupt the regression estimations of the time segments with normal behavior. Furthermore, detecting anomalous time segments provides an insight into the service problems and a possibility to correct the problems before they cause major service failure. We consider a time segment T m as anomalous if one of the following conditions take place. The constant coefficient, C 0,m, is large. Typically, C 0,m is used in the regression model to represent the average CPU overhead related to idle system activities. There are operating system processes or system background jobs that consume CPU time even when there is no transaction in the system. The estimate for the idle system CPU overhead over a monitoring window is set by the service provider. When C 0,m exceeds this threshold a time segment T m is considered as anomalous. The segment length of T m is short, indicating that a model does not have a good fit that ensures the allowed error threshold. Intuitively, the same regression model should persist over the whole time segment between the application updates/modifications unless something else, anomalous, happens to the application consumption model and it manifests itself via the model changes. (3) Model reconciliation. After anomalies have been filtered out, one would like to unify the time segments with no application change/update/modification by using a single regression model. This way, it is possible to differentiate between the segments with application changes from the segments which are the parts of the same application behavior and were segmented out by the anomalies in between. In such cases, the consecutive segments can be reconciled into a single segment after the anomalies in the data are removed. If there is an application change, on the other hand, the segments will not be reconciled, since the regression model that fits to the individual segments will not fit to the overall single segment without exceeding the allowable error (unless the application performance change is so small that it still fits within the allowable regression error). 4.4 Algorithm Complexity The complexity of the algorithm is O(M 2 ), where M is the number of time samples collected so far. This is problematic since the complexity is quadratic in a term that increases as more time samples are collected. In our case study, we have not experienced a problem as we used only 30 hours of data with 1-minute

15 Automated Anomaly Detection and Performance Modeling 6:15 Table I. Testbed Components Processor RAM Clients (Emulated-Browsers) Pentium D / 6.4 GHz 4GB Front Server - Apache/Tomcat 5.5 Pentium D / 3.2 GHz 4GB Database Server - MySQL5.0 Pentium D / 6.4 GHz 4GB intervals, namely M = However, in measuring real applications over long periods of time, complexity is likely to become challenging. To avoid this, one solution might retain only the last X samples, where X should be a few orders larger than the number of transaction types n and/or cover a few weeks/months of historic data. This way, one would have a sufficiently large X to get accurate regression results, yet the complexity will not be too large. 5. CASE STUDY In this section, we validate the proposed regression-based transaction model as an online solution for anomaly detection and performance changes in application behavior and discuss limitations of this approach. The next subsection describes the experimental environment used in the case study as well as a specially designed workload for evaluating the proposed approach. 5.1 Experimental Environment In our experiments, we use a testbed of a multitier e-commerce site that simulates the operation of an online bookstore, according to the classic TPC-W benchmark [TPC-W Benchmark]. This allows to conduct experiments under different settings in a controlled environment in order to evaluate the proposed anomaly detection approach. We use the terms front server and application server interchangeably in this article. Specifics of the software/hardware used are given in Table I. Typically, client access to a Web service occurs in the form of a session consisting of a sequence of consecutive individual requests. According to the TPC-W specification, the number of concurrent sessions (i.e., customers) or emulated browsers (EBs) is kept constant throughout the experiment. For each EB, the TPC-W benchmark statistically defines the user session length, the user think time, and the queries that are generated by the session. The database size is determined by the number of items and the number of customers. In our experiments, we use the default database setting, that is, the one with 10,000 items and 1,440,000 customers. TPC-W defines 14 different transactions which are classified as either of browsing or ordering types as shown in Table II. We assign a number to each transaction (shown in parentheses) according to their alphabetic order. Later, we use these transaction ids for presentation convenience in the figures. According to the weight of each type of activity in a given traffic mix, TPC-W defines 3 types of traffic mixes as follows: the browsing mix with 95% browsing and 5% ordering; the shopping mix with 80% browsing and 20% ordering; the ordering mix with 50% browsing and 50% ordering.

16 6:16 L. Cherkasova et al. Table II. 14 Basic Transactions and Their Types in TPC-W Browsing Type Ordering Type Home (8) Shopping Cart (14) New Products (9) Customer Registration (6) Best Sellers (3) Buy Request (5) Product detail (12) Buy Confirm (4) Search Request (13) Order Inquiry (11) Execute Search (7) Order Display (10) Admin Request (1) Admin Confirm (2) Fig. 5. The pseudocode for the random process. One drawback of directly using the transaction mixes described previously in our experiments is that they are stationary, that is, the transaction mix and load do not change over time. Since real enterprise and e-commerce applications are typically characterized by nonstationary transaction mixes (i.e., with changing transaction probabilities in the transaction mix over time) under variable load [Douglis et al. 1997; Cherkasova and Karlsson 2001; Stewart et al. 2007] we have designed an approach that enables us to generate nonstationary workloads using the TPC-W setup. To generate a nonstationary transaction mix with variable transaction mix and load we run 4 processes as follows: the three concurrent processes each executing one of the standard transaction mixes (i.e., browsing, shopping and ordering respectively) with the arbitrary fixed number of EBs (e.g, 20, 30, and 50 EBs, respectively). We call them base processes; the fourth, so-called random process executes one of the standard transaction mixes (in our experiments, it is the shopping mix) with a random execution period while using a random number of EBs for each period. To navigate and control this random process we use specified ranges for the random parameters in this workload. The pseudocode of this random process is shown in Figure 5 (the code also shows parameters we use for the nonstationary mix in this article). Due to the fourth random process the workload is nonstationary and the transaction mix and load vary significantly over time.

17 Automated Anomaly Detection and Performance Modeling 6:17 Fig hour TPC-W workload used in the case study. In order to validate the online anomaly detection and application change algorithm, we designed a special 30-hour experiment with TPC-W that has 7 different workload segments shown in Figure 6, which are defined as follows. (1) The browsing mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (2) In order to validate whether our algorithm correctly detects performance anomalies, we generated a special workload with nonstationary transaction mix as described earlier and an additional CPU process (that consumes random amount of CPU) on a background. (3) The shopping mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (4) The ordering mix with the number of EBs equal to 200, 400, 600, 800, and 1000 respectively. (5) The nonstationary TPC-W transaction mix described earlier in this section. (6) In order to validate whether we can automatically recognize the application change, we modified the source code of the Home transaction (the 8th transaction) in TPC-W by inserting a controlled CPU loop into the code of this transaction and increasing its service time by 5 ms. Using this modified code, we performed experiment with the nonstationary TPC-W transaction mix described before. (7) Another experiment with the modified TPC-W benchmark, where the service time of the Home transaction is increased by 10 ms and the nonstationary TPC-W transaction mix described previously. 5.2 Validation of Online Regression-Based Algorithm We applied our online regression-based algorithm to the special 30-hour workload shown in Figure 6. The expectations are that the algorithm should exhibit a model change and issue an alarm for anomaly workload described by segment 2 in Figure 6. Then a single model should represent workload segments

18 6:18 L. Cherkasova et al. Fig. 7. Model changes in the studied workload. Fig. 8. Model reconciliation in the studied workload. 3, 4, and 5 as shown in Figure 6. Moreover, the algorithm should reconcile segments 1, 3, 4, and 5 with a single model (since they are all generated with the same application operating under different workloads), while issuing the model changes for application modifications described by segments 6 and 7. We experimented with two values of allowable error in our experiments: ɛallow 1 = 3% and ɛ2 allow = 1% to demonstrate the impact of error setting and stress the importance of tuning this value. When we used ɛallow 1 = 3%, the algorithm had correctly identified 4 major model changes as shown in Figure 7. In fact, for the second segment there were 42 model changes (not shown in this figure to simplify the presentation) with maximum segment being 5 epochs. The algorithm accurately detected that the whole segment 2 is anomalous. Then the tool correctly performed the model reconciliation for the consecutive segments around the anomalous segment as shown in Figure 8. Finally, the algorithm correctly raised alarms on the application change when the regression model has changed and could not be reconciled (last two segments in Figure 8).

19 Automated Anomaly Detection and Performance Modeling 6:19 When we used ɛallow 1 = 1%, the algorithm had identified 6 major model changes: In addition to 4 model changes shown in Figure 7 the algorithm reported 2 extra segments at timestamps 790 and 1030 (separating apart segments 3, 4, and 5 shown in Figure 6) that correspond to workload changes and that are false alarms. It is important to use the appropriate error setting to minimize the number of false alarms. When the allowable error is set too small the proposed algorithm often overreacts and identifies model changes that correspond to significant workload changes under the same, unmodified application. In the next section, we will introduce a complementary technique that helps to get an insight in whether the model change is indeed an application change or whether it rather corresponds to a workload change. This technique can also be used for error tuning in the earlier described online algorithm. The power of the introduced regression-based approach is that it is sensitive to accurately detect a difference in the CPU consumption model of application transactions. However, it cannot identify the transactions that are responsible for this resource consumption difference. To complement the regression-based approach and to identify the transactions that cause the model change we use a different method described in the next section and which is based on building a representative application performance signature. 6. DETECTING TRANSACTION PERFORMANCE CHANGE Nowdays there is a new generation of monitoring tools, both commercial and research prototypes, that provide useful insights into transaction activity tracking and latency breakdown across different components in multitier systems. However, typically such monitoring tools just report the measured transaction latency and provide an additional information on application server versus database server latency breakdown. Using this level of information it is often impossible to decide whether an increased transaction latency is a result of a higher load in the system or whether it can be an outcome of the recent application modification, and is directly related to the increased processing time for this transaction type. In this section, we describe an approach based on an application performance signature that provides a compact model of runtime behavior of the application. The application signature is built based on new concepts: the transaction latency profiles and transaction signatures. These become instrumental for creating an application signature that accurately reflects important performance characteristics. Comparing new application signature against the old application signature allows detecting transaction performance changes. 6.1 Server Transaction Monitoring Many enterprise applications are implemented using the J2EE standard, a Java platform which is used for Web application development and designed to meet the computing needs of large enterprises. For transaction monitoring we use the HP (Mercury) Diagnostics [Mercury Diagnostics] tool which offers a monitoring solution for J2EE applications. The Diagnostics tool consists of

20 6:20 L. Cherkasova et al. Fig. 9. Multitier application configuration with the Diagnostics tool. two components: the Diagnostics Probe and the Diagnostics Server as shown in Figure 9. The Diagnostics tool collects performance and diagnostic data from applications without the need for application source-code modification or recompilation. It uses bytecode instrumentation and industry standards for collecting system and JMX metrics. Instrumentation refers to bytecode that the Probe inserts into the class files of the application as they are loaded by the class loader of the virtual machine. Instrumentation enables a Probe to measure execution time, count invocations, retrieve arguments, catch exceptions, and correlate method calls and threads. The J2EE Probe shown in Figure 9 is responsible for capturing events from the application, aggregating the performance metrics, and sending these captured performance metrics to the Diagnostics Server. We have implemented a Java-based processing utility for extracting performance data from the Diagnostics Server in real time and creating a so-called application log that provides a complete information on all transactions processed during the monitoring window, such as their overall latencies, outbound calls, and the latencies of the outbound calls. In a monitoring window, Diagnostics collects the following information for each transaction type: a transaction count; an average overall transaction latency for observed transactions. 1 This overall latency includes transaction processing time at the application server as well as all related query processing at the database server, that is, latency is measured from the moment of the request arrival at the application server to the time when a prepared reply is sent back by the application server; see Figure 10; a count of outbound (database) calls of different types; an average latency of observed outbound calls (of different types). The average latency of an outbound call is measured from the moment the database 1 Note that here a latency is measured for the server transaction (see the difference between client and server transactions described in Section 2).

Automated Anomaly Detection and Performance Modeling of Enterprise Applications

Automated Anomaly Detection and Performance Modeling of Enterprise Applications Automated Anomaly Detection and Performance Modeling of Enterprise Applications Ludmila Cherkasova, Hewlett-Packard Labs, Kivanc Ozonat, Hewlett-Packard Labs, Ningfang Mi, Northeastern University, Julie

More information

Anomaly? Application Change? or Workload Change? Towards Automated Detection of Application Performance Anomaly and Change

Anomaly? Application Change? or Workload Change? Towards Automated Detection of Application Performance Anomaly and Change Anomaly? Application Change? or Workload Change? Towards Automated Detection of Application Performance Anomaly and Change Ludmila Cherkasova 1, Kivanc Ozonat 1, Ningfang Mi 2 Julie Symons 1, Evgenia Smirni

More information

Efficient Capacity Building and Network Marketing - A Regression Approach

Efficient Capacity Building and Network Marketing - A Regression Approach A Regression-Based Analytic Model for Capacity Planning of Multi-Tier Applications Qi Zhang 1 Ludmila Cherkasova 2 Ningfang Mi 3 Evgenia Smirni 3 1 Microsoft, One Microsoft Way, Redmond, WA 9852, USA 2

More information

R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads

R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads Qi Zhang 1, Ludmila Cherkasova 2, Guy Mathews 2, Wayne Greene 2, and Evgenia Smirni 1 1 College of

More information

Analysis of Application Performance and Its Change via Representative Application Signatures

Analysis of Application Performance and Its Change via Representative Application Signatures Analysis of Application Performance and Its Change via Representative Application Signatures Ningfang Mi College of William and Mary Williamsburg, VA 23187, USA ningfang@cs.wm.edu Ludmila Cherkasova, Kivanc

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Real vs. Synthetic Web Performance Measurements, a Comparative Study

Real vs. Synthetic Web Performance Measurements, a Comparative Study Real vs. Synthetic Web Performance Measurements, a Comparative Study By John Bartlett and Peter Sevcik December 2004 Enterprises use today s Internet to find customers, provide them information, engage

More information

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev (skounev@ito.tu-darmstadt.de) Information Technology Transfer Office Abstract Modern e-commerce

More information

Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:

Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as: Performance Testing Definition: Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program or device. This process can involve

More information

TPC-W * : Benchmarking An Ecommerce Solution By Wayne D. Smith, Intel Corporation Revision 1.2

TPC-W * : Benchmarking An Ecommerce Solution By Wayne D. Smith, Intel Corporation Revision 1.2 TPC-W * : Benchmarking An Ecommerce Solution By Wayne D. Smith, Intel Corporation Revision 1.2 1 INTRODUCTION How does one determine server performance and price/performance for an Internet commerce, Ecommerce,

More information

Profiling and Modeling Resource Usage of Virtualized Applications Timothy Wood 1, Ludmila Cherkasova 2, Kivanc Ozonat 2, and Prashant Shenoy 1

Profiling and Modeling Resource Usage of Virtualized Applications Timothy Wood 1, Ludmila Cherkasova 2, Kivanc Ozonat 2, and Prashant Shenoy 1 Profiling and Modeling Resource Usage of Virtualized Applications Timothy Wood 1, Ludmila Cherkasova 2, Kivanc Ozonat 2, and Prashant Shenoy 1 Abstract 1 University of Massachusetts, Amherst, {twood,shenoy}@cs.umass.edu

More information

A STUDY OF WORKLOAD CHARACTERIZATION IN WEB BENCHMARKING TOOLS FOR WEB SERVER CLUSTERS

A STUDY OF WORKLOAD CHARACTERIZATION IN WEB BENCHMARKING TOOLS FOR WEB SERVER CLUSTERS 382 A STUDY OF WORKLOAD CHARACTERIZATION IN WEB BENCHMARKING TOOLS FOR WEB SERVER CLUSTERS Syed Mutahar Aaqib 1, Lalitsen Sharma 2 1 Research Scholar, 2 Associate Professor University of Jammu, India Abstract:

More information

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance Key indicators and classification capabilities in Stratusphere FIT and Stratusphere UX Whitepaper INTRODUCTION This whitepaper

More information

INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS

INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS WHITE PAPER INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS Network administrators and security teams can gain valuable insight into network health in real-time by

More information

Workload Analysis and Demand Prediction for the HP eprint Service Vipul Garg, Ludmila Cherkasova*, Swaminathan Packirisami, Jerome Rolia*

Workload Analysis and Demand Prediction for the HP eprint Service Vipul Garg, Ludmila Cherkasova*, Swaminathan Packirisami, Jerome Rolia* Workload Analysis and Demand Prediction for the HP eprint Service Vipul Garg, Ludmila Cherkasova*, Swaminathan Packirisami, Jerome Rolia* HP PPS R&D Bangalore, India and *HP Labs Palo Alto, CA, USA E-mail:

More information

Application Performance Testing Basics

Application Performance Testing Basics Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free

More information

Performance Testing Process A Whitepaper

Performance Testing Process A Whitepaper Process A Whitepaper Copyright 2006. Technologies Pvt. Ltd. All Rights Reserved. is a registered trademark of, Inc. All other trademarks are owned by the respective owners. Proprietary Table of Contents

More information

Profiling services for resource optimization and capacity planning in distributed systems

Profiling services for resource optimization and capacity planning in distributed systems Cluster Comput (2008) 11: 313 329 DOI 10.1007/s10586-008-0063-x Profiling services for resource optimization and capacity planning in distributed systems Guofei Jiang Haifeng Chen Kenji Yoshihira Received:

More information

Fingerprinting the datacenter: automated classification of performance crises

Fingerprinting the datacenter: automated classification of performance crises Fingerprinting the datacenter: automated classification of performance crises Peter Bodík 1,3, Moises Goldszmidt 3, Armando Fox 1, Dawn Woodard 4, Hans Andersen 2 1 RAD Lab, UC Berkeley 2 Microsoft 3 Research

More information

What Is Specific in Load Testing?

What Is Specific in Load Testing? What Is Specific in Load Testing? Testing of multi-user applications under realistic and stress loads is really the only way to ensure appropriate performance and reliability in production. Load testing

More information

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting WHITE PAPER Five Steps to Better Application Monitoring and Troubleshooting There is no doubt that application monitoring and troubleshooting will evolve with the shift to modern applications. The only

More information

Introducing Performance Engineering by means of Tools and Practical Exercises

Introducing Performance Engineering by means of Tools and Practical Exercises Introducing Performance Engineering by means of Tools and Practical Exercises Alexander Ufimtsev, Trevor Parsons, Lucian M. Patcas, John Murphy and Liam Murphy Performance Engineering Laboratory, School

More information

Jean Arnaud, Sara Bouchenak. Performance, Availability and Cost of Self-Adaptive Internet Services

Jean Arnaud, Sara Bouchenak. Performance, Availability and Cost of Self-Adaptive Internet Services Jean Arnaud, Sara Bouchenak Performance, Availability and Cost of Self-Adaptive Internet Services Chapter of Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions

More information

How To Test For Elulla

How To Test For Elulla EQUELLA Whitepaper Performance Testing Carl Hoffmann Senior Technical Consultant Contents 1 EQUELLA Performance Testing 3 1.1 Introduction 3 1.2 Overview of performance testing 3 2 Why do performance testing?

More information

Web Load Stress Testing

Web Load Stress Testing Web Load Stress Testing Overview A Web load stress test is a diagnostic tool that helps predict how a website will respond to various traffic levels. This test can answer critical questions such as: How

More information

Table 1: Comparison between performance regression detection and APM tools

Table 1: Comparison between performance regression detection and APM tools Studying the Effectiveness of Application Performance Management (APM) Tools for Detecting Performance Regressions for Web Applications: An Experience Report Tarek M. Ahmed 1, Cor-Paul Bezemer 1, Tse-Hsun

More information

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo qzhang@uwaterloo.ca Joseph L. Hellerstein Google Inc. jlh@google.com Raouf Boutaba University of Waterloo rboutaba@uwaterloo.ca

More information

Cisco Performance Visibility Manager 1.0.1

Cisco Performance Visibility Manager 1.0.1 Cisco Performance Visibility Manager 1.0.1 Cisco Performance Visibility Manager (PVM) is a proactive network- and applicationperformance monitoring, reporting, and troubleshooting system for maximizing

More information

Parameterizable benchmarking framework for designing a MapReduce performance model

Parameterizable benchmarking framework for designing a MapReduce performance model CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable

More information

Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing

Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing Zhuoyao Zhang Google Inc. Mountain View, CA 9443, USA zhuoyao@google.com Ludmila Cherkasova Hewlett-Packard Labs

More information

Holistic Performance Analysis of J2EE Applications

Holistic Performance Analysis of J2EE Applications Holistic Performance Analysis of J2EE Applications By Madhu Tanikella In order to identify and resolve performance problems of enterprise Java Applications and reduce the time-to-market, performance analysis

More information

Network Performance Monitoring at Small Time Scales

Network Performance Monitoring at Small Time Scales Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA dina@sprintlabs.com Electrical and Computer Engineering Department University

More information

Data on Kernel Failures and Security Incidents

Data on Kernel Failures and Security Incidents Data on Kernel Failures and Security Incidents Ravishankar K. Iyer (W. Gu, Z. Kalbarczyk, G. Lyle, A. Sharma, L. Wang ) Center for Reliable and High-Performance Computing Coordinated Science Laboratory

More information

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

More information

A Symptom Extraction and Classification Method for Self-Management

A Symptom Extraction and Classification Method for Self-Management LANOMS 2005-4th Latin American Network Operations and Management Symposium 201 A Symptom Extraction and Classification Method for Self-Management Marcelo Perazolo Autonomic Computing Architecture IBM Corporation

More information

Specification and Implementation of Dynamic Web Site Benchmarks. Sameh Elnikety Department of Computer Science Rice University

Specification and Implementation of Dynamic Web Site Benchmarks. Sameh Elnikety Department of Computer Science Rice University Specification and Implementation of Dynamic Web Site Benchmarks Sameh Elnikety Department of Computer Science Rice University 1 Dynamic Content Is Common 1 2 3 2 Generating Dynamic Content http Web Server

More information

supported Application QoS in Shared Resource Pools

supported Application QoS in Shared Resource Pools Supporting Application QoS in Shared Resource Pools Jerry Rolia, Ludmila Cherkasova, Martin Arlitt, Vijay Machiraju HP Laboratories Palo Alto HPL-2006-1 December 22, 2005* automation, enterprise applications,

More information

vrealize Operations Manager User Guide

vrealize Operations Manager User Guide vrealize Operations Manager User Guide vrealize Operations Manager 6.0.1 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by

More information

SIEM Optimization 101. ReliaQuest E-Book Fully Integrated and Optimized IT Security

SIEM Optimization 101. ReliaQuest E-Book Fully Integrated and Optimized IT Security SIEM Optimization 101 ReliaQuest E-Book Fully Integrated and Optimized IT Security Introduction SIEM solutions are effective security measures that mitigate security breaches and increase the awareness

More information

Using WebLOAD to Monitor Your Production Environment

Using WebLOAD to Monitor Your Production Environment Using WebLOAD to Monitor Your Production Environment Your pre launch performance test scripts can be reused for post launch monitoring to verify application performance. This reuse can save time, money

More information

Tableau Server 7.0 scalability

Tableau Server 7.0 scalability Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

More information

Chapter 3 Application Monitors

Chapter 3 Application Monitors Chapter 3 Application Monitors AppMetrics utilizes application monitors to organize data collection and analysis per application server. An application monitor is defined on the AppMetrics manager computer

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Data-Driven Performance Management in Practice for Online Services

Data-Driven Performance Management in Practice for Online Services Data-Driven Performance Management in Practice for Online Services Dongmei Zhang Principal Researcher/Research Manager Software Analytics group, Microsoft Research Asia October 29, 2012 Landscape of Online

More information

The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a

The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a Technical White Paper: WEb Load Testing To perform as intended, today s mission-critical applications rely on highly available, stable and trusted software services. Load testing ensures that those criteria

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

CA Nimsoft Monitor. Probe Guide for iseries System Statistics Monitoring. sysstat v1.1 series

CA Nimsoft Monitor. Probe Guide for iseries System Statistics Monitoring. sysstat v1.1 series CA Nimsoft Monitor Probe Guide for iseries System Statistics Monitoring sysstat v1.1 series Legal Notices This online help system (the "System") is for your informational purposes only and is subject to

More information

Ready Time Observations

Ready Time Observations VMWARE PERFORMANCE STUDY VMware ESX Server 3 Ready Time Observations VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified

More information

How to Painlessly Audit Your Firewalls

How to Painlessly Audit Your Firewalls W h i t e P a p e r How to Painlessly Audit Your Firewalls An introduction to automated firewall compliance audits, change assurance and ruleset optimization May 2010 Executive Summary Firewalls have become

More information

Machine Data Analytics with Sumo Logic

Machine Data Analytics with Sumo Logic Machine Data Analytics with Sumo Logic A Sumo Logic White Paper Introduction Today, organizations generate more data in ten minutes than they did during the entire year in 2003. This exponential growth

More information

Detecting false users in Online Rating system & Securing Reputation

Detecting false users in Online Rating system & Securing Reputation Detecting false users in Online Rating system & Securing Reputation ABSTRACT: With the rapid development of reputation systems in various online social networks, manipulations against such systems are

More information

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools A Software White Paper December 2013 Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools A Joint White Paper from Login VSI and Software 2 Virtual Desktop

More information

DMS Performance Tuning Guide for SQL Server

DMS Performance Tuning Guide for SQL Server DMS Performance Tuning Guide for SQL Server Rev: February 13, 2014 Sitecore CMS 6.5 DMS Performance Tuning Guide for SQL Server A system administrator's guide to optimizing the performance of Sitecore

More information

Service Virtualization:

Service Virtualization: Service Virtualization: Reduce the time and cost to develop and test modern, composite applications Business white paper Table of contents Why you need service virtualization 3 The challenges of composite

More information

Providing TPC-W with web user dynamic behavior

Providing TPC-W with web user dynamic behavior Providing TPC-W with web user dynamic behavior Raúl Peña-Ortiz, José A. Gil, Julio Sahuquillo, Ana Pont Departament d Informàtica de Sistemes i Computadors (DISCA) Universitat Politècnica de València Valencia,

More information

Modellus: Automated Modeling of Complex Data Center Applications

Modellus: Automated Modeling of Complex Data Center Applications University of Massachusetts, Technical Report TR31-07 1 Modellus: Automated Modeling of Complex Data Center Applications Peter Desnoyers, Timothy Wood, Prashant Shenoy, Department of Computer Science University

More information

Informatica Data Director Performance

Informatica Data Director Performance Informatica Data Director Performance 2011 Informatica Abstract A variety of performance and stress tests are run on the Informatica Data Director to ensure performance and scalability for a wide variety

More information

Performance Tuning Guide for ECM 2.0

Performance Tuning Guide for ECM 2.0 Performance Tuning Guide for ECM 2.0 Rev: 20 December 2012 Sitecore ECM 2.0 Performance Tuning Guide for ECM 2.0 A developer's guide to optimizing the performance of Sitecore ECM The information contained

More information

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition Liferay Portal Performance Benchmark Study of Liferay Portal Enterprise Edition Table of Contents Executive Summary... 3 Test Scenarios... 4 Benchmark Configuration and Methodology... 5 Environment Configuration...

More information

Chapter 5. Regression Testing of Web-Components

Chapter 5. Regression Testing of Web-Components Chapter 5 Regression Testing of Web-Components With emergence of services and information over the internet and intranet, Web sites have become complex. Web components and their underlying parts are evolving

More information

Ranking Configuration Parameters in Multi-Tiered E-Commerce Sites

Ranking Configuration Parameters in Multi-Tiered E-Commerce Sites Ranking Configuration Parameters in Multi-Tiered E-Commerce Sites Monchai Sopitkamol 1 Abstract E-commerce systems are composed of many components with several configurable parameters that, if properly

More information

Server & Application Monitor

Server & Application Monitor Server & Application Monitor agentless application & server monitoring SolarWinds Server & Application Monitor provides predictive insight to pinpoint app performance issues. This product contains a rich

More information

Information Technology Policy

Information Technology Policy Information Technology Policy Security Information and Event Management Policy ITP Number Effective Date ITP-SEC021 October 10, 2006 Category Supersedes Recommended Policy Contact Scheduled Review RA-ITCentral@pa.gov

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Evaluating and Comparing the Impact of Software Faults on Web Servers

Evaluating and Comparing the Impact of Software Faults on Web Servers Evaluating and Comparing the Impact of Software Faults on Web Servers April 2010, João Durães, Henrique Madeira CISUC, Department of Informatics Engineering University of Coimbra {naaliel, jduraes, henrique}@dei.uc.pt

More information

The Association of System Performance Professionals

The Association of System Performance Professionals The Association of System Performance Professionals The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the measurement

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Analytical Modeling for What-if Analysis in Complex Cloud Computing Applications

Analytical Modeling for What-if Analysis in Complex Cloud Computing Applications Analytical Modeling for What-if Analysis in Complex Cloud Computing Applications Rahul Singh, Prashant Shenoy Dept. of Computer Science, University of Massachusetts Amherst, MA, USA Maitreya Natu, Vaishali

More information

Monitoring applications in multitier environment. Uroš Majcen uros@quest-slo.com. A New View on Application Management. www.quest.

Monitoring applications in multitier environment. Uroš Majcen uros@quest-slo.com. A New View on Application Management. www.quest. A New View on Application Management www.quest.com/newview Monitoring applications in multitier environment Uroš Majcen uros@quest-slo.com 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Management Challenges

More information

BW-EML SAP Standard Application Benchmark

BW-EML SAP Standard Application Benchmark BW-EML SAP Standard Application Benchmark Heiko Gerwens and Tobias Kutning (&) SAP SE, Walldorf, Germany tobas.kutning@sap.com Abstract. The focus of this presentation is on the latest addition to the

More information

Optimizing a ëcontent-aware" Load Balancing Strategy for Shared Web Hosting Service Ludmila Cherkasova Hewlett-Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94303 cherkasova@hpl.hp.com Shankar

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

SharePoint Impact Analysis. AgilePoint BPMS v5.0 SP2

SharePoint Impact Analysis. AgilePoint BPMS v5.0 SP2 SharePoint Impact Analysis Document Revision r5.1.4 November 2011 Contents 2 Contents Preface...4 Disclaimer of Warranty...4 Copyright...4 Trademarks...4 Government Rights Legend...4 Virus-free software

More information

XenMon: QoS Monitoring and Performance Profiling Tool

XenMon: QoS Monitoring and Performance Profiling Tool XenMon: QoS Monitoring and Performance Profiling Tool Performance Study: How Much CPU Needs to Be Allocated to Dom O for Efficient Support of Web Server Applications? Diwaker Gupta, Rob Gardner, Ludmila

More information

Business Application Services Testing

Business Application Services Testing Business Application Services Testing Curriculum Structure Course name Duration(days) Express 2 Testing Concept and methodologies 3 Introduction to Performance Testing 3 Web Testing 2 QTP 5 SQL 5 Load

More information

Scalability Factors of JMeter In Performance Testing Projects

Scalability Factors of JMeter In Performance Testing Projects Scalability Factors of JMeter In Performance Testing Projects Title Scalability Factors for JMeter In Performance Testing Projects Conference STEP-IN Conference Performance Testing 2008, PUNE Author(s)

More information

Bringing Value to the Organization with Performance Testing

Bringing Value to the Organization with Performance Testing Bringing Value to the Organization with Performance Testing Michael Lawler NueVista Group 1 Today s Agenda Explore the benefits of a properly performed performance test Understand the basic elements of

More information

System Requirements Table of contents

System Requirements Table of contents Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5

More information

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad Test Run Analysis Interpretation (AI) Made Easy with OpenLoad OpenDemand Systems, Inc. Abstract / Executive Summary As Web applications and services become more complex, it becomes increasingly difficult

More information

CA Nimsoft Monitor. Probe Guide for Internet Control Message Protocol Ping. icmp v1.1 series

CA Nimsoft Monitor. Probe Guide for Internet Control Message Protocol Ping. icmp v1.1 series CA Nimsoft Monitor Probe Guide for Internet Control Message Protocol Ping icmp v1.1 series CA Nimsoft Monitor Copyright Notice This online help system (the "System") is for your informational purposes

More information

Proactive Performance Management for Enterprise Databases

Proactive Performance Management for Enterprise Databases Proactive Performance Management for Enterprise Databases Abstract DBAs today need to do more than react to performance issues; they must be proactive in their database management activities. Proactive

More information

Assuring Global Reference Data Distribution on a Virtual Appliance

Assuring Global Reference Data Distribution on a Virtual Appliance White Paper Assuring Global Reference Data Distribution on a Virtual Appliance Predictive Analytics for IT in Action January 2012 Introduction Reference data has always been critical to the trading workflow

More information

HP Operations Smart Plug-in for Virtualization Infrastructure

HP Operations Smart Plug-in for Virtualization Infrastructure HP Operations Smart Plug-in for Virtualization Infrastructure for HP Operations Manager for Windows Software Version: 1.00 Deployment and Reference Guide Document Release Date: October 2008 Software Release

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary White Paper The Ten Features Your Web Application Monitoring Software Must Have Executive Summary It s hard to find an important business application that doesn t have a web-based version available and

More information

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com

More information

10 Best Practices for Application Performance Testing

10 Best Practices for Application Performance Testing Business white paper 10 Best Practices for Application Performance Testing Leveraging Agile Performance Testing for Web and Mobile Applications 10 Best Practices for Application Performance Testing Table

More information

.:!II PACKARD. Performance Evaluation ofa Distributed Application Performance Monitor

.:!II PACKARD. Performance Evaluation ofa Distributed Application Performance Monitor r~3 HEWLETT.:!II PACKARD Performance Evaluation ofa Distributed Application Performance Monitor Richard J. Friedrich, Jerome A. Rolia* Broadband Information Systems Laboratory HPL-95-137 December, 1995

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

Monitoring App V eg Enterprise v6

Monitoring App V eg Enterprise v6 Monitoring App V eg Enterprise v6 Restricted Rights Legend The information contained in this document is confidential and subject to change without notice. No part of this document may be reproduced or

More information

BridgeWays Management Pack for VMware ESX

BridgeWays Management Pack for VMware ESX Bridgeways White Paper: Management Pack for VMware ESX BridgeWays Management Pack for VMware ESX Ensuring smooth virtual operations while maximizing your ROI. Published: July 2009 For the latest information,

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

WHAT WE NEED TO START THE PERFORMANCE TESTING?

WHAT WE NEED TO START THE PERFORMANCE TESTING? ABSTRACT Crystal clear requirements before starting an activity are always helpful in achieving the desired goals. Achieving desired results are quite difficult when there is vague or incomplete information

More information

Rapid Bottleneck Identification A Better Way to do Load Testing. An Oracle White Paper June 2009

Rapid Bottleneck Identification A Better Way to do Load Testing. An Oracle White Paper June 2009 Rapid Bottleneck Identification A Better Way to do Load Testing An Oracle White Paper June 2009 Rapid Bottleneck Identification A Better Way to do Load Testing. RBI combines a comprehensive understanding

More information

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.

More information

Load Testing Scenarios Selection

Load Testing Scenarios Selection Load Testing Scenarios Selection Abstract The purpose of Load testing is to identify and resolve all the application performance bottlenecks before they affect the application real users. If a web application

More information

Web Browsing Quality of Experience Score

Web Browsing Quality of Experience Score Web Browsing Quality of Experience Score A Sandvine Technology Showcase Contents Executive Summary... 1 Introduction to Web QoE... 2 Sandvine s Web Browsing QoE Metric... 3 Maintaining a Web Page Library...

More information

HP Service Health Analyzer: Decoding the DNA of IT performance problems

HP Service Health Analyzer: Decoding the DNA of IT performance problems HP Service Health Analyzer: Decoding the DNA of IT performance problems Technical white paper Table of contents Introduction... 2 HP unique approach HP SHA driven by the HP Run-time Service Model... 2

More information

IBM Rational Asset Manager

IBM Rational Asset Manager Providing business intelligence for your software assets IBM Rational Asset Manager Highlights A collaborative software development asset management solution, IBM Enabling effective asset management Rational

More information