Statistical Techniques for Online Anomaly Detection in Data Centers
|
|
|
- Kristian Sanders
- 9 years ago
- Views:
Transcription
1 Statistical Techniques for Online Anomaly Detection in Data Centers Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, Karsten Schwan HP Laboratories HPL Keyword(s): Anomaly Detection, Data Center Management, Statistics, Algorithms Abstract: Online anomaly detection is an important step in data center management, requiring light-weight techniques that provide sufficient accuracy for subsequent diagnosis and management actions. This paper presents statistical techniques based on the Tukey and Relative Entropy statistics, and applies them to data collected from a production environment and to data captured from a testbed for multi-tier web applications running on server class machines. The proposed techniques are lightweight and improve over standard Gaussian assumptions in terms of performance. External Posting Date: May 21, 2011 [Fulltext] Approved for External Publication Internal Posting Date: May 21, 2011 [Fulltext] To be published and presented at IFIP/IEEE International Symposium on Integrated Network Management (1M) 2011, 23 May - 27 May 2011 Copyright IFIP/IEEE International Symposium on Integrated Network Management (1M) 2011.
2 Statistical Techniques for Online Anomaly Detection in Data Centers Chengwei Wang, Krishnamurthy Viswanathan*, Lakshminarayan Choudur*, Vanish Talwar*, Wade Satterfield*, Karsten Schwan CERCS, Georgia Institute of Technology *Hewlett-Packard Abstract Online anomaly detection is an important step in data center management, requiring light-weight techniques that provide sufficient accuracy for subsequent diagnosis and management actions. This paper presents statistical techniques based on the Tukey and Relative Entropy statistics, and applies them to data collected from a production environment and to data captured from a testbed for multi-tier web applications running on server class machines. The proposed techniques are lightweight and improve over standard Gaussian assumptions in terms of performance. I. INTRODUCTION Commercial data center environments are increasingly characterized by extremely large scale and complexity. Individual applications like Hadoop MapReduce and Web 2.0 can involve thousands of servers. Utility clouds like Amazon EC2 or Google App are able to serve more than 2 millions businesses to run their own applications, each of which may have different workload characteristics. These facts make data center management a difficult task, especially in systems where malfunctions can lead to extensive losses in profit due to lack of responsiveness or availability. Our work seeks to improve system performance and availability by developing online, closed loop management solutions that (i) detect problems, (ii) diagnose them to determine potential remedies or mitigation methods, and (iii) trigger and carry out such solutions. A key element of such research and the topic of this paper is online anomaly detection, which is to understand whether a system is behaving as expected or whether it is behaving in ways that are unusual and deserve further investigation and diagnosis. Anomaly detection is important because it must be done continuously, as long as a system is running and at scale for entire data center systems. We are interested in online anomaly detection solutions that can be applied to continuous monitoring scenarios as opposed to methods that rely on static profiling or limited sets of historical data. There are several challenges in designing effective solutions for such online anomaly detection in large data centers. These include scale, for which the anomaly detection methods must be lightweight, both in terms of the number of metrics they require to run (the volume of monitoring data continuously captured and used), and in terms of their runtime complexity for executing the detection methods. Next, these methods should have the ability to handle multiple metrics at the different levels of abstraction hardware, system software, middleware, or applications present in data centers. Furthermore, the methods need to accommodate the workload characteristics and patterns including day of the week, and hour of the day patterns of workload behavior. The methods also need to be aware of and address the dynamic nature of data center systems and applications, including dealing with application arrivals and departures, changes in workload, and system-level load balancing through say, virtual machine migration. Finally, the solutions must exhibit good accuracy and low false alarm for meaningful results. The state of the art approaches for anomaly detection deployed in today s data centers apply a fixed threshold on the metrics being monitored. These thresholds are usually computed offline from training data and remain constant during the entire process of anomaly detection. Often these thresholds are applied to each individual measurement separately. Variants such as Multivariate Adaptive Statistical Filtering (MASF) [1] additionally maintain a separate threshold for data segmented and aggregated by time (e.g., hour of day, day of week). However, these existing techniques assume the data distribution to be Gaussian for determining the threshold values. This assumption is frequently violated in practice. Furthermore, fixed thresholds cannot adapt to loads that may change over time or intermittent bursts, nor can they react to anomalous behavior that may not show up as extremal large or small values in the data. All of these lead to false alarms and reduced accuracy with existing techniques. We propose statistical techniques in this paper that overcome these limitations improving accuracy and insights raised by anomaly flags. We make the following contributions: We select two statistical techniques, Tukey method and the multinomial goodness-of-fit test based on the Relative Entropy statistic, and adapt them to the specific needs for anomaly detection in data centers. Algorithms using these techniques are proposed that compute statistics on data based on multiple time dimensions - entire past, recent past, and context based on hour of day and day of week. These statistics are then employed to determine if specific points or windows are anomalies. The proposed algorithms have low complexity and are scalable to
3 process large amounts of data. We have experimented the proposed algorithms with data from a 3-tier RUBis (Internet Service) testbed with injected performance and configuration anomalies, as well as data from production testbeds. Our results have shown improvement in accuracy and reduction in false alarm rates compared to state of art Gaussian techniques. Further, our techniques are shown to be more adaptable to varying workload changes and they learn the multiple states of operation of the workload over time. Furthermore, we also illustrate how results from multiple time dimensions and multiple techniques can be combined to provide improved insights on application behavior to administrators. The remainder of this paper is organized as follows. Section II provides background information on existing techniques. Section III describes our proposed statistical techniques that overcome limitations of existing approaches. Experimental results are presented in Section IV. Section V provides discussion on combining results from multiple time dimensions. Section VI discusses related work, and we finally conclude the paper in Section VII. II. BACKGROUND Anomalies manifest in data in a variety of ways. Two common manifestations are those in which a data point is atypical with reference to the normal behavior of the data distribution, and secondly those when the distribution of the data changes with time. One way of dealing with the first problem is to use the distribution of the data to compute thresholds. Typical data under normal process behavior will oscillate within the threshold limits. Data points falling beyond and below the upper and lower thresholds respectively are flagged as anomalies. These thresholds are obtained under certain assumptions about the behavior (shape) of the distribution. An understanding and quantitative representation of the data distribution is obtained by studying the historical data. This approach is known as parametric thresholding. To address the second problem, the variability in the process over an extended period of time is studied and analyzed by using the historical repository of data. From this, an estimate of the variability of the data is characterized and threshold limits are computed. This approach avoids the necessity to make assumptions about the shape of the distribution of the data. Such methods are known as non-parametric methods. In many applications involving statistical analysis of data, the Gaussian distribution is the assumed underlying probability model [10]. For example, a popular method for anomaly detection in data centers is MASF [1] which relies on the Gaussian law. MASF first segments the data by hour of day and day of week. Subsequently, threshold limits are computed based on the standard deviation (σ) of this segregated data. Under Gaussian assumptions, 95% of the data is within the 2 standard deviations (σ) of the mean (μ), and 99 % of the data is within 3 standard deviations of the mean μ. A data point falling outside the 3σ range occurs 27 times out of opportunities which is deemed as a rare event and thus is flagged as an anomaly. So the limits can be adjusted to 3σ or 4σ, etc. The typical limits are μ ± 3σ. Although the Gaussian assumption holds in general, some data points do not conform to Gaussian behaviors. While, this may not always be detrimental, it is recommended that other methods that do not rely on restrictive normality assumptions be considered. We present such methods in Section III. Finally, we would like to point out that prior to the application of an anomaly detection method, pre-processing of data is usually done. This includes data cleansing to remove any invalid and spurious data, as well as smoothing of data using procedures such as (1) Moving average smoothing, (2) Smoothing by Fourier transform, or (3) the Wavelet transform. III. STATISTICAL APPROACHES In this section, we will examine two classes of statistical procedures for anomaly detection. The first class is based on applying thresholds to individual data points. The second class involves measuring the changes in distribution by windowing the data and using that to determine anomalies. A. Point thresholds Within the class of point threshold techniques, we propose Tukey [9] method for anomaly detection. Similar to Gaussian methods, this method constructs a lower threshold and an upper threshold to flag data as anomalous. However, the Tukey procedure does not make any distributional assumptions about the data as is the case with Gaussian method. The Tukey is a simple but effective procedure for identifying anomalies. We describe it below. Let x 1,x 2,...,x n be a series of observations such as the cpu utilization of a server. This data is arranged in an ascending order from the smallest to the largest observation. The ordered data is broken into four quarters, the boundary of each quarter defined by Q 1, Q 2, and Q 3, called the 1st quartile, 2nd quartile, and 3rd quartile respectively. The difference Q 3 Q1 is called the inter-quartile range. The Tukey upper and lower thresholds for anomalies respectively are: ltl = Q 1 3 Q 3 Q 1 and Q 3 +3 Q 3 Q 1. Observations falling beyond these limits are called serious anomalies and any observation, x i,i =1, 2,...,n such that Q Q 3 Q 1 x i Q Q 3 Q 1 called a possible anomaly. Similarly Q Q 3 Q 1 x i Q Q 3 Q 1 a possible anomaly on the lower side. The method allows the user flexibility in setting the threshold limits. For example, depending on a user s experience, the upper and lower anomaly limits can be set to, Q 3 +k Q 3 Q 1 and Q 1 k Q 3 Q 1, where k is an appropriately chosen scalar. The lower and upper Tukey limits correspond to a distance of 4.5 σ (standard deviations) from the sample mean if the distribution of the data is Gaussian. B. Windowing approaches Identifying individual data points as anomalous can result in false alarms when, for example, sudden instantaneous spikes in CPU or memory utilization are flagged. To avoid this,
4 one might seek to examine windows of data consisting of a collection of points and then make a determination on the window. A simple extension of the point threshold approach to anomaly detection on windowed data is to apply the threshold to the mean of the window of the data. The thresholds could be computed from the entire past history or merely the past few windows. While simple, this approach though fails to capture a large fraction of anomalies (this will be shown in our results as well). Hence, we instead propose a new approach to detect anomalies which manifest as changes in the system behavior inconsistent with what is expected. Our approach is based on a classical question in hypothesis testing [16], namely determining if the observed data is consistent with a given distribution. In the classical hypothesis testing problem, we are required to determine which of two hypotheses, the null hypothesis and the alternate hypothesis, best describes the data. The two hypothesis are each defined by a single distribution or a collection of distributions. Formally, let X 1,X 2,...,X n denote the observed sample of size n. Let P 0 denote the distribution representing the null-hypothesis and let P 1 the alternate hypothesis. Then the optimal test for minimizing the probability of falsely rejecting the null hypothesis under a constraint on the probability of incorrectly accepting the null hypothesis is given by the Neyman-Pearson theorem. The theorem states that the optimal test is given by determining if the likelihood ratio P 0 (X 1,X 2,...,X n ) T (1) P 1 (X 1,X 2,...,X n ) where P 0 (X 1,X 2,...,X n ) is the probability assigned to the observed sample by P 0, P 1 (X 1,X 2,...,X n ) is the corresponding probability assigned by P 1, and T is a threshold that can be determined based on the constraint on the probability of incorrectly accepting the null hypothesis. Sometimes the alternate hypothesis is merely the statement that the observed data is not drawn from the null hypothesis. Tests designed for this purpose are called goodness-of-fit tests. For our problem, we invoke a particular test known as the multinomial goodness-of-fit test (see e.g. [16]) that is applied to a scenario where the data X i are discrete random variables that can take at most k values, say {1, 2,...,k}. Let P 0 = (p 1,p 2,...,p k ) where i p i = 1 denote the distribution corresponding to the null hypothesis (p i denotes the probability of observing i). Let n i denote the number of times i was observed in the sample X 1,X 2,...,X n.let ˆP = (ˆp 1, ˆp 2,...,ˆp k ) where ˆp i = ni n denote the empirical distribution of the observed sample X 1,X 2,...,X n. Then the likelihood ratio in (1) reduces to L =log Πk i=1 ˆpni i Π k i=1 pni i = n n i=1 ˆp i log ˆp i p i. Note that the relative entropy (also known as the Kullback- Leibler divergence [15]) between two distributions Q = (q 1,q 2,...,q k ) and P =(p 1,p 2,...,p k ) is given by D(Q P )= i q i log q i p i. Thus the likelihood ratio is L = n D( ˆP P ). The multinomial goodness-of-fit test relies on the observation that if the null hypothesis P is true, then as the number of samples n grows 2 n D( ˆP P ) converges to a chi-squared distribution 1 with k 1 degrees of freedom. Therefore the test is performed by comparing 2 n D( ˆP P ) with a threshold that is determined based on the cumulative distribution function (cdf) of the chisquared distribution and a desired upper bound on the false negative probability. To apply the multinomial goodness-of-fit test, for the purpose of anomaly detection we first quantize the metric being measured and discretize it to a few values. For example, the percentage of CPU utilization which takes values between 0 and 100 can be quantized into 10 buckets each of width 10. Thus the quantized series of CPU utilization values takes one of 10 different values. Then we window the data and perform anomaly detection for each window. To do so, we select the quantized data observed in a window, decide on a choice of the null hypothesis P, and perform the multinomial goodness-offit test with a threshold based on an acceptable false negative probability. If the null hypothesis is rejected, then we raise an alarm that the window contained an anomaly. If it is accepted, then we declare that the window did not contain an anomaly. Depending on the choice of the null hypothesis, one can obtain a variety of different tests. We consider two different choices in this paper. The first choice involves setting P = (p 1,p 2,...p k ) where p i is the fraction of times i appeared in the past, i.e., before the current window. Intuitively this choice declares a window to be anomalous if the distribution of the metric values in that window differs significantly from the distribution of metric values in the past. In the second choice p i is set to be the fraction of times i appears in the past few windows. This choice declares a window to be anomalous if the distribution of the metric values in that window differs significantly from the distribution of metric values in the recent past. Unlike the first choice, this choice distinguishes between the cases where the distribution of the metric being monitored changes in a gentle manner and where the distribution of the metric changes abruptly. In many real life applications, the nature of the load on a data center is often not constant. In fact, it is strongly related to the day of the week and the hour of the day. Thus applying the hypothesis tests as described above directly may not yield the desired results. Therefore, in such cases it is necessary to contextualize the data, namely, compute the null hypothesis based on the hour of day or the day of the week. Furthermore, some systems may operate in multiple states. For example, the system could encounter very small or no load for most of the time and higher loads in bursts intermittently. In that case, it is likely that the relative entropy based approaches outlined here would flag the second state as anomalous. This may not be desirable. We present an extension of the goodnessof-fit approach as a way to ameliorate such problems. In this 1 A chi-squared distribution with m degrees of freedom is the sum of the squares of m independent identically distributed zero mean, unit variance Gaussian random variables.
5 extension, the test statistic is computed against several null hypotheses as opposed to a single one. We formally describe how the null hypotheses are selected and how the anomaly detection is performed in Figure 1. We first describe the inputs and intermediate variables in our algorithm. Inputs: Util is the timeseries of the metric on which the anomaly needs to be detected N bins is the number of bins into which Util is to be quantized Util min and Util max are the minimum and maximum values that Util can take n is the length of the time series being monitored W is the window size T is the threshold against which the test statistic is compared. It is usually set to that point in the chi-squared cdf with N bins 1 degrees of freedom that corresponds to 0.95 or c th is a threshold against which c i is compared to determine if a hypothesis has occurred frequently enough. Intermediate variables: m tracks the current number of null hypothesis Stepsize is the step size in time series quantization Util current is the current window of utilization values B current is the current window of bin values obtained by quantizing the utilization values ˆP, the empirical frequency of the current window based on B current. c i tracks the number of windows that agree with hypothesis P i Algorithm ANOMALY DETECTION USING MULTINOMIAL GOODNESS-OF-FIT Input: (Util, N bins, Util min, Util max, n, W, T, c th ) 1) Set m =0 2) Set W index =1 3) Set Stepsize =(Util max Util min)/n bins 4) While (W index W<n) a) Set Util current = Util((W index 1) W +1 : W index W ) b) Set B current = ((Util current Util min)/stepsize) c) Compute ˆP d) If m =0 Set P 1 = ˆP, m =1, c 1 =1 e) Else if (2 W D( ˆP P i) <T) for any hypothesis P i, i m, Increment c i by 1 (If more than one such i exists, select the one with lowest D( ˆP P i)) If c i >c th, Declare window to be non-anomalous Else Declare window to be anomalous f) Else Declare window to be anomalous Increment m by 1, setp m = ˆP,andc m =1 Fig. 1. Algorithm for anomaly detection with multiple null hypotheses The algorithms works as follows. The current window is selected in Step 4a) and the values in the window are quantized in Step 4b). The algorithm computes the empirical frequency ˆP of the current window as in Step 4c). Observe that P 1,P 2,...,P m denote the m null hypotheses at any point in time. A test-statistic involving ˆP and each of the P i s is computed and compared to the threshold T in Step 4e). If the test-statistic exceeds the threshold for all of the null-hypotheses, the window is declared anomalous, m is incremented by 1 and a new hypothesis is created based on the current window. If the smallest test-statistic is less than the threshold but corresponds to a hypothesis that was accepted less than c th times in the past, then the window is declared anomalous, but the number of appearances of that hypothesis is incremented. If neither of the two conditions is satisfied, the window is declared non-anomalous and the appropriate book-keeping performed. The relative entropy based approaches described here have several advantages. They are non-parametric, namely they do not assume a particular form for the distribution of the data. They can be easily extended to handle multi-dimensional timeseries thereby incorporating correlation, and can be adapted to workloads whose distribution changes over time. While relative entropy and hypothesis testing approaches have been used for anomaly detection, we are not aware of any work that uses it in the context of multinomial goodness of fit test. The latter provides a systematic way (based on the cdf of the chi-squared distribution) to choose a threshold above which alarms are raised. This feature is not always present in other works using the relative entropy metric. We also extend the technique in a novel manner to adapt to multiple operating states and for detecting contextual anomalies. All the algorithms that we propose are computationally lightweight. The algorithms require computing statistics for current window and updating historical statistics. The computation overhead for the current window statistics is negligible. Updating quantities such as mean and standard deviation can clearly be performed in linear time, Further this can be done in an online manner with very little memory as only the sum of the observed values and the sum of the squares of the observed values need to be maintained. For the Tukey method, we need to compute quantiles, and for the relative entropy method, the empirical frequency of previous windows. These can also be computed in time linear in the input. Further, these quantities can be well-approximated in one pass with limited memory, without having to store all the observed data points [22]. IV. THE RESULTS We tested the algorithms discussed in Section III on two different types of data. The first data set was one where we could inject anomalies and validate the results returned by our algorithms. It was obtained from an experimental setup with a representative internet service - RUBiS [17], a distributed online service implementing the core functionality of an auction site. The second type of data set was collected from production data centers.
6 Method Statistics Over Recall FPR Gaussian (windowed) Entire past Gaussian (windowed) Recent windows Gaussian(smoothed) Entire Past Tukey(smoothed) Entire Past Relative entropy Entire past Relative entropy Recent past Relative entropy Multiple Hypotheses TABLE I RECALL AND FALSE POSITIVE RATES FOR DIFFERENT TECHNIQUES WITH RUBIS DATA A. RUBiS Testbed Results The RUBiS testbed uses 5 virtual machines (VM1 to VM5) on Xen platform hosted on two physical servers (Host1 and Host2). VM1, VM2, and VM3 are created on Host1. The frontend server processing or redirecting service requests runs in VM1. The application server handling the application logic runs in VM2. The database backend server is deployed on VM3. The deployment is typical in its use of multiple VMs and the consolidation of them onto a smaller number of hosts. A request load generator and an anomaly injector are running on two virtual machines, VM4 and VM5, on Host2. The generator creates 10 hours worth of service request load for Host1 where the auction site resides. The load emulates concurrent clients, sessions, and human activities. During the experiment, the anomaly injector injects 50 anomalies into the RUBiS online service in Host1. Those 50 anomalies come from major sources of failures or performance issues in online services [18]. We inject them into the testbed using a uniform distribution. The virtual machine and host metrics are collected/analyzed in an anomaly detector. We present the results of analyzing the CPU utilizations in Table I. The CPU utilization data was quantized into 33 equally spaced bins, and for the windowing techniques, the window length was set to 100. For algorithms that use only the recent past, 10 windows of data were used. For the Gaussian methods, for each window, anomaly detection was performed on the CPU utilization of each of the three virtual servers and an alarm raised if an anomaly is detected in at least one of them. For the relative entropy methods, the sum of the test statistics of each of the servers is compared to a threshold (computed based on the fact that the sum of chisquared random variables is a chi-squared random variable). As mentioned, there were total 50 anomalies injected, and in our evaluation, an anomaly is said to be detected if an alarm is raised in a window containing the anomaly. The results are presented in terms of statistical metrics - Recall and False Positive Rate (FPR). Recall = # of succesful detections # of total anomalies # of false alarms False Positive Rate (FPR) = # of total alarms. (3) From Table I, it is clear that the windowed Gaussian techniques perform poorly when applied to the RUBiS data. (2) The relative entropy-based algorithms and the Tukey method perform much better with the multiple hypothesis relative entropy technique giving the best results (detecting 86% of the anomalies with a minimal false positive rate). Also, using the few past windows to select the null hypothesis provides better accuracy than using the entire past. B. Production Data Center Results In this section, we report results from analyzing data collected from two different real world customer production data centers over a 30 and 60 day period respectively (henceforth called CUST1 and CUST2 respectively) [19]. The metrics such as CPU and Memory utilization are sampled every 5 minutes. We segmented this data by hour of day, and day of week for both CUST1 and CUST2, and applied the anomaly detection methods over them, thus performing context-based evaluations. We primarily report results in this section on these context-based evaluations only. Also, since this is data from production data centers, we had no control of the anomalies that manifested, nor do we have knowledge about them. So, in our evaluations, we simply report the number of anomalies detected by the various techniques, and do a comparison among them. The implementation for the Gaussian techniques is based on the industry-standard MASF approach [1]. In Figure 2, we present a representative plot of the windowed techniques on the data of one of the servers for CUST1 contextualizing the data based on hour of day, as explained in Section III. The results shown are for a fixed hour during the day and for CPU Utilization data (this data is measured in terms of number of cores utilized ranging from 0 to 8). The alarms raised by different techniques are shown. To interpret the plots, note that if any of the curves is non-zero, then it implies that an alarm was raised by the technique corresponding to the curve. The heights of the curves do not carry any further meaning. We observe that the Gaussian threshold based method does not detect the first unusual increase in CPU utilization as soon as the Goodnessof-fit based methods. The multiple hypothesis version learns the behavior of the system and does not raise an alarm during similar subsequent increases. Fig. 2. Anomaly detection on real-world data (CUST1) Next, we present some results from CUST2 data. We exam-
7 Hour of day # Anomalies # Anomalies # Anomalies Common Unique to RE Unique to Unique to (Relative (Gaussian) (Tukey) Anomalies Gaussian Tukey Entropy-RE) TABLE II COMPARISON OF VARIOUS TECHNIQUES ON REAL CUSTOMER DATA (CUST2): NUMBER OF ANOMALIES DETECTED BY EACH TECHNIQUE IN EACH HOUR FOR WEEKDAYS ined the CPU utilization of one of the servers and interleaved the data according to hour of day. More specifically, for each x {0, 1, 2,...,23}, we grouped together all the measurements made between x hours and x+1 hours on weekdays. We tested the pointwise Gaussian and Tukey techniques, as well as the relative entropy technique with multiple hypotheses. To determine the limits for the pointwise Gaussian and Tukey techniques, we used the first half of the data set to estimate the mean, standard deviation and relevant quantiles. We then applied the thresholds on the second half of the data set to detect anomalies. The server in question had been allocated a maximum of 16 cores and thus the values of CPU utilization ranged between 0.0 and To apply the relative entropy detector, we used a bin width of 2.0 which corresponds to 8 bins. The window size was 12 data points which corresponds to an hour. Thus the relative entropy detector, at the end of each hour, declares that window to be anomalous or otherwise. To facilitate an apples-for-apples comparison, we processed the alarms raised by the Gaussian and Tukey methods as follows. For each of the techniques, if within one window at least one alarm is raised, then the window is deemed anomalous. This way we are able to compare the number of windows that each of the techniques flags as anomalous. The above comparison is performed for each of the 24 hours. The results are presented in Table II. The first column in Table II specifies the hour of day that was analyzed. The next three columns indicate number of anomalies detected by the three techniques. The fifth column states the number of anomalies detected by all of them and the sixth, seventh and eighth columns the anomalies that were uniquely detected by each of the techniques. For instance the sixth column records the number of anomalies detected by the relative entropy technique but not by the other two. To briefly summarize the results, we observe that in most cases the relative entropy technique identifies the anomalies detected by the other two techniques, and flags a few more as well. There are three notable exceptions hours , 1000 and 1500 where the Tukey method flags 6 7 more anomalies than the other techniques. A closer examination of the data corresponding to these hours reveals that the CPU utilization during these hours was mostly very low (< 0.5) leading to the inter-quartile range being very small and therefore resulting in a very low upper threshold. As a result, the algorithm flagged a large number of windows as anomalous and it is reasonable to surmise that some of these may be false alarms. On the other hand, the data corresponding to hour 1100 results in the relative entropy technique returning 5 anomalies that the other two techniques do not flag. Some of these anomalies are interesting when examined closely. The typical behavior of the CPU utilization is as follows: it hovers between 0.0 and 2.0 with occasional spikes. But often after spiking to a value greater than 6.0 for a few measurements, it drops back down to a value around 2.0. But in a couple of cases, the value does not drop down entirely and remains between 4.0 and 6.0, indicating perhaps a stuck thread. The Gaussian and Tukey methods do not catch this anomaly as it does not manifest as an extreme value. However, it is interesting that the relative entropy is able to catch this. Finally, we present results for data segmented by hour for a given day of the week only. In other words, we string together data by each hour for a given day over the entire data collection period. The results we present consist of analyzing the hours between 10 AM and 5 PM on Mondays only over the collection period. Both cpu-utilization and memory utilization
8 Hour Parameter # Anomalies # Anomalies # Common of day (Gaussian) (Tukey) Anomalies 10 CPU Memory CPU Memory CPU Memory CPU Memory CPU Memory CPU Memory CPU Memory CPU Memory TABLE III COMPARISON OF GAUSSIAN AND TUKEY METHODS FOR MONDAYS ONLY (CUST2). THE DATA IS ANALYZED BY HOUR OF DAY AND DAY OF WEEK. methods were analyzed. The summary of the analysis is that while fewer anomalies are flagged there are some significant blips. This is presumably due to the server utilization over short, hourly periods tend to be stable with sudden bursts of activity. The data is summarized in Table III. It is clear from the tabulated results that CPU-utilization shows more variability than Memory utilization. The two anomaly detection techniques flag few anomalies common to each other. Only in the hours of 12, 13, and 15 do they trigger common anomalies albeit few. Note that there is significant variability in the performance of the two techniques. In the 10 A.M. hour, while the Gaussian technique flags 12 anomalies, the Tukey trigger none for memory utilization. On the other hand, in the hour 11 A.M. hour, the Gaussian technique does not flag any anomalies, while the Tukey flags as many as 13 each of lower and upper anomalies in cpu-utilization (total 26). Although, the Tukey method flags more anomalies overall (51) as opposed to 46 by the Gaussian technique, the Tukey is better suited to anomaly detection because of its generality, flexibility, and ease of computation. V. DISCUSSION As seen, the statistical algorithms for anomaly detection analyze monitoring data over historical periods to make their decisions. This historical period could be the entire past of the workload operation, or it could be restricted to recent past values only. In either of these cases, the considered data could be organized based on continuous history over time, or it could be segregated by context, e.g. hour of day or day or week. Which historical period and organization type is chosen has an effect on the results of the anomaly detection algorithms. For example, for Gaussian and Tukey methods, this will affect the value of thresholds chosen, and for Relative Entropy, it will affect the null hypotheses against which the current window is evaluated. Several factors play a role in selecting the appropriate historical period. Workload behavior and pattern is one of them. If the workload has a periodic behavior, then recent past values that can give enough statistical significance might suffice. Further, if the pattern is based on weekly or hourly behavior, then organizing the data by context is preferable. If the workload is long running and has exhibited aperiodic behavior in the past, the entire past data would help smooth out averages. If the workload has varying behavior that is timedependent or if it is originating from virtual machines that have been migrated or subject to varying resource allocation in shared environments, considering only recent windows might help to eliminate noise from past variable behavior. In enterprise environments with dedicated infrastructures, it might be straightforward to select the appropriate historical period and organization of data. However, in cloud and utility computing environments with limited prior knowledge of workload behavior, high churn, and shared infrastructure for workloads, the most appropriate historical period and organization of data to choose may be challenging and expensive to determine over large scale. Instead, we suggest an alternative approach in which at any given time, we do multiple runs of the anomaly detection algorithm in parallel, each run leveraging data analyzed from a different historical period. The results from these individual runs could then be combined to provide overall insight and a system status indicator to the administrator. For performing of multiple runs to be feasible, the algorithm has to be lightweight. As discussed in previous sections, our proposed algorithms are lightweight and could be used for this approach. Table IV(a) illustrates how the approach would work. The table shows a sample trace and output of an anomaly detection algorithm over time using data from different historical periods (entire past, recent past) with and without consideration of context. An output of 1 indicates that an anomaly flag is raised for that time period, and 0 indicates otherwise. To combine the results, we propose a weighted combination of these output flags. This weighted combination can be mapped to a coloring scheme shown in Table IV(b). The color represents the overall system status for the system being monitored. Table IV(a) shows the results when different weights are used. Column 5 shows the system status with equal weights to the different historical periods, Column 6 shows the system status when workload patterns (hence context) are given higher weight (1, 0.5, 1 for Columns 2, 3 and 4 respectively), and Column 7 shows the system status when historical knowledge is given higher weight (1, 0.5, 0.5 for Columns 2, 3, 4). While not shown, it is easy to see that the same approach can be extended to combine results from multiple algorithms as well. This could lead to two level of combinations - combinations for different historical periods for a given algorithm, and then combinations of those across multiple algorithms. The feasibility of this hybrid approach would however be dependent on the complexity of algorithms so as to be able to execute multiple algorithms on online data. More broadly, the discussion in this section points us to the different dimensions that anomaly detection algorithms have to consider, and how a systematic approach can enable us to obtain improved understanding of system behavior in complex, and large-scale shared computing infrastructures such as being
9 (a) Anomaly Flags and Combined System Status Time Anomaly Flags System Status Entire Past Recent Past Recent Past Equal weight Weighted by Weighted by (with Context) (No Context) (with Context) context history t Green Green Green t Green Green Green t Yellow Yellow Yellow t Orange Yellow Yellow t Red Orange Orange t Red Orange Orange t Orange Orange Yellow (b) Color Mapping of Weighted Sum Weighted Sum (w) Color Problem Intensity w>=3 Red Very Severe 2 <= w<3 Orange Severe 1 <= w<2 Yellow Mild w<1 Green None TABLE IV ILLUSTRATION OF COMBINED SYSTEM STATUS represented by utility clouds. VI. RELATED WORK Most of current industry monitoring tools use fixed thresholds for anomaly detection. Fixed upper and lower bounds are determined apriori and remain constant during the entire process of anomaly detection. MASF [1] is one of the popular threshold-based techniques being adopted in industry. MASF applies thresholds to data segmented by hour of day, and day of week. As discussed earlier, these techniques have limitations of accuracy and false alarm rates due to their assumed data distributions, and limited adaptibility to changing workloads. They have poor scalability and lack of correlation analysis. Prior academic work has developed many useful methods for anomaly detection, typically based on statistical techniques [2], [3], [4], [5], [6], [7], [8]. A summary is provided by [21]. However, few of them can operate at the scale of future data center or cloud computing systems and/or have the lightweight characteristic desired for online operation. Reasons include their use of statistical algorithms with high computing overheads and their use of onerously large amounts of raw metric information. In addition, they may require prior knowledge about application SLOs, service implementations, request semantics, or they are focused on solving certain welldefined problems at specific levels of abstraction (i.e., metric levels) in data center systems. In contrast, the techniques we propose are lightweight techniques that systematically improve on the state of art techniques widely adopted in industry. We have proposed techniques to improve on current point fixed-threshold approaches, and we have also developed new windowing-based approaches that observe changes in the distribution of data. Our techniques are adaptable and can learn the workload characteristics over time, improving accuracy and reducing false alarms. They also meet the scalability needs of future data centers, are applicable to multiple contexts of data (catching contextual anomalies), and can be applied to multiple metrics in data centers. VII. CONCLUSIONS AND FUTURE WORK Anomaly Detection is an important component for closed loop management in data centers. In this paper, we presented statistical approaches for online detection of anomalies. Specifically, we have presented methods based on Tukey and Relative Entropy statistics, and experimentally evaluated them. The proposed approaches are lightweight and improve upon prevalent Gaussian-based approaches. As ongoing work, we are performing more evaluations on synthetic as well as customer data. We are also exploring further refinements to the techniques presented in this paper for multiple metrics and for aggregation across multiple machines at large scale. We also plan to do evaluations with other cloud workloads and benchmarks. REFERENCES [1] Jeffrey P. Buzen and et. al., MASF: Multivariate Adaptive Statistical Filtering, CMG Confernce, [2] S. Agarwala and et. al., E2EProf: Automated End-to-End Performance Management for Enterprise Systems, DSN, [3] S. Agarwala and K. Schwan, SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring, ICDCS, [4] Ira Cohen and et. al., Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control, OSDI, [5] Bahl, Paramvir and et. al., Towards highly reliable enterprise network services via inference of multi-level dependencies, SIGCOMM, [6] Chen, Mike Y. and et. al., Pinpoint: Problem Determination in Large, Dynamic Internet Services, DSN, [7] A. Barham, and et. al., Using magpie for request extraction and workload modelling, OSDI 04, [8] M.K. Aguilera and et. al., Performance debugging for distributed systems of black boxes, SOSP 03, [9] J. Tukey, Exploratory Data Analysis, Addison Wesley, MA, [10] D. Montgomery and et. al., Engineering Statistics, Wiley, [11] I. Daubechies, Ten Lectures on Wavelets, Applied Mathematics, [12] R. Baeza-Yates and et. al., Modern Information retrieval,1999 [13] R. Johnson and et. al., Applied Multivariate Statistical Analysis, 2001 [14] S. Mallat, A Theory of Multiresolution Signal Decomposition: The Wavelet Representation, Academic Press, [15] T. M. Cover and et. al., Elements of Information Theory, [16] E. L. Lehmann and et, al., Testing Statistical Hypothesis, Springer, [17] E. Cecchet, and et. al., Performance and scalability of ejb applications, OOPSLA, [18] C. Wang and et. al., Online detection of utility cloud anomalies using metric distributions, NOMS, [19] Anonymized Production Data Center Logs, HP Internal Correspondence. [20] C. Bishop, Neural networks for pattern recognition [21] V.Chandola, and et. al., Anomaly Detection : A Survey, ACM Computing Surveys, 2009 [22] G. Cormode, and et. al.,an improved data stream summary: The countmin sketch and its applications, Journal of Algorithms, 2005.
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
An Enterprise Dynamic Thresholding System
An Enterprise Dynamic Thresholding System Mazda Marvasti, Arnak Poghosyan, Ashot Harutyunyan, and Naira Grigoryan Presented by: Bob Patten, VMware 2013 2011 VMware Inc. All rights reserved Agenda Modern
supported Application QoS in Shared Resource Pools
Supporting Application QoS in Shared Resource Pools Jerry Rolia, Ludmila Cherkasova, Martin Arlitt, Vijay Machiraju HP Laboratories Palo Alto HPL-2006-1 December 22, 2005* automation, enterprise applications,
Machine Data Analytics with Sumo Logic
Machine Data Analytics with Sumo Logic A Sumo Logic White Paper Introduction Today, organizations generate more data in ten minutes than they did during the entire year in 2003. This exponential growth
INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD
INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD M.Rajeswari 1, M.Savuri Raja 2, M.Suganthy 3 1 Master of Technology, Department of Computer Science & Engineering, Dr. S.J.S Paul Memorial
Cost Effective Automated Scaling of Web Applications for Multi Cloud Services
Cost Effective Automated Scaling of Web Applications for Multi Cloud Services SANTHOSH.A 1, D.VINOTHA 2, BOOPATHY.P 3 1,2,3 Computer Science and Engineering PRIST University India Abstract - Resource allocation
Energy Constrained Resource Scheduling for Cloud Environment
Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo [email protected] Joseph L. Hellerstein Google Inc. [email protected] Raouf Boutaba University of Waterloo [email protected]
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
Detecting Network Anomalies. Anant Shah
Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting
Topology Aware Analytics for Elastic Cloud Services
Topology Aware Analytics for Elastic Cloud Services [email protected] Master Thesis Presentation May 28 th 2015, Department of Computer Science, University of Cyprus In Brief.. a Tool providing Performance
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
Optimal Service Pricing for a Cloud Cache
Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
Towards an understanding of oversubscription in cloud
IBM Research Towards an understanding of oversubscription in cloud Salman A. Baset, Long Wang, Chunqiang Tang [email protected] IBM T. J. Watson Research Center Hawthorne, NY Outline Oversubscription
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902
Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited
GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS
USER GUIDE GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS Contents Using the LabVIEW Point-By-Point VI Libraries... 2 Initializing Point-By-Point VIs... 3 Frequently Asked Questions... 5 What Are the
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
Evaluating the Lead Time Demand Distribution for (r, Q) Policies Under Intermittent Demand
Proceedings of the 2009 Industrial Engineering Research Conference Evaluating the Lead Time Demand Distribution for (r, Q) Policies Under Intermittent Demand Yasin Unlu, Manuel D. Rossetti Department of
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES
USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES Carlos Oliveira, Vinicius Petrucci, Orlando Loques Universidade Federal Fluminense Niterói, Brazil ABSTRACT In
Anomaly Detection in Predictive Maintenance
Anomaly Detection in Predictive Maintenance Anomaly Detection with Time Series Analysis Phil Winters Iris Adae Rosaria Silipo [email protected] [email protected] [email protected] Copyright
Network Performance Monitoring at Small Time Scales
Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA [email protected] Electrical and Computer Engineering Department University
ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem
ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem INTRODUCTION As IT infrastructure has grown more complex, IT administrators and operators have struggled to retain control. Gone
A Middleware Strategy to Survive Compute Peak Loads in Cloud
A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: [email protected]
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
HP Service Health Analyzer: Decoding the DNA of IT performance problems
HP Service Health Analyzer: Decoding the DNA of IT performance problems Technical white paper Table of contents Introduction... 2 HP unique approach HP SHA driven by the HP Run-time Service Model... 2
CHI-SQUARE: TESTING FOR GOODNESS OF FIT
CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity
Exploratory Data Analysis
Exploratory Data Analysis Johannes Schauer [email protected] Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
EXPLORING SPATIAL PATTERNS IN YOUR DATA
EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze
Figure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 [email protected], [email protected] Abstract One of the most important issues
An Autonomic Auto-scaling Controller for Cloud Based Applications
An Autonomic Auto-scaling Controller for Cloud Based Applications Jorge M. Londoño-Peláez Escuela de Ingenierías Universidad Pontificia Bolivariana Medellín, Colombia Carlos A. Florez-Samur Netsac S.A.
Benchmarking Amazon s EC2 Cloud Platform
Benchmarking Amazon s EC2 Cloud Platform Goal of the project: The goal of this project is to analyze the performance of an 8-node cluster in Amazon s Elastic Compute Cloud (EC2). The cluster will be benchmarked
Test Run Analysis Interpretation (AI) Made Easy with OpenLoad
Test Run Analysis Interpretation (AI) Made Easy with OpenLoad OpenDemand Systems, Inc. Abstract / Executive Summary As Web applications and services become more complex, it becomes increasingly difficult
Two-Sample T-Tests Assuming Equal Variance (Enter Means)
Chapter 4 Two-Sample T-Tests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
Cluster Analysis for Evaluating Trading Strategies 1
CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. [email protected] +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. [email protected]
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation
Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation Yu Gu, Andrew McCallum, Don Towsley Department of Computer Science, University of Massachusetts, Amherst, MA 01003 Abstract We develop
Tutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls [email protected] MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
Fingerprinting the Datacenter: Automated Classification of Performance Crises
Fingerprinting the Datacenter: Automated Classification of Performance Crises Peter Bodík University of California, Berkeley Armando Fox University of California, Berkeley Moises Goldszmidt Microsoft Research
Geostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]
12.5: CHI-SQUARE GOODNESS OF FIT TESTS
125: Chi-Square Goodness of Fit Tests CD12-1 125: CHI-SQUARE GOODNESS OF FIT TESTS In this section, the χ 2 distribution is used for testing the goodness of fit of a set of data to a specific probability
On Correlating Performance Metrics
On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are
SPC Data Visualization of Seasonal and Financial Data Using JMP WHITE PAPER
SPC Data Visualization of Seasonal and Financial Data Using JMP WHITE PAPER SAS White Paper Table of Contents Abstract.... 1 Background.... 1 Example 1: Telescope Company Monitors Revenue.... 3 Example
IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
THE PROCESS CAPABILITY ANALYSIS - A TOOL FOR PROCESS PERFORMANCE MEASURES AND METRICS - A CASE STUDY
International Journal for Quality Research 8(3) 399-416 ISSN 1800-6450 Yerriswamy Wooluru 1 Swamy D.R. P. Nagesh THE PROCESS CAPABILITY ANALYSIS - A TOOL FOR PROCESS PERFORMANCE MEASURES AND METRICS -
Towards Combining Online & Offline Management for Big Data Applications
Towards Combining Online & Offline Management for Big Data Applications Brian Laub, Chengwei Wang, Karsten Schwan, and Chad Huneycutt, Georgia Institute of Technology https://www.usenix.org/conference/icac14/technical-sessions/presentation/laub
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang [email protected]
Black-box Performance Models for Virtualized Web Service Applications Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang [email protected] Reference scenario 2 Virtualization, proposed in early
FChain: Toward Black-box Online Fault Localization for Cloud Systems
2013 IEEE 33rd International Conference on Distributed Computing Systems FChain: Toward Black-box Online Fault Localization for Cloud Systems Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu North Carolina
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Dynamic Load Balancing of Virtual Machines using QEMU-KVM
Dynamic Load Balancing of Virtual Machines using QEMU-KVM Akshay Chandak Krishnakant Jaju Technology, College of Engineering, Pune. Maharashtra, India. Akshay Kanfade Pushkar Lohiya Technology, College
Euler: A System for Numerical Optimization of Programs
Euler: A System for Numerical Optimization of Programs Swarat Chaudhuri 1 and Armando Solar-Lezama 2 1 Rice University 2 MIT Abstract. We give a tutorial introduction to Euler, a system for solving difficult
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY
CANADIAN APPLIED MATHEMATICS QUARTERLY Volume 12, Number 4, Winter 2004 ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY SURREY KIM, 1 SONG LI, 2 HONGWEI LONG 3 AND RANDALL PYKE Based on work carried out
Optimizing a ëcontent-aware" Load Balancing Strategy for Shared Web Hosting Service Ludmila Cherkasova Hewlett-Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94303 [email protected] Shankar
Fingerprinting the datacenter: automated classification of performance crises
Fingerprinting the datacenter: automated classification of performance crises Peter Bodík 1,3, Moises Goldszmidt 3, Armando Fox 1, Dawn Woodard 4, Hans Andersen 2 1 RAD Lab, UC Berkeley 2 Microsoft 3 Research
UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST
UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly
Load Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
Cloud Vendor Benchmark 2015. Price & Performance Comparison Among 15 Top IaaS Providers Part 1: Pricing. April 2015 (UPDATED)
Cloud Vendor Benchmark 2015 Price & Performance Comparison Among 15 Top IaaS Providers Part 1: Pricing April 2015 (UPDATED) Table of Contents Executive Summary 3 Estimating Cloud Spending 3 About the Pricing
QOS Based Web Service Ranking Using Fuzzy C-means Clusters
Research Journal of Applied Sciences, Engineering and Technology 10(9): 1045-1050, 2015 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2015 Submitted: March 19, 2015 Accepted: April
Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network
Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the
Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland [email protected]
Big data coming soon... to an NSI near you John Dunne Central Statistics Office (CSO), Ireland [email protected] Big data is beginning to be explored and exploited to inform policy making. However these
The Relative Worst Order Ratio for On-Line Algorithms
The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, [email protected]
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Week 1. Exploratory Data Analysis
Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Chapter 3 RANDOM VARIATE GENERATION
Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.
Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)
Chapter 45 Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when no assumption
Comparison of Windows IaaS Environments
Comparison of Windows IaaS Environments Comparison of Amazon Web Services, Expedient, Microsoft, and Rackspace Public Clouds January 5, 215 TABLE OF CONTENTS Executive Summary 2 vcpu Performance Summary
START Selected Topics in Assurance
START Selected Topics in Assurance Related Technologies Table of Contents Introduction Some Statistical Background Fitting a Normal Using the Anderson Darling GoF Test Fitting a Weibull Using the Anderson
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
