On Correlating Performance Metrics

Transcription

1 On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are the basis of identifying and addressing computer performance related issues, from monitoring to capacity planning. IT professionals and performance analysts often face the questions of what metrics to look at, whether those metrics are related, and how to group them in meaningful way for further analysis, e.g., workload characterization. In this paper we present a number of techniques for correlating performance metrics, which could be sampled at different rates, at different lengths and overlapping intervals, and with different response times to a common system event. We also discuss how to assess the reliability of the correlations we compute and how to efficiently find relationships of interest.. Introduction For any given computer system it is likely that many different performance metrics are available. IT staffs and analysts often face the question of which to consider, whether they are related, which ones to monitor, and how to group them in meaningful ways for further analysis. For instance, You might want to know which disks are being used by a particular software package or application; You might want to figure out which resources are used by which http requests; You might want to know which remote node is sending packets to which remote node; You might want to investigate which performance metric values are affecting the application response time and causing it to exceed its threshold; You might want to identify the related resources used on different nodes to compute the end-to-end response time for an application; You might want to find out the time delays among the monitored values of different performance metrics for a given system event; You might need to know which processes are related, so you can properly describe each workload when doing workload characterization. The list can go on and on. One common characteristic of the above is that you want to link one object or phenomenon to another. While you may be able to answer some of these questions with probes or other specialized hardware or software, the power of the technique discussed here is that it has the potential to answer all of these questions and many others like them with relatively little pre-knowledge. The basic idea search for events with high correlation coefficients was described in another paper []. In this paper the emphasis is on addressing the

2 problems that happen when correlating real world data and how they can be handled. Many of the problems that occur during analysis happen because data has not been reliably produced. For example, performance metrics are often collected at different rates. A preprocessor is needed in this case before correlation can be performed. (This is discussed in Section 3.) Or more seriously, some metrics may not be collected continuously, due to hardware and software interruptions. Another complication is that some metrics have large variations and some only minor ones. Some may even be constant for a whole time/measurement interval. These differing characteristics create interesting challenges for identifying the relationships among performance metrics. We will deal with these issues in this paper. Let us first start with some definitions that will help us to quantify the discussion... Definitions The following concepts are used throughout this paper. Let X and Y be any two random variables with means (averages) x and y, and positive variances σ x and σ y. A typical example of the random variables that we will be comparing is the CPU consumption of two processes. Another typical comparison is between process page fault counts and disk activity. Definition. Variance: Let X be a random variable with a distribution F( x) : = Pr( X x), the variance of the distribution is defined by σ x : = E([ X E( X )] ) : = x x, where E(X) x is the expectation (mean) of X. The variance is a measure of the spread or scattering of a distribution. Definition. Covariance: Assume that X and Y have finite variances. Their covariance is defined as COV ( X, Y ) = xy x y. From the above definition we know that, if X and Y are independent, i.e., xy = x y, then COV(X, Y) =. (The converse is not true.) Since covariance is not a normalized number, when it is not equal to, we do not know how closely the two random variables are related. If we normalize the two variables by X * = ( X x) / σ and * Y = ( Y y) / σ, y x * * then they, X and Y, have mean and variance, which leads to a more informative relationship indicator: Definition 3. Correlation Coefficient. The correlation coefficient of X and Y is defined by C(X, Y) = COV( * X, Y * COV ( X, Y ) ) =. σ σ In other words, the correlation coefficient of two random variables is the covariance of the normalized two variables. Note that, C ( X, Y ) and that C(X, Y) = ± if and only if Y and X are related linearly, i.e., Y = ax + b. Note further that the correlation coefficient is independent of the origins and units of measurements. In other words, for any constants, a, b, c, d, with a >, c >, we have C(aX + b, cy + d) = C(X, Y). This property allows us to manipulate the scale of the data for a better visual representation of the relationship among performance metrics. One way to scale the measured data is to normalize it, so that the values are between and. For instance, let x y

3 a = /( X max X min ) and b = X X X ), () where min /( max min X X max ( min ) is the largest (smallest) value of the data, then the linear transformation ' X = ax + b will normalize the data, but not change the correlation coefficient. In other words, ' C ( X, Y ) = C( X, Y ). Y can also be changed, similarly. Please note that a and c have to be positive. Otherwise the manipulation will cause the curve to change its direction (first derivative), which will cause the correlation coefficient to change. (Though if both a and c are negative, the changes will cancel out.) Intuitively, high positive correlation of two random variables implies the peaks (and valleys) of their values tend to occur together. A linear transformation will not change this visual property. We will see in a latter section that the correlation coefficient is strongly affected by the trending or shape, i.e., the second derivative, of the performance curves. The basic technique here for finding relationships among different performance metrics is to find the correlation coefficient between pairs. When the correlation is high (close to ), we can assume that the two metrics represent resources that were working on the same task. Because correlation coefficients are invariant under linear transformations, the units of the metrics do not matter (that is, changing a metric from, say, bytes to MBytes will not change its correlation coefficient with another metric). However, correlations are sensitive to the time phase of the data. For this reason, the measurement interval should not be too small. There might be a delay between the various activities that complete a transaction. On the other hand, the measurement or sample interval should not be too large either. A very large sample interval not only reduces the potential number of samples for a given time interval, but also could link otherwise unrelated events. As a rule of thumb, the interval should be greater than times the average delay at different resources / devices... A Model for Presenting Performance Data Before we address the issues related to performance data correlation, let us first define time serial data stream, D: D = {( d, t ),( d, t ),...,( d n, t n )}. Generally, in performance data collection or measurement, it is true that t i+ ti = ti+ ti+, for i =,,, n. In other words, data or events are sampled at regular intervals. Unless stated otherwise we use this assumption in this paper. Now let us consider another data set, D ', ' ' D = {( d, t' ),( d', t' ),...,( d' ', t' ' )}. n n ' In general, t i may not equal to t i. When t ' i is a subsequence of t i (or vice versa) we will say the two sequences are in sync. When the sequences are not in sync, it is still possible to compute a correlation, but the results are less reliable. In the following discussion (unless stated otherwise) we assume that performance data is collected in sync.. Not All Metrics are the Same Consider a data set {( d, t ),( d, t ),...,( d n, t n )}. The first complexity we need to deal with is that the {d i } can have different meanings for different performance metrics. They might represent:. the current value of a fluctuating metric (e.g., memory utilization); or. the average value of a fluctuating metric since time t i- ; or 3

4 3. the current value of a cumulating metric (e.g., disk reads since system start up); or 4. an event count since t i-; Two Cumulative Sequences with CC=.998 When computing correlation coefficients it is best if: A. the data represents events that occurred during the same period of time, i.e., in sync; B. the data is of the same type or has been changed to the same type. For instance, if you try to compare an average value with an instantaneous value, you will not be getting the sharpest possible correlation, since the average represents work that occurred over a longer period of time. If both values are averages, then the correlation coefficient is more reliable when the averages are over the same number of samples. Another issue comes with cumulative metrics (which is how the majority of metrics are presented). Such metrics, of course, are always increasing (except when they rollover). Further, they represent events that happened before the current interval as well as events in the current interval. Because of this, correlating such metrics directly will underplay their variations. This means that, when computing correlations, one should transform such metrics into a sequence of differences and not use the metric itself. The correlation coefficients for the sequence of differences and for the original cumulative metrics are often quite different. Here is a simple example: Consider the following two sequences: their correlation coefficient (CC) is.998 (See Figure ) Series Series Figure. Two cumulative sequences appear to be highly correlated with correlation coefficient CC =.998. While the correlation coefficient of the sequences of their differences: is.49 (See Figure ) Two Sequences with CC = Series Series Figure. Although the two cumulative sequences in Figure appear to be highly correlated, the sequences of their differences are not. In fact, the CC =

5 It is appropriate that these latter two sequences have a negative correlation, since more often than not, one tends to increase its activity whenever the other decreases. It is interesting to note (though not directly relevant to this paper) that although two cumulative metrics are more likely to produce positive correlation, this is not always true. In fact two cumulative metrics could have close to zero correlation. For example, the two curves in Figure 3 have a correlation coefficient of.7..5e+44 E+44.5E+44 E+44 5E+43 Two Cumulative Series with CC=.7 previous sample. It is this new data set that should be used for computing the correlation coefficient. Correlating cumulative metrics directly would not give useful information Two Cumulative Series with CC= Series Series 8 Figure 4. Two cumulative series with correlation coefficient CC = Two Series with CC=-.7 Series Series Figure 3. Two cumulative sequences with correlation coefficient CC =.7. In theory, the two cumulative data sets could have a correlation coefficient anywhere from to (two straight lines, for instance). Figure 4 shows two cumulative series with correlation coefficient CC =.55. We could expect that, as the two curves' knees become closer, their correlation coefficient should approach. As we have mentioned before, two positively correlated cumulative performance data sets may not, in reality, be related. For each cumulative data set, if we take the difference of consecutive samples and form a new sequence based on the difference we will have a new data set in which each data sample captures the change since the Series 6 9 Series Figure 5. The two data series derived by taking differences of consecutive cumulative samples. The CC for the two series is.77, which is very different from the CC for the two original cumulative data sets, which have CC =.55 (see Figure 4)

6 3. Correlating Metrics with Different Sampling Rate Performance metrics are often sampled at different rates. Experience shows that even collectors programmed to sample at the same rate will eventually drift or skip for any number of reasons. When correlating two metrics with different sampling rates, one has to find a common interval in order to correlate them. First, we will assume that one interval is a multiple of the other. Let the two data sets be D and D : D = {[( d, t [( d,..., [( d ( d and ( n+ ), t ),( d ( n+ ) [( m ) n+ ] t ( mn), ( mn), t )]}, t ),( d [( m ) n+ ] ),...,( d ( n+ ), t ),( d t n, n ( n+ ) )] ),...,( d [( m ) n+ ], t D = {( d, t ),( d, t ),...,( d m, t m )}, t (n), (n) [( m ) n+ ] )] ),..., where for each data sample of D there are n data samples in D. We can only correlate data that represents the same period of time. So before we compute the correlation, the first sequence above must be modified so that n data samples that correspond to a particular data sample in the second sequence are merged or aggregated. This generally means that either one difference is computed or a sum (or average) is taken it depends on the nature of the first metric. Different ways of handling the additional information gained through the extra n samples in data set D for each sample in data set Dwill usually yield different correlation coefficient values. Since data set D has coarse sample intervals, the nature of the metric, cumulative, difference, or average, will determine whether the n data samples in D that correspond to a particular data sample in D are merged or aggregated. For instance, if both D and D are cumulative, then we need to take the differences between samples for both data sets. In this case, d d corresponds to d ( n + ) d, d 3 d corresponds to d (n+ ) d( n+ ), and so on. For the purpose of computing their correlation coefficient, the values of the skipped samples in D are irrelevant. 4. Correlating Data with Holes It would be simplest if the data we wanted to correlate had been collected continuously. But, quite frequently, there are interruptions in data collections, for either short periods (a few samples) or long periods (perhaps hours or days). In the cases where some sections of data (for either metric) are not measured/collected for more than one sample interval, those (time) sections for both should be removed from the correlation computation. We define a period that is not sampled for more than one sample interval as a hole in the measurement data set (Figure 6) Two Series with Holes Series Series Figure 6. Two data series each with one hole in it. The duration of the holes are represented by a value. 6

7 The question now becomes: what is the logical way of computing the correlation coefficient? There are several ways that the correlation coefficient can be computed. One method is to remove samples from the data sets when either metric contains a hole and correlate the remaining points. Another way is to compute the correlation coefficient for each continuous section and then compute the weighted average for those coefficients. While it may not be obvious, the first method will give more accurate results, as averaging correlation coefficients over pieces of intervals will give misleading answers. As a simple example (Figure 7), it could be that two metrics have a positive linear relationship on each of two subintervals (and thus would have a correlation coefficient of. on each subinterval), but that the parameters of the linear relationship differs on the two subintervals, which means overall the relationship is not linear and the correlation coefficient is less than.. In the example presented in Figure 7, the overall correlation coefficient for the two data sets is.. Figure 8 shows a scenario, based on the example in Figure 6, where holes are removed and continuous sections are connected for analysis. Note that, as we remove holes, the other metric s data intervals corresponding to the holes are removed as well. In this example, the two holes were artificially introduced. Before they were introduced, the correlation coefficient for the two data sets was.53. After the holes are removed, the two data sets presented in Figure 8 have correlation coefficient.5, which is close to the original one. Two Series with Linear Subintervals Series Series Figure 7. Although there are two subintervals in which the two data sets are linearly correlated, i.e., with correlation coefficient CC =, overall, the two data series are not correlated. Their correlation coefficient CC =.. Therefore, using the correlation coefficients of the subinterval to compute the overall correlation coefficient through, say, weighted average is not accurate. 7

8 ..8.6 Two Series After Removing the Holes estimated by finding what shift gives the maximal correlation coefficient. Note that, in general, the spill interval will be much larger than the delay, so the scenario described here will not happen very often. A spill interval is defined as one or more (usually more) consecutive sample intervals for which statistical aggregations, such as average, are often computed for the samples Series Series 3 Figure 8. The two data series after removing the holes existed in Figure 6. For the time intervals that either hole exists, samples for both data sets are removed. Note that it is not generally useful to interpolate or otherwise attempt to 'fill' a hole. In the first place, while it is easier to use data without holes, a hole does not have any impact on the reliability of the calculation of the correlation coefficient. Secondly, interpolation will tend to give more weight to the values bordering the holes: an 'advantage' that cannot be justified. 5. Correlating Events not Occurring at the Same Time Software execution is a chain event. Even with parallel processing on a multiprocessor machine some events happen before others. That is, correlated events may well happen with a delay. We could identify those potentially correlated events by systematically shifting the correlation interval. In other words, we could correlate the values of one metric with those of another, where the values of the second metric were collected at a later time than those of the first. When the correlation coefficient goes up after a shift, it is an indication that the two metrics support the same activity and that one runs with a delay relative to the other. The amount of the delay can be Note also that, since the length of the delay is affected by the system performance, i.e., how busy the system is, it is difficult to identify a constant delay. However, if one does find the amount of delay by shifting correlation intervals, the delay information could be used for many other performance management related activities, such as performance prediction for certain metrics. One problem in looking for delays by shifting is that when you are looking at a large number of metrics, it is hard to know which ones to test and how much to shift. Trying everything is too expensive. In order to discover metrics that are likely to be related with a delay, we decided to combine intervals (that is, make two or more neighboring intervals into one by adding the utilizations of those intervals) and see if the correlation between the two metrics goes up significantly. If it does, then there is a reasonable chance that the two metrics are related via a delay. Note that under ordinary circumstances the correlation will go up some when fewer intervals are used. So to suspect that there is a delayed correlation, one must see a significant jump in the correlation coefficient. Of course, this must be confirmed directly, i.e., by correlating the metrics after a shift. For instance, we collected data at -second intervals for about half an hour and found that the System metric Exception Dispatches and the CPU utilization of a process called LSASS had a correlation coefficient of -.45, i.e., it appeared that these two metrics were not correlated (Figure 9). 8

9 LSASS Exception Dispatches LSASS Exception Dispatches Figure 9. Metric Exception Dispatches and the CPU utilization of a process LSASS had a correlation coefficient of Data is sampled at -second intervals. But when we converted the data into -second intervals, we found that the correlation coefficient jumped to.94 (Figure ). We confirmed that there was a delayed correlation by going back to the original data and matching the process metrics with the Exception Dispatches numbers that were recorded one interval later. The correlation coefficient was.97. Now if we go back to Figure 9 and look at the activities of the two metrics we can clearly see that, more or less, Exception Dispatches numbers react to LSASS process activity one interval later. Another scenario to consider is that two metrics could at times correlate with a delay and at other times correlate without a delay or with a different delay. Which happens might depend on how busy the machine is. This means that one has to be cautious when using data taken at a high sampling rate. Merging the data into effectively a lower sampling rate might be an appropriate standard procedure. This should allow correlated metrics with small varying delays to be identified effectively. Figure. When data is sampled at -second intervals, metric Exception Dispatches and the CPU utilization of process LSASS had a correlation coefficient (CC) of.94. Here we have normalized the data to. (Note that normalization does not change the CC.) 6. Correlation Quality Let us define the ideal correlation coefficient between two metrics as the limit of their correlation coefficient when we compute it at an unlimited number of sample points for a given time interval. Of course, we cannot expect to find precisely the ideal correlation coefficient. We discuss here the relationship between the accuracy of computing the correlation coefficient (how close we are to the ideal correlation coefficient) and the number of samples used to compute it. We look at this in two ways:. How many samples do you need to use to ensure that the computed correlation coefficient is within a predetermined of the ideal correlation coefficient 95% of the time?. If we use N samples to compute the correlation coefficient, what is the 95% confidence interval? 9

10 Generally, for our purposes, such as performance analysis activities listed in section, it is not essential that correlation coefficient be computed very accurately. However, it is useful to know that it is reasonably correct. The reason the second question is of special interest is described in Section 6.. Note that one can never be absolutely sure that a computed correlation coefficient has any given accuracy. All we can say is that there is a certain probability of having a particular accuracy. 6. How Many Samples to Use In general, based on the central limit theorem [], the accuracy grows with the square root of the number of samples. That is, if you use 4 times as many samples, the accuracy is doubled. For instance, if we use samples to compute a correlation coefficient and obtain a value of.8, then the 95% confidence interval is ±., and if we use samples it is ±.5 and with 4 samples it is ±.3. As high accuracy is not essential for our purposes, generally 5- samples will give an acceptable result. The following example should give some sense of how increasing the number of samples improves the accuracy of the computation. We computed sets of estimated correlation coefficients as follows: we started with data that had 6 samples. Then for iterations we chose samples at random and computed an estimated correlation coefficient for each. We repeated the experiment for 4 samples, for 3 samples and 8 samples. Finally, we put the coefficients into histogram buckets of size.5 and drew the graphs in Figure. As you can see, the graphs get sharper as we increase the number of samples. That is, the standard deviation decreases which means our confidence in any one computation increases. With 4 data samples or more, as we can see in Figure, the graph of correlation coefficients becomes normal-like and therefore starts to become fairly reliable. For instance, if the correlation threshold had been chosen as.55, then the graph implies that data whose ideal correlation coefficient is around.85 would almost certainly identified as correlated. But, we can infer that as the ideal correlation coefficient comes closer to the correlation threshold, there is an increasing chance that we will erroneously compute the correlation coefficient as less than the threshold and thus mistakenly conclude that the two metrics' correlation coefficient is below the threshold. However, there is a significant degree of uncertainty in choosing correlation thresholds, and an error of this type is equivalent to choosing a slightly different threshold Samples 4 Samples 3 Samples 8 Samples Figure. The distribution of the computed correlation coefficient when, 4, 3 and 8 samples are each chosen times. The example of samples is especially interesting. With these few samples, the graph is very far from normal, as it is more likely than with more samples, that the correlation coefficient will be near one. This is because it is more likely that with a small number of samples, the chosen samples are linearly (or nearly linearly) related. The lesson is: samples are much too few to give reliable results. Note that these computations were done with randomly chosen samples. It is logical to assume that if the samples were chosen more cleverly, then the correlation coefficient that we compute would

11 be more accurate. However, it would be difficult to do this in a way that is more efficient than just taking more samples. 6. When There are Too Many Metrics As we have mentioned in section, IT staffs and analysts often face many, sometimes too many, performance metrics. To identify the relationships among the metrics is a daunting task, even for computers (to finish within reasonable amount of time). When there are so many metrics, it is impractical to compute the correlation coefficient between all pairs of metrics with the desired accuracy. However, since we only care when the correlation coefficient is above a predefined correlation threshold, we can do a preliminary calculation to determine which pairs appear to be poorly correlated. If we can eliminate many such pairs with a quick computation (that does not eliminate pairs of interest), we will have greatly sped up our calculations. The basic idea is to compute initially the correlation coefficient with a substantially fewer number of samples than we intend to use. While this will give us a less reliable value for the correlation coefficient, as long as we adjust for the greater margin of error, it should be reliable enough to eliminate many metric pairs that are not of interest, i.e., unlikely to have an ideal correlation coefficient above the correlation threshold. On the other hand, once we decide a pair might be of interest, we can compute the correlation coefficient with more samples to get a more reliable value. Let us assume that we have chosen a correlation threshold of T to determine whether or not a significant correlation exists between two metrics. That is, if the correlation coefficient is below T, we will assume that there is no relationship between the metrics and, if it is above T, we will assume that there is a relationship. We wish to use N samples (with N being small) to estimate the correlation coefficient. What lesser threshold should we use to determine whether or not it is possible that the ideal correlation coefficient is below T? If it might be greater than T, we will then make a more accurate computation on a greater number of samples. But, if we can be reasonably sure that it is less than T, we can proceed to other pairs of metrics, and will have saved a considerable amount of computation. Let us require that to discard a pair we have to be 95% certain that its ideal correlation coefficient is less than T. Then, if we assume that the sequence is normally distributed, the algorithm is:. compute a statistic Z = + ln T T ; This is Fisher's Z transformation that produces an approximately normal statistic.. compute its standard deviation which is: σ z = ; N 3 3. then the lower threshold will be exp exp ( * ( Z.64σ z ) ( * ( Z.64σ ) + z (This is a the first equation solved for T, with Z replaced by Z.64 σ z..64 is taken from a table and is the multiplier for the standard deviation to get 95% certainty.) For instance, say that we wish to consider pairs of metrics only if the correlation coefficient exceeds T =.8 and that we wish to estimate the correlation coefficient by looking at samples. Then Z will be.986, σ z will be.45, and we should set the lower threshold at.649. That is, if the computation of the correlation coefficient using samples is less than.649, we can be 95% certain that the ideal correlation coefficient is less than our threshold of.8, so we can assume that we do not have to consider the pair further..

12 7. Summary To understand what is happening within a single computer or a network of computers, it is often necessary to make connections among events and metrics. Generally, the metrics made available by the operating system give little help. A priori, computing correlation coefficients appears to be an important technique to discover this information. However, real world data creates many issues that need to be dealt with before this technique can be effectively used. The solutions that we have presented should make using correlation coefficients a practical tool in the arsenal of performance analyst. 8. Reference [] Ding, Yiping and Newman, Kenneth, Automatic Workload Characterization, CMG Proceedings, Vol. pp 53-66, Orlando, Florida, December,.. [] Kachigan, Sam Kash, Statistical Analysis, An Interdisciplinary Introduction to Univariate & Multivariate Methods, Radius Press, New York, 986.