On Correlating Performance Metrics


 Magnus Hicks
 3 years ago
 Views:
Transcription
1 On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are the basis of identifying and addressing computer performance related issues, from monitoring to capacity planning. IT professionals and performance analysts often face the questions of what metrics to look at, whether those metrics are related, and how to group them in meaningful way for further analysis, e.g., workload characterization. In this paper we present a number of techniques for correlating performance metrics, which could be sampled at different rates, at different lengths and overlapping intervals, and with different response times to a common system event. We also discuss how to assess the reliability of the correlations we compute and how to efficiently find relationships of interest.. Introduction For any given computer system it is likely that many different performance metrics are available. IT staffs and analysts often face the question of which to consider, whether they are related, which ones to monitor, and how to group them in meaningful ways for further analysis. For instance, You might want to know which disks are being used by a particular software package or application; You might want to figure out which resources are used by which http requests; You might want to know which remote node is sending packets to which remote node; You might want to investigate which performance metric values are affecting the application response time and causing it to exceed its threshold; You might want to identify the related resources used on different nodes to compute the endtoend response time for an application; You might want to find out the time delays among the monitored values of different performance metrics for a given system event; You might need to know which processes are related, so you can properly describe each workload when doing workload characterization. The list can go on and on. One common characteristic of the above is that you want to link one object or phenomenon to another. While you may be able to answer some of these questions with probes or other specialized hardware or software, the power of the technique discussed here is that it has the potential to answer all of these questions and many others like them with relatively little preknowledge. The basic idea search for events with high correlation coefficients was described in another paper []. In this paper the emphasis is on addressing the
2 problems that happen when correlating real world data and how they can be handled. Many of the problems that occur during analysis happen because data has not been reliably produced. For example, performance metrics are often collected at different rates. A preprocessor is needed in this case before correlation can be performed. (This is discussed in Section 3.) Or more seriously, some metrics may not be collected continuously, due to hardware and software interruptions. Another complication is that some metrics have large variations and some only minor ones. Some may even be constant for a whole time/measurement interval. These differing characteristics create interesting challenges for identifying the relationships among performance metrics. We will deal with these issues in this paper. Let us first start with some definitions that will help us to quantify the discussion... Definitions The following concepts are used throughout this paper. Let X and Y be any two random variables with means (averages) x and y, and positive variances σ x and σ y. A typical example of the random variables that we will be comparing is the CPU consumption of two processes. Another typical comparison is between process page fault counts and disk activity. Definition. Variance: Let X be a random variable with a distribution F( x) : = Pr( X x), the variance of the distribution is defined by σ x : = E([ X E( X )] ) : = x x, where E(X) x is the expectation (mean) of X. The variance is a measure of the spread or scattering of a distribution. Definition. Covariance: Assume that X and Y have finite variances. Their covariance is defined as COV ( X, Y ) = xy x y. From the above definition we know that, if X and Y are independent, i.e., xy = x y, then COV(X, Y) =. (The converse is not true.) Since covariance is not a normalized number, when it is not equal to, we do not know how closely the two random variables are related. If we normalize the two variables by X * = ( X x) / σ and * Y = ( Y y) / σ, y x * * then they, X and Y, have mean and variance, which leads to a more informative relationship indicator: Definition 3. Correlation Coefficient. The correlation coefficient of X and Y is defined by C(X, Y) = COV( * X, Y * COV ( X, Y ) ) =. σ σ In other words, the correlation coefficient of two random variables is the covariance of the normalized two variables. Note that, C ( X, Y ) and that C(X, Y) = ± if and only if Y and X are related linearly, i.e., Y = ax + b. Note further that the correlation coefficient is independent of the origins and units of measurements. In other words, for any constants, a, b, c, d, with a >, c >, we have C(aX + b, cy + d) = C(X, Y). This property allows us to manipulate the scale of the data for a better visual representation of the relationship among performance metrics. One way to scale the measured data is to normalize it, so that the values are between and. For instance, let x y
3 a = /( X max X min ) and b = X X X ), () where min /( max min X X max ( min ) is the largest (smallest) value of the data, then the linear transformation ' X = ax + b will normalize the data, but not change the correlation coefficient. In other words, ' C ( X, Y ) = C( X, Y ). Y can also be changed, similarly. Please note that a and c have to be positive. Otherwise the manipulation will cause the curve to change its direction (first derivative), which will cause the correlation coefficient to change. (Though if both a and c are negative, the changes will cancel out.) Intuitively, high positive correlation of two random variables implies the peaks (and valleys) of their values tend to occur together. A linear transformation will not change this visual property. We will see in a latter section that the correlation coefficient is strongly affected by the trending or shape, i.e., the second derivative, of the performance curves. The basic technique here for finding relationships among different performance metrics is to find the correlation coefficient between pairs. When the correlation is high (close to ), we can assume that the two metrics represent resources that were working on the same task. Because correlation coefficients are invariant under linear transformations, the units of the metrics do not matter (that is, changing a metric from, say, bytes to MBytes will not change its correlation coefficient with another metric). However, correlations are sensitive to the time phase of the data. For this reason, the measurement interval should not be too small. There might be a delay between the various activities that complete a transaction. On the other hand, the measurement or sample interval should not be too large either. A very large sample interval not only reduces the potential number of samples for a given time interval, but also could link otherwise unrelated events. As a rule of thumb, the interval should be greater than times the average delay at different resources / devices... A Model for Presenting Performance Data Before we address the issues related to performance data correlation, let us first define time serial data stream, D: D = {( d, t ),( d, t ),...,( d n, t n )}. Generally, in performance data collection or measurement, it is true that t i+ ti = ti+ ti+, for i =,,, n. In other words, data or events are sampled at regular intervals. Unless stated otherwise we use this assumption in this paper. Now let us consider another data set, D ', ' ' D = {( d, t' ),( d', t' ),...,( d' ', t' ' )}. n n ' In general, t i may not equal to t i. When t ' i is a subsequence of t i (or vice versa) we will say the two sequences are in sync. When the sequences are not in sync, it is still possible to compute a correlation, but the results are less reliable. In the following discussion (unless stated otherwise) we assume that performance data is collected in sync.. Not All Metrics are the Same Consider a data set {( d, t ),( d, t ),...,( d n, t n )}. The first complexity we need to deal with is that the {d i } can have different meanings for different performance metrics. They might represent:. the current value of a fluctuating metric (e.g., memory utilization); or. the average value of a fluctuating metric since time t i ; or 3
4 3. the current value of a cumulating metric (e.g., disk reads since system start up); or 4. an event count since t i; Two Cumulative Sequences with CC=.998 When computing correlation coefficients it is best if: A. the data represents events that occurred during the same period of time, i.e., in sync; B. the data is of the same type or has been changed to the same type. For instance, if you try to compare an average value with an instantaneous value, you will not be getting the sharpest possible correlation, since the average represents work that occurred over a longer period of time. If both values are averages, then the correlation coefficient is more reliable when the averages are over the same number of samples. Another issue comes with cumulative metrics (which is how the majority of metrics are presented). Such metrics, of course, are always increasing (except when they rollover). Further, they represent events that happened before the current interval as well as events in the current interval. Because of this, correlating such metrics directly will underplay their variations. This means that, when computing correlations, one should transform such metrics into a sequence of differences and not use the metric itself. The correlation coefficients for the sequence of differences and for the original cumulative metrics are often quite different. Here is a simple example: Consider the following two sequences: their correlation coefficient (CC) is.998 (See Figure ) Series Series Figure. Two cumulative sequences appear to be highly correlated with correlation coefficient CC =.998. While the correlation coefficient of the sequences of their differences: is.49 (See Figure ) Two Sequences with CC = Series Series Figure. Although the two cumulative sequences in Figure appear to be highly correlated, the sequences of their differences are not. In fact, the CC =
5 It is appropriate that these latter two sequences have a negative correlation, since more often than not, one tends to increase its activity whenever the other decreases. It is interesting to note (though not directly relevant to this paper) that although two cumulative metrics are more likely to produce positive correlation, this is not always true. In fact two cumulative metrics could have close to zero correlation. For example, the two curves in Figure 3 have a correlation coefficient of.7..5e+44 E+44.5E+44 E+44 5E+43 Two Cumulative Series with CC=.7 previous sample. It is this new data set that should be used for computing the correlation coefficient. Correlating cumulative metrics directly would not give useful information Two Cumulative Series with CC= Series Series 8 Figure 4. Two cumulative series with correlation coefficient CC = Two Series with CC=.7 Series Series Figure 3. Two cumulative sequences with correlation coefficient CC =.7. In theory, the two cumulative data sets could have a correlation coefficient anywhere from to (two straight lines, for instance). Figure 4 shows two cumulative series with correlation coefficient CC =.55. We could expect that, as the two curves' knees become closer, their correlation coefficient should approach. As we have mentioned before, two positively correlated cumulative performance data sets may not, in reality, be related. For each cumulative data set, if we take the difference of consecutive samples and form a new sequence based on the difference we will have a new data set in which each data sample captures the change since the Series 6 9 Series Figure 5. The two data series derived by taking differences of consecutive cumulative samples. The CC for the two series is.77, which is very different from the CC for the two original cumulative data sets, which have CC =.55 (see Figure 4)
6 3. Correlating Metrics with Different Sampling Rate Performance metrics are often sampled at different rates. Experience shows that even collectors programmed to sample at the same rate will eventually drift or skip for any number of reasons. When correlating two metrics with different sampling rates, one has to find a common interval in order to correlate them. First, we will assume that one interval is a multiple of the other. Let the two data sets be D and D : D = {[( d, t [( d,..., [( d ( d and ( n+ ), t ),( d ( n+ ) [( m ) n+ ] t ( mn), ( mn), t )]}, t ),( d [( m ) n+ ] ),...,( d ( n+ ), t ),( d t n, n ( n+ ) )] ),...,( d [( m ) n+ ], t D = {( d, t ),( d, t ),...,( d m, t m )}, t (n), (n) [( m ) n+ ] )] ),..., where for each data sample of D there are n data samples in D. We can only correlate data that represents the same period of time. So before we compute the correlation, the first sequence above must be modified so that n data samples that correspond to a particular data sample in the second sequence are merged or aggregated. This generally means that either one difference is computed or a sum (or average) is taken it depends on the nature of the first metric. Different ways of handling the additional information gained through the extra n samples in data set D for each sample in data set Dwill usually yield different correlation coefficient values. Since data set D has coarse sample intervals, the nature of the metric, cumulative, difference, or average, will determine whether the n data samples in D that correspond to a particular data sample in D are merged or aggregated. For instance, if both D and D are cumulative, then we need to take the differences between samples for both data sets. In this case, d d corresponds to d ( n + ) d, d 3 d corresponds to d (n+ ) d( n+ ), and so on. For the purpose of computing their correlation coefficient, the values of the skipped samples in D are irrelevant. 4. Correlating Data with Holes It would be simplest if the data we wanted to correlate had been collected continuously. But, quite frequently, there are interruptions in data collections, for either short periods (a few samples) or long periods (perhaps hours or days). In the cases where some sections of data (for either metric) are not measured/collected for more than one sample interval, those (time) sections for both should be removed from the correlation computation. We define a period that is not sampled for more than one sample interval as a hole in the measurement data set (Figure 6) Two Series with Holes Series Series Figure 6. Two data series each with one hole in it. The duration of the holes are represented by a value. 6
7 The question now becomes: what is the logical way of computing the correlation coefficient? There are several ways that the correlation coefficient can be computed. One method is to remove samples from the data sets when either metric contains a hole and correlate the remaining points. Another way is to compute the correlation coefficient for each continuous section and then compute the weighted average for those coefficients. While it may not be obvious, the first method will give more accurate results, as averaging correlation coefficients over pieces of intervals will give misleading answers. As a simple example (Figure 7), it could be that two metrics have a positive linear relationship on each of two subintervals (and thus would have a correlation coefficient of. on each subinterval), but that the parameters of the linear relationship differs on the two subintervals, which means overall the relationship is not linear and the correlation coefficient is less than.. In the example presented in Figure 7, the overall correlation coefficient for the two data sets is.. Figure 8 shows a scenario, based on the example in Figure 6, where holes are removed and continuous sections are connected for analysis. Note that, as we remove holes, the other metric s data intervals corresponding to the holes are removed as well. In this example, the two holes were artificially introduced. Before they were introduced, the correlation coefficient for the two data sets was.53. After the holes are removed, the two data sets presented in Figure 8 have correlation coefficient.5, which is close to the original one. Two Series with Linear Subintervals Series Series Figure 7. Although there are two subintervals in which the two data sets are linearly correlated, i.e., with correlation coefficient CC =, overall, the two data series are not correlated. Their correlation coefficient CC =.. Therefore, using the correlation coefficients of the subinterval to compute the overall correlation coefficient through, say, weighted average is not accurate. 7
8 ..8.6 Two Series After Removing the Holes estimated by finding what shift gives the maximal correlation coefficient. Note that, in general, the spill interval will be much larger than the delay, so the scenario described here will not happen very often. A spill interval is defined as one or more (usually more) consecutive sample intervals for which statistical aggregations, such as average, are often computed for the samples Series Series 3 Figure 8. The two data series after removing the holes existed in Figure 6. For the time intervals that either hole exists, samples for both data sets are removed. Note that it is not generally useful to interpolate or otherwise attempt to 'fill' a hole. In the first place, while it is easier to use data without holes, a hole does not have any impact on the reliability of the calculation of the correlation coefficient. Secondly, interpolation will tend to give more weight to the values bordering the holes: an 'advantage' that cannot be justified. 5. Correlating Events not Occurring at the Same Time Software execution is a chain event. Even with parallel processing on a multiprocessor machine some events happen before others. That is, correlated events may well happen with a delay. We could identify those potentially correlated events by systematically shifting the correlation interval. In other words, we could correlate the values of one metric with those of another, where the values of the second metric were collected at a later time than those of the first. When the correlation coefficient goes up after a shift, it is an indication that the two metrics support the same activity and that one runs with a delay relative to the other. The amount of the delay can be Note also that, since the length of the delay is affected by the system performance, i.e., how busy the system is, it is difficult to identify a constant delay. However, if one does find the amount of delay by shifting correlation intervals, the delay information could be used for many other performance management related activities, such as performance prediction for certain metrics. One problem in looking for delays by shifting is that when you are looking at a large number of metrics, it is hard to know which ones to test and how much to shift. Trying everything is too expensive. In order to discover metrics that are likely to be related with a delay, we decided to combine intervals (that is, make two or more neighboring intervals into one by adding the utilizations of those intervals) and see if the correlation between the two metrics goes up significantly. If it does, then there is a reasonable chance that the two metrics are related via a delay. Note that under ordinary circumstances the correlation will go up some when fewer intervals are used. So to suspect that there is a delayed correlation, one must see a significant jump in the correlation coefficient. Of course, this must be confirmed directly, i.e., by correlating the metrics after a shift. For instance, we collected data at second intervals for about half an hour and found that the System metric Exception Dispatches and the CPU utilization of a process called LSASS had a correlation coefficient of .45, i.e., it appeared that these two metrics were not correlated (Figure 9). 8
9 LSASS Exception Dispatches LSASS Exception Dispatches Figure 9. Metric Exception Dispatches and the CPU utilization of a process LSASS had a correlation coefficient of Data is sampled at second intervals. But when we converted the data into second intervals, we found that the correlation coefficient jumped to.94 (Figure ). We confirmed that there was a delayed correlation by going back to the original data and matching the process metrics with the Exception Dispatches numbers that were recorded one interval later. The correlation coefficient was.97. Now if we go back to Figure 9 and look at the activities of the two metrics we can clearly see that, more or less, Exception Dispatches numbers react to LSASS process activity one interval later. Another scenario to consider is that two metrics could at times correlate with a delay and at other times correlate without a delay or with a different delay. Which happens might depend on how busy the machine is. This means that one has to be cautious when using data taken at a high sampling rate. Merging the data into effectively a lower sampling rate might be an appropriate standard procedure. This should allow correlated metrics with small varying delays to be identified effectively. Figure. When data is sampled at second intervals, metric Exception Dispatches and the CPU utilization of process LSASS had a correlation coefficient (CC) of.94. Here we have normalized the data to. (Note that normalization does not change the CC.) 6. Correlation Quality Let us define the ideal correlation coefficient between two metrics as the limit of their correlation coefficient when we compute it at an unlimited number of sample points for a given time interval. Of course, we cannot expect to find precisely the ideal correlation coefficient. We discuss here the relationship between the accuracy of computing the correlation coefficient (how close we are to the ideal correlation coefficient) and the number of samples used to compute it. We look at this in two ways:. How many samples do you need to use to ensure that the computed correlation coefficient is within a predetermined of the ideal correlation coefficient 95% of the time?. If we use N samples to compute the correlation coefficient, what is the 95% confidence interval? 9
10 Generally, for our purposes, such as performance analysis activities listed in section, it is not essential that correlation coefficient be computed very accurately. However, it is useful to know that it is reasonably correct. The reason the second question is of special interest is described in Section 6.. Note that one can never be absolutely sure that a computed correlation coefficient has any given accuracy. All we can say is that there is a certain probability of having a particular accuracy. 6. How Many Samples to Use In general, based on the central limit theorem [], the accuracy grows with the square root of the number of samples. That is, if you use 4 times as many samples, the accuracy is doubled. For instance, if we use samples to compute a correlation coefficient and obtain a value of.8, then the 95% confidence interval is ±., and if we use samples it is ±.5 and with 4 samples it is ±.3. As high accuracy is not essential for our purposes, generally 5 samples will give an acceptable result. The following example should give some sense of how increasing the number of samples improves the accuracy of the computation. We computed sets of estimated correlation coefficients as follows: we started with data that had 6 samples. Then for iterations we chose samples at random and computed an estimated correlation coefficient for each. We repeated the experiment for 4 samples, for 3 samples and 8 samples. Finally, we put the coefficients into histogram buckets of size.5 and drew the graphs in Figure. As you can see, the graphs get sharper as we increase the number of samples. That is, the standard deviation decreases which means our confidence in any one computation increases. With 4 data samples or more, as we can see in Figure, the graph of correlation coefficients becomes normallike and therefore starts to become fairly reliable. For instance, if the correlation threshold had been chosen as.55, then the graph implies that data whose ideal correlation coefficient is around.85 would almost certainly identified as correlated. But, we can infer that as the ideal correlation coefficient comes closer to the correlation threshold, there is an increasing chance that we will erroneously compute the correlation coefficient as less than the threshold and thus mistakenly conclude that the two metrics' correlation coefficient is below the threshold. However, there is a significant degree of uncertainty in choosing correlation thresholds, and an error of this type is equivalent to choosing a slightly different threshold Samples 4 Samples 3 Samples 8 Samples Figure. The distribution of the computed correlation coefficient when, 4, 3 and 8 samples are each chosen times. The example of samples is especially interesting. With these few samples, the graph is very far from normal, as it is more likely than with more samples, that the correlation coefficient will be near one. This is because it is more likely that with a small number of samples, the chosen samples are linearly (or nearly linearly) related. The lesson is: samples are much too few to give reliable results. Note that these computations were done with randomly chosen samples. It is logical to assume that if the samples were chosen more cleverly, then the correlation coefficient that we compute would
11 be more accurate. However, it would be difficult to do this in a way that is more efficient than just taking more samples. 6. When There are Too Many Metrics As we have mentioned in section, IT staffs and analysts often face many, sometimes too many, performance metrics. To identify the relationships among the metrics is a daunting task, even for computers (to finish within reasonable amount of time). When there are so many metrics, it is impractical to compute the correlation coefficient between all pairs of metrics with the desired accuracy. However, since we only care when the correlation coefficient is above a predefined correlation threshold, we can do a preliminary calculation to determine which pairs appear to be poorly correlated. If we can eliminate many such pairs with a quick computation (that does not eliminate pairs of interest), we will have greatly sped up our calculations. The basic idea is to compute initially the correlation coefficient with a substantially fewer number of samples than we intend to use. While this will give us a less reliable value for the correlation coefficient, as long as we adjust for the greater margin of error, it should be reliable enough to eliminate many metric pairs that are not of interest, i.e., unlikely to have an ideal correlation coefficient above the correlation threshold. On the other hand, once we decide a pair might be of interest, we can compute the correlation coefficient with more samples to get a more reliable value. Let us assume that we have chosen a correlation threshold of T to determine whether or not a significant correlation exists between two metrics. That is, if the correlation coefficient is below T, we will assume that there is no relationship between the metrics and, if it is above T, we will assume that there is a relationship. We wish to use N samples (with N being small) to estimate the correlation coefficient. What lesser threshold should we use to determine whether or not it is possible that the ideal correlation coefficient is below T? If it might be greater than T, we will then make a more accurate computation on a greater number of samples. But, if we can be reasonably sure that it is less than T, we can proceed to other pairs of metrics, and will have saved a considerable amount of computation. Let us require that to discard a pair we have to be 95% certain that its ideal correlation coefficient is less than T. Then, if we assume that the sequence is normally distributed, the algorithm is:. compute a statistic Z = + ln T T ; This is Fisher's Z transformation that produces an approximately normal statistic.. compute its standard deviation which is: σ z = ; N 3 3. then the lower threshold will be exp exp ( * ( Z.64σ z ) ( * ( Z.64σ ) + z (This is a the first equation solved for T, with Z replaced by Z.64 σ z..64 is taken from a table and is the multiplier for the standard deviation to get 95% certainty.) For instance, say that we wish to consider pairs of metrics only if the correlation coefficient exceeds T =.8 and that we wish to estimate the correlation coefficient by looking at samples. Then Z will be.986, σ z will be.45, and we should set the lower threshold at.649. That is, if the computation of the correlation coefficient using samples is less than.649, we can be 95% certain that the ideal correlation coefficient is less than our threshold of.8, so we can assume that we do not have to consider the pair further..
12 7. Summary To understand what is happening within a single computer or a network of computers, it is often necessary to make connections among events and metrics. Generally, the metrics made available by the operating system give little help. A priori, computing correlation coefficients appears to be an important technique to discover this information. However, real world data creates many issues that need to be dealt with before this technique can be effectively used. The solutions that we have presented should make using correlation coefficients a practical tool in the arsenal of performance analyst. 8. Reference [] Ding, Yiping and Newman, Kenneth, Automatic Workload Characterization, CMG Proceedings, Vol. pp 5366, Orlando, Florida, December,.. [] Kachigan, Sam Kash, Statistical Analysis, An Interdisciplinary Introduction to Univariate & Multivariate Methods, Radius Press, New York, 986.
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
More informationVOLATILITY AND DEVIATION OF DISTRIBUTED SOLAR
VOLATILITY AND DEVIATION OF DISTRIBUTED SOLAR Andrew Goldstein Yale University 68 High Street New Haven, CT 06511 andrew.goldstein@yale.edu Alexander Thornton Shawn Kerrigan Locus Energy 657 Mission St.
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationIn this chapter, you will learn improvement curve concepts and their application to cost and price analysis.
7.0  Chapter Introduction In this chapter, you will learn improvement curve concepts and their application to cost and price analysis. Basic Improvement Curve Concept. You may have learned about improvement
More informationAnalysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
More informationModule 4: Data Exploration
Module 4: Data Exploration Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive
More informationAP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
More informationTECHNICAL APPENDIX ESTIMATING THE OPTIMUM MILEAGE FOR VEHICLE RETIREMENT
TECHNICAL APPENDIX ESTIMATING THE OPTIMUM MILEAGE FOR VEHICLE RETIREMENT Regression analysis calculus provide convenient tools for estimating the optimum point for retiring vehicles from a large fleet.
More informationhttp://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions
A Significance Test for Time Series Analysis Author(s): W. Allen Wallis and Geoffrey H. Moore Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 36, No. 215 (Sep., 1941), pp.
More informationJoint Probability Distributions and Random Samples (Devore Chapter Five)
Joint Probability Distributions and Random Samples (Devore Chapter Five) 101634501 Probability and Statistics for Engineers Winter 20102011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete
More informationChapter 5 Estimating Demand Functions
Chapter 5 Estimating Demand Functions 1 Why do you need statistics and regression analysis? Ability to read market research papers Analyze your own data in a simple way Assist you in pricing and marketing
More informationCHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationSan Jose State University Engineering 10 1
KY San Jose State University Engineering 10 1 Select Insert from the main menu Plotting in Excel Select All Chart Types San Jose State University Engineering 10 2 Definition: A chart that consists of multiple
More informationSimple Linear Regression
STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze
More informationThe Association of System Performance Professionals
The Association of System Performance Professionals The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the measurement
More informationSIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID
SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of
More informationThe Method of Least Squares
The Method of Least Squares Steven J. Miller Mathematics Department Brown University Providence, RI 0292 Abstract The Method of Least Squares is a procedure to determine the best fit line to data; the
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationCORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
More informationTHE STATISTICAL TREATMENT OF EXPERIMENTAL DATA 1
THE STATISTICAL TREATMET OF EXPERIMETAL DATA Introduction The subject of statistical data analysis is regarded as crucial by most scientists, since errorfree measurement is impossible in virtually all
More informationCALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 15 scale to 0100 scores When you look at your report, you will notice that the scores are reported on a 0100 scale, even though respondents
More informationECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
More informationA synonym is a word that has the same or almost the same definition of
SlopeIntercept Form Determining the Rate of Change and yintercept Learning Goals In this lesson, you will: Graph lines using the slope and yintercept. Calculate the yintercept of a line when given
More informationAn introduction to ValueatRisk Learning Curve September 2003
An introduction to ValueatRisk Learning Curve September 2003 ValueatRisk The introduction of ValueatRisk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk
More informationUniversal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.
Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we
More information2 Integrating Both Sides
2 Integrating Both Sides So far, the only general method we have for solving differential equations involves equations of the form y = f(x), where f(x) is any function of x. The solution to such an equation
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationCredibility and Pooling Applications to Group Life and Group Disability Insurance
Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary paul.correia@milliman.com (207) 7711204 May 20, 2014 What I plan to cover
More informationNumerical Summarization of Data OPRE 6301
Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting
More information4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4
4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Nonlinear functional forms Regression
More information1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x  x) B. x 3 x C. 3x  x D. x  3x 2) Write the following as an algebraic expression
More informationAppendix E: Graphing Data
You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance
More informationSTATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
More informationSection A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA  Part I
Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA  Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting
More information2. Simple Linear Regression
Research methods  II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationThe Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
More informationCopyrighted Material. Chapter 1 DEGREE OF A CURVE
Chapter 1 DEGREE OF A CURVE Road Map The idea of degree is a fundamental concept, which will take us several chapters to explore in depth. We begin by explaining what an algebraic curve is, and offer two
More informationHigh School Algebra Reasoning with Equations and Inequalities Solve equations and inequalities in one variable.
Performance Assessment Task Quadratic (2009) Grade 9 The task challenges a student to demonstrate an understanding of quadratic functions in various forms. A student must make sense of the meaning of relations
More informationGCSE Statistics Revision notes
GCSE Statistics Revision notes Collecting data Sample This is when data is collected from part of the population. There are different methods for sampling Random sampling, Stratified sampling, Systematic
More informationStatistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Deardra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm3pm in SC 109, Thursday 5pm6pm in SC 705 Office Hours: Thursday 6pm7pm SC
More informationamleague PROFESSIONAL PERFORMANCE DATA
amleague PROFESSIONAL PERFORMANCE DATA APPENDIX 2 amleague Performance Ratios Definition Contents This document aims at describing the performance ratios calculated by amleague: 1. Standard Deviation 2.
More informationIntroduction to time series analysis
Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples
More informationChicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
More informationCapacity Planner Technical FAQ
Capacity Planner Technical FAQ Table of Contents Requirements... 2 Installation... 3 Discovery and Collection... 3 Utilization and Performance Counters... 5 Hyperthreading... 7 Application Conflicts...
More informationData reduction and descriptive statistics
Data reduction and descriptive statistics dr. Reinout Heijungs Department of Econometrics and Operations Research VU University Amsterdam August 2014 1 Introduction Economics, marketing, finance, and most
More informationLaboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling
Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling Avram Golbert 01574669 agolbert@gmail.com Abstract: While Artificial Intelligence research has shown great success in deterministic
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationOverview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
More informationMeasurement Information Model
mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationThe Big 50 Revision Guidelines for S1
The Big 50 Revision Guidelines for S1 If you can understand all of these you ll do very well 1. Know what is meant by a statistical model and the Modelling cycle of continuous refinement 2. Understand
More informationSimple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
More informationCharacteristics and statistics of digital remote sensing imagery
Characteristics and statistics of digital remote sensing imagery There are two fundamental ways to obtain digital imagery: Acquire remotely sensed imagery in an analog format (often referred to as hardcopy)
More informationImproved metrics collection and correlation for the CERN cloud storage test framework
Improved metrics collection and correlation for the CERN cloud storage test framework September 2013 Author: Carolina Lindqvist Supervisors: Maitane Zotes Seppo Heikkila CERN openlab Summer Student Report
More informationUSING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE FREE NETWORKS AND SMALLWORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE FREE NETWORKS AND SMALLWORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu
More informationRUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY
RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY I. INTRODUCTION According to the Common Core Standards (2010), Decisions or predictions are often based on data numbers
More informationQuantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
More informationFlorida Math for College Readiness
Core Florida Math for College Readiness Florida Math for College Readiness provides a fourthyear math curriculum focused on developing the mastery of skills identified as critical to postsecondary readiness
More informationHow to Win the Stock Market Game
How to Win the Stock Market Game 1 Developing ShortTerm Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.
More informationCase Study I: A Database Service
Case Study I: A Database Service Prof. Daniel A. Menascé Department of Computer Science George Mason University www.cs.gmu.edu/faculty/menasce.html 1 Copyright Notice Most of the figures in this set of
More informationGeostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt
More informationSkewness and Kurtosis in Function of Selection of Network Traffic Distribution
Acta Polytechnica Hungarica Vol. 7, No., Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Petar Čisar Telekom Srbija, Subotica, Serbia, petarc@telekom.rs Sanja Maravić Čisar
More information2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
More informationThe Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
More informationUtah Core Curriculum for Mathematics
Core Curriculum for Mathematics correlated to correlated to 2005 Chapter 1 (pp. 2 57) Variables, Expressions, and Integers Lesson 1.1 (pp. 5 9) Expressions and Variables 2.2.1 Evaluate algebraic expressions
More informationRisk Analysis Using Monte Carlo Simulation
Risk Analysis Using Monte Carlo Simulation Here we present a simple hypothetical budgeting problem for a business startup to demonstrate the key elements of Monte Carlo simulation. This table shows the
More informationAMS 5 CHANCE VARIABILITY
AMS 5 CHANCE VARIABILITY The Law of Averages When tossing a fair coin the chances of tails and heads are the same: 50% and 50%. So if the coin is tossed a large number of times, the number of heads and
More informationLecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization
Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization 2.1. Introduction Suppose that an economic relationship can be described by a realvalued
More information1.2 ERRORS AND UNCERTAINTIES Notes
1.2 ERRORS AND UNCERTAINTIES Notes I. UNCERTAINTY AND ERROR IN MEASUREMENT A. PRECISION AND ACCURACY B. RANDOM AND SYSTEMATIC ERRORS C. REPORTING A SINGLE MEASUREMENT D. REPORTING YOUR BEST ESTIMATE OF
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture  17 ShannonFanoElias Coding and Introduction to Arithmetic Coding
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationCanonical Correlation
Chapter 400 Introduction Canonical correlation analysis is the study of the linear relations between two sets of variables. It is the multivariate extension of correlation analysis. Although we will present
More informationFactoring Trinomials: The ac Method
6.7 Factoring Trinomials: The ac Method 6.7 OBJECTIVES 1. Use the ac test to determine whether a trinomial is factorable over the integers 2. Use the results of the ac test to factor a trinomial 3. For
More informationComparing Alternate Designs For A MultiDomain Cluster Sample
Comparing Alternate Designs For A MultiDomain Cluster Sample Pedro J. Saavedra, Mareena McKinley Wright and Joseph P. Riley Mareena McKinley Wright, ORC Macro, 11785 Beltsville Dr., Calverton, MD 20705
More informatione = random error, assumed to be normally distributed with mean 0 and standard deviation σ
1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.
More informationChapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
More informationPricing complex options using a simple Monte Carlo Simulation
A subsidiary of Sumitomo Mitsui Banking Corporation Pricing complex options using a simple Monte Carlo Simulation Peter Fink Among the different numerical procedures for valuing options, the Monte Carlo
More informationIndustry Environment and Concepts for Forecasting 1
Table of Contents Industry Environment and Concepts for Forecasting 1 Forecasting Methods Overview...2 Multilevel Forecasting...3 Demand Forecasting...4 Integrating Information...5 Simplifying the Forecast...6
More informationDATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
More informationPrentice Hall Algebra 2 2011 Correlated to: Colorado P12 Academic Standards for High School Mathematics, Adopted 12/2009
Content Area: Mathematics Grade Level Expectations: High School Standard: Number Sense, Properties, and Operations Understand the structure and properties of our number system. At their most basic level
More informationForecasting in supply chains
1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the
More informationNorthumberland Knowledge
Northumberland Knowledge Know Guide How to Analyse Data  November 2012  This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about
More informationYear 9 set 1 Mathematics notes, to accompany the 9H book.
Part 1: Year 9 set 1 Mathematics notes, to accompany the 9H book. equations 1. (p.1), 1.6 (p. 44), 4.6 (p.196) sequences 3. (p.115) Pupils use the Elmwood Press Essential Maths book by David Raymer (9H
More informationMuse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0
Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without
More informationIn recent years, Federal Reserve (Fed) policymakers have come to rely
LongTerm Interest Rates and Inflation: A Fisherian Approach Peter N. Ireland In recent years, Federal Reserve (Fed) policymakers have come to rely on longterm bond yields to measure the public s longterm
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationMath Review. for the Quantitative Reasoning Measure of the GRE revised General Test
Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important
More informationMeasuring Line Edge Roughness: Fluctuations in Uncertainty
Tutor6.doc: Version 5/6/08 T h e L i t h o g r a p h y E x p e r t (August 008) Measuring Line Edge Roughness: Fluctuations in Uncertainty Line edge roughness () is the deviation of a feature edge (as
More informationReport on application of Probability in Risk Analysis in Oil and Gas Industry
Report on application of Probability in Risk Analysis in Oil and Gas Industry Abstract Risk Analysis in Oil and Gas Industry Global demand for energy is rising around the world. Meanwhile, managing oil
More information4. Introduction to Statistics
Statistics for Engineers 41 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. !approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of
More informationWe have discussed the notion of probabilistic dependence above and indicated that dependence is
1 CHAPTER 7 Online Supplement Covariance and Correlation for Measuring Dependence We have discussed the notion of probabilistic dependence above and indicated that dependence is defined in terms of conditional
More informationInferential Statistics
Inferential Statistics Sampling and the normal distribution Zscores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are
More informationOutline. Correlation & Regression, III. Review. Relationship between r and regression
Outline Correlation & Regression, III 9.07 4/6/004 Relationship between correlation and regression, along with notes on the correlation coefficient Effect size, and the meaning of r Other kinds of correlation
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More information