Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis is how two or more variables influence each other. We may be searching for a driver than helps explains sales, profits, or revenues; we may be interested in factors that better explain performance of employees; or how which marketing method has the most impact on sales. A basic starting point for understanding a relationship between two variables is covariance, or the more common and standardized measure, correlation. Covariance and correlation are both measures of association between two variables that shows the linear relationship between the variables. Each provides a single summary measure of association that is easily interpreted, and provides a building block for more advanced techniques, such as regression. You will see that correlation and covariance are really similar concepts and are related mathematically. However, of the two terms, correlation is used more often in every day language. When we say two things are correlated we mean that the two things are related to each other. The correlation can be strong or weak, but we understand it as a relationship. In statistics, correlation has the same meaning, but it will be expressed in mathematical terms with a specific interpretation, direction (positive or negative) and strength. In particular, the correlation coefficient provides a good starting point for more advanced data analysis. Along with scatter plots, the correlation coefficient provides insight into bivariate, or two variable, relationships. It is a flexible measure of association which can be used with continuous level variables, ordinal variables, and dummy variables. I think you will find the correlation coefficient intuitive and useful tool to summarize a relationship between two variables. It also has a direct connection with bivariate regression. Key Objectives Understand the properties of measures of association Understand the covariance and correlation as bivariate measures of association Understand how to interpret the correlation coefficient and to read and interpret a correlation matrix Understand how to use correlations as an intermediate step in data analysis In this Module We Will: Describing measures of association Look at covariance and correlation matrices, along with corresponding scatter plots Begin the linkage of correlation with regression For more information, contact: Tom Ilvento 213 Townsend Hall, Newark, DE 19717 302-831-6773 ilvento@udel.edu

Using Statistical Data to Make Decisions: Correlation and Covariance Page 2 MEASURES OF ASSOCIATION Measures of association show the relationship between two variables. It is a numerical measure and in most cases a single measure (although it can be several numbers). Most often, they focus on how two variables vary together (or not). There are many measures of association in statistics, developed for their usefulness with different types of data and different situations. Some of them have inferential properties and some are useful solely for their ability to help describe a relationship. Example measures of association include the correlation coefficient, an odds ratio, R 2 in regression, and the regression coefficient. A good starting point for discussion of measures of association is to understand some criteria of any measure of association. These criteria are used to evaluate and compare various measures of association, and as such help us to interpret the measure. The criteria focus on the range of the measure, whether it is bounded by an upper or lower level, whether is is symmetrical, and how to interpret the measure. Each are discussed briefly below. What is the range (from high to low)? We want to know the possible range of a measure of association in order to gain some sense of what is a high or low value. We might ask if it can take on negative values or is it only positive; whether it is centered around a natural midpoint; and if the upper and lower values are the same when it is calculated for every variable. Measures of association are numerical measures which typically focus on how two variables vary together (or not). Criteria for Measures of Association What is the range? Is it bounded? Is it Symmetrical? How to interpret? Is it bounded? Similar to the last point, we want to know if there is a natural upper or lower bound to our measure of association. Some measures of association (such as an odds ratio) have a lower bound, but no upper level. As a result, an odds ratio can be very large. Other measures of association do have natural upper and lower bound that makes it easier to interpret is there is a strong or weak relationship. In some cases, statisticians have been able to reformulate a measure of association to create an upper and lower bound. Is it symmetrical? If a measure of association is symmetrical it means that the relationship between two variables, say X and Y, is the same for when we specify it as X to Y or Y to X. This implies that we do not have to designate one variable as preliminary, independent, or as necessarily influencing the other.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 3 How to interpret? Interpretation should be the key criteria for any measure of association - what does it mean for my data? We usually start with trying to understand the extremes. What does it mean to have a perfect relationship (the highest value or the lowest value)? What does it mean if there is no relationship? If you can identify a clear understanding of the extremes you can begin to gain a sense of what an intermediate value means. The next section will begin to discuss covariance and then correlation. We will return to these criteria of measures of association as a way to interpret and compare these two measures of association. COVARIANCE We have already started with the concept of how a single variable varies about its mean as a measure of the spread of the data. We identified the variance as the total sum of squared deviations about the mean (Total Sum of Squares) divided by n-1 (the degrees of freedom). We will use a similar concept to talk about how two variables vary about their means together. Another way to express the formula for covariance is given below. SS XY is called the sum of squares cross product. Cov XY = SS n XY The formula for covariance is given below. If you focus on the numerator, it shows that the we are looking at how two variables vary about their means together. Cov XY = n 2 ( X i X ) ( Yi Y ) i= 1 Let me use an illustration to show how covariance works, and then we will use a data example. The following table (Figure 1) represents a the graph of a scatter plot between X (on the horizontal axis) and Y (on the vertical Axis). I have marked the Y-mean and the X-mean values on the graph with lines which divide the graph into four quadrants. A data point that is above the mean for both X and Y will fall in the first quadrant, and a data point that is both below the mean for Y and the mean for X will fall in the third quadrant. If a scatter plot tends to have values that fall mainly in the First and Third quadrants the covariance between the two variables will be positive - values of X tend to vary about its mean in the same way that values of Y vary about its mean. Likewise, if values tend to fall in the Second and Fourth quadrants it means that deviations of X values about the X- mean tend to be in a different direction than deviations of Y values about its mean. This is associated with negative covariance. n 2 If a scatter plot tends to have values that fall mainly in the I and III quadrants, the covariance between the two variables will be positive. If they fall in the II and IV quadrants, it will result in negative covariance.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 4 II I Y-mean III IV X-mean Figure 1. Graphic depiction of Covariance Between Two Variables, X and Y Let s look at a data example. The following is some data about mid-level managers in a company. The variables are RATING, a rating scale of the managers from 0 to 10; SALARY, the salary of the manager in $1,000); YEARS, years of service at the company; and ORIGIN, a dummy variable indicating whether they were promoted inside the company (coded as 0) or were recruited from outside the company (coded as 1). The descriptive statistics for these variables are given below. RATING SALARY YEARS ORIGIN Mean 5.90 71.63 8.14 0.59 Standard Error 0.12 0.87 0.32 0.04 Median 5.80 71.00 7.00 1.00 Mode 5.00 76.00 5.00 1.00 Standard Deviation 1.49 10.70 3.94 0.49 Sample Variance 2.21 114.57 15.49 0.24 Kurtosis -0.30-0.17 0.26-1.90 Skewness 0.08 0.37 0.83-0.36 Range 7.30 55.00 21.00 1.00 Minimum 1.80 48.00 0.00 0.00 Maximum 9.10 103.00 21.00 1.00 Sum 885.10 10745.00 1221.00 88.00 Count 150 150 150 150 Table 1. Descriptive Statistics in the Manager Salary Example

Using Statistical Data to Make Decisions: Correlation and Covariance Page 5 The mean salary level is 71.63, or $71,630. The mean for ORIGIN is.59, indicating that 59 percent of the managers were recruited from outside the company. The mean and the median levels for all the variables are very close to each other, indicating no great skew in any of the variables. The coefficients of variation (data not shown) indicate that the most variability is with the variable YEARS (CV = 48%). The covariance matrix is given in Table 2. A covariance matrix shows the covariance of each variable with the other variables and itself. It is a symmetric matrix (the of covariance of X with Y is the same as the covariance of Y with X). As a result, you generally only see half the matrix presented as output (the rest is redundant). The values on the diagonal are the covariance of each variable with itself -in other words, the variances. If you compare these values with the variances in the descriptive statistics tables you will notice a slight difference. For example, the covariance of RATING with itself is 2.193 and the variance is given as 2.21. The slight difference is because the descriptive statistics use the sample formula for the covariance which is divided by n-1. Table 2. Covariance Matrix of Manager Salary Data RATING SALARY YEARS ORIGIN RATING 2.193 SALARY 10.801 113.806 YEARS 0.393-13.509 15.387 ORIGIN -0.174-2.778 1.038 0.242 Limitations of Covariance Covariance is measured in squared cross-products terms The upper bound is not known Hard to interpret and compare The covariance values in Table 2 point out some of the problems with using covariance as a measure of association. The values are is squared cross-product terms and are hard to interpret. There is a sign to the values (either positive or negative), but it is not clear how to interpret something in squared, cross-product terms. Covariance are unbounded, and thus it is difficult to determine if a value is larger or small. As a result, interpretation is difficult. Most of these problems will be solved by making a transformation of the covariance into correlation coefficients. However, the covariance is the building block for regression and many other multivariate analyses. It is important to at least grasp the basic concept of covariance - that it is based on how two variables vary about their means together; that it is similar to the variance and seeks to place the measure of association in the context of variability of the variables; and that it is a symmetric measure of association.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 6 CORRELATION If we divided the SS XY by the cross-product of the standard deviations we generate a new measure of association, the correlation coefficient (often designated by r). The correlation coefficient is a standardized version of the covariance. It is bounded between -1 and 1, and zero means there is no linear relationship between the two variables. Correlation coefficients provide an easy way to summarize the relationship between two variables and that is why they are so often used. You should note that correlation coefficient requires an equal sample size for both variables and any missing values for one variable will cause that observation to be removed from the analysis (this is called pair-wise deletion). The formula for the correlation coefficient (also known as the Pearson Product Moment Correlation Coefficient) is given below. Cov r = XY σ σ X Y The correlation coefficient r) has the following useful properties. The correlation coefficient has many nice properties: It is bounded between -1 and 1 It is a symmetric measure of association It is standardized measure and easy to compare It is invariant to scale r has a range from 1 to 1. A value of -1 means perfect negative correlation, a value of 1 means perfect positive correlation, and a value of 0 means no linear association. Thus, it is bounded to -1 to 1. If you obtain a value greater than 1 or less than -1, something is wrong! The correlation coefficient is a symmetrical measure of association. The correlation between X and Y is the same as the correlation between Y and X ( r XY = r YX ) The correlation coefficient is invariant to scale. By this I mean that if you add or subtract a constant to each value in the data set, or you multiply or divide by a constant, it does not change the correlation between the two variables. For example, if you express income as per $1,000, it will not change the relationship of income and sales. As with covariance, the correlation matrix is usually present as half a matrix because the values are symmetrical. Table 3 contains the correlations for the Manager Salary data.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 7 Table 3. Correlation Coefficients for the Manager Salary Data RATING SALARY YEARS ORIGIN RATING 1.000 SALARY 0.684 1.000 YEARS 0.068-0.323 1.000 ORIGIN -0.238-0.529 0.537 1.000 The values on the diagonal are all 1 indicating each variable is perfectly correlated with itself. The value of.684 shows the correlation between RATING and SALARY. Its interpretation is that managers with higher salaries tend to get higher ratings. The correlation is not perfect, but it is moderately large (we will see a scatter plot of these two variables to get a better sense of what a correlation of.684 looks like). Any correlation with a dummy variable (one which has only two values, zero and one) has a very simple interpretation. Since it is a dummy variable that only takes on two values, the interpretation of the correlation coefficient reflects which group has a higher on average level of the other variable. For example, the correlation between ORIGIN and SALARY is -.529. This means that managers who are recruited outside the company (ORIGIN =1) have on average, lower salaries. The correlation coefficient is a useful summary measure of a relationship between two variables,. With a single value you can talk about the strength and direction of the relationship. However, we need to be cautious in its use. For one thing, it is a linear measure of association between two variables. A correlation of zero means there is no linear relationship between two variables. It would be represented by a flat line in a graphical representation. However, if the relationship in nonlinear the correlation coefficient would fail to capture the full relationship. Figure 2. Shows a graphical depiction of an obvious and perfect nonlinear relationship. Such a relationship would most likely have a correlation of near zero. A correlation with a continuous variable with a dummy variable has the following interpretation. If the correlation is positive, the category in the dummy variable that is represented by one tends to have higher on average values of the continuous variables. If the correlation is negative, the dummy group represented by one has lower on average values. The correlation coefficient is a linear measure of association. A value of zero only means no linear association between the variables. Nonlinear Relationshp 100.00 80.00 60.00 40.00 20.00 0.00 0 20 40 60 80 100 Figure 2. Graphic of a Non-Linear Relationship

Using Statistical Data to Make Decisions: Correlation and Covariance Page 8 A second caution with correlations is that it does not reflect causality; the fact that two things are correlated does not mean one variable causes the other. This is an easy trap to fall into, but as we will see in multiple regression, bivariate relationships can be deceiving. For example, in the summer, there is a correlation between ice cream sales and the number of people who drown in cities and towns across America. This does not mean that eating ice cream causes people to drown - the two things tend to happen more in the summer time, and the season is the third variable that is related to both of the others. Correlation does not imply causality - be careful not to imply a casual relationship when using correlation coefficients. GRAPHICAL EXAMPLES OF CORRELATIONS A value of 1 or -1, or a value of zero, are relatively easy correlations to interpret. A value of 1 or -1 reflects a perfect linear relationship between two variables. A value of zero reflects no linear relationship. If we drew a line on a scatter plot for a correlation of zero it would be a flat line - any change in the value of X does not influence the value of Y. However, intermediate values of correlations are not as easy to interpret. Often what is large or small depends upon the data you are using and the discipline you are involved with. When the units of analysis are people, correlations of.5 to.6 are relatively large. However, when looking at data over time, correlations tend to be much higher;.90 to.99. Scatter Plot of Salary vs Employee Rating SALARY ($1,000s) 125 100 75 50 25 0 2 4 6 8 10 RATING Figure 3. Scatter Plot of Salary Versus Rating Scatter plots are a useful way to look at the relationship between two variables. Figure 3 shows the scatter plot of the relationship between SALARY(Y-axis) and RATING (Xaxis). Earlier we noted that the correlation between these two variables was.684. From the graph we can see that the relationship is linear, but not perfect. If we fit a line to the data all the points would not fit on the line.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 9 SALARY ($1,000s) Scatter Plot of Salary vs Employee Rating 125 100 y = 4.9246x + 42.575 R 2 = 0.4674 75 50 25 0 2 4 6 8 10 RATING Excel will allow you to fit a best fitting line to the scatter plot. This line is a regression line. Figure 4. Scatter Plot with Trendline, Equation, and R 2 In fact, Excl will allow us to fit a best Fitting linear line which is generated from a regression of SALARY on RATING. Using options with the Chart feature in Excel we can add a trend line, include the equation of the line on the chart, and include a measure of association called R 2. Figure 4 shows the same graph with these options. The options can be accessed by selecting the graph in Excel, clicking on Chart in the menu bar, and then clicking on Add Trendline. Once in Trendline you should click on Linear and then you can access options of including the equation and R 2. The best fitting line in Figure 4 is actually a regression line. From the graph we can see that the line fits the data very well. The equation for the line follows the classic formula for a line with an intercept term (a) and a slope coefficient (b) Y = a +b(x). Our line is not a perfect deterministic function (there is scatter around the line) so I am expressing it as an estimate. Estimated Y = 42.575 + 4.9246(X) R 2 given on the graph is a measure of association from regression. More will be said about this in the next module on regression. For now we can say that R 2 shows how much of the dependent variable (in this case SALARY) is explained by knowing something about the independent variable. It ranges from zero to one. In this case, an R 2.4674 means that 46.7 percent of the variability in SALARY is explained by knowing the RATING of the employee. You should also note that if we squared the correlation coefficient it would equal R 2 (r 2 = R 2 for a bivariate regression). Try it and see. Thus, another interpretation of the correlation coefficient, if squared, is how much variability in one variable is explained by knowing something about another variable.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 10 Average State Verbal Scores Versus Math Scores Verbal Scores 650 600 y = 0.9555x + 22.603 R 2 = 0.9411 550 500 450 400 400 450 500 550 600 650 Math Scores Scatter Plots are a good way to see the correlation between two variables. Figure 5. Average State Verbal SAT Scores Versus Math Scores, 2001 Let s look at few other graphic depictions of correlations to better see what a high or low correlation looks like. In Figure 5 we have a scatter plot of average state verbal versus math SAT scores. The correlation is very high,.970. You can see that the pattern is linear and there is very little scatter of the data points around the best fitting line. The positive correlation tells us that states with higher average verbal scores also tend to have higher average math scores, as might be expected. Notice also that R2 for the this line is very high,.941-94 percent of the variability in verbal scores is explained by knowing the math scores. A scatter plot can show the strength and direction of the relationship, as well as if the relationship is in fact linear. Figure 6 show a strong negative correlation between the average state SAT scores (verbal plus math) versus the Average State SAT scores by Percent Taking the Test, 2001 Average SAT (Math + Verbal) 1250 1200 1150 1100 1050 1000 950 900 y = -2.133x + 1145.3 R 2 = 0.7657 0 20 40 60 80 100 Percent Taking Figure 6. Average SAT Scores (Math + Verbal) Versus Percent of High School Class Taking SAT percent of the high school class that took the SAT test. The correlation between these two variables is -.875. The scatter plot shows the downward slope of the relationship and that the fit of the line is good, but not perfect.

Using Statistical Data to Make Decisions: Correlation and Covariance Page 11 Manager Salary versus Years of Service Salary ($1,000s) 120 100 80 60 40 y = -0.8779x + 78.78 20 R 2 = 0.1042 0 0 5 10 15 20 25 Years of Service Figure 7. Scatter Plot of a Low Correlation Between Salary and Years of Service Finally, the last graph shows a weak correlation between two variables (Figure 7). The correlation between the managers salary and years of service is -.323. The more years of service, the lower the salary, but the relationship is weak. Figure 6 shows far more scatter around the best fitting line. We can see the relationship in the graph, but there is considerable scatter in the data than in the other graphs. CONCLUSIONS Measures of association are useful summary statistics to describe a relationship between two or more variables. In this module we looked at covariance and correlation as two measures of linear association between two variables. Both of these measures are related to each other and to regression. The correlation coefficient is a standardized version of the covariance so it has a known range and is bounded between -1 and 1, with zero indicating no linear relationship. In a single number, the correlation coefficient provides a indication of the strength and direction of the relationship. It is a useful next step in data analysis to begin to examine bivariate relationships with correlation coefficients and to graph these relationships. We also noted that caution should be taken with correlation coefficients in two main areas. First, it is a linear measure of association. We cannot assume that a low value of a correlation means that there is no association, only there is no linear association. The second issue is to be careful not to imply causation when dealing with correlation coefficients. While we noted we can establish that two variables are related to each other, care should be taken not to say that one variable causes the other.