1. Measuring association using correlation and regression

How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a baby's brthweght. We mght be nterested n the relatonshp between a patent's blood pressure and the amount of drug the patent takes per day. Suppose that we have data on blood pressure and drug dose such as n Table <BP-drug dose>. Table <BP-drug dose>. Drug dose (mg per day) Blood pressure 5 151 5 145 5 136 10 137 10 124 10 124 15 111 15 105 20 110 20 98 Drug dose versus blood pressure. R=- 0.92 Blood pressure 200 150 100 50 0 0 5 10 15 20 25 Drug dose (mg per day) Two questons we may ask are: 1. If I know how much drug the patent was gven, how well can I predct ther blood pressure? Put another way, we can ask how much of the varablty n blood pressure can be explaned by dfferences n the amount of drug the patent takes. To answer ths queston, we use correlaton, whch we dscuss n ths chapter.

2. For a unt change n the amount of drug gven, how much change n blood pressure do we expect? To answer ths queston, we use regresson, whch we dscuss n the next chapter. As another example of where we use correlaton or regresson, suppose that we are nterested n babes who are born wth low brthweghts, and want to examne factors that affect brthweght. We mght have data on mother's weght and baby's brthweght as n Table <Brthweghts>. Table <Brthweghts>. Mother's weght Chld's brthweght 100 5.10 115 4.50 120 6.00 125 6.80 140 7.50 150 8.10 155 7.50 160 8.10 180 9.00 200 11.00 Chld's weght Chld's weght vs mother's weght. R= 0.96 12.00 10.00 8.00 6.00 4.00 2.00 0.00 0 50 100 150 200 250 Mother's weght Agan, two questons we may ask are: 1. If I know the mother's weght, how well can I predct the baby's weght? Put another way, we can ask how much of the varablty n baby's weght can be explaned by dfferences n the mother's weght. To answer ths queston, we use correlaton. 2. For a unt change n the mother's weght (one pound ncrease), how much change n baby's weght do we expect? To answer ths queston, we use regresson.

You may notce that all the varables we are consderng (blood pressure, weght, dose) are measured on a contnuous scale, and these are sutable for correlaton and regresson. If we want to measure assocaton between categorcal varables (such as male/female, Republcan/Democrat, pass/fal, yes/no, and so on) we use statstcs such as the chsquare test whch we'll look at n a later chapter. We are gong to focus manly on the most wdely used correlaton measure, whch s R, the Pearson lnear correlaton coeffcent. Later on, we'll look at another correlaton measure, the Spearman rank correlaton coeffcent, whch s sometmes better to use than the Pearson measure.

2. Correlaton can be postve, zero, or negatve (rangng from 1.0 to -1.0) Correlaton can be postve as n the brthweght example or negatve as n the drug/blood pressure example. By defnton, usng the formula we'll see n the next secton, the maxmum (postve) correlaton s 1.0. In the brthweght example, correlaton was nearly perfect at R = 0.96. The mnmum possble (negatve) correlaton s -1.0. In the drug versus blood pressure example, correlaton was strongly negatve wth R = -0.92. Correlaton can also be near zero, as shown n Table <Scrambled brthweghts>, where we have scrambled the chldren's brthweghts, and see R = 0.03. Table <Scrambled brthweghts> Mother's weght Chld's weght 100 8.10 115 7.50 120 6.00 125 6.80 140 7.50 150 8.10 155 11.00 160 4.50 180 5.10 200 9.00 Chld's weght (scrambled) vs. mother's weght. R=0.03 Chld's weght 12.00 10.00 8.00 6.00 4.00 2.00 0.00 0 50 100 150 200 250 Mother's weght

3. How to calculate the Pearson lnear correlaton coeffcent We'll frst defne the Pearson lnear correlaton coeffcent, and then look at how to nterpret t. Recall the formula for varance from the chapter on descrptve statstcs. Varance descrbes varablty around the mean value. Varance = 2 ( x x) N Covarance has a formula smlar to that for the varance. Covar ancex (, y) = ( x x N )( y y ) Correlaton uses the covarance of two varables. The correlaton of two varables, x and y, s equal to the covarance of x and y dvded by a number that makes correlaton be between -1.0 and 1.0. Correlato n( x, y) = R = Covarancex (, y) Var( x)* Var( y) The term n the denomnator, the square root of Var(x) * Var(y), just forces the correlaton coeffcent to be between -1.0 and 1.0; t doesn't affect how we nterpret the correlaton coeffcent, so we won't look at t any further.

4. How to nterpret the correlaton coeffcent Let's look at what the correlaton coeffcent tells us. We'll start wth just four ponts, one from each quadrant, as shown n Table <Ponts n 4 quadrants>. Quadrant 1 s labeled here as (1,1), quadrant 2 s labeled (-1,1), quadrant 2 s labeled (1, -1), and quadrant 4 s labeled (-1, -1). For any data set, we can force the mean to be at (0,0) by subtractng the mean of all the x values from the x value for each pont and the mean of all the y values from the y value for each pont. For these "Mean corrected" values, the mean s now at (0,0), and every pont must fall nto one of the four quadrants relatve to the mean. Table <Ponts n 4 quadrants>. x value y value 1 1-1 1 1-1 -1-1 Fgure <Ponts n 4 quadrants>. 1.5-1, 1 1 0.5 1, 1 0-1.5-1 -0.5 0 0.5 1 1.5-0.5-1, -1-1 -1.5 1, -1 Now, let's look agan at the formula for covarance. Covar ancex (, y) = ( x x N )( y y ) We've specfed that we subtract the means, so the new mean value of x s zero and the new mean value of y s 0, and the formula for covarance then smplfes as follows.

Covar ancex (, y) = ( )( x y N ) Consder a pont n quadrant 1 n Fgure <Ponts n 4 quadrants>, such as the pont (1,1). In the formula for covarance, we put the pont (1,1), nto the term (x )(y ), and we get 1*1 = 1, whch s a postve number. For the term (x )(y ), every pont n quadrant 1 wll gve a postve value, because we are multplyng two postve numbers. Next, consder a pont n quadrant 3, such as (-1,-1). In the formula for covarance, we put the pont (-1,-1) nto the term (x )(y ), whch gves us -1*-1 = 1, whch s agan a postve number. For the term (x )(y ), every pont n quadrant 3, where we are multplyng two negatve numbers, whch wll gve a postve value. Ponts n quadrants 2 and 4 wll gve us negatve values for the term (x )(y ). In quadrant 2, we see that -1* 1 = -1, and n quadrant 4, we see that -1* 1 = -1. If all the ponts n our data set fall nto quadrant 1 or quadrant 3 wth respect to the mean, then every pont wll contrbute a postve value to the covarance, whch wll n turn gve us a large postve correlaton. In contrast, f all the ponts n our data set fall nto quadrant 2 or quadrant 4 wth respect to the mean, then every pont wll contrbute a negatve value to the covarance, whch wll n turn gve us a large negatve correlaton. If ponts are scattered across all four quadrants, we wll get a mxture of postve and negatve terms that tend to cancel each other out, gvng a correlaton near zero.

5. Potental problems wth Pearson lnear correlaton The Pearson lnear correlaton coeffcent can be greatly affected by a sngle observaton. In partcular, a sngle pont (an outler) that falls a long way from other ponts n the x-y plane can greatly ncrease or decrease the Pearson R. For example, let's look agan at the data on drug dose versus blood pressure, but suppose that the last patent, nstead of havng a blood pressure measurement of 98, has a value of 150 as n Table <Outler n BP-drug dose>. For these data, the Pearson correlaton coeffcent s R = -0.47, whch s a large change from the R = -0.92 we had before changng ths sngle pont. When we see an ndvdual pont that s so nfluental n determnng the value of our statstc, we should consder the possblty that there was an error n the measurement, and make sure that we are not beng mslead. Table <Outler n BP-drug dose>. Drug dose (mg per day) Blood pressure 5 151 5 145 5 136 10 137 10 124 10 124 15 111 15 105 20 110 20 150 An outler n blood pressure measurement at (20,150) Blood pressure 200 150 100 50 0 0 5 10 15 20 25 Drug dose

A sngle outler can also make a weak correlaton appear much stronger. For the data n <Table no-outler>, the correlaton coeffcent s qute small, R = 0.05. <Table no-outler> x value y value 1 4 1 1 2 3 2 3 3 1 3 4 4 3 4 2 No outler. R = -0.05 12 10 8 Y 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 X

However, f we add a sngle observaton at (10, 10), as shown n Table <Table Sngleoutler> and Fgure <Fgure Sngle-outler>, we change the correlaton coeffcent from R = 0.05 to R = 0.81. <Table Sngle-outler> x value y value 1 4 1 2 2 3 2 3 3 1 3 4 4 3 4 2 10 10 <Fgure Sngle-outler> Sngle outler at (10,10). R = 0.8 12 10 8 Y 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 X So we see that the Pearson lnear correlaton coeffcent may be very senstve to a sngle pont. In such stuatons, we may choose to use an alternatve assocaton measure, the Spearman rank correlaton coeffcent, whch we'll look at shortly.

The Pearson lnear correlaton coeffcent s not good at detectng genune but non-lnear assocatons between varable. Suppose that we have values such as those n Table <Table non-lnear relaton> and Fgure < Fgure non-lnear relaton>. Although there s clearly a relatonshp between x and y, the correlaton coeffcent s R = 0.0. Ths example shows that t s always a good dea to graph your data, and not to rely completely on a statstc. <Table non-lnear relaton>. x value y value -5 25-4 16-3 9-2 4-1 1 0 0 1 1 2 4 3 9 4 16 5 25 Non-lnear assocaton wth R = 0 Y 30 25 20 15 10 5 0-6 -4-2 0 2 4 6 X

6. Spearman rank correlaton: an alternatve to Pearson correlaton We saw that the Pearson correlaton coeffcent may be greatly affected by sngle nfluental ponts (outlers). Sometmes we would lke to have a measure of assocaton that s not so senstve to sngle ponts, and at those tmes we can use Spearman rank correlaton. Recall that, when we calculate the mean of a set of numbers, a sngle extreme value can greatly ncrease the mean. But when we calculate the medan, whch s based on ranks, extreme values have very lttle nfluence. The same dea apples to Pearson and Spearman correlaton. Pearson uses the actual values of the observatons, whle Spearman uses only the ranks of the observatons, and thus, lke the medan, s not much affected by outlers. Most statstcs packages wll calculate ether Pearson or Spearman, but Excel wll only do Pearson. The easest way to get Spearman s to replace each observaton by the rank value of each observaton, and then calculate the Pearson coeffcent usng the ranks. For the outler examples, recall that the Pearson correlaton s R = -0.05 excludng the outler and R = 0.81 ncludng the outler. For these data, the Spearman rank correlaton s R s = -0.10 excludng the outler and R s = 0.24 ncludng the outler. Let's do the calculatons. Here's the data excludng the sngle outler. I've assgned the rank to each value, wth tes gven the average rank. x value x rank y value y rank 1 1.5 4 7.5 1 1.5 1 1.5 2 3.5 3 5 2 3.5 3 5 3 5.5 1 1.5 3 5.5 4 7.5 4 7.5 3 5 4 7.5 2 3

We can extract the ranks, and calculate the Pearson coeffcent for the ranks, gettng R s = -0.10 excludng the outler. x rank y rank 1.5 7.5 1.5 1.5 3.5 5 3.5 5 5.5 1.5 5.5 7.5 7.5 5 7.5 3 Here's the data wth the sngle outler ncluded. Agan, I've assgned the rank to each value, wth tes gven the average rank. x value x rank y value y rank 1 1.5 4 7.5 1 1.5 1 1.5 2 3.5 3 5 2 3.5 3 5 3 5.5 1 1.5 3 5.5 4 7.5 4 7.5 3 5 4 7.5 2 3 10 9 10 9 We can extract the ranks, and calculate the Pearson coeffcent for the ranks, gettng R s = 0.24 wth the outler ncluded. x rank y rank 1.5 7.5 1.5 1.5 3.5 5 3.5 5 5.5 1.5 5.5 7.5 7.5 5 7.5 3 9 9 The Spearman coeffcent s much less affected by the sngle nfluental pont than s the Pearson correlaton coeffcent.