1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio level of some utriet ad weight gai. The tools used to explore this relatioship, is the regressio ad correlatio aalysis. These tools ca be used to fid out if the outcome from oe variable depeds o the value of the other variable, which would mea a depedecy from oe variable o the other. Regressio ad correlatio aalysis ca be used to describe the ature ad stregth of the relatioship betwee two cotiuous variables. 1.1 Scatterplot The first step i the ivestigatio of the relatioship betwee two cotiuous variables is a scatterplot! Create a scatterplot for the two variables ad evaluate the quality of the relatioship. Example: Does the umber of years ivested i schoolig pay off i the job market? Apparetly so the better educated you are, the more moey you will ear. The data i the followig table give the media aual icome of full-time workers age 25 or older by the umber of years of schoolig completed. x=years of Schoolig y=salary (dollars) 8 18,000 10 20,500 12 25,000 14 28,100 1 34,500 19 39,700 Start of with creatig a scatterplot for X ad Y. 1
The scatterplot shows a strog, positive, liear associatio betwee years ad salary. Questios to be aswered with the help of the scatterplot: 1. Does a relatioship exist that ca be described by a straight lie (which meas is there a liear relatioship)? 2. Is there a relatioship, that is ot liear? 3. If the scatterplot of the variables look like a cloud there is o relatioship betwee both variables ad oe would stop at this poit. 1.2 Correlatio If the scatterplot shows a reasoable liear relatioship (straight lie) calculate Pearso s correlatio coefficiet to evaluate the stregth of the liear relatioship. Notatio: Let (x 1, y 1 ), (x 2, y 2 ),..., (x, y ) deote a sample of (x, y) pairs. Defiitio: Give the followig sum of squares S xy = xy ( x)( y) S xx = x 2 ( x) 2 S yy = y 2 ( y) 2 Pearso s Correlatio Coefficiet ca be calculated as: r = S xy Sxx S yy Pearso s correlatio coefficiet (amed after Karl Pearso, 1857-193) is a umber betwee -1 ad 1, that measures the stregth of a liear relatioship betwee two cotiuous variables. The absolute value of the coefficiet measures how closely the variables are related. The closer it is to 1 the closer the relatioship. A correlatio coefficiet over 0.8 idicates a strog correlatio betwee the variables. Data patters ad Pearso s Correlatio Coefficiet 2
The sig of the correlatio coefficiet tells you of the tred i the relatioship. A positive (egative) coefficiet meas that oe variable icreases (decreases), whe the other icreases. Cotiue Example: Calculate Pearso s correlatio coefficiet for years ad salary. First fid x = 13.17, s x = 4.02 ad ȳ = 2733, s y = 8290. x i =Years of Schoolig y i =Salary (dollars) x i y i x 2 i yi 2 8 18,000 144000 4 324,000,000 10 20,500 205000 100 420,250,000 12 25,000 300000 144 25,000,000 14 28,100 393400 19 789,10,000 1 34,500 552000 25 1,190,250,000 19 39,700 754300 31 1,57,090,000 So that x i = 79, y i = 15800, x i y i = 2348700, x 2 i = 1121, y 2 i = 4, 925, 200, 000. This leads to So that S xy = x i y i ( x i ) ( y i ) S xx = x 2 i ( x i ) 2 S yy = y 2 i ( y i ) 2 r = S xy Sxx S yy = = 2348700 = 1121 (79)2 (79) (15800) = 4, 925, 200, 000 (15800)2 = 80.8333 = 15.5 = 343593333.333 15.5 80.8333 343593333.333 = 0.994. 3
The Pearso correlatio coefficiet of Years of schoolig ad salary r = 0.994. A correlatio of 0.9942 is very high ad shows a strog, positive, liear associatio betwee years of schoolig ad the salary. 1.3 Liear Regressio I the example we might wat to predict the expected salary for differet times of schoolig, or calculate the icrease i salary for every year of schoolig. For this purpose we ca do a regressio aalysis. Terms ad Defiitio: If we wat to use a variable x to draw coclusios cocerig a variable y: y is called depedet or respose variable. x is called idepedet, predictor, os explaatory variable. If the relatioship betwee two variables is liear is ca be summarized by a straight lie. A straight lie ca be described by a equatio: y = a + b x a is called the itercept ad b the slope of the equatio. The slope is the amout by which y icreases whe x icreases by 1 uit. Fittig a straight lie Give data poits (x i, y i ) a ad b shall ow be chose i that way that the correspodig liear lie will have the best fit for the give data. The criteria for best fit used i regressio aalysis is the sum of the squared differeces betwee the data poits ad the lie itself, that is the y deviatios. For data poits (x i, y i ), 1 i this ca be writte as mi a,b (y i (a + bx i )) 2 i I words: miimize the sum by choosig the appropriate parameters a ad b. The resultig lie is called the least square lie or sample regressio lie. After the problem is stated it ca be solved mathematically ad the results are formulas, how to calculate the best parameters. b = S xy Sxx ad a = ȳ b x. Write the equatio of the least squares lie as ŷ = a + bx ŷ gives a estimate for y for a give value of x. 4
Cotiue Example: Sice the salary ad the years of schoolig show such a strog liear relatioship ad the salary ca be viewed as depedig o the years of schoolig, do a liear regressio aalysis with the salary as the respose variable ad the years of schoolig as the predictor variable. Calculate b = S xy xx = 15.5 80.8333 Our result is the least squares lie = 2050.28 ad a = ȳ b x = 2733 2050.28 13.17 = 30.81 ŷ = a + bx = 30.81 + 2050.28 x The slope equals $2050.28, that is for every year of schoolig the average salary icreases by this amout. To estimate the average salary after 18 years of schoolig we calculate ŷ with x = 18 ŷ = 30.81 + 2050.28 18 = 37535.85$ Do t use the regressio lie for values outside the rage of the observed values. This is a model that oly has bee proved valid for the give rage. Properties of the regressio or least squares lie 1. The least squares lie passes always through the balace poit ( x, ȳ) of the data set. 2. The regressio lie of y o x should ot be used to predict x, sice it is ot the lie that miimizes the sum of squared x deviatios. Assessig the fit of a lie Oce the least squares lie has bee obtaied, it is atural to examie how effectively the lie summarizes the relatioship betwee x ad y. The first questio that has to be aswered is, if the lie is a appropriate way to summarize the relatioship. I order to aswer this questio, we will calculate the coefficiet of determiatio r 2. Defiitio: The coefficiet of determiatio for he regressio of y o x is r 2 = S2 xy S xx S yy the square of Pearso s Correlatio Coefficiet. It gives the proportio of variatio i y that ca be attributed to a liear relatioship betwee x ad y. Is r 2 greater tha 0.8, the model has a good fit ad ca be used to calculate reliable predictios of the depedet variable by usig the idepedet variable. I the example, the variable Years of Schoolig explais r 2 = 98.8% of the variatio i the variable Salary. Which is very high. The plot showed that the data poits are almost o a straight lie. Use the least squares lie for predictig the aual salary of a perso with 13 years of schoolig. 5
ŷ(13) = a + b 13 = 30.81 + 2050.28 13 = 27284.45$ This is just a estimate, from the other parts of the class, we kow that a cofidece iterval ca be foud that gives more iformatio.