Ordinary Least Squares Regression Vartanian: SW 540

Transcription

1 Ordinary Least Squares Regression Vartanian: SW 540 When to Use Ordinary Least Squares Regression Analysis A. Variable types 1. When you have an interval/ratio scale dependent variable. 2. When your independent variables are either interval/ratio scale or dummy variables. B. Types of relationships We use ordinary least squares regression when we are interested in determining cause-and-effect relationships. Thus, if we believe that there is a positive relationship between the unemployment rate in a community and time on welfare (we believe that high unemployment causes people to spend a relatively long time on welfare) then use ordinary least squares regression analysis. The Process of Using OLS Regression Analysis When examining the relationship between an independent and dependent variable in a scattergram, the line that fits these points best is known as the least squares line. This line is chosen by minimizing the distance between all of these points and the line. In other words, we re choosing a line that is closest to all the data points. How do we form the line that goes through the data points (in the scattergram)? We do this by minimizing the sum of the squared deviations from any line we could draw through the points. We thus will choose a line that minimizes the following equation ( Y Y ) i p 2. Here, Y i are the actual values of Y (for each of the sample members) and Y p is the predicted value of Y (or the line we ll be drawing through the scattering of points). We re trying to minimize the sum of the squared deviations of the actual (sample) values of Y (Y i ) from the best 2 line we can draw through all of the Y i points. This ( Yi Yp) expression is known as the unexplained sums of squares or the error sums of squares. The total sums of squares given below can be broken up into explained and unexplained sums of squares. Or The first expression after the equals sign is the unexplained sums of squares and the second expression after the equals sign is the explained sums of squares. The first expression is the total sums of squares (to the left of the equals sign). C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 1

2 Unexplained: Our error in predicting what y will be by using the regression line. Explained: What we gain by using Y p instead of What we re trying to do is predict the value of Y, or the dependent variable, given that we know something about the person, our the independent variable, X. If we knew nothing about the person, our best guess of what Y is would be We are trying to improve on in predicting the value of Y. We ll do this with our knowledge of the independent variable, X. The Y p line will allow us to predict the value of the dependent variable, Y, for any value of X, the independent variable. For example, we may know that a particular state has welfare payments of $500/month. We may wish to predict how long a person will stay on AFDC if they live in such a state. By knowing the Y p line, we ll be able to predict how long a person stays on AFDC. We may not be perfectly right in our prediction, for instance, if the points around the line are highly dispersed. But if the points around the line are concentrated around the line, then we can predict fairly accurately how long someone will spend on AFDC for a given AFDC payment level within the state. We are able to determine this ordinary least squares line by examining each X value and determining the mean value of Y at each X. We then connect each of these mean values, at each X value, to form the OLS regression line. If we were examining the effect of the number of children on income, we would examine the mean value of Y at each X value, or each number of children. We then connect these points to form the OLS regression line. Not all of the sample points will be located on the OLS regression line some will be below the line and some will be above the line. The closer the points are to this line, the better the predictor of the dependent variable the independent variable will be. We can determine the Y p line by the following equation: Y p = a + b X Here, a is the intercept, b is the slope coefficient, and X is the independent variable. Y p is the predicted value of Y for a given value of X. The formulas for determining the intercept (a) and the slope (b) are given below (on the next page). We can define the a and b coefficients as the following: a, or the intercept, is the point where we cross the Y axis when the value of X is 0. We know this because if we give X a value of 0, Y p =a. b, or the slope coefficient, tells us how much Y p changes for a one-unit change in X. A positive value for b indicates that there is a positive relationship between the independent and C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 2

3 dependent variable. A negative value for b indicates that there is a negative relationship between the independent and dependent variable. A value of 1 for b indicates that for every 1 unit increase in the independent variable, the dependent variable increases by 1 unit. If b=2, this indicates that for a one unit increase in the independent variable, the dependent variable increases by 2 units. If b= -9, this indicates that for every 1 unit increase in the independent variable, the dependent variable would decrease by 9 units. Thus, b Change in Y = 1 Unit Increase in X. The slope is generally defined as. Let s say we have the following 5 observations, where X, the independent variable, is the number of children in the household, and Y, the dependent variable, is the time in months on AFDC. X Y The formula for determining the slope, or the b coefficient estimate is The formula for the intercept, or a coefficient estimate is. or In the example given, N=5. a = Y bx C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 3

4 XY = 55 X Y = 15 X ( X ) = 15 2 = 2 55 = 225 b = 5(55) 1515 ( ) 5(55) = = 50 1 a = 15 1( 15) 5 0 = = 5 0 So, Y p = (X). The b coefficient estimate tells us that for every 1 unit increase in X, the predicted value for the dependent variable will increase by 1 unit. The a coefficient estimate tells us that when X=0, the value of the dependent variable is 0. When X =1, Y p =1. We could graph this line to see the relationship between the two variables -- the independent and the dependent -- which is given above. It turns out in this case, we have a perfect relationship since all of the points lie on the Y p line. If we were to determine a correlation coefficient (r), it would be =1. To graph this relationship, we could determine the value of Y p for each X. Y p X Let s say we have the following 5 cases for a second example. X Y C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 4

5 N=5 To determine b: The regression equation is therefore Y p =6-(1)X or Y p =6-X The b coefficient estimate, or the slope coefficient, for this example = -1. The a coefficient estimate, or the intercept, = 6. Thus, when X=0, Y p, the predicted value of Y, is 6. If X=1, then the predicted value of Y is 5. In this second situation, we again would find a perfect relationship between the two variables all of the points are on the regression line. If we were to determine the correlation coefficient (r) for this example, it would = -1. To graph this we could determine the value of Y p for each X value. We again use the Y p equation from above. X Y p C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 5

6 We will rarely find a perfect relationship between two variables as we have in the two examples above. For example, if we had the following 5 cases below, we would not find a perfect relationship between the two variables. X Y N=5 To determine b: The regression equation is therefore Yp= (X). Where b=.8 and a=3.6. Thus, when X=0, the predicted value for Y p is replace X with a value of 0 in the above Y p equation. When X=1, the predicted value for Y p =4.4 replace X with a value of 1 in the above C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 6

7 Y p equation. When X=10, the predicted value for Y p =11.6. A final example examines a sample of people who have been on AFDC to determine the relationship between time on AFDC (in months) and the unemployment rate in the area where the AFDC recipient lives. We come up with the following a and b coefficients: a=3, b=4 In other words, Y p = X Here, X=unemployment rate in the area of residence of the AFDC recipient. What we can do is put in different values of X to see what we predict about the dependent variable. If X=0 (or the unemployment rate is at 0%), we would predict that AFDC recipients will spend 3 months on AFDC. Y p =3 + 4 (0) = 3. If X=1, we would predict that AFDC spell length would be 7 months Y p = 3 + 4(1) = 7. If X=2 (the country unemployment rate is at 2%), we would predict that AFDC recipients would spend 11 months on AFDC. Y p = 3 + 4(2) = 11. The Disturbance or Residual Term The points above and below the regression line constitute what is called the disturbance or residual.. We can determine the value of the disturbance or residual for each of the observations in the sample. The value of Y (Y i ) for each sample member is determined by the following equation: Y i = a + b X i + e i Where e i is the disturbance or residual. The disturbance or residual measures: 1. Variables that have not been used in the equation that should have been used. Theory states that you need particular variables in your equation but you fail to include these variables. 2. Unknown variance in the measurement of the dependent variable. In order to get unbiased and efficient estimates for the coefficient estimate, b, the following must be true: C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 7

8 1. The expected value of the residual or disturbance term E (e i )=0. 2. e i is normally distributed 3. e i is independent of X. That is, e i and X are uncorrelated. If you omit variables from your equation that are necessary in determining your equation, these will be picked up by the residual or disturbance term, e i. If these are correlated with any of the X i, then your X i 's will be correlated with the disturbance term and you will be violating rule # 3 above. If this is the case, you will not have unbiased estimates of your b coefficients. It is important that your theory capture the necessary variables to estimate Y and you include these variables in your statistical models. Determining the disturbance for each of the observations: Example: kids(x) Spell Length (Y) From here we could determine the disturbance or residual for each of the observations: Y i =a+bx i +e i kids(x) Spell Length (Y i ) Disturbance (e) Yp Y p = a + b X. In this example, b=1 and a=2.4. For the first observation, 3= (1) + e C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 8

9 For the second observation, For the third observation, For the fourth observation, For the fifth observation, 3=3.4+e -.4=e 1 Y p = *1= 3.4 5= (2) + e 5=4.4 +e.6=e 2 Y p =2.4+1*2=4.4 6= (3) + e 6=5.4+e.6=e 3 Y p =2.4+1*3=5.4 5= (4) + e 5=6.4 +e -1.4=e 4 Yp=2.4+1*4=6.4 8= (5) + e 8=7.4 +e.6=e 5 Y p =2.4+1*5=7.4 You could also find the e i values by the formula: Y i - Y p. Once you determine these residuals, you can see that the mean, or the expected value, for the residuals is equal to zero (add up all of the residuals). You could also determine the error sums of squares or the unexplained sums of squares (they re the same thing) by squaring each of the disturbance terms: To determine the explained sums of squares, first determine each of the observations and square this difference., then subtract it from Y p for Explained sums of Squares 1: ( ) 2 =4 2: ( ) 2 =1 3: ( ) 2 =0 4: ( ) 2 =1 5:( ) 2 = 4 The explained sum of squares = 10, the unexplained sums of squares = 3.2. Therefore, the total sums of squares = 13.2 (the Total SS=Unexplained SS+Explained SS). From this information, you can determine how much of the variation in the dependent variable is being explained by the independent variable. This is called the R 2 value and can be determined by using the following formula: C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 9

10 In this case, the R 2 value = 10/13.2=.7575, or 75.75% of the variation in the dependent variable, AFDC Spell length, is explained by the independent variable, number of kids. Testing to Determine if the Relationship Between the Independent and Dependent Variables is Significant or Testing the Significance of the b coefficient estimate. You will generally be testing a null hypothesis that states that there is no relationship between the independent and dependent variables. In other words, you ll be testing the following: H 0 : B=0. If you re testing for a positive relationship between the independent and dependent variables, your one tailed research hypothesis will be: H R : B>0. A negative research hypothesis will be: H R : B<0 A two-tailed research hypothesis will be: HR: B 0 In order to test for the significance of the b coefficient, you will have to know the standard error for the b coefficient. The standard error for the coefficient is very similar to a standard deviation it measures the spread of the distribution. We will use a student t distribution to test the b coefficient, to determine if there is in all likelihood a relationship between the independent and dependent variables. The student t distribution value is very similar to a z value (related to the normal distribution) that we learned earlier. The t is telling us how many standard error units we are away from our hypothesized value. The hypothesized value we re examining is the null hypothesis -- a value of B=0. We found that for the normal distribution, when we were 1.96 units away from the mean of the distribution (where z=1.96), we were in the.025 tail of the normal distribution. When sample sizes get relatively large, it will again take around 1.96 units (now standard error units measured in t values rather than z values) for us to be in the.025 tail-end of the distribution. In other words, when sample sizes get large, the student t distribution turns into a normal distribution. C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 10

11 The t value is determined by the formula below. t n k 1 = b SE b For now, we ll determine the standard error of the b estimate by using the following formula: Where the ESS stands for the error sums of squares or the unexplained sums of squares. The n-k-1 part of the t formula indicates the degrees of freedom. Here, n is equal to the number of observations, k is equal to the number of independent variables, and SE b is the standard error for the b coefficient estimate. If we had 5 observations and 1 independent variable, we would have 3 degrees of freedom. We would use this degrees of freedom in a table of critical values for t to determine if the t value is greater than or equal to the critical value. If the t value is greater than the critical value, you will reject the null hypothesis. If the t value is less than the critical value, you will accept the null hypothesis. Let s say that you determine that the b coefficient estimate = 4. You also determine that the standard error for the b coefficient estimate is 2, with an n=42 (or you re examining 42 cases). Let s also say you re examining a one-tailed hypothesis at the.05 level of significance. Your t statistic would be the following: t t = = 2 = 2 This indicates that the t value = 2, with 40 degrees of freedom. The critical value is Because the t value is greater than the critical value, you would reject the null hypothesis at the.05 level, for a one-tailed test. If you were testing this hypothesis at the.05 level for a two-tailed test, the critical value = Because the t value is less than the critical value, you would accept the null hypothesis. C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 11

12 A MORE COMPLICATED EXAMPLE You re examining the relationship between age and wage. You have the following 4 observations: Obs Age (X) Wage (Y) From this information, we could determine the a and b coefficients: a=3.9, b=.085. Y p = X Therefore, the predicted value of Y and the residual or disturbance terms will be: (Y i -Y p ): 1: Y p = *20 = 5.6; e 1 = = -.1 2: Y p = *30 = 6.45; e 2 = =.05 3: Y p = *40= 7.3; e 3 = =.2 4: Y p = *50=8.15; e 4 =8-8.15= -.15 If we square each of these residuals, we get =.075. This is the value for the unexplained sums of squares. If we divide this value by n-k-1, or 2, we get This value, the, is called the Mean Square Error (MSE). It is the unexplained sums of squares divided by the degrees of freedom. We ll use the MSE again when examining the relationship between the entire model of independent variables and the dependent variable. To determine the standard error of the b estimate, use the formula: We ve just determined that = To determine the rest of this formula, do the following: C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 12

13 We can then determine whether the t coefficient is significant by using the t formula: t 2 =.085/ = 9.8. At two degrees of freedom for a.05, two-tailed test, the critical value is 4.3. Because the t value is greater than the critical value, reject the null hypothesis. C:\WP60_1\LECT1.PHD\OLSReg\Regression.Explained.wpd Page 13