Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis

Transcription

1 Economics of Strategy (ECON 4550) Maymester 015 Applications of Regression Analysis Reading: ACME Clinic (ECON 4550 Coursepak, Page 47) and Big Suzy s Snack Cakes (ECON 4550 Coursepak, Page 51) Definitions and Concepts: Sample Maximum the largest realized value of a variable Sample Minimum the smallest realized value of a variable Dummy Variable a variable that indicates whether an observation is characterized by a particular attribute (typically equal to 1 if the attribute is true and equal to 0 otherwise) Omitted Variable Bias a problem of distorted regression results arising from specifying a model which leaves out one or more important independent variables (i.e., a specification of the true model which is wrong because all of the relevant X variables were not included) For such a bias to arise in linear regression, the omitted variable must (i) be a true determinant of the independent variable and (ii) be strongly correlated with one or more of the other included independent variables If such a relevant independent variable is omitted, then the estimated coefficient on the strongly correlated (included) independent variable is partly measuring the impact of the highly correlated omitted variable Note: an Excel file containing the data used in each of the examples discussed in lecture is posted on the course webpage (

2 1. Estimating an Average Cost Function Consider an automobile manufacturer trying to estimate ATC (q), based on past realizations of Average Total Costs for different levels of Output Assume ATC( q) b0 b1q bq We have data on Average Costs and Quantity of Output for each of the past 6 weeks as follows: Average Costs Quantity Average Costs Quantity 39, , , , , , , , , , , , , , , , , , , , , , , , , ,10 40 Start by computing some descriptive statistics for the variables in our data set: sample mean, sample standard deviation, sample maximum, and sample minimum In practice, this partly serves as a check to potentially identify any errors in the dataset Descriptive Statistics: Average Costs Quantity Mean 33, Std Dev 15, Maximum 71, Minimum 1, In order for our data to match the assumed functional form for Average Costs, we need to do a non-linear transformation of Quantity (i.e., compute Quantity Squared for each observation) Regression results from Excel

3 Example 1 Estimating an Average Cost Function SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 6 ANOVA df SS MS F Significance F Regression E 06 Residual Total Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept E X Variable X Variable Estimated equation of bˆ 0 bˆ ˆ 1q bq 41, q (0.1781) q Note, all p-values are small enough so that each estimated coefficient is statistically significant at a 0.1% error level R. 6858

4 (?) What is the Efficient Scale of Production for this firm? (A) Recall, the Efficient Scale of Production is the quantity of output that minimizes Average Total Costs of Production. We have estimated Average Total Costs of Production to be: ATC( q) 41, q (0.1781) q From here, we have: AT C ( q) (0.356) q and AT C( q) AT C( q) 0 for small quantities and AT C( q) 0 for large quantities Average Total Costs are minimized where: AT C( q) (0.356) q 0 ( 0.356) q q Thus, the Efficient Scale of Production is roughly 338 units of output

5 . Estimating Demand Consider a coffee house with retail outlets in 3 markets For each market they have data on annual quantity sold, price per unit, average income, and price set by a rival. Store Number Quantity Sold Price Average Income Rival Price 1 476, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,10.5 5, , , , , , , , , , , , , , , , , , , , , , , , , , , , ,800.0

6 Descriptive Statistics: Quantity Price Income Rival Price Mean 494, , Std Dev 86, , Maximum 67, ,750.5 Minimum 34, , Suppose they conjecture that: B 1 B ( ) _ B quantity A price income rival price 3 a b c Note: ln( x y z ) a ln( x) bln( y) c ln( z) Thus, the demand relation above can be expressed as: quantity ln( A) B1 lnprice B lnincome B lnrival _ price quantity B B lnprice B lnincome B lnrival _ price ln 3 ln We can do a transformation of variables and run a linear regression! Regression results from Excel (see following page) From here, we can essentially undo the previous transformation of variables Note, since B0 ln( A) and B ˆ , it follows that A ˆ exp{ } 1, So, our estimated equation is: B 1 B ( ) _ B quantity A price income rival price 3 price income rival _ quantity price Recognize that fixing income and rival price, this demand function is of the constant elasticity form => price elasticity of demand is p => Elastic Demand Further, Income Elasticity of Demand is I => Normal Good And Cross Price Elasticity of Demand (with respect to rival price) is X, p Y => good in question is a Substitute for the good being sold by the rival firm

7 Example Estimating Demand SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 3 ANOVA df SS MS F Significance F Regression Residual Total Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept X Variable X Variable X Variable Estimated equation of Bˆ Bˆ lnprice Bˆ lnincome Bˆ ln( rival _ ) is: 0 1 price ( ) 1.681ln( price).6309ln( income).7067ln( rival _ price) Note, all p-values are small enough so that each estimated coefficient is statistically significant at a 5% error level. R. 5104

8 3. ACME Clinic Page 47 in Coursepak 1. Based upon Exhibit A, are male nurses paid less than female nurses? If so, by how much? Is that difference statistically significant?. What about the clinic s claim that Mr. Jones is appropriately paid if you account for his below average education? Is that supported by the data? If education is the only determinant of compensation, what is a fair estimate of what Mr. Jones salary should be? 3. After conducting your preliminary analysis, you interview supervisors in the clinic and find that years of experience are also highly valued by the clinic. Based on that observation, you request data on the experience of the nurses and receive data contained in Exhibit B. How is you analysis altered if you consider experience as a factor that determines compensation? Is Mr. Jones underpaid according to this analysis? Why not? 4. How do you reconcile the apparent contradiction between your answers above? Exhibit A (with Gender=1 for female and Gender=0 for male ) ID # Salary Education Gender 1 49, , , , , , , , , , , , , , , , , , , ,980 0 Descriptive Statistics: Salary Education Gender Mean 41, Std Dev 11, Maximum 64, Minimum 4,

9 1. Based upon Exhibit A, are male nurses paid less than female nurses? If so, by how much? Is that difference statistically significant? Observe that from the dataset we can compute that the Average Salary of Female nurses is $49,158.33, while the Average Salary of Male nurses is only $31, => Male nurses are paid $17, less! If we run a regression to estimate the equation salary b0 b1( female), we get the results labeled Example 3 ACME Clinic [Regression (i)] So, based upon the results of this regression, it appears as if Male nurses are paid less ($17, less!) than Female nurses Further, this difference is statistically significant at a.01% error level. What about the clinic s claim that Mr. Jones is appropriately paid if you account for his below average education? Is that supported by the data? If education is the only determinant of compensation, what is a fair estimate of what Mr. Jones salary should be? To determine the relation between education and salary (assuming education is the only determinant of salary), run a regression on the equation salary b ( 0 b1 education). Doing so, we get the results labeled Example 3 ACME Clinic [Regression (ii)] So, based upon the results of this regression, it appears that nurses with more education are paid higher salaries Mr. Jones education level (only years) is slightly below the sample mean of (.8) But, by the estimated equation 31, ,906.17( education), the expected salary of a nurse with years of education should be 31, ,906.17() 38, => Mr. Jones salary of only $9,980 is well below this amount Thus, the Clinic s claim that Mr. Jones low salary is accounted for by his below average education is not supported by the data 3. After conducting your preliminary analysis, you interview supervisors in the clinic and find that years of experience are also highly valued by the clinic. Based on that observation, you request data on the experience of the nurses and receive data contained in Exhibit B. How is you analysis altered if you consider experience as a factor that determines compensation? Is Mr. Jones underpaid according to this analysis? Why not?

10 we now have Exhibit B ID # Salary Education Female Experience 1 49, , , , , , , , , , , , , , , , , , , , Descriptive Statistics: Salary Education Gender Experience Mean 41, Std Dev 11, Max 64, Min 4, To determine the relation between salary and all three independent variables, run a regression on salary b0 b1 ( education) b ( female) b3 (experience). Doing so, we get the results labeled Example 3 ACME Clinic [Regression (iii)] So, based upon the results of this regression, there is not statistically significant difference in salaries of females versus males Accounting for Mr. Jones education level (only years) and experience (only 3 years), his expected salary is 19,89.50,054.64() (0) 1,855.88(3) 9, His actual salary of $9,980 is greater than this estimated expected salary (an estimate that takes into account his level of education and experience) => if anything, he s slightly overpaid

11 4. How do you reconcile the apparent contradiction between your answers above? To answer Question (1) we ran a regression for the equation salary b ( 0 b1 female) and found the impact of female to be statistically significant To answer Question (3) we ran a regression for salary b0 b1 ( education) b ( female) b3 ( Experience) and found the impact of education and experience to be statistically significant but the impact of female to not be statistically significant When running this latter regression, we are determining the impact of changes in each independent variable, controlling for differences in each of the other independent variables (recall, for multiple regression the interpretation of each coefficient is along the lines of all other factors fixed ) The regression we ran to answer Question (1) suffers from an Omitted Variables Bias, due to the fact that for this population there is a strong, positive correlation between Female and Experience Recall, definition of Correlation Coefficient: cov( X, Y ) XY s X sy Value of the correlation coefficient between each pair of independent variables: Education Female Experience Education 1 Female Experience Correlation Coefficient between Experience and Female is (.7999), which is fairly close to the upper bound of (1) For the regression we ran to answer Question (1), this was precisely the case Recall, the specified equation for this regression was salary b0 b1( female) We omitted Experience, which is highly correlated with Female => when doing so, the estimated coefficient for Female is actually providing a measure of both gender and the highly correlated experience Once we include both Female and Experience, the coefficient on Female only measures the impact of gender and not the impact of experience => from these results we see that experience has a statistically significant impact on salary, while gender does not Thus, the better results in this case are those from the regression which includes all three potential determinants of salary (i.e., results for the estimation of the equation salary b0 b1 ( education) b ( female) b3 ( Experience), as estimated within our answer to Question 3) => these results do NOT suffer from any Omitted Variable Bias

12 Example 3 ACME Clinic [Regression (i)] SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0 ANOVA df SS MS F Significance F Regression E 05 Residual Total Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept E X Variable E Estimated equation: bˆ ˆ 0 b1 ( female) 31,180 17,978.33( female) => if we run a regression with only one X variable that happens to be a dummy, then ˆb 0 is equal to the average value of the observations with (dummy)=(0) and ˆb 1 is equal to the difference between average value of the observations with (dummy)=(1) and average value of the observations with (dummy)=(0) Each estimated coefficient is significant at a.01% error level R

13 Example 3 ACME Clinic [Regression (ii)] SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0 ANOVA df SS MS F Significance F Regression Residual Total Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept E X Variable Estimated equation: bˆ ˆ 0 b1 ( education) 31, ,906.17( education) Each estimated coefficient is significant at a.01% error level R

14 Example 3 ACME Clinic [Regression (iii)] SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0 ANOVA df SS MS F Significance F Regression E 19 Residual Total Coefficients Standard Error t Stat P value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept E X Variable E X Variable X Variable E bˆ ˆ ˆ ˆ 0 b1 ( education) b ( female) b3 ( Experience) 19,89.50,054.64( education) ( female) 1,855.88( Experience) R However, the coefficient for the Female dummy variable is no longer statistically significant ("pvalue of.8335)

15 Multiple Choice Questions: 1. refers to a problem of distorted regression results arising from specifying a model which leaves out one or more important independent variables. A. Selection Bias B. A Dummy Variable C. Omitted Variable Bias D. A Log-Transformation.. A Dummy Variable A. is typically defined in such a way that it can take on any value between ( 1) and (1), but cannot take on values less than ( 1) or greater than (1). B. can only ever be included in a regression as the Y variable (and never as one of the X variables ). C. indicates whether an observation is characterized by a particular attribute. D. More than one (perhaps all) of the above answers is correct. 3. Henry ran a regression to estimate qx b0 b1 ln( px ) b ln( py ) b3 ln( Inc), where q x denotes quantity of good x, p x denotes price of good x, p y denotes price of good y, and Inc denotes per capita Income in the market for good x. His estimated coefficient values are b ˆ , b ˆ , b ˆ , and b ˆ These results would suggest that A. good x is an inferior good. B. good x is a substitute for good y. C. good x is a complement to good y. D. More than one (perhaps all) of the above answers is correct. 4. Suppose you have the following observations on the value of variable X1 : 9, 10, 1, 16, 13, 18, 7, 10, 9, 6, 14, and 11. For these observations, the Sample Minimum is A. 6. B. 1. C. 1. D. 144.

16 Problem Solving or Short Answer Questions: 1. John is planning on running a regression in order to determine the factors influencing salaries of public school teachers in the state of Georgia. He has obtained data on current salary, level of education, number of years of teaching experience, age, and gender for a random sample of,457 teachers in the state. Every teacher in his sample has at least a Bachelor s degree, but some have a Master s Degree or Doctorate. He has created a dummy variable (named AdvDeg ) to indicate whether or not each individual has one of these advanced degrees. He has also created a dummy variable (named Male ) to indicate the gender of each individual. Before running his regression, he computed Descriptive Statistics for each variable, as reported below: Salary AdvDeg Experience Age Male Mean 45, Std Dev 15, Max 7, Min 8, Based upon these reported values, do you have any observations to offer about his dataset? Explain.. Amy ran a regression to estimate the parameters in the equation y b0 b1 x1 b x b3 x3 b4 x4 In part, her regression results are: Regression Statistics R Square Adjusted R Square Observations 357 Coefficients P value Intercept X Variable X Variable X Variable E 10 X Variable A. Based upon her reported p-values, which of her coefficient estimates are statistically significant at a 5% error level? Which of her coefficient estimates are statistically significant at a 1% error level? B. Do you have any concerns with her reported regression results? If so, explain.

17 Answer Questions 3 through 5 using the data posted online at: 3. You have been hired by Jim Highland Homes (a custom home builder operating in northern Georgia, northeastern Alabama, northwestern South Carolina, and southwestern North Carolina) to conduct an analysis to determine the factors influencing the price of homes. Specifically, you are given the data contained in the worksheet titled Data for Question 3. This dataset contains observations on Selling Price, Square Footage, and Lot Size (in acres), for a sample of 88 recently sold new homes in a market where Jim Highland Homes is considering starting a new development. Some of these properties were also on either a waterfront lot or a golf course lot, as indicated in the dataset. 3A. Determine the value of Sample Mean, Sample Standard Deviation, Sample Maximum, and Sample Minimum for each of the variables in this dataset. 3B. Run a regression on the equation ( price) b0 b1 ( SqFootage) b ( LotSize) b3 ( Waterfront) b4 ( GolfCourse) and state the estimated coefficient values for this regression. 3C. Based upon the estimated coefficient values, how much of a premium are people willing to pay for a Waterfront Lot? How much of a premium are people willing to pay for a Golf Course Lot? 3D. Which coefficient estimates are statistically significant at a 10% error level? Which coefficient estimates are statistically significant at a 1% error level? 4. Mo, Caleb, and Gene have been hired by the U.S. Federal Trade Commission to conduct a study on the impact of market power on the pricing patterns of firms. They have been provided with the data in the worksheet titled Data for Question 4. This dataset contains observations on Price, Marginal Cost, the value of C4, and the value of HHI for 100 firms operating in 9 different industries with the U.S. 4A. Mo claims, I know from my economics classes that firms with substantial market power charge higher prices than firms with less market power. Since C4 is a good measure of market power, we should run a regression on the equation price) b b ( 4). I am very confident that we will get good results, with ( 0 1 C b ˆ1 0. Run the regression suggested by Mo. Based upon the resulting value of R and the resulting p-values, would Mo obtain the results that he expects? Explain. 4B. Caleb says, It is true that firms with substantial market power charge higher prices than firms with less market power. But, C4 is not a good measure of market power HHI is a superior measure. We should run a regression on the equation ( price) b0 b1 ( HHI). For this regression we are sure to get good results, with b ˆ1 0. Run the regression suggested by Caleb. Based upon the resulting value of R and the resulting p-values, would Caleb obtain the results that he expects? Explain. 4C. Gene storms out of the room yelling, I can t work with you idiots. IEPR! IEPR!!! Don t you remember anything from your economics classes!? With this

18 data, if you are going to run a regression it should be on an equation along the lines of either p MC 100 b b ( 1 p 0 C 4) p MC p or 100 b ( 0 b1 HHI). IEPR!!! IEPR!!!!! What is this IEPR that Gene is ranting about? Run the regressions suggested by Gene. Based upon the resulting values of R and the resulting p-values, are the results of these regressions better than those suggested in parts (4A) and (4B)? 4D. Using the results of the first regression suggested by Gene, what would be the impact on firm pricing of a change in market structure that increases the value of C4 by (5)? Explain. 5. Professor Tufnel teaches an introductory marketing class at a small university near Des Moines, Indiana. He has been accused of gender discrimination (specifically, of giving female students lower grades than male students). Using the data in the worksheet titled Data for Problem 5, you need to evaluate the validity of this accusation. This spreadsheet provides a summary of the Semester Average, Combined SAT Score, Age (a dummy variable indicating if the student is over the age of 5), Gender (a dummy variable indicating if the student is male), and Major of each of the 61 students enrolled in his class during the most recent semester. 5A. Determine the Mean of Semester Average for male students and for female students. How do these two values compare to each other? 5B. Run a regression on the equation ( SemAvg) b0 b1 ( SAT) b ( Over5) b3 ( Male). Based upon the results of this regression, is there evidence of gender discrimination? Is the difference in assigned grades between genders statistically significant at a 1% error level? Explain. 5C. After receiving a report of your results from the regression in part (5B), Professor Tufnel discussed your findings with Professors St. Hubbins and Smalls, two econometricians in his college. They think that there is a major error with the analysis above. They suggest that a regression should be run on the equation ( SemAvg) b0 b1 ( SAT) b ( Over5) b3 ( Male) b4 ( Bus), where (Bus) is a dummy variable indicating whether the student is majoring in one of the three business majors (Economics, Finance, or Marketing) offered by their college. (To assist in the construction of this dummy variable, the business majors have been color-coded light green in Column E of the spreadsheet.) After running the regression suggested by Professors St. Hubbins and Smalls, does there appear to be any evidence of gender discrimination? Explain. 5D. Determine the value of the correlation coefficient between each pair of the variables (SAT), (Over 5), (Male), and (Bus). Based upon these values, explain the apparent discrepancy between the regression results from (5B) and the regression results from (5C).

19 Answers to Multiple Choice Questions: 1. C. C 3. D 4. A Answers to Problem Solving or Short Answer Questions: 1. Based upon the reported Descriptive Statistics, there appear to be some errors in his dataset. First, the reported minimum values for Salary and Age are each negative. These values do not make sense, since each of these variables should always be positive in value. Second, AdvDeg is a dummy variable, which should only take on a value of either (0) or (1). Thus, the reported maximum value of (15) cannot be correct. Finally, the reported maximum value for Experience is (94). Since this variable is measuring number of years of teaching experience, this reported value is most certainly a mistake. A. Based upon the reported p-values, her estimates for b 0, b 1, and b 3 are statistically significant at a 5% error level (while the estimates for b and b 4 are not). Further, her estimates for b 1 and b 3 are statistically significant at a 1% error level (while the estimates for b 0, b, and b 4 are not). B. Her reported value for R is approximately The mathematical upper bound for R is a value of (1). Thus, there would seem to be some sort of error with her reported results. 3A. Price Sq Footage Lot Size Waterfront Lot Golf Course Lot mean 43,464.98, std dev 58, max 40,950 4, min 143,800 1, B. The estimated coefficients are: b ˆ0 3, , b ˆ , b ˆ 0, , b ˆ3 8, , and b ˆ4 6, C. These results imply that a home on a Waterfront Lot will sell for a premium of $8,17.1, while a home on a Golf Course Lot will sell for a premium of $6, D. Based upon the obtained p-values, the estimates for b 1, b, b 3, and b 4 are statistically significant at a 10% error level (while the estimate for b 0 is not). Further, only the estimates for b 1 and b 3 are statistically significant at a 1% error level.

20 4A. For this regression, R The p-values of (.14969) and (.55740) imply that neither ˆb 0 nor ˆb 1 are statistically significant. So, no, the results of this regression are not good. 4B. For this regression, R The p-value of (.055) implies that ˆb 1 is not statistically significant. So again, no, the results of this regression are not good. 4C. Gene is ranting about the Inverse Elasticity Pricing Rule. Recall, this rule states that in order to be maximizing profit, a firm must be operating where p MC 1. p p That is, where the markup of price over Marginal Costs (as a percentage of price) is equal to the inverse of the absolute value of Price Elasticity of Demand. Since firms with more market power would tend to face demand for their output that is less elastic (so that the inverse of the absolute value of elasticity is greater in value), we could reasonably expect there to be a positive relation between either C4 or HHI (recall, these are measures of market structure for which a larger value corresponds to a market that is less competitive, in which case firms have more market power ) and any increasing function of p MC p p MC p. By considering 100, Gene is simply suggesting that this percentage increase be stated in such a way to make the values be between (0) p MC p and (100). For the regression on 100 b0 b1( C4), we obtain R.78061, along with p-values of (.00574) and (4.76E-34). These results are much better than those in part (4A). Finally, b ˆ , suggesting a positive relation between (C4) and the percentage markup (as expected). For the regression on p MC 100 b0 b1( HHI) p, we obtain R. 7878, along with p-values of (5.0E-10) and (9.E-35). These results are much better than those in part (4B). Finally, b ˆ , suggesting a positive relation between (HHI) and the percentage markup (as expected). 4D. If there were a change in market structure causing C4 to increase in value by (5), we see that using the value of b ˆ from the results of the first regression in part (4C) firms in the industry would increase their expected percentage markup by approximately A. There are a total of 35 male students in the sample. These students have a semester,810 average of There are a total of 6 female students in the sample. These 1,999 students have a semester average of Thus, a simple comparison of 6

21 sample means between genders shows that the mean semester average of male students is higher than that of female students. 5B. Running a regression for ( SemAvg) b0 b1 ( SAT) b ( Over5) b3 ( Male), we obtain b ˆ Based upon the p-value for this estimated coefficient (of.0019), this estimate is statistically significant at the 1% error level. Thus, these results would seem to provide evidence of gender discrimination, since male the expected semester average of a male student is points above that for a female student, even after controlling for SAT Score and Age. 5C. Running a regression for ( SemAvg) b0 b1 ( SAT) b ( Over5) b3 ( Male) b4 ( Bus), we obtain b ˆ (with a p-value of.8345). Based upon this p-value, gender no longer has a statistically significant impact on semester average. That it, once we control for SAT Score, Age, and Major, there no longer appear to be a difference in grades between male and female students. 5D. The numerical values of the six relevant correlation coefficients are: SAT Over 5 Male Business SAT 1 Over Male Business Note that there is a strong, positive correlation between being male and being a business major (implied by the value of above). The regression results from (5C) suggest that while semester averages in this marketing course do not differ between male and female students, there is a substantial, statistically significant difference in performance between business majors and non-business majors (the estimated value of the coefficient attached to (Bus) is b ˆ , with a p-value of E-06). When the dummy variable identifying college major is left out of the regression (as was done in part (5B)), the results suffer from an omitted variable bias, since the estimated coefficient for (Male) (of b ˆ , with a p-value of.0019) is partly capturing this difference in performance resulting from chosen major. In summary, once we control for SAT Score, Age, and Major, there is no longer any evidence of gender discrimination. Perhaps a better explanation is simply that students who choose to major in a business discipline are likely to be more interested in and perform better in a marketing class (compared to students who have chosen to major in a non-business discipline).