ECO 22000 McRAE SELF-TEST: SIMPLE REGRESSION Note: Those questions indicated with an (N) are unlikely to appear in this form on an in-class examination, but you should be able to describe the procedures used to get an answer and be able to interpret the answers. 1. What are the assumptions involved in simple linear regression? 2. What line does the method of least squares actually find? 3. What information might we get from a scatter plot of y against x? 4. Describe how to use Excel to create a scatter diagram. 5. Describe how to use Excel to calculate a regression line. 6. The regression equation of starting salary on GPA for a sample of recent graduates of RCCC is salary = 8000 + 3500 * GPA. Randy just graduated with a GPA of 2.6; what starting salary would the regression equation predict for him? 7. For a cross-section of companies, a marketing analyst regressed sales on advertising expenditures, resulting in the following Excel output: SUMMARY OUTPUT Regression Statistics Multiple R 0.9 R Square 0.81 Adjusted R Square 0.80 Standard Error 100 Observations 45 ANOVA df SS MS F Significance F Regression 1 48000 48000 16 0.000245081 Residual 43 129000 3000 Total 44 173000 Coefficients Standard Error t Stat P-value Intercept 400 75 5.33 3.41301E -06 Advertising 0.5 0.125 4 0.000245081 a) Write out the regression equation, showing sales as a function of advertising expenditures. b) Give a point prediction for sales for a company whose advertising expenditures equal $7,000. c) Give a 95% confidence interval for the average sales level for a company spending $2,000 on advertising. Assume the mean advertising expenditure = $4,000. d) Give a 95% confidence interval for a specific value of y for a company spending $2,000 on advertising with x = $4,000. e) Explain why the intervals in c. and d. are not the same.
Stats II, Regression, page 2 8. What shape do confidence intervals for y values at given x values have? What does this imply about predicted values far from the mean value of x? 9. Say whether the following statement is true or false and explain your answer: If a regression equation has a high r 2, statisticians see no problem with making extrapolations well beyond the observed range of x and y values. 10. What does the coefficient of correlation measure? How is it related to a regression line? 11. Find the coefficient of correlation between x and y: x y 2 5 1 7 6 3 12. To test whether a correlation between x and y is significant, we should test the null hypothesis with alternative hypothesis ; the test statistic is a with d.f. 13. Describe three different ways to find the correlation coefficient using Excel. 14. Comment on the following: Among the industrial nations, there is a negative correlation between average medical expenditures and life expectancy; this proves that medical care causes people to live shorter lives. 15. r 2 is called the ; it is interpreted as giving the in y which is by variation in x. 16. Generally speaking, what does r-squared tell us about a regression equation? 17. ART's engineers regressed production costs on output and found the regression equation: cost = 4000 + 2 * output. In the regression results, s y.x = 1800 and s b = 0.6; the regression was based on a sample of 40 days output and costs. Give a 98% confidence interval for β 1. 18. Using the data of the preceding question, formulate and conduct an appropriate test for the significance of the regression coefficient. 19. The following Excel output was generated by regressing percentage rates of inflation on percentage rates of increase in the money supply: SUMMARY OUTPUT Regression Statistics Multiple R 0.7 R Square 0.49 Adjusted R Square 0.46 Standard Error 1 Observations 62 ANOVA df SS MS F Significance F Regression 1 900 Residual 60 6000 Total 61 6900 Coefficients Standard Error t Stat P-value Intercept -1 0.2-5 0.00032 X Variable 1 1.2 0.4
Stats II, Regression, page 3 a) What is the simple correlation coefficient between prices and money? b) In a t test of H 0 : ρ = 0, what is the calculated value of t? c) In a t test of H 0 : β 1 = 0, what is the calculated value of t? At α = 0.01, what should we do with the null hypothesis? d) In an ANOVA test of this regression equation, what is the critical value of F for α = 0.025? (Use FINV to find the critical value.) e) What is the calculated value of F in an ANOVA test? Should we accept or reject the null hypothesis of no linear relation between money and inflation? 20. In a regression ANOVA table, how are the following terms defined? Regression sum of squares; residual sum of squares; total sum of squares. What does each represent? 21. In a regression of managers' salaries on firm size, researchers estimated the equation salary = 20000 + 5000 * sales, where sales were measured in millions of dollars. Observation number 42 works at a firm with annual sales of 8 million dollars, and he makes $53,000 a year. What is the residual for observation 42? 22. How could a graph of the residuals from a regression equation help in determining whether ε is normally distributed? 23. How might you use a histogram of the residuals from a regression equation? A CPA has gathered the following data for a sample of twelve corporations: Observation # Long-Term Assets Long-Term Debt 1 54 28 2 47 26 3 60 39 4 56 43 5 64 24 6 26 16 7 47 30 8 69 38 9 62 43 10 45 24 11 48 36 12 39 20 24. (N) Suppose that we wish to know whether acquiring long-term assets is done primarily by acquiring long-term debt. a) Designating assets as y and debt as x, use your spreadsheet to find the regression equation of assets on debt; state this equation in algebraic notation. b) What does the x coefficient tell you about the relation between assets and debt? c) What is the correlation between assets and debt? Use a t test to find whether we can consider this significant. d) Use an appropriate t test to test whether the slope of the regression line can be considered different from 0; set your significance level at 5%. e) At 1% significance, use ANOVA to test H 0 : there is no significant linear relation between assets and debt. f) Make a point prediction of assets for a corporation which has 25 million dollars of long term debt. g) Give a 95% prediction interval for the assets of a corporation with 25 million dollars of debt. h) Give a 95% confidence interval for the average of all corporations that have 25 million dollars of debt. i) Compute and interpret the residual for observation #9. j) Give a 90% confidence interval for the value of β. 25. What would you look for in a residual plot that would be a clue to the presence of each of the following conditions? a) non-normality of the residuals b) heteroscedasticity c) non-linearity of the relation between x and y
Stats II, Regression, page 4 d) autocorrelation 26. In the ANOVA table, the regression sum of squares is defined as SSR = Σ( ŷ y) 2 ; explain why that represents the variation in y which is explained by variation in x. 27. The residual sum of squares, or error sum of squares, is defined as SSE = Σ(y ŷ ) 2 ; explain why this term represents the variation in y which is NOT explained by variation in x. 2 28. r 2 2 ( y yˆ) is defined as r = 1. Explain how this definition leads to the interpretation usually given of 2 ( y y) r 2. 29. What condition is indicated by each of the following residual plots? A. B. C. D.
SELF TEST: MULTIPLE REGRESSION 1. Marketing researchers at ART, Inc., have regressed their sales on Gross Domestic Product and their own advertising expenditures with the following result: Sales = 400,000 + 4,000 GDP + 7000 A a) What could we predict ART's sales to be if GDP = 6.5 trillion and advertising expenditures = 20 million? b) If GDP rose to 6.8 trillion, by how much would we expect sales to change? c) ART wishes to increase its unit sales by 21,000; by how much will they need to increase their advertising budget? 2. Why is the use of adjusted R 2 preferred to the use of plain R 2 in multiple regression? What is it we're adjusting for? 3. When is it important to use adjusted R 2? When is it not important? 4. R 2 can be thought of as the proportion of in y which is by in the x's. State the definition of R 2 and explain why that definition leads to this interpretation. 5. In performing a t test on a coefficient from multiple regression, what null and alternative hypotheses are we testing? The following Excel output is for questions 6 to 12: SUMMARY OUTPUT Regression Statistics Multiple R 0.774597 R Square 0.6 Adjusted R Square 0.52 Standard Error 10.00 Observations 16 ANOVA df SS MS F Significance F Regression 3 1800 Residual 12 1200 Total 15 3000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 20.00 10 2 0.06866 X Variable 1 10.00 5 X Variable 2 5.00 1.5 X Variable 3 3.00 0.5 6. What is the regression equation? 7. How many degrees of freedom are there in the t Stats? 8. According to the t ratios, which of the regression coefficients would be significant at the 5% level? Which at the 10% level? 9. What is the F ratio? What null hypothesis would be tested with this value? At α = 0.01, can we reject the null hypothesis? Can we reject at α = 0.05?
Stats II, Regression, page 6 10. Suppose x 1 = 6, x 2 = 0, and x 3 = 2; what is ŷ? 11. Observation #8 had x 1 = 8, x 2 = 2, and x 3 = 4; for that observation, y = 108. What is the residual for this observation? 12. Find a 95% confidence interval for ß 1, the coefficient on variable X 1. 13. What is a dummy variable? 14. A marketing researcher has created a dummy variable for "Owns own home." John lives in an apartment; what value will this dummy have for him? Mary is paying off the mortgage on her condominium; what value will this dummy have for her? 15. In a regression of monthly entertainment expenditures on several things, the dummy of q. 14 had the value $21. Explain the meaning of this number. 16. What is multicollinearity? How can we detect it? 17. What are the effects in regression analyses of multicollinearity? 18. Suppose the relation between x and y is not linear: how could you detect this nonlinearity? 19. (N) A researcher wishes to be able to predict the number of movies attended in a year's time on the basis of four explanatory variables: age, education, income, and sex. A sample of ten people yields the following data: No. of Movies Age Education Income Sex Dummy (Male = 1) 25 18 11 35 1 12 35 13 38 0 21 21 14 35 1 9 35 16 50 0 18 25 14 36 0 27 21 13 39 1 4 39 13 37 0 17 31 12 34 0 17 20 14 41 1 7 40 12 29 0 a) Using your spreadsheet, find the regression equation and write it out in algebraic notation. b) Explain what each of the regression X coefficients means. c) Using an appropriate t test, at 5% significance test H 0 : β i = 0 for i = 1 to 4. d) What is the adjusted R 2? How would we interpret that number? Why is there so much difference in this case between R 2 and adjusted R 2? e) Using ANOVA state and test the appropriate null hypothesis to test whether there is a significant linear relation among these variables. f) Predict how many movies will be seen by a 37 year-old female high-school graduate whose family income is $43,000 a year. g) State the 95% confidence interval for each X coefficient. h) Calculate a 98% confidence interval for β 2 i) Find the residual for the first observation (25 movies, age 18 and so on). j) In examining the residual plots generated by the Excel, do you detect any problems or violations of the regression assumptions? k) Does there appear to be significant multicollinearity among the X variables? How do you know that?
Stats II, Regression, page 7 Selected Answers: Simple Regression:: 6. 17,100 19. a. 0.7 7. a. sales = 400 + 0.5 adv b. 7.59 b. 3900 c. 3 reject c. 1400 ± 505.07 d. 5.29 d. 1400 ± 543.84 e. 9, reject 11. 0.945 21. $7,000 17. 2 ± 1.4574 29. a. nothing in particular 18. H 0 : β = 0; t = 3.33; p-value b. autocorrelation = 0.0019 c. non-linearity d. heteroscedasticity 24. a. y-hat = 22.62 + 0.94 X b. for each one-dollar increase in debt, assets increase 94 cents c. 0.71; since p value = 0.0092, we can reject at 1% significance the hypothesis that population correlation = 0. d. for α = 0.05, critical t = 2.228 < calculated 3.219, so reject the null that β = 0. (Alternatively, since p < 0.05, reject.) e. Critical F = 10.04 < 10.359, so reject null and conclude there is a significant relation. (Alternatively, in ANOVA table p < 0.01, so reject null.) f. 46.16 g. 46.16 ± 20.71 h. 46.16 ± 6.72 i. 1.107 j. 0.41 β 1 1.47 Multiple Regression: 1. 566,000; +1,200; $3 million 6. ŷ = 20+10x 1 +5x 2 +3x 3 7. 12 8. β 2 and β 3 at 5%; all at 10% 9. F=6; with 3,12 d.f. F.01 =5.95, so reject H O at 1% and 5% 10. 86 11. 14 12. 10 ± 10.89 14. 0; 1 15. homeowners typically spend $21 a month less on entertainment 19. a. movies = 56.71 0.93 x age 1.30 x educ + 0.096 x inc 2.28 x male b. movies attended falls by.93 for each year age increases, falls by 1.3 for each extra year of education, and increases by about 0.1 for each extra thousand dollars of family income; other things being equal males attend 2.28 fewer movies a year than females c. reject H 0 for β 1 since p = 0.024; fail to reject for i = 2-4 since all p values > 0.05 d. Adj. R 2 = 0.77; these four variables explain 77% of the observed variation in movie attendance. e. H 0 : β 1 = β 2 = β 3 = β 4 = 0 vs. H 1 : at least one equality not true F = 8.549 with p value = 0.018, so at 2% significance we reject null and conclude there is a significant linear relation with at least one of the x variables. f. y-hat = 10.82. g. see output Lower 95% Upper 95% h. 3.37 ± 4.86 i. since y-hat = 26.72, residual = 1.72 j. no k. yes; education is highly correlated with income and sex with age; use Data Analysis Correlation tool