egresson through the Orgn Blackwell Oxford, TEST 0141-98X 003 5 31000 Orgnal Joseph Teachng G. UK Artcle Publshng Esenhauer through Statstcs the Ltd Trust Orgn 001 KEYWODS: Teachng; egresson; Analyss of varance; Statstcal software packages. Joseph G. Esenhauer Cansus College, Buffalo, USA. e-mal: esenhauer@cansus.edu Summary Ths artcle descrbes stuatons n whch regresson through the orgn s approprate, derves the normal equaton for such a regresson and explans the controversy regardng ts evaluatve statstcs. Dfferences between three popular software packages that allow regresson through the orgn are llustrated usng examples from prevous ssues of Teachng Statstcs. ª INTODUCTION ª Although ordnary least-squares (OLS regresson s one of the most famlar statstcal tools, far less has been wrtten especally n the pedagogcal lterature on regresson through the orgn (TO. Indeed, the subject s surprsngly controversal. The present note hghlghts stuatons n whch TO s approprate, dscusses the mplementaton and evaluaton of such models and compares TO functons among three popular statstcal packages. Some examples gleaned from past Teachng Statstcs artcles are used as llustratons. For expostory convenence, OLS and TO refer here to lnear regressons obtaned by least-squares methods wth and wthout a constant term, respectvely. ª MODEL SELECTION: ª WHEN IS TO APPOPIATE? Textbooks rarely dscuss TO other than to cauton aganst droppng the constant term from a regresson, on the grounds that mposng any such restrcton can only dmnsh the model s ft to the data. There are, however, crcumstances n whch TO s approprate or even necessary. Frst, TO may be unavodable f transformatons of the OLS model are needed to correct volatons of the Gauss Markov assumptons. Consder, for example, the smple lnear regresson of Y on x Y = β 0 + β 1 x + e (1 where β 0 s the ntercept, β 1 s the slope and e denotes the th resdual. Laggng observatons and takng frst dfferences (.e. subtractng each observaton from ts successor to correct for seral correlaton n the errors requres transformng equaton (1 nto an TO equaton of the form Y Y 1 = β 1 (x x 1 + (e e 1 Alternatvely, applyng weghted least squares to correct for heteroscedastcty wll result n a model wth no ntercept f the weghtng factor (z s not an ndependent varable. In that case, β 0 becomes a coeffcent and equaton (1 s replaced by a multple lnear regresson wthout a constant: Y /z = β 0 (1/z + β 1 (x /z + (e /z Even wthout such transformatons, however, there are often strong a pror reasons for belevng that Y = 0 when x = 0, and therefore omttng the constant. Indeed, Thel (1971, p. 176 contends From an economc pont of vew, a constant term usually has lttle or no explanatory vrtues. Whle that may be a slght exaggeraton t s easy to fnd examples n whch an ntercept does matter there are certanly cases n whch economc theory posts the absence of a constant. The wdely used Cobb Douglas producton functon, for example, relates output (Y to captal (K and labour (L β1 β accordng to Y = K L, and takng logarthms yelds ln Y = β 1 ln K + β ln L; mposng a constant 76 Teachng Statstcs. Volume 5, Number 3, Autumn 003
on ths model would mply an unrealstc ablty to manufacture goods wthout resources. An agrcultural example s provded by Chambers and Dunstan (1986, who regress sugar cane harvests on farmland acreage; clearly, f no land s cultvated, there wll be no crop. Casella (1983, p. 150 suggests an engneerng example n whch gasolne usage s a smple lnear functon of vehcular weght; he reasons that, n prncple, a weghtless vehcle would consume no fuel, so consderng the physcal constrants... t seems most approprate to ft a lne through the orgn. And Adelman and Watkns (1994 apply TO to the valuaton of mneral deposts. Of course, smlar nstances can be found n almost any dscplne; some ornthologcal and nutrtonal examples are dscussed below. Even when theory proscrbes a constant, however, careful consderaton of the observed range of data s needed. As Hockng (1996, p. 177 ponts out, f the data are far from the orgn, we have no evdence that the lnearty apples over ths expanded range. For example, the response may ncrease exponentally near the orgn and then stablze nto a near lnear response n the regon of typcal nputs. Alternatvely, observatons at the orgn may represent a dscontnuty from an otherwse lnear functon wth a postve or negatve ntercept. Under those crcumstances, knowng that Y = 0 when x = 0 s nsuffcent justfcaton for TO. If there s uncertanty regardng the approprateness of ncludng an ntercept, several dagnostc devces can provde gudance. Most obvously, one can run the OLS regresson and test the null hypothess Η 0 : β 0 = 0 usng the Student s t statstc to determne whether the ntercept s sgnfcant. Alternatvely, Hahn (1977 suggests runnng the regresson wth and wthout an ntercept, and comparng the standard errors to decde whether OLS or TO provdes a superor ft. And Casella (1983 suggests artfcally creatng an extra observaton a leverage pont that pulls the OLS regresson lne naturally through the orgn. Unless the data set s small and the observatons cluster near the orgn, any such leverage pont s lkely to be an outler but, f t appears to be a plausble extrapolaton of the actual data, one may conclude that TO s an acceptable model. Unfortunately, there are nfntely many such leverage ponts that could be chosen for that exercse, and the reasonableness of TO wll depend on whch pont s used. ª IMPLEMENTATION AND ª EVALUATION OF TO In one respect, TO s merely a specal case of OLS, and the absence of the constant s actually a smplfcaton. Indeed, mnmzng the sum of squared errors for the smple lnear TO model Y = βx + e nvolves far less calculaton than t does for the OLS model of equaton (1. The problem Mn ( Y βx = Y β xy + β x β has only one normal equaton or frst-order condton xy + 3 x = 0 and the easly derved second-order condton, x > 0, clearly guarantees a mnmum. From the normal equaton, the estmated slope of the regresson lne s 3 = xy x as noted by, for example, Pettt and Peers (1991. (For weghted versons, see Turner, 1960. Unfortunately, the TO resduals wll usually have a nonzero mean, because forcng the regresson lne through the orgn s generally nconsstent wth the best ft. The proper method for evaluatng TO has long been dsputed (see, for example, Marquardt and Snee 1974; Maddala 1977; Gordon 1981. To apprecate the controversy, note the famlar dentty (Y y = (Y Y + (Y y ( where y denotes the mean of the dependent varable and Y s the th ftted value. Squarng both sdes and summng across all observatons gves (Y y = (Y Y + (Y y + (Y Y (Y y but, as s well known, the cross-product term s equal to zero n the case of OLS. The remanng terms therefore consttute the usual analyss of varance decomposton Teachng Statstcs. Volume 5, Number 3, Autumn 003 77
(Y y = (Y Y + (Y y (3 where the left-hand sde s the sum of squares total (SST, the frst term on the rght s the sum of squares due to error (SSE and the fnal term s the sum of squares due to regresson (SS. The coeffcent of determnaton for OLS s then defned by the rato of SS to SST or equvalently = ( Y y ( Y y (4 Some authors mantan that because ths dagnostc measure s based on an dentty, t should not depend on the ncluson or excluson of a constant term n the regresson. From that perspectve, equaton (4 s equally vald for TO and OLS. However, when there s no constant n the regresson, (Y Y (Y y wll generally take a nonzero value, so equaton (3 s not a vald bass for analyss of varance n TO. And f the TO model provdes a suffcently poor ft, the data may exhbt more varaton around the regresson lne than around y, n whch case (Y Y > (Y y. Heedlessly applyng equaton (4 would then result n an mplausbly negatve (and thus unnterpretable coeffcent of determnaton as well as a negatve F rato. Moreover, t s often argued that defnng SST as the sum of squared devatons from the mean s napproprate when the regresson lne s forced through the orgn but does not necessarly pass through (x,y ; when so vewed, equaton ( s replaced by the dentty (Y 0 = (Y Y + (Y 0 ( Squarng and summng yelds Y = 1 ( Y ( Y y Y = ( Y Y + Y + Y ( Y Y but the fnal (cross-product term n ths equaton equals zero under TO, because Y ( Y Y = 3x( Y 3x = 3[ xy 3 x ] = 3[ xy ( xy / x x ] = 0 Thus, equaton (3 s replaced by Y = ( Y Y + Y (3 Applyng equaton (3 rather than equaton (3 to TO, one fnds that SSE s unchanged, but SST = Y and SS = Y. edefnng SST and SS n ths manner results n = Y Y (4 a strctly non-negatve coeffcent of determnaton that equals or exceeds the measure n equaton (4. Of course, these defntons also affect the adjusted and F statstcs, but do not alter the standard error of the regresson (S e. Note that, wthout a constant, the degrees of freedom for SST, SS and SSE are n, k and n k, respectvely, where n s the sample sze and k s the number of ndependent varables; thus, Se = SSE/( n k regardless of how SST s defned. The controversy over SST s not merely academc: practtoners (and students runnng TO wll obtan varous outputs dependng on whch computer packages they use. Indeed, as Prvan et al. (00, p. 74 observed n a recent comparson of Mntab, SPSS and Excel, Obtanng a smple lnear regresson s easy n all three packages, and all three gve the standard output optons (such as regresson through the orgn. But n fact the three packages all gve dfferent outputs for TO! Two llustratve examples are provded below. ª EXAMPLES ª In Kmber s essay on the shape of brds eggs, egg heght s regressed on wdth, both wth and wthout an ntercept (Kmber 1995. Her study of 81 speces can be approxmately replcated usng the 96 observatons provded n the Data Bank secton of the Summer 1990 ssue of Teachng Statstcs (Data Bank 1990. egardless of whch computer package s used, OLS yelds the followng output: Heght = 1.774 + 1.444 Wdth [0.001] S e =.13 F = 570.076 = 0.984 = 0.984 78 Teachng Statstcs. Volume 5, Number 3, Autumn 003
where two-taled p-values are shown n brackets below the estmates. Notce that the ntercept s statstcally sgnfcant; although t s, of course, mpossble for an egg to have a zero wdth, the ntercept may nevertheless be mportant, as t represents the extrapolaton of the regresson lne back to the vertcal axs. The effect of removng the ntercept can be seen by runnng TO. If Excel s used, as t was by Kmber, TO yelds Heght = 1.38 Wdth S e =.51 F = 5073.616 = 0.9816 = 0.9711 whch ndcates a poorer ft by all dagnostc measures: the standard error, F and (adjusted and unadjusted. However, the SPSS lnear regresson procedure wthout an ntercept yelds Heght = 1.38 Wdth S e =.51 F = 4,83.995 = 0.996 = 0.996 Notce that the regresson equaton and standard error are the same n the two programs, but the F and statstcs are dfferent. Indeed, n SPSS these statstcs seem to ndcate a better ft wthout the ntercept than wth t. The dscrepancy between software packages arses because Excel s based on equatons (3 and (4, whle the TO functon n SPSS uses equatons (3 and (4. The SPSS output, however, s accompaned by the dsclamer For regresson through the orgn (the no-ntercept model, Square measures the proporton of the varablty n the dependent varable about the orgn explaned by regresson. Ths CANNOT be compared to Square for models whch nclude an ntercept [emphass n orgnal]. To make matters more confusng stll, SPSS offers a nonlnear regresson opton, whch requres a model statement and ntal parameter values. If one uses the nonlnear opton but specfes the lnear model and a reasonable ntal value for the slope, ths opton yelds results dentcal to those for Excel that s, t apples equaton (4 to compute! Meanwhle, the Mntab opton for TO gves the same regresson equaton and standard error as Excel and SPSS, but reports nether the F nor the statstc. However, Mntab s ANOVA table, from whch F and would be derved, s based on equaton (3. Because Excel and the nonlnear opton n SPSS apply equaton (4 regardless of whether an ntercept s present, t s easy (and perhaps nstructve for students to construct examples that generate negatve and F statstcs for regressons through the orgn usng these packages. (One need only construct a lne wth a large ntercept and then estmate t wthout the ntercept. Extreme cases of that sort can provde a sprngboard for dscusson, and make a compellng argument for usng equaton (4 rather than equaton (4 to evaluate TO. The same ssues arse, of course, n multple lnear regressons. Consder the nutrtonal study conducted by Johnson (1995: the calorc contents of varous foods are regressed on ther fat, proten and carbohydrate contents. For the 13 foods n hs sample, OLS yelds Calores = 4.446 + 8.715 Fat + 4.044 Proten [0.395] + 3.841 Carbohydrates S e = 6.97 F = 3 = 0.987 = 0.983 regardless of whch statstcal software s used. But here the constant s nsgnfcant and, as Johnson observes, nutrtonal theory ndcates that a constant s napproprate for ths regresson. In SPSS, removng the constant gves Calores = 8.888 Fat + 4.66 Proten + 3.978 Carbohydrates S e = 6.90 F = 1459.66 = 0.998 = 0.997 wth all dagnostcs ndcatng an mproved ft. Mntab and Excel produce the same equaton but dfferent dagnostcs. Mntab agan reports only S e, whle Excel generates Calores = 8.888 Fat + 4.66 Proten + 3.978 Carbohydrates S e = 6.895 F = 36.5 = 0.986 = 0.883 In contrast to the prevous example, the Excel output now seems more confusng than the SPSS output. Notce that Excel s and adjusted statstcs for TO ndcate a worse ft, whle ts S e and F statstcs ndcate a better ft, compared to the OLS model. Teachng Statstcs. Volume 5, Number 3, Autumn 003 79
Gven these nconsstences, Hockng (1996, p. 178 notes: It s natural to ask f there s a measure analogous to for the no-ntercept model. We suggest the square of the sample correlaton between observed and predcted values. It can easly be shown that ths measure s equal to the unadjusted coeffcent of determnaton for the OLS model. It therefore gves an nterpretable measure of the qualty of an TO model, but does not help n comparng TO wth OLS. For that purpose, the best measures appear to be the p- value of the OLS constant and the standard errors of the OLS and TO regressons. Usng these measures, the constant should be retaned n the eggs example gven above, but not n the nutrton example. ª CONCLUSION ª egresson through the orgn s an mportant and useful tool n appled statstcs, but t remans a subject of pedagogcal neglect, controversy and confuson. Hopefully, ths synthess provdes some clarty. However, n the lght of the unresolved debate, perhaps the strongest concluson to be drawn from ths revew s that the practce of statstcs remans as much an art as t s a scence, and the development of statstcal judgment s therefore as mportant as computatonal skll. Acknowledgements The author would lke to thank Donald Dale, Scott Trees and Lug Ventura, whose dscussons of results n another paper prompted hm to wrte ths one. He also thanks the edtor and anonymous referees for provdng helpful comments. Any errors are hs own. eferences Adelman, M.A. and Watkns, G.C. (1994. eserve asset values and the Hotellng valuaton prncple: further evdence. Southern Economc Journal, 61(1, 664 73. Casella, G. (1983. Leverage and regresson through the orgn. Amercan Statstcan, 37(, 147 5. Chambers,.L. and Dunstan,. (1986. Estmatng dstrbuton functons from survey data. Bometrka, 73(3, 597 604. Data Bank (1990. Brds, eggs, and databases. Teachng Statstcs, 1(, 6 3. Gordon, H.A. (1981. Errors n computer packages: least squares regresson through the orgn. The Statstcan, 30(1, 3 9. Hahn, G.J. (1977. Fttng regresson models wth no ntercept term. Journal of Qualty Technology, 9(, 56 61. Hockng,.. (1996. Methods and Applcatons of Lnear Models: egresson and Analyss of Varance. New York: John Wley. Johnson,. (1995. A multple regresson project. Teachng Statstcs, 17(, 64 6. Kmber, H. (1995. The golden egg. Teachng Statstcs, 17(, 34 7. Maddala, G.S. (1977. Econometrcs. New York: McGraw-Hll. Marquardt, D.W. and Snee,.D. (1974. Test statstcs for mxture models. Technometrcs, 16(4, 533 7. Pettt, L.I. and Peers, H.W. (1991. An example not to be followed? Teachng Statstcs, (131, 8. Prvan, T., ed, A. and Petocz, P. (00. Statstcal laboratores usng Mntab, SPSS, and Excel: a practcal comparson. Teachng Statstcs, 4(, 68 75. Thel, H. (1971. Prncples of Econometrcs. New York: John Wley. Turner, M.E. (1960. Straght lne regresson through the orgn. Bometrcs, 16(3, 483 5. 80 Teachng Statstcs. Volume 5, Number 3, Autumn 003